Metagenomics post processing (Data visualization and analysis)

Metagenomics post-processing refers to the set of analytical and interpretative steps applied after raw sequencing data have been quality-controlled, assembled, and taxonomically or functionally profiled. This stage focuses on transforming complex metagenomic outputs into biologically meaningful insights through data visualization, statistical analysis, and integrative interpretation.

Visual example of fungal metagenomics data
Visual example of fungal metagenomics data

This tutorial combines the workflows for analyzing WHO Fungal Priority Pathogens (WHO FPPL) and Antimicrobial Resistance (AMR) Genes into a single, cohesive guide. It assumes you have already processed your raw reads into count tables (as covered in the previous data processing step).

Glimpse of R program & R studio

R program is a free, open-source integrated development environment (IDE) for the R programming language, widely used for data analysis, visualization, and statistical interpretation.

Install R
Install Rstudio
Tutorial for basic R in ecology

Please check our video tutorial WHO FPPL visualization ▶️here and downloading our training data 📁here.
Additional information for R color palette here and ggplot info1 / info2.

Fungal Metagenomics Data Visualization Pipeline

Step 0: Open Rstudio cloud and Launch Console

Landing Page

Once you log in to Rstudio cloud, your web browser should bring up a similar window as the picture shown above. Click the button on the top right corner to create a new Rstudio project. Then, the next step is to click “Terminal” which should look like a picture below after you click on it.

R script: T1_MGS data processing_HPCC.Rmd

Prerequisites Ensure you have the following R packages installed and loaded. These libraries are essential.

library(tidyr)
library(tidyverse)
library(stringr)
library(readr)
library(dplyr)

Step 1: Data preparation

Put all txt file of kraken result into one folder Adjust the folder & pattern to your files (e.g., RK1.txt, RB1.txt, …)

## 1. Path to folder containing all report files
folder <- "C:/Users/ASUS/Downloads/TrainingFAILSAFE/Training data/sequence"

## 2. List all txt files in that folder

files <- list.files(folder, pattern = "\\.txt$", full.names = TRUE)
## 3. Derive sample IDs from file names (strip path and extension)
sample_ids <- tools::file_path_sans_ext(basename(files))

# 4. Function to read one kraken report (all ranks)
read_kraken_report <- function(file, sid){
  df <- read.table(file,
                   sep = "\t",
                   header = FALSE,
                   quote = "",
                   comment.char = "",
                   stringsAsFactors = FALSE,
                   col.names = c("Percentage","Reads_in_clade","Reads_direct","Rank_code","NCBI_taxid","Taxon"))
  
  df$Taxon <- trimws(df$Taxon)
  
  # Now we just attach the counts column for this sample
  df <- df %>%
    group_by(Percentage, Reads_direct, Rank_code, NCBI_taxid, Taxon) %>%
    summarise(!!sid := sum(Reads_in_clade), .groups = "drop")
  
  return(df)
}

## 5. Read and combine all files into one table
counts <- Reduce(function(x, y) full_join(x, y,
                                          by = c("Percentage","Reads_direct","Rank_code","NCBI_taxid","Taxon")),
                 Map(read_kraken_report, files, sample_ids))

## 6. Replace NAs with zeros
counts[is.na(counts)] <- 0
counts_long <- counts %>%
  tidyr::pivot_longer(
    cols = -c(Percentage, Reads_direct, Rank_code, NCBI_taxid, Taxon),
    names_to = "SampleID",
    values_to = "Reads_in_clade"
  )
## save taxa as csv
readr::write_csv(counts_long, "HPCC_training.csv")

Step 2: Combining sequence & mapping file

## 1. read mapping file (.txt, tab-delimited)
map <- readr::read_tsv("C:/Users/ASUS/Downloads/TrainingFAILSAFE/Training data/training.mapping_file.txt")
## 2. merge mapping file with sample data (.txt, tab-delimited)
merged_data <- counts_long %>%
  left_join(map, by = "SampleID")

quick checks

unique(merged_data$SampleID)     # should be RK1..SS3
head(merged_data)                # Taxon should be scientific names now

Save metadata

## save metadata as csv
write.csv(merged_data, "training.metadata.csv", row.names=FALSE)

R script: T2_MGS Fungal WHO Taxonomicbar_HPCC.Rmd

Load the package

library(dplyr)
library(tidyr)
library(stringr)
library(readr)
library(reshape2)
library(ggplot2)
library(ggtext)
library(scales)
library(colorspace)

Step 1: Load the data from combining kraken + mapping file

## load metadata
metadata <- read.csv("C:/Users/ASUS/Downloads/TrainingFAILSAFE/training.metadata.csv", stringsAsFactors = FALSE)
metadata %>%
  count(SampleID) %>%
  filter(n > 1)
metadata %>% select(SampleID, crop) %>% distinct() %>% count(SampleID)

Step 2: Merge Taxa and Select Only FPPL

## 1. load taxa
taxa_long <- read.csv("C:/Users/ASUS/Downloads/TrainingFAILSAFE/HPCC_training.csv", stringsAsFactors = FALSE)
## 2. Merge counts with metadata (only SampleID & crop needed)
counts_meta <- taxa_long %>%
  left_join(
    metadata %>% select(SampleID, crop) %>% distinct(),
    by = "SampleID"
  )
## 3. Replace any marks in the taxa
counts_meta <- counts_meta %>%
  mutate(Taxon = str_replace_all(Taxon, "\\[|\\]", ""))
## 4. Select only WHO FPPL (case-insensitive, matches full names or genera)
FPPL <- c(
        #Critical group
          "Candida albicans",
          "Candida auris",
          "Aspergillus fumigatus",
          "Cryptococcus neoformans",
        #High group
          "Candida tropicalis",
          "Nakaseomyces glabratus",
          "Candida parapsilosis",
        #Eumycetoma causative group
          "Madurella",
          "Medicopsis romeroi",
          "Falciformispora",
          "Trematosphaeria grisea",
          "Chaetomium atrobrunneum",
          "Exophiala",
          "Cladophialophora bantiana",
          "Corynespora cassiicola",
          "Curvularia lunata",
          "Nigrograna mackinnonnii",
          "Phialophora verrucosa",
          "Pseudochaetosphaeronema larense",
          "Rhytidhysteron rufulum",
          "Neoscytalidium dimidiatum",
          "Pseudallescheria boydii",
          "Acremonium",
          "Microascus gracilis",
          "Microsporum",
          "Neocosmospora cyanescens",
          "Paecilomyces variotii",
          "Phaeoacremonium krajdenii",
          "Phaeoacremonium parasiticum",
          "Aspergillus candidus",
          "Aspergillus flavus",
          "Aspergillus nidulans",
          "Aspergillus niger",
          "Aspergillus sydowii",
          "Aspergillus terreus",
          "Aspergillus ustus",
          "Tricophyton interdigitale",
          "Tricophyton rubrum",
          "Sarocladium kiliense",
          "Diaporthe phaseolorum",
          "Pleurostoma ochraceum",
          "Lomentospora prolificans",
          "Pichia kudriavzevii",
          "Cryptococcus gattii",
          "Fusarium",
        #Mucorales
          "Mucor",
          "Rhizopus",
          "Apophysomyces",
          "Benjaminiella",
          "Cokeromyces",
          "Parasitella",
          "Pilaira",
          "Actinomucor",
          "Helicostylum",
          "Thamnidium",
          "Backusella",
          "Sporodiniella",
          "Blakeslea", 
          "Choanephora",
          "Gilbertella", 
          "Mycotypha", 
          "Pilobolus", 
          "Gongronella",
          "Absidia", 
          "Cunninghamella", 
          "Hesseltinella", 
          "Chlamydoabsidia", 
          "Halteromyces",
          "Lichtheimia",
          "Circinella", 
          "Rhizomucor",
          "Thermomucor", 
          "Zychaea", 
          "Dichotomocladium", 
          "Fennellomyces", 
          "Phascolomyces", 
          "Syncephalastrum", 
          "Phycomyces", 
          "Spinellus", 
          "Saksenaea", 
          "Radiomyces",
        #Medium group
          "Scedosporium",
          "Talaromyces marneffei",
          "Histoplasma",
          "Pneumocystis jirovecii", 
          "Coccidioides", 
          "Paracoccidioides")
## 5. Detect and filter FPPL taxa
pattern <- paste0("\\b(", paste(gsub(" ", "\\\\s+", FPPL), collapse="|"), ")\\b")
fppl_meta <- counts_meta %>%
  filter(str_detect(str_to_lower(Taxon), str_to_lower(pattern)))

Choosing only species - you can adjust to genus

## 6. Create Genus and Species columns + rename counts
fppl_meta <- fppl_meta %>%
  mutate(Genus = word(Taxon, 1),
         Species = word(Taxon, 1, 2)) %>%
  rename(Count = Reads_in_clade)   # <-- adjust this name if different
  

Step 3.1: Visualization 1 - Shows crop aggregate

Bargraph visualization

## 1. Add a Genus column (first word of the taxon)
fppl_meta <- fppl_meta %>%
  mutate(Genus = word(Taxon, 1),
         Species = word(Taxon, 1, 2))  # Genus + species epithet
## 2. Summarize total counts per crop × sample × species

per_crop <- fppl_meta %>%
  group_by(crop, SampleID, Species, Genus) %>%
  summarise(Count = sum(Count, na.rm = TRUE), .groups = "drop") %>%
  group_by(crop, SampleID) %>%

## 3. Thresholding for rare taxa    
  mutate(
    Prop     = Count / sum(Count, na.rm = TRUE),
    Species2 = ifelse(Prop < 0.005, "<0.5% abund.", Species)
  ) %>%
  group_by(crop, SampleID, Species2, Genus) %>%
  summarise(Prop = sum(Prop), .groups = "drop") %>%
  # --- NEW STEP: repair NAs in Species2 using Genus ---
  mutate(
    Species2 = dplyr::case_when(
      Species2 == "<0.5% abund." ~ Species2,                     # keep rare bin
      is.na(Species2) & !is.na(Genus) ~ paste0(Genus, " sp."),   # e.g. Coccidioides → Coccidioides sp.
      is.na(Species2) ~ "Unidentified species",                  # no genus info
      TRUE ~ Species2
    )
  )

optional based on condition:

> 1. Threshold <0.5%
  mutate(
    Prop     = Count / sum(Count, na.rm = TRUE),
    Species2 = ifelse(Prop < 0.005, "<0.5% abund.", Species)
  ) %>%
  group_by(crop, SampleID, Species2, Genus) %>%
  summarise(Prop = sum(Prop), .groups = "drop") %>%
  # --- NEW STEP: repair NAs in Species2 using Genus ---
  mutate(
    Species2 = dplyr::case_when(
      Species2 == "<0.5% abund." ~ Species2,                     # keep rare bin
      is.na(Species2) & !is.na(Genus) ~ paste0(Genus, " sp."),   # e.g. Coccidioides → Coccidioides sp.
      is.na(Species2) ~ "Unidentified species",                  # no genus info
      TRUE ~ Species2
    )
  )


> or (Prop < 0.01, "<1% abund.", Species)

> 2. No threshold     
  mutate(
    Prop     = Count / sum(Count, na.rm = TRUE),
    Species2 = Species              # keep all species names
  ) %>%
  group_by(crop, SampleID, Species2, Genus) %>%
  summarise(Prop = sum(Prop), .groups = "drop") %>%
  # --- Fix NA species using Genus ---
  mutate(
    Species2 = case_when(
      is.na(Species2) & !is.na(Genus) ~ paste0(Genus, " sp."),
      is.na(Species2) ~ "Unidentified species",
      TRUE ~ Species2
    )
  )
## 4. Replace missing species names
per_crop <- per_crop %>%
  mutate(Species2 = ifelse(is.na(Species2), "Unidentified species", Species2))

## 5. Grouping & Data Normalization. 
per_crop_avg <- per_crop %>%
  group_by(crop, Species2) %>%
  summarise(Prop = mean(Prop, na.rm = TRUE), .groups="drop") %>%
  group_by(crop) %>%
  mutate(Prop = Prop / sum(Prop))
unique(per_crop_avg$Species2)
## 6. Define custom colors by species (grouped by genus)
species_colors <- c(
  # ===== Abundance category =====
  "<0.5% abund."              = "#d9d9d9",  # light grey

  # ===== Apophysomyces (teal) =====
  "Apophysomyces sp."         = "#80cdc1",
  "unclassified Apophysomyces"= "#35978f",

  # ===== Exophiala (purple) =====
  "Exophiala aquamarina"      = "#d4b9da",
  "Exophiala calicioides"     = "#c994c7",
  "Exophiala sp."             = "#df65b0",
  "unclassified Exophiala"    = "#dd1c77",

  # ===== Fusarium (red / magenta gradient) =====
  "Fusarium albidum"          = "#fee0d2",
  "Fusarium buharicum"        = "#fcbba1",
  "Fusarium buxicola"         = "#fc9272",
  "Fusarium concolor"         = "#fb6a4a",
  "Fusarium decemcellulare"   = "#ef3b2c",
  "Fusarium dimerum"          = "#cb181d",
  "Fusarium fujikuroi"        = "#a50f15",
  "Fusarium sambucinum"       = "#67000d",
  "Fusarium solani"           = "#980043",
  "Fusarium sp."              = "#ce1256",
  "Fusarium staphyleae"       = "#e7298a",
  "unclassified Fusarium"     = "#f768a1",

  # ===== Other FPPL taxa =====
  "Kendrickiella phycomyces"  = "#9ecae1",  # blue
  "Madurella fahalii"         = "#8c510a",  # brown
  "Neoscytalidium dimidiatum" = "#5ab4ac",  # teal
  "Rhytidhysteron rufulum"    = "#01665e",  # dark teal

  # ===== Unknown =====
  "Unidentified species"      = "black"
)

## 7. Apply in ggplot
ggplot(per_crop_avg, aes(x = crop, y = Prop, fill = Species2)) +
  geom_col(width = 0.8) +
  scale_y_continuous(labels = scales::percent, limits = c(0,1)) +
  scale_fill_manual(values = species_colors) +
  labs(x = "Crop", y = "Relative abundance (100%)", fill = "Species") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5)) +
  guides(fill = guide_legend(nrow = 30))
jpeg ("C:/Users/ASUS/Downloads/TrainingFAILSAFE/Result/TrainingAgregate.jpg", units="in", width = 8.8, height = 8.5, res = 1000)
ggplot(per_crop_avg, aes(x = crop, y = Prop, fill = Species2)) +
  geom_col(width = 0.8) +
  scale_y_continuous(labels = scales::percent, limits = c(0,1)) +
  scale_fill_manual(values = species_colors) +
  labs(x = "Crop", y = "Relative abundance (100%)", fill = "Species") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5)) +
  guides(fill = guide_legend(nrow = 30))
dev.off()

Extract percentage data

per_crop_avg <- per_crop %>%
  group_by(crop, Species2) %>%
  summarise(Prop = mean(Prop, na.rm = TRUE), .groups="drop") %>%
  group_by(crop) %>%
  mutate(Prop = Prop / sum(Prop))

View the percentage data

# Save as CSV
write.csv(per_crop_avg, "Training-Agregate_percentage.csv", row.names = FALSE)

Step 3.2: Visualization 2 - Shows sample replication

Bargraph visualization

## 1. Add a Genus column (first word of the taxon)
fppl_meta <- fppl_meta %>%
  mutate(Genus = word(Taxon, 1),
         Species = word(Taxon, 1, 2))  # Genus + species epithet
## 2. Summarize total counts per crop × sample × species
per_sample <- fppl_meta %>%
  group_by(SampleID, crop, Species, Genus) %>%
  summarise(Count = sum(Count, na.rm = TRUE), .groups="drop") %>%
  group_by(SampleID, crop) %>%
## 3. Thresholding for rare taxa 
    mutate(Prop = Count / sum(Count, na.rm = TRUE), Species2 = ifelse(Prop < 0.005, "<0.5% abund.", Species)) %>% group_by(crop, SampleID, Species2, Genus) %>% summarise(Prop = sum(Prop), .groups="drop")   # keep all species names

#optional based on condition:

# 1 mutate(Prop = Count / sum(Count, na.rm = TRUE), Species2 = ifelse(Prop < 0.01, "<1% abund.", Species)) %>% group_by(crop, SampleID, Species2, Genus) %>% summarise(Prop = sum(Prop), .groups="drop")
# or 0.005, "<0.5% abund."
# 2   mutate(Prop = Count / sum(Count, na.rm = TRUE), Species2 = Species) keep all species names
## 4. Replace missing species names
per_sample <- per_sample %>%
  mutate(Species2 = ifelse(is.na(Species2), "Unidentified species", Species2))

## 5. Grouping & Data Normalization.
per_sample_avg <- per_sample %>%
  group_by(SampleID, crop, Species2) %>%
  summarise(Prop = mean(Prop, na.rm = TRUE), .groups="drop") %>%
  group_by(SampleID, crop) %>%
  mutate(Prop = Prop / sum(Prop))

unique(per_sample$Species2)
## 5. remove "Unidentified species"
per_sample <- per_sample %>%
  filter(Species2 != "Unidentified species")
## 6. Define custom colors by species (grouped by genus)
species_colors <- c(
  "<0.5% abund."               = "#d9d9d9",
  "Apophysomyces sp."          = "#80cdc1",
  "Exophiala aquamarina"       = "#d4b9da",
  "Fusarium albidum"           = "#fee0d2",
  "Fusarium buharicum"         = "#fcbba1",
  "Fusarium buxicola"          = "#fc9272",
  "Fusarium decemcellulare"    = "#ef3b2c",
  "Fusarium dimerum"           = "#cb181d",
  "Fusarium fujikuroi"         = "#a50f15",
  "Fusarium sambucinum"        = "#67000d",
  "Fusarium solani"            = "#980043",
  "Fusarium sp."               = "#ce1256",
  "Fusarium staphyleae"        = "#e7298a",
  "Kendrickiella phycomyces"   = "#9ecae1",
  "Madurella fahalii"          = "#8c510a",
  "Neoscytalidium dimidiatum"  = "#5ab4ac",
  "unclassified Apophysomyces" = "#35978f",
  "unclassified Fusarium"      = "#f768a1",
  #"Unidentified species"       = "black",
  "Exophiala calicioides"      = "#c994c7",
  "Exophiala sp."              = "#df65b0",
  "Fusarium concolor"          = "#fb6a4a",
  "unclassified Exophiala"     = "#dd1c77",
  "Rhytidhysteron rufulum"     = "#01665e"
)
## 7. Priority group mapping
priority_map <- c(
  "<0.5% abund."               = "Unidentified",
  "Apophysomyces sp."          = "High",
  "Exophiala aquamarina"       = "High",
  "Fusarium albidum"           = "High",
  "Fusarium buharicum"         = "High",
  "Fusarium buxicola"          = "High",
  "Fusarium decemcellulare"    = "High",
  "Fusarium dimerum"           = "High",
  "Fusarium fujikuroi"         = "High",
  "Fusarium sambucinum"        = "High",
  "Fusarium solani"            = "High",
  "Fusarium sp."               = "High",
  "Fusarium staphyleae"        = "High",
  "Kendrickiella phycomyces"   = "High",
  "Madurella fahalii"          = "High",
  "Neoscytalidium dimidiatum"  = "High",
  "unclassified Apophysomyces" = "High",
  "unclassified Fusarium"      = "High",
  "Unidentified species"       = "Unidentified",
  "Exophiala calicioides"      = "High",
  "Exophiala sp."              = "High",
  "Fusarium concolor"          = "High",
  "unclassified Exophiala"     = "High",
  "Rhytidhysteron rufulum"     = "High"
)

## 8. Grouping & Data Normalization.
per_sample_avg_fppl <- per_sample %>%
  filter(Species2 %in% names(priority_map)) %>%
  group_by(SampleID, crop) %>%
  mutate(Prop = Prop / sum(Prop)) %>%  # renormalize
  ungroup() %>%
  mutate(
    Priority = priority_map[Species2],
    Species_priority = paste0("[", Priority, "] ", Species2)
  )
unique(per_sample_avg_fppl$Species_priority)
## 9. Reorder legend: Critical → High → Medium
priority_order <- c("Critical", "High", "Medium", "Unidentified")

# a. Make Priority an ordered factor
per_sample_avg_fppl <- per_sample_avg_fppl %>%
  dplyr::mutate(
    Priority = factor(Priority, levels = priority_order)
  )

# b. Get species in the order of Priority
species_order <- per_sample_avg_fppl %>%
  dplyr::filter(!is.na(Priority)) %>%      # drop species without FPPL group
  dplyr::distinct(Species_priority, Priority) %>%
  dplyr::arrange(Priority) %>%
  dplyr::pull(Species_priority)

# c. Apply that order to Species_priority
per_sample_avg_fppl <- per_sample_avg_fppl %>%
  dplyr::mutate(
    Species_priority = factor(Species_priority, levels = species_order)
  )
# 10. Define the overlapping species (between your color list and your data)
valid_species <- intersect(names(species_colors), per_sample_avg_fppl$Species2)

# 11. Build the color vector with matching names
species_colors_with_priority <- setNames(
  species_colors[valid_species],  # 35 colors (for example)
  paste0("[", priority_map[valid_species], "] ", valid_species)  # 35 names
)
colnames(per_sample_avg_fppl)
unique(per_sample_avg_fppl$crop)
unique(per_sample_avg_fppl$Species_priority)[1:5]
trialplot <- ggplot(per_sample_avg_fppl, aes(x = SampleID, y = Prop, fill = Species_priority)) +
  geom_col(width = 0.9) +
  facet_grid(~ crop, scales = "free_x", space = "free_x") +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(
    values = species_colors_with_priority,
    na.value = "grey80",            # fallback color if anything is unmatched
    drop = FALSE                    # keep all legend levels
  ) +
  labs(
    x = "Sample",
    y = "Relative abundance",
    fill = "WHO FPPL Species"
  ) +
  theme_bw() +
  theme(
    strip.text = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(size = 15, angle = 45, hjust = 1, color = "black"),
    axis.text.y = element_text(size = 15, color = "black"),
    panel.grid.minor = element_blank(),
    legend.title = element_text(size = 16, face = "bold"),
    legend.text  = element_text(size = 14, face = "italic", color = "black"),
    axis.title.x = element_text(size = 18, face = "bold"),
    axis.title.y = element_text(size = 18, face = "bold")
  ) +
  guides(fill = guide_legend(ncol = 1, override.aes = list(size = 4)))
trialplot
jpeg ("C:/Users/ASUS/Downloads/TrainingFAILSAFE/Result/TrainingComplete_persample.jpeg", units="in", width = 11.5, height = 9.8, res = 1200)
trialplot
dev.off()

Extract percentage data

per_sample_avg.data <- per_sample_avg_fppl %>%
  group_by(SampleID, crop, Species2) %>%
  summarise(Prop = mean(Prop, na.rm = TRUE), .groups="drop") %>%
  group_by(SampleID, crop) %>%
  mutate(Prop = Prop / sum(Prop))

View the percentage data

# Save as CSV
write.csv(per_sample_avg.data, "Training-persample_percentage.csv", row.names = FALSE)

R script: T3_ResGen FungAMR data_HPCC.Rmd

#Install only for the first time
install.packages(c("httr", "jsonlite"))

library(tidyr)
library(tidyverse)
library(stringr)
library(readr)
library(tidyverse)
library(vegan)       # for ordination, diversity
library(phyloseq)    # for integration (optional)
library(ComplexHeatmap) # for heatmap visualization
library(microbiome)
library(purrr)
library(dplyr)
library(tools)
library(httr)
library(jsonlite)

Step 1: Data preparation

Put all txt file of kraken result into one folder Adjust the folder & pattern to your files (e.g., RK1.txt, RB1.csv, …)

## 1. Path to folder containing all CSV report files
folder <- "C:/Users/ASUS/Downloads/TrainingFAILSAFE/Training data/FungAMR"

## 2. List all .csv files in that folder
files <- list.files(folder, pattern = "\\.csv$", full.names = TRUE)
## 3. Define the expected columns (in order)
expected_cols <- c(
  "qseqid", "sseqid", "pident", "length", "mismatch", "gapopen",
  "qstart", "qend", "sstart", "send", "evalue", "bitscore",
  "qlen", "slen", "stitle"
)
# 4. Safely read and combine all CSVs
mmseqs_all <- map_dfr(files, function(f) {
  if (file.info(f)$size > 0) {  # skip empty files
    df <- read.csv(f, header = TRUE, stringsAsFactors = FALSE)
    
    # Optional: check that all expected columns are present
    missing_cols <- setdiff(expected_cols, names(df))
    if (length(missing_cols) > 0) {
      warning("File ", basename(f), " is missing columns: ",
              paste(missing_cols, collapse = ", "))
    } else {
      # keep columns in a consistent order
      df <- df[, expected_cols]
    }
    
    # Add SampleID from filename (without extension)
    df$SampleID <- tools::file_path_sans_ext(basename(f))
    return(df)
  }
  return(NULL)
})
# 5. Inspect the combined dataframe
head(mmseqs_all)
str(mmseqs_all)
names(mmseqs_all)
# 6. Retrieve the species name in dataframe
mmseqs_all <- mmseqs_all %>%
  mutate(
    Species = str_extract(stitle, "(?<=OS=)[^\\(]+"),
    Species = str_trim(Species)
  )
# 7.1 Clean the species name in dataframe
mmseqs_all <- mmseqs_all %>%
  mutate(
    Species = stitle,
    # 1. cut everything before OS=
    Species = sub(".*OS=", "", Species),
    # 2. remove UniProt-style metadata if present
    Species = sub(" OX=.*", "", Species),
    Species = sub(" GN=.*", "", Species),
    Species = sub(" PE=.*", "", Species),
    Species = sub(" SV=.*", "", Species),
    # 3. cut any strain info in parentheses
    Species = sub(" \\(.*", "", Species),
    # 4. trim whitespace
    Species = trimws(Species)
  )
print(mmseqs_all$Species)
# 7.2 Retrieve the genus name in dataframe
mmseqs_all <- mmseqs_all %>%
  mutate(Genus = sub(" .*", "", Species))
# 8. Filter the percent identic genes
mmseqs_filtered <- mmseqs_all %>%
  filter(pident >= 70)
#percentage optional   filter(pident >= 90)

Step 2: Data alignment from UniProt Database

# 9. Retrieve gene accession
mmseqs_filtered <- mmseqs_filtered %>%
  mutate(Accession = str_extract(sseqid, "(?<=[\\|])[A-Z0-9]+(?=\\|)"))
print(mmseqs_filtered$Accession)
# 10. Retrieving gene information from UniProt Database
#a. Add gene that wanted to search
accessions <- unique(mmseqs_filtered$Accession)
#b. UniProt search
get_uniprot_batch <- function(accessions_batch) {
  # Build query of the form: accession:D5MTG0 OR accession:S5UGJ8 ...
  query_str <- paste(sprintf("accession:%s", accessions_batch), collapse = "+OR+")
  
  url <- paste0(
    "https://rest.uniprot.org/uniprotkb/search?query=",
    query_str,
    "&fields=accession,protein_name,gene_primary,organism_name&format=json&size=500"
  )
  
  resp <- httr::GET(url)
  
  # Stop if HTTP error (e.g. no internet)
  httr::stop_for_status(resp)
  
  res <- jsonlite::fromJSON(content(resp, "text", encoding = "UTF-8"))
  
  # If no results, return empty tibble
  if (is.null(res$results) || length(res$results) == 0) {
    warning("No UniProt results for batch: ", paste(accessions_batch, collapse = ", "))
    return(tibble(
      Accession = character(),
      Gene      = character(),
      Protein   = character(),
      Organism  = character()
    ))
  }
  
  # Extract fields safely
  df <- res$results %>%
    mutate(
      Accession = primaryAccession,
      Gene = sapply(
        genes,
        function(g) {
          if (!is.null(g$geneName$value) && length(g$geneName$value) > 0) {
            g$geneName$value
          } else {
            NA_character_
          }
        }
      ),
      Protein  = proteinDescription$recommendedName$fullName$value,
      Organism = organism$scientificName
    ) %>%
    select(Accession, Gene, Protein, Organism)
  
  return(df)
}
#c. Retrieving Uniprot information
get_uniprot_info <- function(accessions, batch_size = 50) {
  acc_unique <- unique(accessions)
  
  # Split accessions into batches of 'batch_size'
  batches <- split(acc_unique, ceiling(seq_along(acc_unique) / batch_size))
  
  # Fetch each batch and bind rows
  results_list <- lapply(batches, get_uniprot_batch)
  uniprot_df   <- bind_rows(results_list)
  
  return(uniprot_df)
}
#d. Extract unique accessions
accessions <- unique(mmseqs_filtered$Accession)
#e. Run UniProt annotation
uniprot_info <- get_uniprot_info(accessions)

head(uniprot_info)
#f. Merge back into your main dataframe & view result
mmseqs_annotated <- mmseqs_filtered %>%
  left_join(uniprot_info, by = "Accession")

head(mmseqs_annotated)
#g. Remove the NA genes
mmseqs_annotated <- mmseqs_annotated %>%
  filter(!is.na(Protein) & Protein != "" & Protein != "NA")
#h. Change the NA gene to Putative gene
mmseqs_annotated <- mmseqs_annotated %>%
  mutate(Gene = ifelse(is.na(Gene) | Gene == "", 
                         "Putative gene", 
                         Gene))

Step 3: Data analysis

# Data set calculation
# 11. Count AMR genes per sample
amr_counts <- mmseqs_annotated %>%
    count(SampleID, Gene, Species, Protein, name = "Hits")
# 12. Eliminate fungal non WHO-FPPL
amr_counts <- amr_counts %>%
  filter(!grepl("^Alternaria", Species))
# 13. Combining sequence & mapping file
## read mapping file (.txt, tab-delimited)
map <- readr::read_tsv("D:/OneDrive/Dokumen/R/R data/FAILSAFE/Thailand/SakhonNakhonApril2025.mapping_file.txt")
# 14. Merge AMR gene abundance across crops / environments
amr_by_crop <- amr_counts %>%
  left_join(map, by="SampleID") %>%
  group_by(crop, Gene, Species, Protein) %>%
  summarise(TotalHits = sum(Hits), .groups="drop")

Step 4.1: Data visualization - Agregate

Bargraph visualization

barplot <- ggplot(amr_by_crop, aes(x = crop, y = TotalHits, fill = Species)) +
  geom_col(position = "stack") +
  facet_wrap(~ Gene, scales = "free_y") +
  theme_bw() +
  labs(
    title = "Species Contribution to AMR Genes Across Crops",
    x = "Crop",
    y = "Total AMR Gene Hits"
  )
barplot
jpeg ("C:/Users/ASUS/Downloads/TrainingFAILSAFE/Result/FungAMRHeatBarplotTraining.jpg", units="in", width = 12, height = 8.5, res = 800)
barplot

Heatmap visualization

Trialplot <- ggplot(amr_by_crop, aes(x = Protein, y = crop, fill = TotalHits)) +
  geom_tile(color = "gray80") +
  scale_fill_viridis_c() +
  facet_grid(Species ~ Gene, scales = "free") +
  theme_bw() +
  theme(
    axis.text.x  = element_text(angle = 45, hjust = 1, size = 7),
    axis.text.y  = element_text(size = 8),

    strip.text.y = element_text(size = 8, angle = 0, hjust = 1, face = "bold"),
    strip.text.x = element_text(size = 8, face = "bold"),
    panel.spacing.y = unit(0, "cm"),
    panel.spacing.x = unit(0, "cm")
  )
Trialplot
jpeg ("C:/Users/ASUS/Downloads/TrainingFAILSAFE/Result/FungAMRHeatMapTraining.jpg", units="in", width = 12, height = 8.5, res = 800)
Trialplot

Step 4.2: Data visualization - per-sample

amr_by_sample <- amr_counts %>%
  group_by(SampleID, Species, Protein) %>%
  summarise(TotalHits = sum(Hits), .groups = "drop")
amr_by_sample <- amr_by_sample %>%
  mutate(
    SampleID = factor(SampleID, levels = sort(unique(SampleID))),
    Species  = factor(Species)
  )
Heattrial <- ggplot(amr_by_sample, aes(x = SampleID, y = Species, fill = TotalHits)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "red") +
  facet_wrap(~ Protein, scales = "free_y") +
  theme_bw() +
  theme(
    axis.text.x  = element_text(angle = 45, hjust = 1),
    strip.text   = element_text(face = "bold", size = 9),
    panel.spacing = unit(0.5, "lines")
  ) +
  labs(
    title = "AMR Gene Hits by Sample, Species and Protein Function",
    x     = "Sample (replicates)",
    y     = "Species",
    fill  = "Hits"
  )
Heattrial


jpeg ("C:/Users/ASUS/Downloads/TrainingFAILSAFE/Result/FungAMRHeatMapPer-sample.jpg", units="in", width = 10, height = 4.5, res = 800)
Heattrial

Fungal Metagenomics Data Pipeline Summary