To enable accurate profiling of phages in whole microbial community shotgun (whole-genome sequencing, WGS) sequence datasets from stool samples, a more comprehensive database of gut phages (and a small number of eukaryotic viruses) was compiled from publicly available resources18,19,20,21. This database, the Trove of Gut Virus Genomes (TGVG; Methods), consists of genomes from 110,296 viral species-level genome bins (SGBs; clustered at 95% average nucleotide identity and 85% alignment fraction, per community standard25) derived from human-gut metagenome studies. Genomes from 42.6% of the viral SGBs are predicted to be 90–100% complete and the database contains numerous phages infecting all major taxa of gut bacteria (Extended Data Fig. 1). Therefore, it is important to note that the majority of the viral SGBs in TGVG are from incomplete genomes.
To facilitate comparable detection of viral and cellular genomes in WGS data, we developed a marker gene approach. This approach utilizes the concept that some genes are more taxonomically informative than others and, therefore, requires genomes to be represented by a subset of genes that are species-specific and invariable. For each viral SGB in TGVG, essential genes (that is, those involved in virion structure, genome packaging and genome replication) were annotated as potential markers. After de-replication, 416,428 unique viral marker genes were detected. The 49,111 virus SGBs with four or more unique marker genes were used for taxonomic profiling. We developed a bioinformatics tool, Marker-MAGu, that leverages marker genes and taxonomic identities from MetaPhlAn 4 and TGVG to generate trans-kingdom taxonomic profiles for gut metagenomes (Fig. 1a,b; https://github.com/cmmr/Marker-MAGu). Given that WGS data represent sequences from double-stranded DNA molecules, this tool and database focuses on double-stranded DNA viruses.
Fig. 1: Marker-MAGu enables trans-kingdom taxonomic profiling of gut metagenomes.a, Schematic of the Marker-MAGu database structure. b, Marker-MAGu profiling ruleset. c, Average Bray–Curtis dissimilarity between adjacent samples for all participants with at least ten samples for virome and bacteriome (n = 566). ****P < 1 × 10−4, Wilcoxon test. d–f, Example prevalence plots of phages (bottom) and their putative host bacteria (top) in the genera Enterocloster (d), Bacteroides (e) and Faecalibacterium (f) in individual study participants during development as measured by Marker-MAGu. The dot size represents the relative abundance of the taxa. d, Bacteria, n = 4; viruses, n = 19. e, Bacteria, n = 6; viruses, n = 17. f, Bacteria, n = 1, viruses, n = 12. c–f, The boxplot centre line is the median, the box boundaries (hinges) correspond to the first and third quartiles, and the whiskers extend to the largest/smallest value no further than 1.5× the interquartile range (IQR) from the hinge.
Simulated read data from bacteria and phages showed that Marker-MAGu has high specificity at all coverage levels for bacteria and viruses along with high sensitivity starting at 0.5× the average read depth (Extended Data Fig. 2a–c). Archaea and micro-eukaryotes in the MetaPhlAn 4 database were detectable but we only focused on bacteria and viruses in this Article. Using real TEDDY WGS data, Marker-MAGu returned a consistent ratio of viral SGBs to bacterial SGBs (Extended Data Fig. 2d). This ratio was only slightly affected by the number of reads in a sample (Extended Data Fig. 2e; coefficient of multiple correlation (R) = 0.089). As expected, Marker-MAGu closely recapitulated MetaPhlAn 4 abundance measurements from sequencing of a bacterial mock community, with slightly lower sensitivity and slightly better specificity (Extended Data Fig. 2f,g).
One hundred metagenomic datasets from the TEDDY study were selected at random for in-depth analysis of virus SGB detection comparing Marker-MAGu to the field standard of alignment of reads to whole virus genomes. Exemplar genomes from all Marker-MAGu viral SGBs were used as the genome database and the breadth and depth of genome coverage per sample were obtained. On the one hand, read alignment to whole genomes without strict breadth of coverage thresholding can lead to false positives. On the other hand, high breadth of coverage is a reliable indication of the presence of a genome; however, in a typical metagenomic library, only viruses that are present at very high abundance will achieve this (for example ≥75% coverage breadth26), leading to false negatives. Extended Data Fig. 3a shows an example of error avoidance enabled by Marker-MAGu. In the top row (sample 4,098,288), reads aligned across the whole genome and Marker-MAGu detected this genome/viral SGB. In the bottom row (sample 6,146,150), reads only aligned to the flanks, which probably encode integrases or bacterial genes that were not removed by pruning. Marker-MAGu correctly failed to detect this genome.
We then considered all the viral SGBs/exemplars with any reads aligning in terms of whether Marker-MAGu did or did not detect these. We expected many false positive genomes/viral SGBs to show low breadth of coverage but high depth of coverage, with true positive alignments showing an even spread of reads across the genome. These patterns were observed in Extended Data Fig. 3b with Marker-MAGu ‘Not-detected’ versus ‘Detected’, respectively. This suggests that Marker-MAGu focuses on true positive signals. Encouragingly, >95% of genomes with ≥70% breadth of coverage were detected (Extended Data Fig. 3c). Finally, we also noted that most detected genomes/viral SGBs have low overall breadth of coverage (median = 28%), which shows the sensitivity of this tool (Extended Data Fig. 3d).
Viruses have a higher turnover rate than the bacteria in the gutTo calculate viral and bacterial SGB prevalence and abundance in 12,262 WGS samples from 887 participants of the TEDDY study, sequencing reads (average of 12.0 million reads per sample) were analysed using the Marker-MAGu marker gene approach. With WGS data, we expected to detect virus genomes inside virions as well as dormant and actively replicating virus genomes inside host cells. Trans-kingdom taxonomic profiles for each participant showed a dynamic interplay between phages and their host bacteria, often revealing successive waves of viral SGBs or viral SGB communities affecting single host bacterial SGBs (Fig. 1d–f). Community change was indeed higher from sample to sample for viral SGBs than bacterial SGBs (Fig. 1c).
Across the entire dataset, 1,709 different bacterial SGBs and 15,693 viral SGBs were detected at least once. Bacterial SGBs (Fig. 2a) had higher saturation based on a rarefaction curve, with similar observations at the genus level. Virus clusters (VCs) are computationally imputed genus-level genome clusters. Although most bacterial and viral SGBs were detected in only one or a few human participants, the skew towards rareness was greater for viral SGBs (Fig. 2b), underpinning the individual-specific nature of the virome. Relatedly, when all longitudinal samples were combined by participant, the Bray–Curtis dissimilarity of the virome was greater between participants than the bacteriome (Fig. 2c) and the ɑ diversity of the virome was on average greater than the bacteriome (Fig. 2d).
Fig. 2: Virome and bacteriome community change.a, Rarefaction curves of viral and bacterial taxa. Ribbons represent minimum and maximum taxa across 50 simulations. b, Distribution of viral and bacterial SGBs in the study participants. The number of participants with any detectable levels of each SGB was determined. Bins (0–50 participants, 51–100 participants and so on) were plotted as a percentage of total SGBs of each type on the y axis (log scale) and trendlines were compared with emtrends (linear model, Student’s t-test). c, All versus all Bray–Curtis distance for bacteriome and virome communities after averaging the microorganism abundance across all available samples. d, Bacteriome and virome ɑ-diversity measurements in gut communities. Each dot represents the average value from a participant. The boxplot centre line is the median, the left and right hinges correspond to the first and third quartiles, and the whiskers extend to the largest/smallest value no further than 1.5× the IQR from the hinge. Wilcoxon test. e, SGB ‘persistence’ within study participants. The data were filtered to exclude participants with <10 samples. Violins represent the density of values from each observation of an SGB in a participant (number of times SGB detected ÷ number of total samples analysed). The solid black lines mark the 50% quantile and the dotted lines are the 25% and 75% quantile marks. Wilcoxon test. f, Cumulative number of viral and bacterial SGBs per participant. Dots depicting the viral and bacterial SGBs of individual study participants, detected across all samples, are connected by a grey line. g, Heaps’ law ɑ values were plotted for each participant on bacterial and viral SGBs to determine ecosystem openness (coloured lines). The genome is considered open if ɑ < 1 (shown by a solid black line). ****P < 1 × 10−4.
The longitudinal nature of the data revealed that the virome has a higher participant-specific diversity and turnover rate than the bacteriome. First, it was observed that viral SGBs are detected, on average, in a smaller percentage of a participant’s samples than bacterial SGBs (Fig. 2e). This supports the idea of a higher turnover rate. A plot of the number of unique samples per participant indicated that new viral SGBs seem to accumulate at a nearly linear rate, whereas new bacterial SGB accumulation approaches saturation (Fig. 2f). This was confirmed by the observed lower Heaps’ law ɑ value for viral SGBs per participant (Fig. 2g).
Similar to previous reports on the bacteriome and virome in early childhood, we observed that the ɑ diversity of viruses and bacteria increased during infancy and for at least the first three years (Extended Data Fig. 4a,b)14, and this was comparable for participants in all four countries of origin (Extended Data Fig. 4c,d). Furthermore, t-distributed stochastic neighbour embedding (t-SNE) analysis (often used to represent relationships between high dimensional datapoints) of all samples demonstrated that samples from early infancy partitioned into many clusters, whereas samples taken from older children converged (Extended Data Fig. 4e). Samples did not partition by eventual type 1 diabetes diagnosis in this analysis (Extended Data Fig. 4e). There is some debate about whether Crassvirales (also known as crAss-like phages) increase or decrease in abundance from infancy to early childhood14,19. In TEDDY, 504 putative Crassvirales viral SGBs were detected in the TGVG (Methods) and a clear increase of abundance and prevalence of Crassvirales over time, with a plateau perhaps being reached 1,000 days after birth, was observed in the TEDDY cohort (Extended Data Fig. 4f). Similarly, the ‘original’ crAss-like phage, now known as Carjivirus communis, was frequently detected (Extended Data Fig. 4g) and long-lived in some patients, being detectable in over 20 longitudinal samples (Extended Data Fig. 4h).
Ecological succession of common gut viral and bacterial SGBsTo understand temporal trends in the TEDDY cohort, we parsed Marker-MAGu reports for all 12,262 sequencing libraries to limit further analyses to SGBs detected in ≥100 samples. This collection consists of 446 bacterial SGBs and 1,291 viral SGBs. By calculating the relative abundance of these common SGBs from 0 to 1,400 days of life across all samples, eight ‘temporal subsets’ of co-abundant microorganisms were separated through latent-class trend analysis27 (Fig. 3) using reasonable, if somewhat subjective, criteria. Specifically, eight clusters yielded locally optimal scores on these data with multiple clustering metrics when 2–12 possible clusters were compared (Extended Data Fig. 5). Interestingly, the temporal subsets describe SGBs that peak at different host ages (Fig. 3a,b). Subsets 1 and 2 included SGBs that were present immediately after birth and declined thereafter. Subsets 3–6 included SGBs that peaked at 100–200, 300–400, 400–600 and 800–1,100 days, respectively, after birth and Subsets 7 and 8 included SGBs that continually increased over the days of life covered in this cohort. As expected based on earlier studies12,13, common bacteria in the earliest subsets included Bifidobacterium breve and Bifidobacterium longum, whereas later subsets included Bacteroidales such as Phocaeicola vulgatus, Bacteroides uniformis and Alistipes onderdonkii along with Faecalibacterium species (Supplementary Table 1).
Fig. 3: Global patterns of viral and bacterial SGBs during development.a, Temporal subsets of prevalent viral SGBs drawn in separate boxes (left) and the temporal cluster membership (right). b, Same as a but with bacterial SGBs. a,b, The grey lines are individual viral SGBs and the coloured lines are average lines. c, Day of life abundance data were used for t-SNE analysis of all prevalent SGBs. d, Correlation between prevalent viral and bacterial SGBs. Given that the host prediction of phages is most accurate at the genus level, the temporal data from a and b were compared for each viral SGB and all bacterial SGBs from the putative host genus, and the best correlation was plotted. The boxplot centre line is the median, the lower and upper hinges correspond to the first and third quartiles, and the whiskers extend to the largest/smallest value no further than 1.5× the IQR from the hinge.
The viral and bacterial temporal abundance data were compared, and it was found that the viral SGBs were almost always well-correlated with their putative bacterial hosts (Fig. 3c and Extended Data Fig. 6) and better correlated with predicted hosts than other taxa (Extended Data Fig. 6a). The correlation between each viral SGB and a bacterial SGB in its predicted host genus was typically one of the best (if not the best) correlation in the set (Extended Data Fig. 6b). Although this is expected of obligate parasites, this finding validated the approach used here. The major exception to this correlation was between Lactococcus bacteria and their phages. A definitive explanation cannot be offered but it is notable that Lactococcus bacteria are commonly ingested due to their role in making dairy fermentation products such as cheese.
Virus data increases discrimination of samples from different countriesThe TEDDY study cohort consists of children who are vulnerable to the development of type 1 diabetes (Methods) and reside in four western countries (Germany, Finland, Sweden and USA). To understand how type 1 diabetes develops, each of the 114 TEDDY study participants that developed type 1 diabetes was matched—based on geographical location, sex and family history of type 1 diabetes—with one of 114 participants who did not develop type 1 diabetes24. This nested case–control study design was used here to assess microbial communities during the development of type 1 diabetes. Similar to studies of bacterial community composition in the TEDDY cohort as well as metastudies of type 1 diabetes, the abundance of individual viral and/or bacterial SGBs could not reliably discriminate between children who developed type 1 diabetes and those who did not (Fig. 4(right))13,28. However, random forest classifiers demonstrated that both bacteriome and virome data had discriminatory power to separate samples geographically, with the virome data outperforming the bacteriome data (virome was better in 2/4 countries, bacteriome was better in 0/4 countries) and a combination of both types of SGBs outperforming either measure alone (4/4 countries). Quantification of the most important features (SGBs) by country and day of life demonstrated geographical differences for many of these features (Extended Data Fig. 7). For example, the Lachnospiraceae bacterium SGB4754 is most abundant in the USA, Microbacterium SGB53643 is most abundant in Germany and Bacteroides ovatus is most abundant in Finland (Extended Data Fig. 7).
Fig. 4: Machine learning on virus and bacteria abundance data.Boxes represent data from 50 permutations of a random forest model on the same data split into train and test groups (70% and 30%) wherein all samples from each participant were always kept in the same group. Each iteration was run with a different random seed. NS, not significant (P ≥ 0.5), ****P < 1 × 10−4; Wilcoxon test. The boxplot centre line is the median, the lower and upper hinges correspond to the first and third quartiles, and the whiskers extend to the largest/smallest value no further than 1.5× the IQR from the hinge. IA, islet autoimmunity; T1D, type 1 diabetes.
Differences in rate of community change between groups in TEDDY studyAnalysis of the viral and bacterial community change over time in each participant was conducted by calculating the Bray–Curtis dissimilarity between each sample and the next, proceeding temporally. Interestingly, the average Bray–Curtis dissimilarity was modestly lower in participants who went on to develop type 1 diabetes than those who did not for both the virome and bacteriome measurements (Fig. 5a,d). Furthermore, when looking at comparisons across age, the trend was statistically significant only between 400 and 700 days of life (approximately the second year of life; Fig. 5b,e), whereas the median age of diagnosis of type 1 diabetes was 2.4 yr (about 880 days)24. Some differences in average Bray–Curtis dissimilarity were also seen between participants from different countries (Fig. 5c,f). Measurements of the relative combined abundance of the taxa from different temporal subsets (see Fig. 3a,b) revealed that samples (particularly 700–1,400 days of life) from participants diagnosed with type 1 diabetes had differential abundances of temporal subsets 2, 5, 6 and 7 at various ages (Extended Data Fig. 8).
Fig. 5: Comparison of community change and diversity.a, Virome average Bray–Curtis dissimilarity metrics for type 1 diabetes (T1D) using nested case–control design pairing each participant who developed type 1 diabetes with a control participant. Each participant is represented by a dot (n = 228). Case–control pairs are joined by lines. b, Samples were binned by day of life (rounded) of collection and plotted by Bray–Curtis dissimilarity to the follow-up sample when both participants in the pair had ≥1 sample from that time period (n = 228). c, Virome average Bray–Curtis dissimilarity metrics by country (not nested case–control; n = 566). d–f, As per a–c, respectively, but for the bacteriome. NS, not significant (P ≥ 0.05), *P < 0.5, **P < 0.01, ***P < 0.001; paired two-sided Wilcoxon test. The boxplot centre line is the median, the lower and upper hinges correspond to the first and third quartiles, and the whiskers extend to the largest/smallest value no further than 1.5× the IQR from the hinge.
留言 (0)