SNV-FEAST: microbial source tracking with single nucleotide variants

SNV-FEAST algorithm

Here we adapt FEAST to accept SNV abundance instead of species abundance as input. A computational challenge in using SNVs instead of species as input to FEAST is that SNVs contribute a significantly larger feature space. The number of different species comprising a microbiome can range from a few hundred to a few thousand, while the number of possible SNVs for a given species alone can be in the thousands [10]. This difference in the number of input features can result in FEAST runtimes that last several hours instead of a few minutes and memory-intensive storage of read counts at all sites of variation.

We devised a likelihood-based approach for selecting a set of informative or “signature” SNVs for a given source tracking analysis, allowing us to overcome the time and memory-intensive challenges of utilizing SNV-level data. We identify these informative SNVs by computing a signature score (Fig. 1A) (see the “Methods” section) that quantifies the extent to which SNVs in the sink are most likely derived from one of the potential sources. This is analogous to identifying SNVs private to sources and their sinks, but more generalized to include SNVs that may be found in multiple sources, albeit at higher frequency in one of the potential sources (see the “Methods” section).

To compute a signature score for a given SNV, two hypotheses are compared for each potential source: (1) that one source solely explains the observed allele counts in the sink and (2) all sources except that one source collectively explain the observed allele counts in the sink. For each hypothesis, we calculate the binomial log-likelihood for the estimate of the allele frequency in the sink, θ.

Hypothesis 1: Source i with allele frequency \(_\) explains the allele counts in the sink.

Hypothesis 2: A combination of all other sources except i (sources \(j\ne\)i) explains the observed allele count distribution in the sink. The estimate of the sink allele frequency is computed using a mixture of the allele frequencies \(_\) from those sources. The mixing parameter \(}_\) is learned using Sequential Least Squares Programming with the constraint that \(\sum\limits_}_=1\).

$$\widehat=\sum\limits_}__$$

The binomial log-likelihood is calculated as follows, where there are n reads with the reference allele and m reads with the alternative allele in the sink.

$$LL\left(\widehat\right)=n\mathit\widehat+m\mathit\left(1-\widehat\right)$$

A log-likelihood ratio representing the support for hypothesis 1 relative to hypothesis 2 is calculated per site per potential source. The maximum log-likelihood ratio per site is the signature score for that SNV, representing how favorably one of the sources explains the sink over all other sources. Signature SNVs are those with scores greater than two standard deviations over the mean signature score computed for all SNVs (see the “Methods” section).

Evaluation of SNV-FEAST in simulations

To compare the accuracy of species-FEAST and SNV-FEAST, we performed simulations mimicking mother-infant transmissions with the goal of estimating contributions of different sources to an infant sink. Our simulations tested the ability of SNVs and species to recapitulate the true source composition in synthetic samples comprised of a mixture of reads drawn from multiple real fecal adult samples. To construct these synthetic infant microbiomes, we mixed metagenomic data from mothers sampled in a mother-infant dataset [11] at various proportions as described below (see the “Methods” section).

The difficulty of source tracking increases with the number of contributing sources [2]. Thus, we simulate infants that have a small (≤ 5) versus large (6–10) number of contributing sources (Additional file 1: Table S1), including an unknown source (e.g., a randomly selected unrelated mother). Known source contributions to the simulated gut microbiome sample of the infant varied between 1 and 90% while the unknown contribution varied between 10 and 90%. The unknown source was not presented to FEAST as a potential known source.

Additionally, not all species in a mother are transmitted to the infant [5, 7, 12,13,14]. Thus, in our simulations, species transmission rates were determined using a beta distribution, which is a natural model for values between (0, 1) and often proposed for microbial abundance data [15,16,17,18] (see the “Methods” section). We therefore consider four simulated scenarios: a combination of low versus high number of sources and low versus high transmission rates (see the “Methods” section).

Figure 2 compares the performance of SNV-FEAST and species-FEAST in estimating the true contribution of sources. FEAST using SNVs has equal if not better performance than species in most scenarios and performs especially well when transmission rates are low and unknown source proportions are high. SNVs have a lower root mean squared error (RMSE) compared to species in three of the four scenarios and higher Pearson correlation between true and estimated contributions in all four scenarios. The difference in these correlations for SNVs versus species is significant in all four cases when using a paired Wilcoxon signed rank test (high transmission: p-value = 0.00560, 0.00251 for small and large number of sources, low transmission: p-value = 0.00024, 0.002340 for small and large number of sources). These results suggest that SNVs may offer useful signatures of transmission.

Fig. 2figure 2

Ability of SNV and species-FEAST to recapitulate true contributions in simulations. Estimated known and unknown source proportions for infant microbiomes simulated with in silico mixtures of real maternal fecal microbiomes under different scenarios: either a small number of contributing sources (≤ 5) or large number of sources (6–11), and a high transmission rate of species or low transmission rate. The transmission rate is the probability of an infant being colonized by a given species, simulated using a beta distribution centered on the relative abundance of species in sources (see the “Methods” section). Twenty-three infants were simulated with five or fewer sources and 19 infants were simulated with a large number of sources (Table S1). The black line indicates the ground truth for proportions. For each simulated infant, there are 11 points plotted, whereby 10 correspond to known sources (some of which have zero contribution), and one corresponds to an unknown source which is indicated by hollow circles in the plot

To assess whether all species and all signatures SNVs in the sink are needed for accurate source tracking, we varied the proportion of species (10%, 50%, or 100%) and SNVs (10%, 50%, or 100%) included as inputs to the algorithm (Additional file 1: Fig. S1). We used Pearson correlation between the true and estimated proportions to represent the accuracy of SNV-FEAST. When decreasing the percentage of SNVs used, there is no statistically significant change in the performance. However, when decreasing the percentage of species used, there are statistically significant decreases in the performance (Additional file 1: Fig. S1).

To illustrate the advantage of SNV-FEAST over traditional strain tracking approaches such as inStrain [6], we used the same synthetic communities produced in the above simulation for inStrain profiling between each infant and each of their potential contributing sources (Additional file 1: Fig. S2). InStrain computes a popANI score, which represents the average nucleotide identity between two different metagenomic samples for a given species. As per the inStrain paper, popANI values > 99.999% represent the same strain being shared between samples for a given species (see the “Methods” section). However, this approach provides a binarization as to whether or not a strain was transmitted and does not account for the relative abundance of the strain in the sink. Thus, we computed the fraction of each infant’s species that have popANI \(\ge\) 99.999%, with each potential source.

As expected, both SNV-FEAST and inStrain produce estimates of sharing that correlate positively with the ground truth mixture proportions of the contributing source samples in each infant (Additional file 1: Fig. S2). We found inStrain results yielded a 0.742 Pearson correlation (p-value < 1 × 10−12) with the true mixture proportions, whereas SNV-FEAST has a 0.866 Pearson correlation (p-value < 1 × 10−12) with the true proportions. The higher correlation values for SNV-FEAST likely reflect that relative abundances of strains and their genomic identities are simultaneously taken into account for source tracking, whereas inStrain only accounts for genomic identities. Finally, several of the shared species in the simulations had popANI values < 99.999%, reflecting the complex mixtures from multiple sources.

We next compared SNV-FEAST with the strain tracking procedure in Nayfach et al. [3]. Again, we used the same synthetic communities produced in the simulation to determine marker alleles as defined in Nayfach et al. [3] (see the “Methods” section). Here a marker allele is determined to be a SNV that is private to mother, infant, or the mother-infant dyad, and absent from the background population, which consisted of other samples in the dataset as well as samples from US adults in the Human Microbiome Project [19, 20] (see the “Methods” section). Species with \(\ge\) 5% marker allele sharing between mother and infant were deemed to share a strain (see the “Methods” section). We found a high correlation between the true mixture proportions (on x-axis) and the percentage of species with transmission events (y-axis) (Pearson correlation 0.915, p-value < 1 × 10−16) (Additional file 1: Fig. S3A). The higher correlation for the Nayfach et al. [3] approach compared to the inStrain approach possibly reflects horizontal gene transfers between lineages residing in infants and mothers. By contrast, there was a lower correlation between the true mixture proportions (x-axis) and the sharing for all marker alleles across species present in the infant (y-axis) and (0.575 Pearson correlation, p-value < 1 × 10−16) (Additional file 1: Fig. S3B).

Source tracking in infants over the first year of life

Having assessed the abilities of SNV-FEAST in synthetic data, we next estimated the contribution from the true mother over time to the true infant with SNV and species-FEAST in the Bäckhed et al. [11] dataset. This dataset is composed of metagenomic samples from infants collected at 4 days, 4 months, and 12 months after birth, as well as their mothers at the time of delivery. Previous analyses on this data have shown that even while species similarity increases, infants and their mothers share fewer proportions of strains over time as revealed by sharing of SNVs private to mother-infant dyads [3]. Thus, SNVs belonging to strains shared only by the infant and their mother may be more informative of the true source compared to species. Here we sought to test whether SNV and species-FEAST recapitulate these results (see the “Methods” section).

In applying FEAST to the Bäckhed et al. [11] dataset, we estimated the proportion of the infant sample at birth attributable to their own mother. For 4-month-old infants, we estimated the proportion attributable to the mother and itself at birth. For 12-month-old infants, we estimated the proportion attributable to the mother and itself at birth and 4 months [2]. This allowed “unknown” to be more strictly defined as the component of the infant microbiome that could not be explained by the mother. It also allowed us to better discern if completely new strains were acquired at the 4th and 12th months of life (that were not already acquired during previous life stages).

First, consistent with previous findings made with species and SNVs [3], species-FEAST estimates an increasing contribution of the mother over time (t-test p-value = 5.1 × 10−4), but SNV-FEAST estimates a decrease over time (p-value = 0.063) (Fig. 3).

Fig. 3figure 3

Source tracking in the infant gut microbiome over the first year of life. Species- and SNV-FEAST were applied to Backhed et al. 2019 data to estimate the contribution of A, B mother, C, D unrelated mothers, and E, F unknown sources to infants sampled at birth, 4 months, and 12 months. The black line and inset statistics pertain to the linear regression fit for the source estimates as a function of age of the infant. G, H are swapped source tracking analyses with mother and infant swapped when using species-FEAST and SNV-FEAST, respectively. Additional file 1: Fig. S4 shows the species that were included in species-FEAST and species that had SNVs included in SNV-FEAST. Additional file 1: Fig. S5 shows the estimate of the unknown component when previous time points of the infant are excluded from the sources

Second, we assessed the ability of species and SNV-FEAST to distinguish the true mother from three randomly selected unrelated mothers. Species-FEAST estimates an increasing contribution of unrelated mothers over time (t-test p-value = 0.014) while SNV-FEAST estimates no significant change over time (t-test p-value = 0.59) (Fig. 3). The increase in contribution from unrelated mothers with species-FEAST does not suggest that these particular unrelated mothers are seeding the infant. Rather, the opposing trend observed with SNVs suggests that similarity at the species level is consistent with the maturation of the infant microbiome over time.

Finally, we estimated contributions from unknown sources, i.e., the proportion of the infant microbiome not explainable by the true mother, the three randomly selected unrelated mothers, or any previous time point. Species-FEAST estimates a sharp decline in the contribution of unknown sources over the first year of life (t-test p-value = 7.1 × 10−12) (Fig. 3). This significant decrease in unknown at the species level reflects the infant microbiome maturation over the first year of life. By contrast, SNV-FEAST estimates little change in the contribution of unknown sources (t-test p-value = 0.49) (Fig. 3). Note that this unknown component reflects what was gained since a previous time point. In other words, at 12 months, the infant on average acquired the same fraction of unknown as it did at 4 months and birth. When source tracking is run without including previous time points as sources, the unknown component increases over the first year of life for SNVs only (Additional file 1: Fig. S5).

Next, we sought to understand the effect of swapping sink and source in the re-analysis of Bäckhed et al. [11] data. In Fig. 3G and H, the infant at birth is the potential source and the mother is the sink. The estimated contribution from baby to mother is significantly smaller (species-FEAST: 11.9 difference, Wilcoxon rank sum test p-value = 0.013; SNV-FEAST: 16.0 difference, p-value = 2.2 × 10−5) compared to that of mother to baby. This trend may be suggestive, but is not conclusive, of directionality, whereby a less diverse source is seeded by a more diverse source.

Contribution of the NICU-built environment to infant microbiomes

Next, we re-analyzed a metagenomic dataset studying the contribution of the hospital environment to the infant gut microbiome in the neonatal intensive care unit (NICU) [21]. This dataset is composed of microbiomes of infant stool, as well as the NICU rooms of the same infants at frequently touched surfaces, sink basins, the floor, and isolette-top sampled over an 11-month period [21]. We applied SNV and species-FEAST to assess the contribution of the infant’s own NICU room as well as a different NICU room in the vicinity to the infant’s gut microbiome (see the “Methods” section).

Concordant with the findings of Brooks et al., both SNV and species-FEAST detected that the most common source contributing to the infant microbiome was the floor and isolette-top from the infant’s own room (Fig. 4A, B). SNV-FEAST found Infant 18 also had large contributions from their own room’s touched surfaces at multiple time points (Fig. 4B), which is consistent with a finding by Brooks et al. that three strains found in Infant 18 perfectly matched (> 99.999% average nucleotide identity) strains found in the touched surfaces samples of Infant 18’s own room. Lastly, both species-FEAST and SNV-FEAST found Infant 6’s microbiome was explained almost entirely by samples from a different room with SNV-FEAST finding a sizeable contribution from both the floor and isolette top and the sink basin in this different room. This is concordant with Brooks et al.’s finding of multiple cases of strain sharing across rooms of Infant 6 and 12 for the different surfaces. FEAST with both data types can quantify the extent to which Infant 6’s microbiome was influenced by strains present in the built environment.

Fig. 4figure 4

Source tracking of infant gut microbiome in the NICU. A Species-FEAST and B SNV-FEAST applied to infants in the NICU. Each bar represents one sampling day in the NICU stay of an infant. Infants 3 and 6 stayed in the same room, but at different times. The same applies to Infants 12 and 18. The contribution of a different room was determined by using samples from Infant 12’s room for Infants 3 and 6, and samples from Infants 6’s room for Infants 12 and 18 for each of the categories of surfaces per infant: touched surface, sink basin, or floor and isolette top surface. The asterisks represent the result of a paired Wilcoxon signed rank test indicating whether the total contribution of surfaces from the infant’s own room was higher than contributions from the other room. Iterative swapping of the infant sink and each potential source for source tracking with C species-FEAST and D SNV-FEAST. The first column shows source tracking results in which the infant was treated as the sink. In each column after the first column, a different environmental source was swapped with the infant and treated as a sink. The brackets indicate the pairs of results that are compared using a paired Wilcoxon signed rank test. For all results, the following symbols represent the results of the statistical test: **** for p-value < 0.0001, *** for p-value < 0.001, ** for p-value < 0.001, * for p-value < 0.05, and n.s. for p-value > 0.05

Through application of SNV and species-FEAST, we can quantify any time trends for the influence of the built environment on the infant microbiome (Fig. 4A, B). SNV-FEAST more consistently finds that contribution from the infant’s own room exceeds contributions from a different room over time (paired Wilcoxon signed rank test for same room > different room: Infant 3: p-value = 1.95 × 10−9, Infant 6: 1.0, Infant 12: 3.05 × 10−5, Infant 18: 3.81 × 10−6) as compared to species-FEAST (Infant 3: p-value = 0.41, Infant 6: 1.0, Infant 12: 5.8 × 10−4, Infant 18: 3.81 × 10−6). Interestingly, species-FEAST assigns one dominant source primarily, whereas SNV-FEAST more often finds a combination of sources for a given sample.

Additionally, both SNV and species-FEAST estimated a large unknown component for all four infants, with Infant 18 showing the largest mean unknown component across the NICU stay based on SNVs (Additional file 1: Fig. S6). This unknown component is important because it signifies the extent to which other sources such as the mother and diet impact infant gut colonization.

We then asked the question is the infant more explained by the built environment rather than vice-versa, the built environment is more explained by the infant. We tested this by swapping the infant and each of the three built environment sources (Fig. 4C, D). The estimated contribution of room to infant is significantly higher than the estimated contribution of infant to room, but this asymmetry is more pronounced with SNV-FEAST. SNV-FEAST showed significantly higher contribution of room to infant for two of the three surface types (floor and isolette top: Wilcoxon rank sum test p-value = 7.00 × 10−9, touched surface: p-value = 0.0058, sink basin: p-value = 0.274) while species-FEAST found this to be true for one of the three surface types (floor and isolette top: Wilcoxon rank sum test p-value = 7.1 × 10−5, touched surface: p-value = 0.968, sink basin: p-value = 0.998). Interestingly, the built environments of different rooms highly resemble each other. This is especially apparent with species-FEAST, suggestive of similar ecological forces operating in similar built environments. By contrast, SNV-FEAST reveals a higher diversity of contributing sources of the built environment samples to other NICU-built environments, once again highlighting the utility of performing source tracking with SNVs.

Global source tracking of ocean microbiomes

The ocean microbiome is a complex community that displays biogeography at the species and functional levels [3, 22]. To further understand global patterns of ocean microbiomes, we applied SNV and species-FEAST to the Tara Oceans microbiome dataset [22]. In the source tracking context, rather than defining sharing as evidence of a transmission event (which is more likely in mother-infant data), estimated source contributions at best explain the extent to which a given ocean sample resembles other ocean samples. On one extreme, an ocean sample might be entirely explainable by a single ocean’s samples, and at the other extreme, an ocean sample might be explainable by multiple oceans at the same time. Another alternative is for an ocean sample to not be explainable by any of the provided sources, resulting in a high unknown component and potentially suggesting high endemism. These source tracking estimates could be indicative of the extent to which oceans mix or may be reflective of similar niches.

Tara Oceans is composed of 182 whole metagenomic sequencing samples derived from 64 stations at multiple depths. Previous research indicates that temperature is one of the highest drivers of variability in microbial composition in the ocean [22, 23]. For this reason, we restricted the source tracking analysis to sinks and sources from the same temperature and depth range: above 20 °C and within an average of 5 m below the surface.

First, we performed source tracking between oceans using SNV and species-FEAST. We treated each station around the world as a sink and estimated the contribution of different oceans around the world to that sink (see the “Methods” section). Unknown represents any portion of the microbiome in these sink samples that cannot be explained by any of the provided source samples. We found that species and SNV-FEAST estimate different amounts of sharing between oceans, where SNVs estimate a higher unknown on average, potentially indicative of endemism. The finding that SNV-FEAST estimates a higher unknown contribution on average is most evident in the North Pacific, North Atlantic, South Atlantic, and Mediterranean oceans (Additional file 1: Fig. S7). Additionally, in some oceans, SNVs identify contributions from oceans that species-FEAST does not detect (Fig. 5, Additional file 1: Fig. S7). For example, in applying FEAST to Indian Ocean samples, we find that there is measurable sharing of microbes with the Mediterranean Sea, but this is not detected with species (Fig. 5C). Such differences were found in samples from other oceans as well (Additional file 1: Fig. S7).

Fig. 5figure 5

Microbial source tracking in the Tara Oceans dataset with SNV and species-FEAST. A World map indicating the location of sampling sites. Source tracking estimates for the contribution of different oceans to the B South Pacific (n = 16) and C Indian Oceans (n = 16) are depicted with vertical bars. In each experiment, all stations around the world excluding those from the “sink” ocean are treated as potential sources. Light blue, for example, represents the total contribution of the four stations from the Mediterranean Sea that had samples in the surface layer that were also greater than 20 °C in temperature

Next, we assessed whether source tracking estimates display a distance-decay relationship. Previous studies found that genetic distance, such as that represented by fixation index FST, increases with geographic distance between populations [24, 25]. Based on these findings, our expectation was that samples that are further away from a given station will have reduced resemblance to that station. To assess this distance-decay relationship, we plotted pairwise source tracking results across all possible pairs of ocean samples (Fig. 6A, B). We found that indeed as the distance increases, the % explainability of a given source ocean decreases − 0.23% per thousand km according to species-FEAST (t-test p-value < 1 × 10–16), and − 0.5% per thousand km according to SNV-FEAST (t-test p-value = 0.0018). The steeper slope for SNV-FEAST suggests that SNVs may be more sensitive to distance decay signals on a global level.

Fig. 6figure 6

Source tracking with ocean samples. Distance decay in the contribution of a “source” ocean to a “sink” ocean when using A species-FEAST and B SNV-FEAST. In each experiment, only stations from one ocean were considered sources for a given sink station. For example, when performing source tracking between the Mediterranean and North Atlantic, for each Mediterranean station, the 10 available North Atlantic stations were considered potential sources. Thus, plotted are 10 points for a given Mediterranean sink, where each point represents the contribution of a source station from the North Atlantic to the Mediterranean sink station in question. Shown in the inset text are the slope and t-test p-value for the slope. C and D are flipped source tracking analysis with the Red Sea and Mediterranean, as well as the South Pacific Ocean and North Atlantic Ocean using species-FEAST and SNV-FEAST, respectively

Finally, we investigated whether some oceans have higher estimated contributions to other oceans than vice versa, potentially indicative of the directionality of transmissions (though see the “Discussion” section). Specifically, we investigated the relationship between the Red Sea to the Mediterranean Sea (Fig. 6C, D). Migration from the Red Sea to the Mediterranean, known as Lessepsian migration, is well-documented for not only microorganisms but also macroorganisms like fish [26,27,28]. Additionally, recent studies may suggest that anti-Lessepsian migration of bacteria (Mediterranean to the Red Sea) is more common than Lessepsian migration [29]. Studies find that the Mediterranean has brine pools that produce a similar environment to the Red Sea’s [30], allowing for bacteria from the MS to potentially thrive in the RS.

By swapping the Red Sea and Mediterranean as source and sink, we found that there was indeed a significant difference in the estimated contribution from one direction to another with SNVs but not species (Fig. 6C, D). SNV-FEAST found the Mediterranean explained an average of 15% of the Red Sea, while the Red Sea explained an average of 1.8% of the Mediterranean (Wilcoxon rank sum test, p-value = 0.02), consistent with anti-Lessepsian migration. Meanwhile, a similar analysis with species-FEAST found the Mediterranean explained 2.5% of the Red Sea and the Red Sea explained 4.9% of the Mediterranean (Wilcoxon rank sum test, p-value = 0.25). In a similar analysis between North Atlantic and South Pacific, we found that both species and SNVs supported significantly greater contributions from the North Atlantic to the South Pacific, with SNV-FEAST estimating a greater contribution (17%, Wilcoxon rank sum test p-value = 5.1 × 10−11) than species-FEAST (10%, Wilcoxon rank sum test p-value = 1.8 × 10−4). The same analysis performed in the other oceans is presented in Additional file 1: Fig. S8.

Together, these results suggest that on average, SNV and species FEAST generate similar source tracking results in the Tara Oceans dataset, with SNVs displaying stronger signals of endemism, distance-decay relationships, and potential directionality of transmission.

留言 (0)

沒有登入
gif