The pan-Draft module is integrated within gapseq, a bioinformatics pipeline designed for the automated de novo reconstruction of ready-to-use GEMs. Given a genomic sequence, the pipeline uses two modules (i.e., find and find-transport) to perform a sequence similarity search against databases collecting gene-protein-reaction annotation rules and a third module (i.e., draft) to reconstruct a preliminary metabolic network [6]. By processing the output of the find, draft, and find-transport commands, pan-Draft aims to reconstruct high-quality draft metabolic models from several incomplete MAGs having a potentially complementary gene content through a comparative pan-reactome analysis. Briefly, a network analysis is performed using a minimum reaction frequency (MRF) threshold to generate a species-level draft model. Simultaneously, reaction weights, used in the gapfilling step, are defined depending on respective reaction frequencies (further details in the Methods section). The method is applicable to any set of prokaryotic genomes, whether they include isolates or MAGs only. However, its primary relevance is conceived for situations where standard genome-centric metagenomics struggles to accurately recover high-quality MAGs for any species of interest. The computational cost of pan-Draft is limited to a few minutes, making it easily applicable to pre-existing GEM collections (Additional file 1: Fig. S1). Nevertheless, the reconstruction of large GEM catalogs remains a demanding process, which can limit the use of this method on personal computers for sequence collections including hundreds of MAGs (Additional file 1: Table S1).
In order to identify the target studies that can benefit from species-level model integration methods and to validate the approach, two among the largest available MAG datasets deriving from highly different environments were used. The first is the Unified Human Gastrointestinal Genome catalog (UHGG v.2.0.1), which collects 289,232 genomes clustered into 4744 SGBs. The UHGG dataset represents a host-associated microbial domain populating either a strictly anaerobic or microaerobic environment [25]. The second dataset, the Ocean Microbiomics Database (OMD, v 1.1), collects 34,815 genomes clustered into 8308 SGBs [26]. This collection, in comparison to the former, includes a more complex and less explored microbiome spanning a wider spectrum of environmental conditions and ecological heterogeneity. Despite this variability, the majority of its metagenomes (approximately 58%) were sampled from the epipelagic zone, reflecting microbiomes adapted to aerobic conditions. With the goal of exploiting the redundancy of multiple MAGs, we counted the number of SGBs with a minimum of 30 associated MAGs clustered at 95% ANI in either dataset, resulting in 450 and 135 SGBs, respectively. Notably, the majority of these SGBs lack an isolated representative (see columns 5–6 in Table 1), highlighting both the gap existing between cultured and uncultured microorganisms and the exploitable genomic redundancy for uncultured taxa. Specifically, 375 and 126 SGBs had no isolated representative in the UHGG and the OMD catalogs (Table 1).
Table 1 Summary of the OMD and UHGG catalog characteristics. A minimum of one and ten isolate/s were used to filter SGBs with (w/) reference genome/s in the OMD and UHGG catalogs, respectively. In each cell is reported the number of SGBs and the corresponding number of associated genomes. The set of SGBs used in this work is highlighted by an asteriskThe SGBs having at least one available reference genome allowed us to validate the pan-GEM reconstruction method. To do this, the databases underwent filtering to select non-redundant genomes on the basis of a 99.9% ANI and to exclude those originating simultaneously from the same species and sample [25]. Afterwards, SGBs with a minimum of 30 MAGs, at least one reference genome in the OMD, and at least ten reference genomes in the UHGG (Fig. 1) were further selected. The filtering resulted in 75 SGBs including 62,034 MAGs and 4311 reference genomes for the UHGG database, and nine SGBs having 472 MAGs and 16 references for the OMD catalog (Additional file 2: Table S1 and Table S2). Both datasets included only MAGs with a completeness level above 50% and contamination below 5%.
Fig. 1Flowchart showing the dataset filtering steps and the pan-Draft workflow. Set of filtering criteria implemented on the UHGG and OMD dataset to select SGBs with non-redundant genomic information for pan-Draft validation. A minimum threshold of 30 MAGs was imposed to reconstruct a set of draft models through gapseq. pan-Draft is integrated within the reconstruction pipeline by processing the draft models and the genetic evidence of reactions. The information used for each genome of a SGB to generate the pan-GEM is highlighted by the red box
Role of genome completeness in model reconstruction: the foundation for a “pan” perspectiveTo define a baseline for pan-GEM reconstruction, we first analyzed the relationship between MAG completeness and draft GEM quality in the two datasets. The selected MAGs covered a wide spectrum of completeness in both environments, ranging from 50 to 100%. The maximal completeness of MAGs for individual SGBs lacking cultured representatives also showed significant variability. In several cases, these MAGs did not reach 90% of completeness (Fig. 2a, b). Specifically, OMD had 59 SGBs (0.7% of the total) with maximal MAG completeness levels below 90%, whereas UHGG had 28 (0.6% of the total—Table 1). Traditional de novo draft GEM reconstruction is expected to yield highly gapped models in these cases. To assess the extent of these gaps, we computed the correlation between the MAG completeness level and the quality of the reconstructed GEMs. The latter was quantified by comparing the presence and absence of reactions shared between GEMs reconstructed from individual MAGs (MAG-GEMs) and GEMs of reference genomes (iso-GEMs). This similarity was expressed as the Matthews correlation coefficient (MCC). By fitting a generalized additive model to the data (Fig. 2c, d) a strong and statistically significant positive correlation emerged (Spearman correlation, rho = 0.78, P-value < 2.2e − 16) between the genome completeness and the MCC, confirming the need of a method to overcome current completeness limits. Interestingly, significant variability in this relationship was also identified across bacterial phyla (Fig. 2c, Additional File 1: Fig. S2).
Fig. 2Potential improvement of the species-level metabolic model reconstruction approach. Completeness level of the representative MAGs for SGBs with no reference genome and at least 30 MAGs in the corresponding database (a, b). On the lower panels, a generalized additive model was fitted to the similarity measure of the gapfilled models reconstructed for 84 SGBs with varying completeness level compared with the species-representative reactome gold standard (c, d). Data regarding the UHGG and the OMD databases are presented in panels a, c and b, d, respectively
Selection of species pan-reactomeGenerating species-level metabolic models could require combining draft GEMs reconstructed from contaminated MAGs. Consequently, a strategy is needed to exclude exogenous reactions from the pool of endogenous SGB-specific enzymes. Moreover, correctly predicted reactions (i.e., those actually belonging to a species pan-reactome) that are present only within small sub-populations should be preferentially filtered out to generate a draft representative of a species metabolism. To achieve this, we defined a MRF threshold necessary to discriminate whether a reaction should be directly included in the draft model or not and empirically determined its optimal value. Specifically, the draft pan-GEM for each species was reconstructed by selecting all available MAGs and varying the MRF parameter from 0 to 100%. The pan-GEM quality was once again quantified by comparing the model reaction set with the species reactome gold standard.
Results highlighted a certain degree of variability in optimal MRF thresholds between the UHGG and OMD datasets and across SGBs (Fig. 3a, b). Nevertheless, in both cases, a threshold above 20% tends to underestimate the size of the SGB reactome, as indicated by the steep decrease in MCC for MRF thresholds between 20 and 100% in all the species. Below that value, the MCC curves flatten, arguably due to the simultaneous effect of contaminant reactions and reactions associated with accessory metabolism present at different frequencies in the species sub-populations. As a trade-off, the optimal MRF value was thus obtained by averaging the MCC across all the species in any of the two datasets and selecting the threshold maximizing this metric. The optimal MRF threshold was thus equal to 6% for both the UHGG and the OMD catalogs (Fig. 3a, b). Overall, these results suggest that a threshold within the range of 5–10% allows the reconstruction of solid pan-GEMs for SGBs across different and distant taxonomic groups.
Fig. 3Minimum reaction frequency threshold definition for draft pan-GEM reconstruction. Quality of the models generated at the species level with a selected MRF ranging from 0 to 100%. The gray and dark vertical dashed lines highlight the MRF maximizing the MCC for each SGB and across all the species, respectively. The black solid line depicts the average MCC across species in the UHGG (a) and the OMD (b) dataset. Boxplots show the relationship between the optimal MRF and the number of MAGs used to reconstruct the draft pan-GEM (c), as well as the effect of varying MRF on the quality of the pan-GEMs (d). Both boxplots show the distribution across all analyzed SGBs. The results are the median (c) and average (d) of 10 iterations per SGB with a random selection of the initial MAGs. The colors in c show the median number of MAGs in each SGB corresponding to the selected MRF threshold, computed as the number of MAGs in the subset times the MRF
Further validation of the identified optimal threshold was performed by testing MRFs ranging from 0 to 100% on MAG subsamples of variable size. Specifically, MAGs were randomly sampled to form batches of n = 2, 5, 10, 15, 30, 60, 100, and 1000 genomes, iterating ten times for each SGB, and the obtained MCCs were averaged over the iterations for a robust characterization. The results confirmed that the optimal MRF lies between 5 and 10% for batches including tens to one thousand MAGs and suggested that it roughly approaches 1/n with a decreasing batch size below 15 MAGs (Fig. 3c). This was in part expected given that, with few MAGs, unique differences between genomes become proportionally more frequent and, at the same time, contaminant and accessory reactions get less distinguishable based on their frequency. With n MAGs, these reactions typically have a frequency of 1/n and the mean optimal MRF lies just below this value, essentially including all the reactions in at least one MAG-GEM (Fig. 3c). As a result, most pan-GEMs reconstructed from fewer than 10 MAGs incorporate the union of all reactions in the associated MAG-GEMs. In contrast, when the number of available MAGs is 15 or above, a low MRF threshold is essential to filter out (rarer) contaminant reactions and retain accessory ones.
Such a transition can be further appreciated when observing the effect of five different MRF thresholds on the quality of pan-GEMs reconstructed with an increasing number of MAGs (Fig. 3d). The results clearly show a drastic model improvement using few MAGs (2–5) as compared to a single MAG, with the MCC nearing its maximum at 30 MAGs and showing marginal improvements using larger batches. In the extreme case of employing a 0% threshold—i.e., including all the reactions present in any available MAG—the performance is the best achievable one at low MAG numbers, whereas it yields a MCC drop of 0.02 ± 0.02 using 30 MAGs and 0.17 ± 0.06 using 1000 MAGs. Thus, while this permissive approach can be safely used when having genome sequences from isolated microorganisms [34, 36], it would lead to a false positive increase of 65.78 ± 41.11 and 424.73 ± 147.63 reactions with 30 and 1000 MAGs, respectively.
In the following analyses, a 6% MRF threshold was set as standard to reconstruct the species-level GEMs. Based on the hypothesis that the contaminating genetic material present in a MAG is not redundant across the species, this threshold provides a high level of confidence to exclude contaminating reactions when dealing with lowly contaminated (estimated level < 5%) MAGs.
Properties of species-level metabolic modelsHaving established an optimal pan-reactome selection criterion, we set out to characterize the generated pan-GEMs. On average, they possessed 3.1 ± 2.1% more metabolites and 4.0 ± 2.7% more reactions than their counterparts obtained from reference genomes (Fig. 4a–b). Thus, as also visible in Fig. 3, they largely reflected the corresponding iso-GEMs in terms of metabolic network components. In contrast, MAG-based GEMs of origin contained up to 64% fewer metabolites and 75% fewer reactions.
Fig. 4Structural and metabolic features of pan-GEMs. The upper boxplots summarize the ratios of the number of metabolites (a), the number of reactions (b), and the number of reactions with associated genetic information (c) in MAG-GEMs and pan-GEMs, relative to the corresponding median values of iso-GEMs. Red dots indicate the statistics for the pan-GEMs reconstructed with all the MAGs of an SGB, either in the UHGG (red boxes) or OMD dataset (light blue boxes). The lower boxplot shows the functional classification of reactions grouped into pan-reactome categories by MetaCyc pathways ontology (d). Core, shell, and cloud consist of all reactions with a frequency higher than 95%, between 5 and 95%, and below 5%, respectively, in all the GEMs of any SGB
The most represented functional super-groups of metabolic reactions pertained to biosynthesis and degradation (Fig. 4d). In particular, core metabolism had numerous reactions for the biosynthesis of lipids, nucleotides, secondary metabolites, and cofactors and for the degradation of carbohydrates (Additional file 1: Fig. S3). Both amino acid biosynthesis and degradation were highly represented pathways. Most of these biomolecules are involved in primary metabolism and have been previously linked with core genome functions by pan-genome analysis [37]. Shell and cloud reactions generally followed the same trends of core reactions, with some exceptions where cloud reactions were over-represented as compared to the core: biosynthesis of storage compounds and polyamines and degradation of fatty acids and lipids, aromatic compounds, and amino acids.
Species-level metabolic models show improved network structureTo assess the extent of improvement in the quality of a pan-GEM compared to the individual GEMs of the same species, a second set of model reconstructions was generated. MAGs were divided into five sets based on equally spaced intervals over their completeness range and, for each SGB, 30 MAGs were randomly sampled for 10 times within every available completeness interval. Those MAG sets were then used for MAG-GEM and pan-GEM reconstruction while recording model quality statistics for each set (Fig. 5). In this way, multiple pan-GEMs were generated for every SGB, each time simulating the discovery of a different MAG set in a predefined completeness interval. As a result, the variability in model quality was found to be highly dependent on genome completeness levels. More specifically, the variability of the MCC calculated for GEMs obtained considering all the species under investigation tends to decrease as completeness increases (Fig. 5a). This was partially expected since pathway inference and gapfilling are susceptible to the specific combination of genes identified in an individual MAG. Conversely, the quality of pan-GEMs generated iteratively using 30 random MAGs associated with the same SGB of the same cluster of completeness had a strongly reduced variability and an enhanced structure. The MCC had a steadily increasing value, with the maximum associated with the MAGs belonging to the best cluster (90–100% completeness). Most notably, the highest improvement during pan-GEM generation was obtained with highly incomplete genomes. Therefore, the improvement rate was inversely dependent on the completeness of the MAGs utilized. Interestingly, the quality of pan-GEMs was comparable with that of the corresponding iso-GEMs, even when starting from low-quality MAGs with completeness levels between 50 and 60% (Fig. 5b). This observation held true for 74 SGBs, but not for Escherichia coli, which was processed independently because the average quality of its iso-GEMs was predicted to be lower compared to the iso-GEMs of the other species (see Additional file 1: 2.4 Linking Escherichia coli genomic variability to model quality: insights from pan-reactome content [38, 39]). Specifically, the pan-GEMs showed a slight overestimation of the reaction content of individual iso-GEMs. This overestimation was attributed to the species’ wide genetic variability and the large number of available isolates (Additional file 1: Fig. S4). Despite this difference, the result for all species clearly highlights the potential of the proposed approach to generate highly accurate species-level GEMs from fragmented MAGs with a low completeness level (Fig. 5b, Additional file 1: Fig. S5).
Fig. 5Structural quality improvement obtained for the pan-GEMs in comparison to the MAG models. Graphs showing the changes in quality between the pan-GEM and MAGs’ models in relation to the reference reaction set. Global results are reported for all SGBs except E. coli (a) and for single species (b). The heatmap reports explicitly the average difference between the MCC of each species pan-GEM and of the GEM from its most complete MAG. Species-level model improvements were analyzed based on MAG completeness. Completeness level values highlight the thresholds used to subset the MAGs into five groups. Only SGB with more than 30 MAGs in each subset were depicted. All statistics are calculated on gapfilled models
At a finer scale, similar patterns can be observed across the 75 SGBs having at least 30 MAGs per completeness interval (Fig. 5b, Additional file 1: Fig. S6). The results showed that the vast majority of pan-GEMs exhibited larger model quality improvements at lower MAG completeness, thus suggesting that pan-Draft is suitable across taxonomic groups. Certain species, such as those belonging to the genus Ruminococcus, Streptococcus, and Faecalibacillus, demonstrated a pronounced progressive gradient of improvement rate across the various completeness levels. Moreover, all the species showed positive improvement for highly complete MAGs, which despite being minor are worth considering. Overall, these results highlight the improvement brought by pan-GEMs as compared to the models obtained from the most complete MAGs, i.e., those that would be most likely selected in a typical metagenome modeling study.
Species-level metabolic models show higher performance in predicting fermentation by-productsNext, we sought to test whether the obtained pan-GEMs could more accurately reproduce fermentation capabilities of gut microbial species compared to traditional MAG- and iso-GEMs. The fermentative potential for a given compound was defined as the capacity of a species to secrete that compound over a pre-defined flux threshold normalized over its growth rate (details in the “Methods” section). To assess this, a comparative analysis of the predictive accuracy of in silico simulations was conducted for eight common anaerobic fermentation products using metabolic model reconstructions for 41 different species of the UHGG catalog (Fig. 6). Validation was performed on selected organisms having: (i) at least a set of pan-GEMs among the 75 SGBs previously reconstructed, (ii) fermentation products experimentally described in literature or in the NJC19 dataset (Additional File 2: Table S3) [40]. The selected metabolites include common anaerobic fermentation end-products, such as acetate, butyrate, lactate, and ethanol. The evaluation involved quantifying the percentage of correctly and incorrectly estimated end-products based on the total number of predictions performed for all the models of a species (Additional File 2: Table S5). Specifically, the confusion matrix per SGB was determined by summing up all the predictions belonging to the same category (e.g. true positive (TP), true negative (TN), false positive (FP), or false negative (FN)). Afterwards, the average percentage of TP and TN metabolites across species was estimated through the normalization of the confusion matrix of each SGB by the total number of tests in that group (i.e., = num. GEMs * num. tested compounds; Additional file 2: Table S6). Also in this case, to investigate the influence of the MAG completeness level on the prediction performance, the species-level models were divided into five sets with increasing source MAG quality. According to the minimize-total-flux (MTF) analysis and flux variability analysis (FVA), the pan-GEMs consistently outperformed the GEMs of individual MAGs with completeness levels between 50 and 90% (Fig. 6a, b). In comparison to high-quality MAG-GEMs (i.e. completeness levels > 90%), the number of correct positive predictions by pan-GEMs showed a small decline, but the values remained comparable to those of iso-GEMs. A more detailed analysis revealed that models of highly incomplete MAGs (completeness level between 50 and 60%) exhibited a high rate of FP (Additional File 1: Fig. S7), whereas corresponding pan-GEMs tended to reduce the number of scattered FPand consolidate the correct predictions. Indeed, the variability of predictions is reduced because a high percentage of pan-GEMs converge to the same by-product. Statistically significant differences in MCC were observed for all MTF and FVA simulations (paired t-test, P < 0.05—Additional File 2: Table S4). On the contrary, the differences were not significant for iso-GEMs compared to the other two groups of models sourced by genomes with completeness above 90%.
Fig. 6Improvement in fermentation product prediction by the pan-GEMs in comparison to MAG- and iso-GEMs. Results of the fermentation product test under anaerobic growth obtained with FVA (circles) and MTF flux balance analysis (triangular). Data summarized at the species level (gray symbols) show in percentage the number of TP (a), TN (b), FP (c), and FN (d) obtained across all genomes within a species. Similarly, the overall performance is summarized with the MCC (e). Colored dots and error bars show the median, 1st, and 3.rd quartile of all species within the subset obtained according to the MAG completeness level. The number of bacterial organisms tested for each completeness level subset is reported above the box group (n. sp.). Statistically significant differences in the MCC distribution, as determined by a paired t-test, are indicated as follows: * P < 0.05, ** P < 0.01
Finally, a comparison against GEMs from the MIGRENE collection was performed to determine whether pan-GEMs reached state-of-the-art performances. This test involved pan-GEMs from 36 SGBs with species names matching those in the MIGRENE collection and assessing the same fermentation products as before (Additional File 2: Table S3). pan-GEMs demonstrated superior performance, showing consistently more correct positive and negative predictions for both FVA and MTF analysis (Additional File 1: Fig. S8c). Specifically, pan-GEMs appeared to predict well the secretion of small molecules such as acetate, butyrate, and ethanol but showed low sensitivity for succinate, propionate, and lactate (Additional File 1: Fig. S8b). In contrast, MIGRENE’s models estimated more TPs for these latter metabolites but produced numerous erroneous estimations for hydrogen and butyrate (Additional File 1: Fig. S8a). Although pan-Draft showed lower sensitivity for a subset of by-products in FVA, the net MCC improvement was 0.20 and 0.18 for FVA and MTF, respectively. Overall, these results highlight the capacity of pan-Draft to enhance the prediction accuracy of the main fermentation products of representative bacteria of the human gut microbiome during anaerobic growth.
留言 (0)