Metagenomic analysis of the interaction between the gut microbiota and colorectal cancer: a paired-sample study based on the GMrepo database

General characteristics of microbiome distribution

Before the matching procedure, a total of 709 metagenomic samples (318 patients with CRC and 391 healthy controls from 6 projects (PRJDB4176, PRJEB10878, PRJEB27928, PRJEB7774, PRJNA397219, and PRJNA447983)) were identified based on the study criteria. After matching, we included data from 86 patients with CRC and 86 healthy controls in the subsequent data analysis. The baseline information of the matched samples is presented in Table 1, Additional file 1: Table S1, and Additional file 2: Figure S1.

Table 1 Baseline characteristics of patients in the GMrepo database

A total of 484 microbial species were identified. Figure 1A and Additional file 1: Table S2 show the top 20 predominant microbial species in both the CRC and control groups. Among these species, Akkermansia muciniphila (A. muciniphila) (median relative abundance 4.93% vs. 1.79%, Wilcoxon signed-rank test, p = 0.014) was significantly enriched in the CRC group, whereas Faecalibacterium prausnitzii (F. prausnitzii) (6.15 vs. 8.57%, z = −2.741, p = 0.006) and Eubacterium rectale (E. rectale) (3.62% vs. 5.32%, z = −1.985, p = 0.047) were more abundant in the control group. Additional file 3: Figure S2 presents the microbiome distribution of species in different subgroups according to age, region, and sex.

Fig. 1figure 1

A The 100% stacked column chart of the relative abundance at the species level in CRC patients and healthy controls. The X-axis represents the CRC and control groups. The value of each species percentage in the Y-axis represents the mean value of relative abundance from each CRC and control cohort. The relative abundance represents the percentage of each species per sample. B Alpha diversity was evaluated by the Shannon index, Pielou evenness, Simpson index, and Equitability evenness. The solid lines indicate the faecal samples from CRC patients and their matched healthy controls. The difference in alpha diversity was calculated by the Wilcoxon signed‒rank test. C PCoAs of Bray‒Curtis distances on species composition, calculated between CRC and healthy controls. Each dot represents a patient with CRC and controls. Points clustered in light blue and pink eclipses represent the gut microbial composition of the CRC and controls, respectively. The boxplots around the PCoA plot represent the Bray‒Curtis distances of Axis1 (the top boxplot) and Axis2 (the right-sided boxplot). Differences in Bray‒Curtis distances of both Axis1 and Axis2 were calculated between the CRC and controls using the Wilcoxon Mann‒Whitney test, and p < 0.05 was considered statistically significant. D Visualization of dispersion differences in microbial composition between the CRC and control groups. Points in light blue and pink represent the gut microbial composition of the CRC and controls, respectively. The red and dark blue points indicate the centroids of all microbial species in each CRC and control group, respectively. The dashed line represents the spatial distance between the centroids and each sample. PERMANOVA was performed, and p < 0.05 was considered statistically significant. CRC: colorectal cancer; PCoAs: principal coordinate analyses

The alpha diversity of the gut microbiota in the CRC groups was not different from that in the control groups (Fig. 1B). For the beta diversity, there was no difference in Bray‒Curtis distances of both Axis1 (p = 0.690) and Axis2 (p = 0.450) in CRC versus controls during PCoA, indicating a similar species composition between the two groups (Fig. 1C). The PERMANOVA test, however, reached a significant difference (p = 0.031) between the CRC and control groups (Fig. 1D), possibly due to the dispersion differences between the CRC and control groups, but such a difference is not visually significant and might not be clinically relevant.

The geographic differences in the microbiome distribution is presented in Fig. 2. The distribution of the microbiome varied in different countries (Fig. 2A). At the phylum level, we noticed a consistent trend of the increased abundance of Verrucomicrobia and Euryarchaeota in the CRC groups in all six countries (Fig. 2B). In terms of certain species, F. prausnitzii had decreased abundance, while A. muciniphila had increased abundance in the CRC samples compared to that in the control samples in all subgroup regions (Fig. 2C, Additional file 1: Table S3).

Fig. 2figure 2

A Geographic distribution of samples from six different countries. Each pie chart represents the composition of the top 10 most abundant species identified in all samples distributed in CRC patients from each country. B The 100% stacked column chart of the relative abundance at the phylum level in CRC patients and healthy controls. The X-axis represents the subgroups of countries. The value of each phylum percentage in the Y-axis represents the mean relative abundance from each CRC and control cohort. The relative abundance represents the percentage of each species made of the organism per sample. C Heatmap visualization of the mean intestinal microbiota abundance in CRC patients and healthy controls based on region differences. Each column represents one subgroup based on the status of the disease in different countries. Each row represents one species ranking in the top 20 of the average relative abundance per subgroup. Values of the average relative abundance were normalized by the Z scores method. The colour scale was set based on the specific value of the average relative abundance after the Z score transformation, with red for relatively high abundance (Z scores > 0) and blue for low abundance (Z scores < 0). The greater the weights of the absolute Z score-transformed abundance values, the deeper the colour of the squares. The original values of average relative abundance per subgroup (before Z score transformation) are shown on the right of the colour scale (0, 0.2, 0.4…,1.2)

Within all metagenomic samples, 87 species exclusively existed in the CRC group, and 30 were in the control groups (Fig. 3A, Additional file 1: Table S4–S5). Among these 87 species, Collinsella tanakaei (C. tanakaei) and Clostridium hylemonae (C. hylemonae) were found in the CRC samples in five countries (Fig. 3B). The exclusive species were ranked by their accumulated abundance in all samples and are listed in Fig. 3C, D. To further identify the distribution characteristics of C. tanakaei and C. hylemonae, we correlated these two species with age and BMI. We found that C. tanakaei had a significant positive correlation with age (Spearman Rho = 0.331, p < 0.05) (Fig. 3E). The distributions of C. tanakaei from different countries are further presented in Fig. 3F. All CRC individuals with C. tanakaei tended to be elderly. No significant correlations were detected between C. tanakaei and BMI (Spearman Rho = −0.223, p = 0.985), C. hylemonae and age (Spearman Rho = 0.121, p = 0.475), or C. hylemonae and BMI (Spearman Rho = -0.120, p = 0.807).

Fig. 3figure 3

A Venn diagrams illustrating the number of species in CRC patients (pink) and healthy controls (light green). B UpSet plot of differentially distributed taxa. The left graph represents the total number of differently distributed species (X-axis) in different countries (Y-axis). The right graph represents the intersection of sets of species in multiple countries. Each column corresponds to a country or set of countries (dots connected by lines below the X-axis) containing the same species. The number of species in each set appears above the column, while countries shared are indicated in the graphic below the column. C Histograms of species exclusively present in healthy individuals. The X-axis represents species that are exclusively shared in healthy individuals (with an accumulated relative abundance of more than 0.5% of each sample in healthy groups). The Y-axis represents a percentage of accumulated relative abundance, which is a measure of the proportions of the microbiota composed of the organism in the healthy control group. D Histograms of species exclusively present in CRC individuals. The X-axis represents species that are exclusively shared in the CRC groups with an accumulated relative abundance of more than 0.5% of each sample in the CRC groups. The Y-axis represents the percentage of relative abundance in the CRC and control groups. E Scatter plots of C. tanakaei and age. The Spearman rank correlation test, p-value < 0.05 was considered statistically significant). F The age distributions of individuals with expression of C. tanakaei in the feceal samples

Identification of microbial markers for CRC diagnosis

Using the Wilcoxon signed-rank test, we identified 44 species with significantly different abundances between CRC patients and healthy patients with log2Fold-change greater than 1.0, including 5 significantly enriched species and 39 significantly depleted species in CRC versus control patients (Fig. 4A, Additional file 1: Table S6). To further determine whether the microbial species could be used as identification biomarkers for distinguishing CRC samples, we established random forest models to classify CRC patients and healthy controls. Internal cross-validation was performed. The training and validation cohort was a 7:3 split of the original data. We calculated different AUC indexes by integrating different numbers of taxa with the highest model-building importance and lowest inner subcategory bias (Additional file 4: Figures S3: A-C). The top 50 microbial species are listed in Additional file 1: Table S7. The random forest models were then established by integrating different numbers of variables from the top 11 to 50 significant microbes. The AUC, sensitivity, and specificity of each model were determined and are presented in Additional file 4: Figure S3D. The median value of the AUC was 0.812 in the training cohort and 0.790 in the validation set. The model containing the top 30 microbial species showed the highest AUC (0.887) values in the training cohort, but its performance did not demonstrate superiority in the validation cohort (AUC = 0.788). Visually, the model performance tended to be more stable by integrating the top 30–50 microbial biomarkers than the top 11–29 biomarkers. Thus, we further compared the median AUC, sensitivity, and specificity of the above two model categories (Additional file 1: Tables S8-S9). The median AUC of models containing the top 30–50 species was significantly higher in the training (0.804 versus 0.784, Wilcoxon T, p < 0.001) and validation (0.804 versus 0.784, Wilcoxon T, p < 0.001) cohorts than that with the top 11–29 biomarkers. There was the same trend of a significantly higher median sensitivity in both the training (90.5% versus 76.9%, Wilcoxon T, p < 0.001) and validation cohorts (90.5% versus 85.7%, Wilcoxon T, p = 0.003). The specificity showed no significant difference.

Fig. 4figure 4

A Volcano plot. The log2-fold-change indicates the mean relative abundance for each species. Each dot represents one species. The grey dots represent species with no significant expression difference or species with |log2 Fold-Change|≤ 1.0 (CRC versus controls), the red dots represent depleted taxa in CRC groups compared with controls, and the light blue dots represent enriched taxa. The Wilcoxon-signed rank test was performed, and p < 0.05 was considered statistically significant. B Venn diagrams illustrating the number of species identified by the primary random forest model, including the top 50 microbial markers (pink) and the 44 species by the Wilcoxon signed rank test (light blue). C The paired scatter plot of the 14 commonly identified species between CRC and controls. D The area under the curve (AUC) of different models. CRC: colorectal cancer; Sens: sensitivity, Spec: specificity

Next, we identified the 14 differentiated microbial species using the Wilcoxon signed rank test and the random forest algorithm. (Fig. 4B–C) Then, we established a relatively simplified model by incorporating only these 14 markers. The simplified model showed no significant differences in ROC, sensitivity, and specificity with the primary models developed from the random forest models. (Fig. 4D).

Relationship of gut microbiota and age or BMI in CRC groups

To track the influence of age- or BMI-related differences in the gut microbiota in the CRC groups, we first applied Spearman’s correlation test. Dorea longicatena (D. longicatena) (Rho = 0.30, p = 0.001), Blautia obeum (B. obeum) (Rho = 0.38, p < 0.001), Adlercreutzia equolifaciens (A. equolifaciens) (Rho = 0.42, p < 0.001), and Eubacterium hallii (E. hallii) (Rho = 0.38, p < 0.001) were positively correlated with BMI changes in CRC patients, whereas the other 22 strains were negatively correlated with BMI (Fig. 5A, Additional file 1: Table S10). Regarding the relationship with age, increased enrichment of Alistipes obesi (Rho = 0.30, p = 0.001) and Turicibacter sanguinis (T. sanguinis) (Rho = 0.27, p = 0.005) was significantly correlated with increased age, whereas Bifidobacterium pseudocatenulatum (Rho = −0.28, p = 0.004) and Coprococcus comes Rho = −0.27, p = 0.005) were negatively correlated with age in the CRC cohort.

Fig. 5figure 5

A Heatmap reporting correlation coefficients (Rho) and p values for the correlation of species and age/BMI. The row represents certain species, and the column represents age and BMI. The bar on the right side shows the colour scale reflecting the Rho values. Positive correlations of a species with age or BMI (Rho > 0) are presented as red-scale squares, whereas negative correlations (Rho < 0) are presented as blue squares. The greater the weight of the absolute Rho value is, the deeper the colour bar. "*" represents the p value of the correlation. (The Spearman rank correlation test, *, **, *** stands for p value < 0.01, 0.005 and 0.001, respectively). B, C Boxplot of species abundance distributed in different sub-age groups (B) and sub-BMI groups (C). Boxplot displays the median of relative abundances (%) with their interquartile range. The upper and lower edges of the box represent the maximum and minimum relative abundance of each microbe, respectively. Relative abundance (%) means the percentage of a microbial species composed of the organism. The difference in the relative abundance of species was compared in both sub-age and sub-BMI groups using the Kruskal‒Wallis test. Pairwise comparisons within the subgroups were calculated using the Wilcoxon Mann‒Whitney U test. A p value < 0.05 was considered statistically significant. (*, **, *** for p values < 0.01, 0.005 and 0.001, respectively)

We further tracked the principal microbial strains in the CRC group associated with age and BMI. In the sub-age groups, the distributions of 19 species were significantly different among the three groups with p < 0.05 by the Kruskal‒Wallis test (Additional file 5: Figure S4A). Among these species, we found a trend of increased abundance in Prevotella stercorea (Kruskal_wallis Test, p· = 0.025), Fusobacterium necrophorum (p = 0.025), T. sanguinis (p = 0.012), and Propionibacterium freudenreichii (p = 0.019), which were associated with stepped growth of age in the subgroups (Fig. 5B). Bacteroides uniformis (B. uniformis) had an increased abundance in the three sub-age groups but did not reach statistical significance. To further identify microbes significantly associated with age, we utilized a random forest algorithm (Additional file 1: Figure S4B). B. uniformis was identified as one of the top 12 candidate microbes significantly associated with age (with a value of 18.0 for the increase in node purity).

Considering that the number of patients with obesity (BMI exceeding 30) was only two in the current CRC cohort, we slightly adjusted the cut-off value in BMI subgroups for better statistical comparisons. With stepped growth of BMI in the three subgroups, there was a trend of an increased abundance of D. longicatena (Kruskal‒Wallis test, p = 0.001), E. hallii (p = 0.001), A. equolifaciens (p = 0.006), B. obeum (p = 0.007), and Ruminococcus sp. (p = 0.007), while there was a trend of decreased abundance of Coprobacillus sp. (p = 0.002), Clostridium citroniae (p = 0.015), Bacteroides vulgatus (p = 0.015), and Bacteroides thetaiotaomicron (p = 0.002) (Fig. 5C, Additional file 6: Figure S5A). The top 12 important BMI-related species are presented in Fig. 4C, in which the abundance of Ruminococcus sp, D. longicatena, Anaerostipes hadrus, A. equolifaciens, and E. rectale tended to increase after BMI exceeded 30 kg/m2. (Additional file 6: Figure S5B).

Establishment of the microbial interaction network

Network analysis was performed based on the relative abundance of 484 taxa in both groups to explore the co-occurrence relationship among these microbial species (Fig. 6). The network was composed of 97 nodes with a total of 235 links. Regarding CRC-enriched specifics, Parvimonas micra (P. micra) showed the most frequent interaction with other species: Holdemanella biformis (positive relation, clustering coefficient (cc) 0.40), Desulfovibrio piger, (positive relation, cc 0.09), Burkholderiales bacterium (negative relation, cc 0.06), and Methanobrevibacter smithii (negative relation, cc 0.03). Peptostreptococcus stomatis was negatively related to Desulfovibrio desulfuricans (cc 0.67), which was also positively related to another CRC-enriched species, Ruminococcus callidus (cc 0.33).

Fig. 6figure 6

Co-occurrence network visualization of the interactions among different species. The lines connecting nodes (edges) represent a positive (pink) or negative (light blue) co-occurrence relationship. The width of the edges represents the strength of the correlation, and the size of the circle represents the degree of interaction with other species. The red colour represents significant CRC-enriched species. The yellow colour represents species with no significant association with CRC status

留言 (0)

沒有登入
gif