Characterize molecular signatures and establish a prognostic signature of gastric cancer by integrating single-cell RNA sequencing and bulk RNA sequencing

3.1 Single-cell profile in gastric cancer

Analysis and correction of scRNA-seq data from gastric cancer (Fig. 1A) were performed and clinical information of there patients was summarized in supplemental file (Table.S1). 14 principal components were selected then for clustering and 16 cellclusters were obtained (Fig. 1B, D). Nine cell types were annotated in the end. Clusters 0 and 11 represented B cells. Clusters 1, 2, 3, 4, 5, and 13 were identified as T cells. Cluster 6 was identified as monocytes. Cluster 7 represented epithelial cells. Cluster 8 contained endothelial cells. Cluster 9 contained dendritic cells. Cluster 10 corresponded to macrophages. Cluster 14 consisted of NK cells. Smooth muscle cells were found in clusters 12 and 15 (Fig. 1E). Except for the T and B cell subgroups, the other cells displayed pronounced heterogeneity across various tumor samples (Fig. 1F).

Fig. 1figure 1

Single-cell transcriptome atlas of gastric cancer. A Cell Quality control and correction of samples (before: higher row; after: lower row). B Principal component analysis (PCA) and top 20 PCs are shown. C Number of cells in each cluster. D, E The UMAP(uniform manifold approximation and projection) plot: cells are classified into 16 clusters and annotated to 9 cell types in the end. F The percent of different types of cells in different samples

Through the utilization of the FindAllMarkers function, we identified a total of 1645 marker genes for nine cell subpopulations and marker genes for each cell type were shown in the plot (Fig. 2A, B). The ssGSEA algorithm was employed to compute ssGSEA scores for each cell type in samples from the TCGA STAD dataset. Furthermore, we compared ssGSEA scores for each cell type between tumor and normal samples. Notably, significant differences (P < 0.05) were detected in five distinct cell types: endothelial cells, monocytes, macrophages, NK cells, and smooth muscle cells. These five cell types were categorized as core cells (Fig. 2C). GO and KEGG functional enrichment analyses were performed on the marker genes associated with these five cell types (Fig. 2D, E). Notably, various functional pathways, such as the TNF signaling pathway, NF-kappa B signaling pathway, and ECM-receptor pathway, have been previously reported to be implicated in tumor progression and prognosis. We present corroborative evidence at the single-cell level in our research.

Fig. 2figure 2

Clarification of key cell types and functional enrichment analysis of core marker genes. A Heatmap of top 10 marker genes in each type of cells. B Differentially expressed top genes in different cells. C Comparison of ssGSEA (single sample GSEA analysis) enrichment score between tumor and control samples. D The GO terms of BP categories enrichment of core marker genes. E KEGG enrichment of core marker genes

Cell-Cell communication analysis of the previously identified nine cell types was carried out utilizing CellChat [7] (Fig. 3A). The findings revealed elevated frequency and strength of interactions between smooth muscle cells and other cell types, and endothelial cells display the next highest level of interaction. Pseudo-time analysis of the monocyte and macrophage cell subtypes was conducted by using the monocle2 algorithm (Fig. 3B). It demonstrated the existence of two differentiation branches, with monocytes undergoing a unidirectional differentiation process towards macrophages.

Fig. 3figure 3

Cell-Cell communication and cell development trajectory between different celltypes. A Number (left) and weights (right) of interactions among different cells. B Trajectory analysis in monocytes and macrophages, including pseudotime ordering of cells (left) and cell types (right)

3.2 Identification and functional enrichment analysis of STAD- related genes in bulk RNAseq data

Differential expression analysis was performed for all genes based on tumor and normal subgroup [FDR < 0.05; |log2FC|>0.58 (log2(1.5)], and 4516 differentially expressed genes were obtained, of which 2188 were up-regulated and 2328 down-regulated(Fig. 4A). The top 10 marker genes shown here, such as TFF1 and PGC, are correlated to the gastric mucosal barrier and cellular protection, and the expression of TFF1 was indicated to be reduced in some gastric cancer patients, particularly in some precancerous lesions, such as atypical hyperplasia(Fig. 4B). Other marker genes are also implicated in immune regulation and the tumor microenvironment.

Fig. 4figure 4

Differentially expressed candidate genes. A Differential analysis between tumor and control form TCGA-STAD bulk RNAseq samples, the top genes ranked by padj and logFC were labeled as top DEGs. The more red colors represent more significant difference. B The GC scRNAseq Umap (uniform manifold approximation and projection) plot of the marker genes associate with different cell types

Next, the sample groups were utilized as phenotype data for WGCNA analysis. Module detection was carried out via hierarchical clustering and dynamic tree-cutting functions (Fig. 5A). 25 modules were partitioned in total distinguished with different colors (Fig. 5B). 19 modules of all exhibited significant correlations with the STAD phenotype (P < 0.05), encompassing a total of 8186 genes (Fig. 5C). Particularly, the blue module showed the highest degree of correlation with the phenotype (correlation value = 0.8) (Fig. 5D).

Fig. 5figure 5

Screening STAD-related candidate genes through WGCNA. A Analysis of network topology for various soft-thresholding powers. B Clustering dendrogram of 25 co-expression gene modules with similar expression patterns. C Module-trait relationships of 25 gene modules. D The relationship between Module Membership (MM) in blue module and Gene Significance (GS) for STAD,.

We initially derived a set of 1203 marker genes specific to five distinct core cell types through the application of scRNA-seq analysis and identified 4516 genes displaying differential expression patterns using bulk RNA sequencing. Additionally, a comprehensive assessment of 8186 genes originating from 19 gene modules exhibiting significant associations with STAD was performed. The confluence of these three datasets resulted in the identification of a cohort of 226 potential candidate genes in the end (Fig. 6A). Furthermore, these candidate genes underwent scrutiny via protein-protein interaction (PPI) analysis (combined_score > 0.4) and resulted in a PPI network consisting of 198 nodes and 1188 edges (Fig. 6B). We note that key nodes such as COL1A1, COL1A2, JUN, MMP9, APOE, and FOS displayed strongest strength of interactions in the network, suggesting that these genes are key genes associated with other proteins in gastric cancer.

Fig. 6figure 6

Discovery of candidate genes and functional analysis. A Venn plot of screening final candidate genes. B Protein-protein interaction network of final candidate genes. The color gradient, ranging from lighter to darker, represents the node’s degree from lower to higher. C Enrichment analysis for final candidates and top 10 enriched pathways are shown

To further understand the biological functions of these differential genes in the gastric cancer, we did a comprehensive functional enrichment analysis of the aforementioned candidate genes by using the Metascape database. It revealed a significant enrichment of the 226 genes across 843 Gene Ontology (GO) bioprocess pathways, 72 KEGG pathways, and 107 Reactome pathways (Fig. 6C). Candidate genes are highly enriched in pathways including metabolism of substances, cellular interactions between intracellular and extracellular components and immune system. Scavenging by class A receptors showed the highest enrichment score.

3.3 Identification of marker differentially expressed genes associated with prognosis and establishment of the prognostic signature

In order to identify the characteristic DEGs that are associated with prognosis, we set up a training set to identify the candidate genes, a test set and an external validation set to evaluate and validate the model. In the TCGA training set, we executed a univariate Cox regression analysis by integrating the selected candidate genes with clinical features; 40 genes were identified that are significantly correlated to prognosis (Fig. 7A). We further conducted supplementary screening to discern nine genes that manifested a robust correlation with prognostic outcomes by using a LASSO 10-fold cross-validation approach ( Fig. 7B). The aforementioned nine genes were subjected to stepwise multivariable Cox regression analysis (Fig. 7C), resulting in the construction of a prognostic risk model comprised of four genes which highly expressed in gastric cancer and lead to poor prognosis (Fig. 7D). The mathematical formulation for computing the Riskscore is outlined as follows:

Fig. 7figure 7

Construction and validation of gene-related risk model. A Univariate Cox regression analysis of final candidate genes. B Further filtering through LASSO regression analysis. C Multivariate Cox analysis. D Expression level of SNCG, RNASE1, FKBP10, CDL4A1 between tumor group and normal group in TCGA cohort. EG Kaplan Meier plot and predictive survival rates of 1, 3, 5 years in training, test and validation cohort, respectively

Riskscore = COL4A1 * 0.318587101 + FKBP10 * 0.227780778 + RNASE1 * 0.207764548 + SNCG * 0.181031788.

TCGA samples were divided into high-risk and low-risk groups based on the median RiskScore. To avoid introducing biases into the results, we have provided the percentage of cases corresponding to each histological subtype in the baseline patient characteristics of the GSE62254 and TCGA-STAD cohort according to risk group in Supplementary Table S2. We used chi-square tests to compare the differences in Lauren’s histological subtype between samples from the high- and low-risk group and found no statistical difference. The application of Kaplan-Meier survival curves yielded evidence of a comparatively unfavorable prognosis for the high-risk group (p-value < 0.05) (Fig. 7E), and ROC curves [14] were subsequently employed to determine the area under the curve (AUC) values for 1, 3, and 5-year intervals and corresponding values of 0.679, 0.691, and 0.810, respectively. Upon subjecting the model to further validation and evaluation through the utilization of TCGA test data and an external validation dataset from the GEO database (Fig. 7F, G), consistent outcomes were obtained and AUC values for the 1, 3, and 5-year intervals within the TCGA test set were computed as 0.605, 0.601, and 0.633 respectively, while the corresponding values for the GEO external validation set were observed to be 0.611, 0.610, and 0.609.

Then we explored the correlation between risk scores and pertinent clinical characteristics (Wilcox test), encompassing Age, Gender, Stage, Grade, and tumor TNM stage (Fig. 8). The risk score is associated with patient tumor stage, grade, size, and extent of the primary tumor. Patients in different risk subgroups do not exhibit significant clinical characteristic differences, further highlighting that the risk score is an independent prognostic factor influencing clinical outcomes in patients with gastric cancer.

Fig. 8figure 8

Clinical relevance of gene-related risk score. A Comparison of age between high- and low- risk subgroups. BG Comparison of risk scores among different levels of clinical characteristics (subtitle of X axis) and proportion of clinical characteristics both in high- and low-risk subgroups

3.4 Construction of a nomogram based on the prognostic signature

Through univariate Cox analysis of the RiskScore risk assessment and clinical features (Fig. 9A), it was observed that specific clinical characteristics such as Age, Stage, AJCC_T, AJCC_N, AJCC_M, and Riskscore were significantly associated with prognosis (p-value < 0.05). These prognostic factors were further subjected to stepwise multivariable Cox regression (Fig. 9B), resulting in the construction of a clinical risk model consisting of Age, AJCC_N, AJCC_M, and RiskScore, and the nomoScore for each patient, and the points of each clinical characteristic as depicted in the column chart (Fig. 9C, Supplementary Table S3), with a model C-index of 0.67. The nomoScore were calculated for TCGA STAD tumor samples based on the nomogram, which allowed the division of all samples into high- and low-nomoScore groups using the median nomoScore as a cutoff. The Kaplan-Meier curves were used to compare the survival differences between the two groups of patients (p-value < 0.001) (Fig. 9D, E), revealing poorer prognosis in the high-risk group. The ROC curves were employed to calculate the AUC values at 1, 3, and 5 years, resulting in values of 0.661, 0.721, and 0.709, respectively. Subsequently, the clinical risk model was evaluated using calibration and decision curve analysis (DCA) curves (Fig. 9F, G) [16], indicating favorable predictive performance of the model.

Fig. 9figure 9

Construction and Evaluation of a Nomogram. A Univariate Cox regression analysis of riskScore and clinical characteristics. B Multivariate Cox regression analysis of riskScore and clinical characteristics. C Nomogram of clinical risk model. D Kaplan Meier plot of high- and low-nomoScore subgroups. E The ROC curve of 1, 3, 5 years of the Nomogram. F Calibration plots of the nomogram for predicting the probability of 1, 3, 5 years survival. G The DCA curves of the Nomogram compared for 1-, 3-, and 5-year overall survival in STAD, respectively

3.5 The difference of molecular function between the high- and low-risk groups

Through the application of the GSVA algorithm, we conducted a comparison of enrichment scores for the HALLMARK pathway gene sets in the high- and low-risk subgroups (Fig. 10A), elucidating noteworthy variations in the enrichment scores of specific pathways between these subgroups and shedding light on the latent molecular mechanisms contributing to prognostic discrepancies in different risk subgroups. The most pronounced enrichment discrepancies between different groups are observed in the pathways of Epithelial-mesenchymal transition and angiogenesis, underscoring the divergent characteristics of different risk subgroups in terms of tumor metastasis and invasion. Concurrently, GSEA enrichment analysis of KEGG pathways was performed on samples from the high- and low-risk subgroups (Fig. 10B, C), elucidating that 56 KEGG pathways were enriched in the high-risk group and 19 KEGG pathways were enriched in the low-risk group. Intriguingly, the ECM-receptor interaction pathway was enriched in the high-risk group. It manifests the invasion and metastasis of gastric cancer involve the interaction between the extracellular matrix (ECM) and cell membrane receptors and has been demonstrated to play a significant role in the infiltration of gastric cancer.

Fig. 10figure 10

Enrichment analysis between high- and low-risk subgroups. A Enrichment analysis with hallmark gene sets. B, C Top 5 enriched pathways in high- and low-risk subgroups, respectively

3.6 The Immune infiltration of M2-macrophages was different between high- and low-risk subgroups

Immune infiltration analysis was performed on each sample using different algorithms, and the differences in immune cell abundance and immune matrix scores between high- and low-risk subgroups were compared using the Wilcox. test (Fig. 11A). The CIBERSORT analysis revealed disparities in 6 types of immune cells (p-value < 0.05). CIBERSORT analysis (Fig. 11A) revealed significant differences in six immune cell types (p-value < 0.05). Intriguingly, the expression of monocytes and macrophages M2 varied across different risk groups, suggesting a potential mechanism of immune suppression and tumor promotion in high- risk subgroups. Through ssGSEA analysis (Fig. 11B), significant differences (p-value < 0.05) in 15 immune cell types were observed between high and low-risk groups. ESTIMATE analysis (Fig. 11C) demonstrated that ESTIMATEScore, StromalScore, and ImmuneScore all exhibited significant differences (p-value < 0.05) between high and low-risk groups.

Fig. 11figure 11

Correlation of RiskScorewith immune targets. A, B Abundance of different types of immune cells in subgroups with high and low RiskScore by using CIBERSORT (A) and ssGSEA (B). C Comparison of immune scores and stromal scores between high- and low-risk subgroups. D Expression of HLA family genes in different subgroups. E Expression of immune checkpoint genes in different subgroups

Extracting immune checkpoint gene and HLA family gene expression data from the subtype samples (Fig. 11D, E), we compared the differential gene expression between high and low- risk subgroups. Notably, we observed significant differences in expression levels of 20 immune checkpoint genes and 1 HLA family gene between the high and low- risk subgroups.

3.7 Somatic mutation analysis

Based on somatic mutation data from the TCGA STAD database, we conducted a comparison of the top 20 genes with the highest mutation frequencies between high and low-risk subgroups, followed by an analysis of the co-mutation correlations among these top 20 genes (Fig. 12A–D). Through the Wilcoxon sum-rank test, a significant difference in Tumor Mutational Burden (TMB) was identified between high and low-risk subgroups (Fig. 12E). Compared to the high-risk group, the low-risk group exhibited higher TMB and a more intricate interplay among mutated genes. TP53 mutations demonstrated more pronounced mutual exclusivity with other genes in both subgroups, with the most significant association with ARID1A mutations. Samples were categorized into high-TMB and low-TMB groups based on the median TMB, and clinical outcomes were analyzed through Kaplan-Meier curves between high and low-risksubgroups as well as between high- and low-TMB subgroups (Fig. 12F). The results indicated that TMB level was not significantly associated with patient prognosis under the conditions of RiskScore stratification.

Fig. 12figure 12

Genetic landscape in high- and low-risksubgroups. A Somatic mutations in high-risk subgroup. B Co-mutations in high-risk subgroup. C Somatic mutations in low-risk subgroup. D Co-mutations in low-risk subgroup. E Tumor mutational burden (TMB) in different subgroups. F Kaplan Meier plot of high- and low-risk subgroup together with different levels of TMB.

3.8 Drug sensitivity prediction and immunotherapy prediction

The sensitivity of each patient to chemotherapy drugs was estimated using the GDSC2 database and CTRP2 database (Fig. 13A). The Wilcoxon test was used to compare the differences in drug sensitivity between the high and low-risk subgroups. 139 of 198 drugs in the GDSC2 database exhibited significant differences in IC50 values between the high and low-risk subgroups. 320 of 545 drugs in the CTRP2 database showed significant differences in IC50 values between 2 subgroups. Sensitivity differences were observed for 5-Fluorouracil, Cisplatin, and Oxaliplatin, which are all part of the first-line systemic chemotherapy drugs for gastric cancer, indicating that the low-risk group is more sensitive to these first-line chemotherapy regimens.

Fig. 13figure 13

Therapeutic agents in high- and low-risk subgroups. A Drug Sensitivity in different subgroups, 5-Fluorouracil, Cisplatin and Oxaliplatin are shown in sequence. B Tumor immune dysfunction and exclusion (TIDE) score in different subgroups. C Clinical prediction of anti-PD1 and anti-CTLA4 in different subgroups

To comprehensively analyze the therapeutic effects of immune-targeted drugs in different subgroups, we employed the TIDE database to predict the degree of benefit that patients may obtain from ICI treatment (Fig. 13B). We found a significant increase in TIDE scores among patients in the high-risk subgroup compared to the low-risk subgroup, indicating that patients in the high-risk subgroup might have a lower likelihood of benefiting from ICI treatment. Then, we predicted the response of patients to anti-PD1 and anti-CTLA4 treatments by using the Submap algorithm (Fig. 13C), revealing that patients in the high-nomoScore group might experience a more significant benefit from anti-CTLA4 treatment (Bonferroni-corrected P = 0.014).

留言 (0)

沒有登入
gif