The crucial prognostic signaling pathways of pancreatic ductal adenocarcinoma were identified by single-cell and bulk RNA sequencing data

Identification of DEGs based on scRNA-seq and GEO datasets, and establishing prognostic model

The full-text analysis flow is shown in Fig. 1. To reveal the differences in gene expression profiles between tumor cells and normal ductal cells in PDAC, we downloaded single-cell transcriptome sequencing dataset from Genome Sequence Archive (GSA). A total of 24 PDAC (38 201 cells) specimens were included to construct gene-cell expression matrix. After cells filtering, normalization, principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) dimensionality reduction, 19 original clusters were identified (Fig. 2A). According to signature genes of each cell type reported previously (Chen et al. 2021b; Elyada et al. 2019; Peng et al. 2019), these clusters were classified into ten known cell types, including type 1 ductal, type 2 ductal, acinar, endocrine, endothelial, fibroblast, stellate, macrophage, T and B cells (Fig. 2B). We subsequently identified cluster-specific marker genes by conducting differential gene expression analysis to characterize the identity of each cell cluster (Fig. 2C, Supplementary Material, Fig. S1A). It was observed that type 2 ductal cells exhibited significantly higher expression of reported poor prognosis PDAC markers, such as CEACAM1/5/640 and KRT19. In addition, we compared the composition of cells in 24 PDAC patients (Supplementary Material, Fig. S1B). The result showed that there were differences in the composition of tumor cells in different patients, while ductal cells and fibroblasts were the main cells, and lymphocytes accounted for a small proportion. To further integrate patient prognosis with single-cell data, we used the Scissor algorithm to identify the cell populations most relevant to prognosis at the cellular level (Sun et al. 2022). Based on the signs of the estimated regression coefficients, the cells with non-zero coefficients can be indicated as Scissor positive (Scissor+) cells and Scissor negative (Scissor-) cells, which are positively and negatively associated with the phenotype of interest, respectively, and differential genes between the two populations of cells are calculated using the “FindMarkers” algorithm (Fig. 2D). Four GEO datasets (GSE62165, GSE71989, GSE16515 and GSE91035) were normalized separately, and the differential genes between tumor and normal tissues were analyzed by “limma” R package, and the intersection of the differential genes of the four datasets was taken. Finally, 101 down-regulated genes and 234 up-regulated genes were obtained by intersection of the differential genes obtained in the scRNA-seq data and the differential genes in the GEO datasets (Fig. 2E). The PDAC cohort in TCGA was randomly assigned to the training set and the validation set according to 7:3. We performed Unicox regression analysis on 335 differentially expressed genes (DEGs) found in training set, found 145 DEGs significantly related to prognosis (P < 0.01). Then, Lasso regression (Supplementary Material, Fig. S1C–D) and multivariate cox regression models were used to further screen the prognostic genes. Ultimately, 12 genes, including TRIM29, S100A14, PLAUR, TAP2, ZWINT, MPZL2, SERPINB5, TWIST1, MMP14, PLAU, IL22RA1 and TPX2, were confirmed as independent prognostic DEGs, and the coefficients of their constructed risk score of the multivariate cox model are shown in the Fig. 2F. All 12 gene met the proportional hazards assumption using Schoenfeld residuals (Supplementary Material, Fig. S1E). The ROC curve analysis showed that the area under the curve of the model in the training set and validation set was 0.76 and 0.67, respectively (Supplementary Material, Fig. S1F). The coefficients of the individual prognostic genes in the model are shown in Fig. 2F, and the survival curve showed that the OS of patients in the high-risk group was poor in the training set and validation set (Fig. 2G). The calibration curves of the model in the validation dataset, and it showed good correlation between nomogram-predicted OS and actual OS, indicating the accuracy of the prognostic model (Fig. 2H).

Fig. 1figure 1

Graphical scheme describing the study design. Step1: The most and least correlated cell populations with prognostic phenotypes from 24 PDAC patients’ single-cell data were obtained by “Scissor” algorithm, and the DEGs between the two groups were obtained by the “FindMarkers” algorithm. Step2: The differential genes between PDAC and normal pancreas in four GEO datasets were analyzed, and then the common DEGs were identified by aggregate ranks. Step3: The key genes of the multivariate cox model were constructed based on TCGA data, and the internal and external data were used to diagnose the model. The correlation of modeled gene expression was analyzed to identify key genes. The model was used to divide the patients in different datasets into high and low risk groups, and the gene mutation landscape and immune infiltration were compared between the groups. The expression of modeling genes in cells was analyzed in the single-cell data to identify the pathways most relevant to prognosis. Step4: The expression of prognostic related genes in control cells and cells with different malignant degrees was detected, and the expression of pathway-related genes in different cells was verified

Fig. 2figure 2

Screening of differential genes and construction of prognostic model. (A, B) The t-SNE plot showing the original cluster (A) and named cell subpopulations (B). C Heatmap showing the expression level of known cell-type-specific markers to demonstrate the identity of each cluster. D The t-SNE visualization of the Scissor selected cells. The red and blue dots are cells associated with the prognosis of tumor. E Venn diagram of the intersection of differential genes from the four GEO data and the intersection of differential genes derived from the scRNA-seq data and those derived from the GEO data. F Bar graph of coefficients from the multivariate cox regression model. G Survival curves of high-risk and low-risk groups in the training and testing sets. H Calibration curves of cox regression model for predicting one-year and three-year survival rates

Validation of diagnostic model

Prognostic models were validated on internal and external data sets. Time-dependent receiver operating characteristic curve (ROC) was used to evaluate the accuracy of predicting 1-year, 3-year, and 5-year OSs. The area under curve (AUC) values were 0.844, 0.870 and 0.961 in the internal validation set (Fig. 3A). To further evaluate the accuracy and reliability of the prognostic model, we retrieved gene expression matrix and clinical follow-up data of PDAC from GSE62452 GSE57495, and ICGC (PACA_CA) as an external validation set. Their 1-/3-/5-year AUC values were 0.814/0.836/0.789, 0.789/0.751/0.783 and 0.901/0.736/0.679 (Fig. 3B-D). Subjects in the high-risk group had significantly shorter OS than those in the low-risk group based on four prognosis-related signatures in all external validation sets for GSE62452 (P = 0.0011), GSE57495 (P = 0.012), and PACA_CA (P = 0.02) (Fig. 3E-G). The risk plots and signature genes expression heatmaps were generated to show detailed survival outcomes of each patient in the external validation cohorts (Fig. 3H–I, Supplementary Material, Fig. S2A).

Fig. 3figure 3

Internal and external datasets were used to validate the prognostic model. AD 1-year, 3-year, and 5-year time-ROC curves evaluate the risk stratification ability and predictive ability of the constructed risk model in the TCGA validation data, GSE62452, GSE57495 and ICGC PADA_CA cohorts. EG The Kaplan–Meier survival curves show the clinical relevance of the PDAC signature on three independent datasets (GSE62452, GSE57495 and ICGC PADA_CA). Tick marks indicate censoring events. The statistical p-values were determined by the two-tailed log-rank sum test. HI Risk plots to illustrate the survival status of each sample and signature genes expression heatmaps in the GSE62452 and GSE57495 cohorts

Clinical relevance, nomogram and mutation landscape between high- and low-risk groups

We performed single factor and multi-factor Cox analyses to determine whether the risk score could be an independent prognostic factor for PDAC patients compared with other common clinicopathological parameters. We observed that the risk score could serve as an independent prognostic factor for these individuals (Fig. 4A). The Unicox results showed that risk score (HR: 4.486, 95% CI: 2.452–7.563, P < 0.0001) were significantly correlated to the OS of subjects. Multivariable Cox regression analysis of the model showed that except S100A14, TAP2, MPZL2 and TRIM29, other genes were significantly correlated with survival time (Fig. 4B-C). Next, we investigated the relationship between the risk score and clinicopathological features, suggesting that TNM stage were not significantly associated with the risk score (Supplementary Material, Fig. S2B). Furthermore, we established the easy-to-use and clinically adaptable prognostic nomogram. The subject with higher total points was associated with worse 1-year and 5-year OSs (Fig. 4D). Afterward, waterfall plots of the PDAC cohort as a whole and the high- and low-risk PDAC groups were generated to explore the detailed mutational profiles between the two groups (Supplementary Material, Fig. S2C-D). We found that KRAS, TP53, and CDKN2A were the most frequently mutated genes in the high- and low-risk groups, and the gene mutation frequency in the high-risk group was higher than that in the low-risk group. Fisher test found that KRAS and TP53 were significantly mutated genes between the two groups (Supplementary Material, Fig. 2E).

Fig. 4figure 4

Unicox and multicox analysis, and nomogram established. A The forest plots show the hazard ratios and 95% confidence intervals for the risk-score and additional clinical features according to the Unicox model. Squares represent the hazard ratios and the horizontal bars extend from the lower limits to the upper limits of the 95% confidence intervals of the estimates of the hazard ratios. The statistical p-values were determined by the two-tailed Wald test. B The forest plots show the hazard ratios and 95% confidence intervals for the risk-score and additional clinical features according to the multivariable Cox model. C The forest plots show the hazard ratios and 95% confidence intervals for the multivariable Cox model. D The prognosis nomogram was drawn to predict 1-year and 5-year OSs for PDAC. Red point on the diagram to illustrate patients through calculation of the nomogram survival probability and the probability of less than 1 year and 5 years

Differential gene expression analysis and chemotherapy drug sensitivity between high-risk group and low-risk group

The “limma” algorithm was employed to calculate the differential expression of genes between the high-risk and low-risk groups in the TCGA dataset. Additionally, the expression levels of genes associated with prognosis were examined (Fig. 5A-B). The results showed that not all modeling genes were differentially expressed between high- and low-risk groups. Furthermore, GO and KEGG enrichment analysis of 167 gene signatures was performed by Cluster Profiler packages in R project (Fig. 5C, Supplementary Material, Fig. S3A). PI3K-Akt signaling pathway, collagen-containing extracellular matrix, and calcium-dependent protein binding were significantly enriched. Chemotherapy drugs, such as Gemcitabine (Qi et al. 2023) and Docetaxel (Fan et al. 2020), have remained the mainstay for the treatment of PDAC. Poor prognosis has been associated with chemoresistance. Therefore, we further predicted the chemotherapy response of the two risk subgroups to common chemotherapy drugs. As shown in Fig. 5D, a significantly higher estimated IC50s for six chemotherapy drugs (Docetaxel, Gemcitabine, Paclitaxel, Doxorubicin, Sunitinib and Erlotinib) of high-risk group when compared with low-risk group, which indicate that low-risk patients can benefit from the chemotherapy agents. In addition, we also observed the sensitivity of patients to chemotherapy drugs in the GSE62452 cohort (Supplementary Material, Fig. S3B), and found that the low-risk group was more sensitive to Gemcitabine, Paclitaxel, Doxorubicin, and Sunitinib, while the high-risk group was more sensitive to Docetaxel and Erlotinib.

Fig. 5figure 5

Differential genes, related pathways and drug sensitivity between high and low risk groups were analyzed. A The volcano plot of differential gene expressions in high-risk versus low-risk. The two vertical dashed lines represent ± ln (2.17) fold-changes in gene expression, and the horizontal dashed line denotes FDR cutoff 0.05. The FDR was the adjusted p-value calculated by the two-tailed Wilcoxon rank-sum test. B Heatmap of high- and low-risk group differences in genetic analysis. C BP, CC and MF analysis of differentially expressed genes between high- and low-risk groups. D Drug sensitivity analysis of high- and low-risk groups in TCGA cohort. Estimated IC50 for Docetaxel, Gemcitabine, Paclitaxel, Doxorubicin, Sunitinib, and Erlotinib

The immune microenvironment and immune checkpoints of high- and low risk groups

We analyzed the differences of tumor microenvironment in the high- or low-risk groups. The tumor purity, immune score, stromal score and ESTIMATE score were calculated by ESTIMATE algorithm, which calculated based on single-sample gene set enrichment analysis (Fig. 6A). Except for stroma score, significant differences were found in immune score, tumor purity and ESTIMATE score between the high and low risk groups. Based on the results of previous studies on pan-cancer immune genes (Charoentong et al. 2017), ssGSEA was employed to analyze the abundance of different immune cells in both the high-risk and low-risk groups in TCGA (Fig. 6B) and the GEO cohort (Supplementary Material, Fig. S4A-B). The results revealed significant differences in the abundance of macrophages, monocytes, CD56+ natural killer cells, activated CD4 T cells, Th17 cells, and Th2 cells between the high-risk and low-risk groups within the TCGA cohort. Additionally, macrophages, monocytes, activated CD4 T cells, and Th2 cells exhibited significant differences in both GEO cohorts (GSE62452 and GSE57495). Furthermore, dendritic cells (both immature and activated) were identified in the analysis of both cohorts. Lastly, we also investigated the association between the signature genes and 28 tumor-infiltrating lymphocytes of TCGA dataset (Fig. 6C). Spearman’s correlation analysis revealed that PLAU, MMP14, TWIST1, and TAP2 was positively correlated with the expression level of most tumor-infiltrating lymphocytes, while TPX2, SERPINB5, MPZL2, and S100A14 was negatively correlated with monocytes, plasmacytoid dendritic cells, type 1 T helper cells, regulatory T cells, and eosinophils. Anti-cancer immune response can be conceptualized as a series of stepwise events referred to as the Cancer-Immunity Cycle (Xu et al. 2018). At the same time, the expression level of immune checkpoint related molecules is closely related to the outcome of immunotherapy. Therefore, we summarized 74 pan-cancer immune checkpoint genes and molecules related to tumor immune cycle found in previous studies (Park et al. 2021), and visualized the expression levels of these molecules in high- and low-risk groups of TCGA, GEO and ICGC cohorts (Fig. 6D-E, Supplementary Material, Fig. S4D-E). Compared with low-risk group, ARG2, NECTIN3, HAVCR1, CD274, CD160, NOS3, EDNRB, CCL2, and CXCL8 were significantly higher expression in high-risk group. As for the expression level of common immune checkpoint related genes of TCGA, GEO and ICGC cohorts, the results revealed that a higher risk score was significantly associated with up-regulation of CD274 (PD-L1), CD44, CD70, CD276, and TNFSF9 and down-regulation of CD160, ADORA2A, and CD200 in TCGA dataset.

Fig. 6figure 6

Immunity landscape and check point inhibitors genes expression pattern of high- and low-risk groups. A The ESTIMATE algorithm was used to calculate the immune score, ESTIMATE score, tumor purity score and stroma score between high- and low-risk groups. B Immune cells infiltration levels in the low- and high-risk groups estimated by ssGSEA. C Spearman analysis plots of correlation between signature gene expression and immune cells infiltration levels. D Heatmap of Cancer-Immunity Cycle-related genes and immune checkpoint gene expression patterns in the high- and low-risk groups. E Boxplot of representative immune-related genes expression between high- and low-risk groups

Taken together, the above results showed that there were differences in tumor purity and immunity level between the high- and low-risk groups, and there were significant differences in macrophages, monocytes and CD4+ activated T cells between the two groups. At the same time, the modeling genes were correlated with the expression levels of immune cells, suggesting that the survival difference between the high- and low-risk groups may be induced by the difference in the immune level of myeloid cells within tumor.

Expression analysis, PPI network identification and cytological experiment verification of prognostic related molecules

In order to further clarify the expression of the 12 prognostic genes, we compared the expression of genes in TCGA and GEO datasets (GSE62452, GSE57495), and found that most genes were significantly different between the two groups (Fig. 7A-B, Supplementary Material, Fig. S5A). The distribution of prognostic genes in cells was further verified in single-cell data, and it was found that these genes were mainly highly expressed in ductal type 2 cells, fibroblasts, acinar cells and macrophages (Fig. 7C). The list of 12 signature genes was uploaded to STRING online database and then the PPI network was visualized by Cytoscape (Fig. 7D). The nine genes with the highest contribution degree of nodes in the network were selected as the hub genes of the whole pathway for protein interaction pathway analysis (Fig. 7E). As can be seen from the plot, EGFR, PLG, PLAU and PLAUR have the largest contribution in the protein–protein interaction network, which may be the key regulators in determining the prognosis of patients, which suggest that the differences in prognosis may be due to these kinds of cells. Subsequently, we determined the expression levels of prognostic related genes by real-time PCR in healthy control cells and pancreatic cancer cells with different degrees of malignance (Fig. 8A-L). The results showed that except for SERPINB5, S100A14, IL22RA1 and MPZL2, the expression levels of other genes were increased along with the increase of cell malignancy. The expression of 10 signature genes was also validated in the tumor samples and normal samples in HPA database (Supplementary Material, Fig. S5B). The result showed that except for TWIST1 and IL22RA1, other ten protein expression levels were significantly increased in PDAC tissues compared to their normal tissues.

Fig. 7figure 7

Expression analysis and PPI network identification. AB The boxplot shows the expression levels of signature genes in TCGA and GSE62452 cohorts. C Density plot of expression distribution of 12 prognostic related genes in single cell data. D According to the contribution degree to build the PPI network nodes. Each node represents a protein, while each edge represents the interaction between two proteins, and the greater the contribution degree, the greater the node. E Identification of hub genes network by cytoHubba app in Cytoscape

Fig. 8figure 8

Cytological experiment verification of prognostic related molecules. A–L Real-time PCR was used to detect IL22RA1 (A), TWIST1 (B), ZWINT (C), SERPINB5 (D), MMP14 (E), PLAU (F), PLAUR (G), MPZL2 (H), S100A14 (I), TPX2 (J), TAP2 (K), and TRIM29 (L) expression in HPDE6-C7 (Normal epithelial cells), PANC-1 (Metastatic type cell), CFPAC-1 (Metastatic type cell), Capan-1 (Carcinoma in situ) and MIA PaCa-2 (Carcinoma in situ) cells

Communication analysis of cell subsets

To further investigate the interactions between cell subsets that are associated with prognosis, cell communication of 10-cell subsets was analyzed by Cellchart (Fig. 9A–C). From Fig. 9A, it can be seen that the interaction of these 10 subsets changes in the number and intensity of ligand‒receptor interactions, and strong cell communication exists between fibroblasts, acinar, type 2 ductal cells, and epithelial cells. When examining individual signaling pathways or ligand-receptor mediated cell interactions, the collagen pathway was found to be highly interactive between fibroblasts and type 2 ductal cells, type 1 ductal cells and acinar cells (Fig. 9D, Supplementary Material, Fig. S6A). Further analysis of the ligand-receptor interactions between cells and their contributions to the collagen signaling pathway showed that macrophages had strong interactions with ductal cells and epithelial cells, and CD44 molecules and COL1A2 molecules contributed the most to the pathway (Fig. 9E, Supplementary Material, Fig. S6B).

Fig. 9figure 9

Cell subsets communicate with their receptors. A Network diagram of the number of cell-to-cell interactions. B Network diagram of the strength of cell-to-cell interactions. C Plot of the intensity of interactions between a single cell and other cells. D Heatmap of collagen pathway interaction strength between cells. The vertical axis is the cell that emits the signal, the horizontal axis is the cell that receives the signal, and the accumulation of the intensity of the vertical and horizontal axes is shown on the upper side. E Bubble plots of signal intensities of collagen pathway ligands and receptors between cell subsets

Cytological validation of cell interaction pathways

Since the collagen formation pathway was found to be closely related to fibroblasts in the single-cell analysis results, we further verified the correlation between the expression levels of prognostic related molecules and fibroblasts in the TIMER2.0 database. (Fig. 10A). The results showed that the expression of 6 genes were significantly positively correlated with the infiltration of fibroblasts in the tumor. We further examined the expression of crucial molecules in the collagen formation pathway in pancreatic cancer cell lines (Fig. 10B-I). The results showed that ITGA1, ITGB1, ITGB8 and SDC1 were highly expressed in tumor cells when compared with control group. These results suggested that the prognosis of patients with PDAC was likely to be related to the interaction between fibroblasts and ductal cells, and the interaction may be mainly mediated by collagen formation.

Fig. 10figure 10

Correlation between the level of fibroblast infiltration and the expression of prognostic molecules and cytologic verification of the expression levels of key molecules in the cellular collagen formation pathway. A 6 signature genes correlate with tumor purity and is significantly positively associated with fibroblast infiltrates using the TIMER2.0 database. B–I Bar plot to verify the expression levels of crucial molecules in the collagen formation pathway, including COL1A1, COL1A2, ITGA1, ITGA10, ITGB1, ITGB8, SCD1, SCD4, in normal pancreatic ductal cells and cells with different degrees of malignancy

留言 (0)

沒有登入
gif