Understanding the heterogeneity in liver hepatocellular carcinoma with a special focus on malignant cell through single-cell analysis

2.1 Acquisition and processing of transcriptome data

The transcriptome data for LIHC was obtained from the TCGA database, consisting of clinical data of 368 patients. This dataset was used as the training group to construct our model. All transcriptome data was transformed into Transcripts Per Million (TPM) format and then log2 transformed for subsequent analyses. Additionally, the LIRI-JP dataset from the ICGC database, consisting of 232 samples, was used as the verification group. The GSE14520 (T = 225, N = 220) and GSE45267 (T = 48, N = 39) LIHC chip data from the GEO database were included in the set of differential gene analysis. The normalizeBetweenArrays function in the limma package was employed to correct for any variations in the chip data. Furthermore, the Combat algorithm from the sva package was utilized for the correction of different batches of data.

2.2 Acquisition and processing of single cell data

The single-cell dataset used in this study was obtained from GSE149614 in the GEO database, which comprised 10 primary tumor samples. Data analysis was conducted using R software (version 4.1.3), with the Seurat package used for analysis. Only cells with a mitochondrial content of less than 20% under the condition of cytoplasmic control were included in the analysis. The UMI count and gene count were limited to a range of 200–50,000 and 200–8000, respectively. Data normalization, selection of highly variable genes (2000), and data transformation [to eliminate the effect of the cell cycle, utilizing the parameter vars.to.regression = c(‘S.Score’, ‘G2M.Score’)] was performed using the NormalizeData, FindVariableFeatures, and ScaleData functions from the Seurat package. To account for batch effects, the harmony algorithm was applied. Dimensionality reduction techniques, such as UMAP and tSNE, and the Louvain clustering algorithm, were employed using the corresponding functions provided by the Seurat package. Differential gene expression analysis was conducted using the FindAllMarkers function, with parameters set to p-value < 0.05, log2 fold change > 0.25, and an expression ratio greater than 0.1.

2.3 Cell annotation analysis

To characterize the cell types present in our dataset, we employed specific marker genes for epithelial cells (‘EPCAM’, ‘KRT18’, ‘KRT19’, ‘CDH1’), fibroblasts (‘DCN’, ‘THY1’, ‘COL1A1’, ‘COL1A2’), endothelial cells (‘PECAM1’, ‘CLDN5’, ‘FLT1’, ‘RAMP2’), T cells (‘CD3D’, ‘CD3E’, ‘CD3G’, ‘TRAC’), NK cells (‘NKG7’, ‘GNLY’, ‘NCAM1’, ‘KLRD1’), B cells (‘CD79A’, ‘IGHM’, ‘IGHG3’, ‘IGHA2’), and mast cells (‘KIT’, ‘MS4A2’, ‘GATA2’). Based on these markers, we performed separate clustering analyses for epithelial cells to explore tumor heterogeneity. Various charts, including UMAP, tSNE, histograms, and heat maps, were generated to visualize the results.

2.4 Subgroup and CNV analysis of epithelial cells

To further dissect the subgroups within the epithelial cell population, we conducted a separate analysis using the standard Seurat workflow. Subgroup classification was based on the copy number variation (CNV) score of each subgroup, with TSNE plots used for visualization. Additionally, to investigate CNVs within the tumor and epithelial cell subsets, we employed the InferCNV software. This analysis, using endothelial cells as a reference, aimed to identify malignant cells and evaluate the CNV scores of the different epithelial cell subgroups.

2.5 Pseudo-time analysis of epithelial cells

We employed the monocle2 software to perform pseudo-time analysis of the epithelial cell subsets. The dimension reduction algorithm used was DDRTree, and default parameters were applied for the remaining steps. This analysis aimed to uncover the differentiation processes of these cells.

2.6 Transcription factor analysis of epithelial cells

To investigate the involvement of transcription factors in the epithelial cell subsets, we utilized the SCENIC software. This analysis utilized the RcisTarget and GRNBoost motif databases with default parameters. The RcisTarget package was employed to identify overexpressed transcription factor binding motifs within the gene list. The AUCell software package was used to score the activity of each regulator group for each cell type.

2.7 Cell-to-cell communication analysis

To assess potential intercellular communication, we employed the CellChat package. The CellChat function was used to import the normalized gene expression matrix and create the CellChat object. Subsequently, the identifyOverExpressedGenes, identifyOverExpressedInteraction, and ProjectData functions were utilized with default parameters for data preprocessing. The computeCommunProb, filterCommunication, and computeCommunProbPathway functions were employed to identify potential ligand-receptor interactions. Finally, the aggregateNet function was utilized to generate the cell communication network.

2.8 Analysis of gene expression differences and enrichment analysis

The limma software was utilized to calculate the differential expression of genes between tumor and adjacent samples in the GEO and TCGA datasets. A significance threshold of p < 0.05 was applied to identify DEGs. The clusterProfiler package was then used to perform enrichment analysis on the up-regulated and down-regulated genes. The algorithm used was GSEA, and the functional databases utilized were Gene Ontology Biological Process (GOBP) and KEGG. The gene sets used to define functional signatures were sourced from the msigdb database. The results of the enrichment analysis were visualized using the enrichplot package.

2.9 Establishment of tumor-related risk characteristics

First, the intersection of the differentially expressed genes in the epithelial cells, GEO dataset, and TCGA dataset was obtained to identify a set of genes relevant to tumor biology. Single-factor Cox analysis was then utilized to extract genes from this set that had prognostic value. Further screening of these prognostic genes was performed using Lasso regression to establish a prognostic model. This model assigns a risk score to each patient, which is used to stratify them into high-risk and low-risk groups based on the median score in the TCGA cohort. We used the “timeROC” package to draw Receiver Operating Characteristic Curve(ROC) to verify the testing efficiency of the model. The area under the curve (AUC) greater than 0.6 was considered to have good testing efficiency.

2.10 Prediction of immunotherapy response

To predict the response to immunotherapy, datasets GSE35640 (melanoma), GSE91061 (melanoma), IMvigor210 (urothelial carcinoma, UC), and GSE126044 (non-small cell lung cancer, NSCLC) were collected. The risk model scores were calculated for each dataset, and these scores were used to predict the response to immunotherapy.

2.11 Tumor immune infiltration analysis and tumor immunity profiling (TIP) analysis

We utilized the IOBR package to assess the extent of immune cell infiltration in LIHC patients from the TCGA database. We employed six evaluation methods, namely CIBERSORT, quanTIseq, MCPcounter, xCell, EPIC, and Estimate, to determine the immune cell composition within the TME. These results were utilized to generate a heat map, quantifying the relative abundance of immune cell infiltration in the TME. Additionally, the Estimate algorithm provided insights into the relative proportions of stromal cells, immune cells, and tumor cells, enabling comparisons across different risk categories. TIP analysis was conducted using the online URL (http://biocc.hrbmu.edu.cn/TIP/).

2.12 Drug sensitivity and mutation analysis

We employed the ‘oncoPredict’ R package to calculate the IC50 of commonly used chemotherapeutic drugs. This analysis allowed us to evaluate the relationship between risk scores and drug sensitivity. The Wilcoxon rank sum test was employed to compare IC50 values between the two risk groups. Mutation data for LIHC were obtained from the TCGA GDC database, and analyzed using the maftools package. Visualization of the mutation data was performed using the oncoPrint function from the ComplexHeatmap package.

2.13 Enrichment analysis of single-sample gene sets

To assess the enrichment score of gene sets in individual samples, we employed single-sample Gene Set Enrichment Analysis (ssGSEA). In this study, ssGSEA analysis was utilized to determine the risk score for each patient.

2.14 Statistical analysis

All data processing, statistical analysis, and visualization were conducted using R 4.1.3 software. The correlation between continuous variables was assessed using the Pearson correlation coefficient. Categorical variables were compared using the chi-square test, while continuous variables were compared using the Wilcoxon rank sum test or t-test. Cox regression and Kaplan–Meier analysis were performed using the survival package. In this study, the standard of statistical significance p value was: p < 0.05 was considered to be statistically significant.

留言 (0)

沒有登入
gif