Construction of a 10-gene prognostic score model of predicting recurrence for laryngeal cancer

Data source

The mRNA sequencing data of head and neck samples (including 604 samples) were obtained from The Cancer Genome Atlas (TCGA) database (https://tcga-data.nci.nih.gov/docs/publications/tcga/?) based on the Illumina HiSeq 2000 RNA Sequencing platform. The positions of the 604 samples were in the alveolar ridges (n = 18), tongue roots (n = 30), buccal mucosa (n = 22), floor of the mouth (n = 67), hard palates (n = 8), hypopharynx (n = 9), larynx (n = 138), lips (n = 3), mouth (n = 38), tongue (n = 158), oropharynx (n = 10), and tonsils (n = 45). The rest of the samples were from uncertain tumor locations. Among the 138 throat samples, we screened 82 LC samples with recurrence and prognosis information (28 and 54 samples with and without recurrence, respectively) in our study.

Additionally, we searched for validation dataset using the keyword “larynx cancer” from the National Center for Biotechnology Information Gene Expression Omnibus database (http://www.ncbi.nlm.nih.gov/geo/). The screening standards were as follows: (1) gene expression profile data, (2) the samples were from the tumor tissue specimen of patients, (3) human expression profile data, and (4) the samples with information of recurrent or non-recurrent prognosis. Two validation datasets were obtained. One was GSE27020 that composed of 109 LC tissue samples (34 and 75 samples with and without recurrence, respectively) based on the Affymetrix Human Genome U133A, the Array platform, and the other one was GSE25727 that included 56 LC tissue samples (17 and 39 samples with and without recurrence, respectively) based on the Illumina HumanRef-8 WG-DASLv3.0 platform.

Screening of differentially expressed genes

A meta-analysis on TCGA dataset, GSE27020 and GSE25727, was conducted using an ES function of MetaDE [8] (version: 1.0.5, https://cran.r-project.org/web/packages/MetaDE) in R3.4.1 to screen the differentially expressed genes (DEGs) [9]. Subsequently, we screened for DEGs [9] that showed consistent expression in these two datasets between samples with recurrence and those without recurrence by calculating the tau2, Q, and Qpval values (criterion for judgment; tau2 = 0 indicates that each research object is homogeneous and unbiased; the statistic Q obeys the Chi-square test with a degree of freedom of k-1, whereas Qpval  > 0.05 indicates that each research object is homogeneous and unbiased). Then, the false discovery rate (FDR) value was obtained using multiple test corrections. FDR < 0.05 showed that the difference was significant. Additionally, each dataset was calculated to express the fold change, after which several parameters were selected, and the threshold value was set. The set parameters were as follows: (1) To ensure that the source of each selected characteristic gene was homogeneous and unbiased (that the expression of each featured gene in each data set was consistent), Tau2 = 0 and Qpval  > 0.05 were selected as homogeneity test parameters. (2) FDR < 0.05 was selected as the significant threshold of expression difference between the genomes. (3) After screening with log2 FC, the genes having similar direction of difference (with the same symbol of log2 FC) were retained. After combining multiple screening parameters, we set the selection of threshold parameters:

I. We ensure that the source of each selected characteristic gene is homogeneous and unbiased, that is, the expression in each data set is consistent, so tau2 = 0 and Qpval > 0.05 are selected as homogeneity test parameters;

II. FDR < 0.05 was considered as the threshold of significant difference in expression between gene groups;

III. Combined with Log2FC for screening, we retained genes with the same direction of difference (consistent Log2FC symbols) in the three datasets.

The threshold was set to a false discovery rate < 0.05. Then, the Gene Ontology Biology Process (GO-BP) annotation analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were conducted for these DEGs with consistency.

Establishment and verification of a risk assessment model

On the basis of the DEGs, we conducted univariate Cox regression analysis in the survival package [10] (version 2.4, http://bioconductor.org/packages/survivalr/) to screen DEGs significantly related to the prognosis in TCGA dataset. The multivariate Cox regression analysis was then used to screen DEGs that can be used as independent prognostic factors. The log-rank P < 0.05 was also regarded as the threshold of significant correlation.

Furthermore, the Cox proportional hazard model [11] based on the L1-penalized (Lasso) in the penalized package (version 0.950; http://bioconductor.org/packages/penalized/) [12] of the R3.4.1 language was used to screen out the optimized prognostic-associated signature DEG combinations based on the aforementioned DEGs related to the prognosis [13]. Then, on the basis of the prognostic coefficient of prognosis-related DEGs, the prognostic score (PS) prediction model was established in the training dataset using the following formula:

where βDEGs refer to the prognostic coefficient of the optimized DEGs in the Lasso algorithm and ExpDEGs represent the expression of the corresponding DEGs in the training dataset.

With the median PS as the dividing point, the samples in TCGA training dataset were further categorized into high- and low-risk groups. After that, the Kaplan–Meier (KM) [14] survival curve in the R3.4.1 language survival package (version 2.41–1) [10] was used to measure the association between the risk model and prognosis. Simultaneously, we screened these optimized DEGs from the validation dataset (GSE25727 and GSE27020). Then, the PS score of each sample was obtained using the PS calculation method. The validation dataset samples were also separated into high- and low-risk sample groups in the same manner as in TCGA dataset samples. Thereafter, the KM curve method of the survival package (version 2.41–1) [10] in the R3.4.1 language was used to evaluate the relationship between the high- and low-risk groups, compared with the actual survival prognosis information from the validation dataset samples.

Screening of independent prognostic clinical factors for performance evaluation

Combining the clinical factors including recurrence, age, gender, pathologic (M, N, and T), pathologic stage, grade, alcohol history, angiolymphatic invasion, and perineural invasion in TCGA (Additional file 1: Table S1), we used univariate and multivariate Cox regression analysis methods in the R3.4.1 language survival package (version 2.41–1) [10] to screen the independent prognostic clinical factors. The threshold was set to log-rank P < 0.05. Next, to further explore the correlation between independent factors and prognosis, the nomogram with 3- and 5-year survival rates was constructed using the RMS software package (version 5.1.2; https://cran.r-project.org/web/packages/rms/index.html) in R3.4.1 [9, 15].

Next, the PS and risk models were compared using the area under the receiver-operating characteristic curve (AUROC) [14] and the concordance index (C-index). Additionally, the AUROC is a quantitative indicator of the receiver-operating characteristic (ROC) curve, which was calculated using the pROC in the R3.4.1 language (version 1.14.0, https://cran.r-project.org/web/packages/pROC/index.html). In contrast, the C-index is referred to as the scores of all individual pairs correctly sorted on the basis of the Harrell C statistics [16] to predict the survival time [17]. It was calculated using the survcomp package (http://www.bioconductor.org/packages/release/bioc/html/survcomp.html) in the R3.4.1 language.

留言 (0)

沒有登入
gif