Screening of Serum miRNAs as Diagnostic Biomarkers for Lung Cancer Using the Minimal-Redundancy-Maximal-Relevance Algorithm and Random Forest Classifier Based on a Public Database

Background: Lung cancer is one of the deadliest cancers, early diagnosis of which can efficiently enhance patient’s survival. We aimed to screening out the serum miRNAs as diagnostic biomarkers for patients with lung cancer. Methods: A total of 416 remarkably differentially expressed miRNAs were acquired using the limma package, and next feature ranking was derived by the minimal-redundancy-maximal-relevance method. An incremental feature selection algorithm of a random forest (RF) classifier was utilized to choose the top 5 miRNA combination with the optimum predictive performance. The performance of the RF classifier of top 5 miRNAs was analyzed using the receiver operator characteristic (ROC) curve. Afterward, the classification effect of the 5-miRNA combination was validated through principal component analysis and hierarchical clustering analysis. Analysis of top 5 miRNA expressions between lung cancer patients and normal people was performed based on GSE137140 dataset, and their expression was validated by qPCR. The hierarchical clustering analysis was used to analyze the similarity of 5 miRNAs expression profiles. ROC analysis was undertaken on each miRNA. Results: We acquired top 5 miRNAs finally, with the Matthews correlation coefficient value as 0.988 and the area under the curve (AUC) value as 0.996. The 5 feature miRNAs were capable of distinguishing most cancer patients and normal people. Furthermore, except for the lowly expressed miR-6875-5p in lung cancer tissue, the other 4 miRNAs all expressed highly in cancer patients. Performance analysis revealed that their AUC values were 0.92, 0.96, 0.94, 0.95, and 0.93, respectively. Conclusion: By and large, the 5 feature miRNAs screened here were anticipated to be effective biomarkers for lung cancer.

© 2022 The Author(s). Published by S. Karger AG, Basel

Introduction

Lung cancer has triggered abundant associated deaths as an aggressive tumor [1, 2]. Despite immense advancement and outcomes thus far, more investigations into early diagnosis and novel targets are still warranted [3]. The recent years have witnessed the development and research of molecular biomarkers becoming a crucial part in optimizing early diagnosis of lung cancer [4]. Blood-derived biomarkers are expected to be mainstream ones due to their stability and efficiency. Blood-derived miRNAs display excellent diagnostic performance as a new biomarker type [5]. For stance, Tobias et al. [6] published a multicenter cohort study on the early diagnosis of lung cancer and built a 15-miRNA signature with machine learning. Though they fully analyzed differences in miRNA expression mode between cancer and normal population, the gradient algorithm used alone is a limitation [6].

Hanchuan Peng first raised the minimal-redundancy-maximal-relevance (mRMR) algorithm in 2005 and screened excellent features out of many [7]. mRMR can recognize genes with independent effect and remove the redundant features of the selected genes [8], by which feature set with the maximum relevance and minimum redundancy can be obtained [9]. In 2008, Zhang et al. [10] established different classifiers based on feature genes screened with mRMR and RliefF to distinguish diseases and normal status. Lu et al. [11] employed mRMR and a simple decision tree model to predict benign and malignant ovarian tumors, manifesting a rather favorable performance. Besides, mRMR has been applied in prediction of lysine ubiquitination [12], protein-protein interaction [13], and human immunodeficiency virus progression-associated genes [14, 15] In general, mRMR is applicable to screen feature genes of tumors from large gene sets.

Random forest (RF) is a classifier consisting of several decision trees [16] and a classifying method of integrated learning, which is active and can prevent overfitting [17]. This method has been employed in resolving many bioinformatics problems, such as identification of proteins and peptide [18] and prediction of transcription factor binding [17]. This paper utilized this method for the development of a novel diagnosis for lung cancer.

In this study, mRMR was used for feature ranking of differentially expressed miRNAs, and top 5 miRNAs were identified. Subsequently, the optimal features were selected using incremental feature selection (IFS) first, and then receiver operator characteristic (ROC) plot was used to examine the selected classifier. In addition, principal component analysis (PCA) and hierarchical clustering analysis revealed that top 5 miRNAs could effectively distinguish lung cancer samples from normal samples. Finally, our speculation was further verified by expression detection of cancer and normal samples. In general, these 5 feature miRNAs were screened and validated by combining bioinformatics methods and qPCR detection. Their diagnostic effects enable them to work as efficient serum biomarkers for lung cancer.

MethodsAcquirement of miRNA Expression Data and Differential Expression Analysis

Expression data of serum miRNAs in lung cancer and nontumor groups were accessed from Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/). The data (GSE137140), including serum miRNA expression of 1,566 preoperative lung cancer patients and 2,178 nontumor population, were acquired by GPL21263 platform. Data were standardized by the limma package [19] and then proceeded to differential expression analysis (logFC >1.5, FDR <0.05).

Feature Ranking Using mRMR Algorithm

mRMR is capable of selecting excellent features for clinical feature classification [7]. The algorithm works by finding a feature with the max-relevance with the output outcome and maintaining the minimum redundancy between features. Given two random variables x and y, their probability density function was p (x), p (y), p (x, y). The mutual information was therefore as follows (the larger I (x; y), the larger the relevance between the two variables):

/WebMaterial/ShowPic/1448878

Based on the function above, max-relevance satisfies the following formula:

/WebMaterial/ShowPic/1448879

xi, features in the feature set S; c: classification results. However, when features in S have high relevance, removing one of them affects little on the classification effect. Thus, the minimum redundancy formula was introduced to reduce relevance between features:

/WebMaterial/ShowPic/1448880

xi/xj, features in S. The mRMR algorithm could be constructed by combining max D and min R. The algorithm was simplified by setting the operator (ϕ):

max ϕ (D, R), ϕ = D – R(4)

Here, remarkably differentially expressed miRNAs were taken as the feature set and ranked using mRMR.

Construction of the RF Classifier and IFS

Important feature miRNAs were selected with the mRMR algorithm, but which combination of miRNAs had the optimum classification performance was not determined. Thus, a series feature miRNA sets F (F = [f1, f2,….fn]) (fn: feature miRNA, n: integer from 1 to 5) were acquired based on top 5 miRNAs ranked by mRMR. Afterward, the python package (sklearn) was applied to establish the RF classifier [20] for each set (F1, F2, …, F5). RF is an ensemble algorithm based on decision trees. Each decision tree classifies samples, whose results are summarized by voting and then outputted according to the voting results. In consideration of unequal sample numbers, the python package (imblearn) was used for simulation training by the upsampling method [21]. To detect the classification performance, we calculated the 10-fold cross-validation Matthews correlation coefficient (MCC) in the training set of each RF classifier and drew the IFS curve [22]. According to the MCC value in the IFS curve, the optimum feature miRNA combination was chosen. In addition, the ROC curve was drawn based on the 5-miRNA RF classifier with the python package (sklearn) to evaluate the classification performance [23].

PCA and Hierarchical Clustering Analysis

PCA is able to reduce the feature dimensionality of data while retaining data variables as much as possible [24]. In such an algorithm, new variables are generated via linear combinations of the original features, namely principal components. The classification effect could be validated by observing data distribution in a space that consists of these new principal components. Using the FactoMineR package, 5 feature miRNAs in this investigation were transformed as PC1 and PC2 that made up a 2-dimensional plane, the distribution of each sample on which was then observed. Additionally, hierarchical clustering analysis was performed on samples using the pheatmao package according to the expression of 5 feature miRNAs [25]. In this way, whether cancer and normal samples could be classified based on the 5-gene expression was verified.

Research Subjects and Sample Collection

Hundred lung cancer patients and 80 normal people from June 2020 to June 2021 in 900 Hospital of the Joint Logistics Team were enrolled as subjects. Their clinical information and serum samples were gathered. All patients did not receive any treatment while they were diagnosed according to the American Joint Committee on Cancer: The 7th Cancer Staging Manual [26]. All subjects’ peripheral blood (5 mL) was collected in a 5-mL clotting tube (Greiner Bio-One, Austria). Serums were separated by centrifugation and stored at −80°C for miRNA extraction. The above steps were approved by the Ethics Committee of 900 Hospital of the Joint Logistics Team.

qPCR

According to the instructions of manufacturers, TRIzol (Invitrogen, USA) was applied to isolate total RNA from serums. The miScript II RT Kit (QIAGEN, Germany) was used to reversely transcribe total RNA into cDNA. The cDNA was quantified and amplified on Bio-Rad CFX96 qPCR apparatus using miScript SYBR green (Bio-Rad, USA). Primers (See Table 1) of miR-1290, miR-663a, miR-3192-5p, miR-1343-3p, miR-6875-5p, U6 (internal reference) were all procured from GENEWIZ, China. The relative expression of miRNAs was calculated by the 2−ΔΔCT method.

Table 1./WebMaterial/ShowPic/1448872Statistics and Analysis

GraphPad Prism (GraphPad Software, USA) was applied for statistical processing. The T test was employed to test expression difference significance which was identified with p value <0.05. ROC analysis of GraphPad Prism was employed to analyze the predictive performance of each miRNA. * refers to p less than 0.05.

ResultsFeature Ranking and Expression Analysis of the Differentially Expressed miRNAs

Analysis of the expression profiles of serum miRNAs revealed 416 remarkably differentially expressed miRNAs (online suppl. Table 1; for all online suppl. material, see www.karger.com/doi/10.1159/000525316). These miRNAs were ranked with the mRMR method according to feature importance. Top 5 miRNAs were used herein after (Table 2).

Table 2.

Top 5 miRNAs by mRMR feature selection

/WebMaterial/ShowPic/1448870Evaluation of the Classification Performance of the Top 5 miRNAs

This part focused on the probability of top 5 feature miRNAs as serum biomarkers for lung cancer. To this end, IFS and ROC curves of the RF classifier based on different feature sets (F1, F2, …, F5) were utilized to analyze its accuracy. In the IFS curve, the top 5 miRNA-based RF classifier displayed the maximum MCC value (0.988) (Fig. 1a). According to the ROC curve, the AUC value was 0.996 (Fig. 1b). Integrated analyses demonstrated that the top 5 miRNA-based RF classifier could effectively distinguish cancer samples from normal ones.

Fig. 1.

Evaluation of the classification performance of top 5 miRNAs. a IFS curve of the RF classifier based on different feature sets. X axis: miRNA numbers; Y axis: MCC value. b ROC curve evaluated the classification performance of the classifier. X axis: 1 − specificity, false-positive rate; Y axis: sensitivity, true-positive rate.

/WebMaterial/ShowPic/1448866PCA and Hierarchical Clustering Analysis

To further verify whether these 5 feature miRNAs could efficiently classify samples, PCA and hierarchical clustering analysis were, respectively, undertaken. As shown by PCA results, two kinds of samples could be distinguished in the 2-dimensional plane consisting of PC1 and PC2 (Fig. 2a). As also exhibited by hierarchical cluster analysis, the expression of the 5 miRNAs could distinguish most of the sample types (Fig. 2b). Next, expression of these miRNAs in samples was analyzed. MiR-6875-5p was downregulated in tumor samples, while the others were upregulated (Fig. 3a–e). Altogether, these 5 feature miRNAs could efficiently distinguish cancer samples from normal ones.

Fig. 2.

PCA and hierarchical clustering analysis for validation of the classification performance of the 5 feature miRNAs. a Clustering of samples in the 2-dimensional plane. Red dots: nontumor samples. Blue triangles: tumor samples. b Hierarchical clustering analysis of the expression of these 5 feature miRNAs. Blue: nontumor group, red: tumor group.

/WebMaterial/ShowPic/1448864Fig. 3.

Analysis of top 5 miRNA expression based on GSE137140 dataset. a–e Expression of miR-6875-5p, miR-3192-5p, miR-1290, miR-663a, and miR-1343-3p, respectively. Green: normal samples. Pink: tumor samples.

/WebMaterial/ShowPic/1448862Validation of the Expression and Predictive Performance of Top 5 miRNAs

To verify whether these miRNAs could be used as serum biomarkers for lung cancer, we collected clinical information (Table 3) and serum samples of subjects (100 tumor patients and 80 normal people). Expression of these miRNAs was measured with qPCR. Except for the remarkably low expression of miR-6875-5p in the tumor group (Fig. 4a), others (miR-3192-5p, miR-1290, miR-663a, miR-1343-3p) displayed markedly high expression when compared to normal group (Fig. 4b–e). Such results were consistent with the expression mode of top 5 miRNAs in section 2.3. Next, hierarchical clustering analysis revealed that most cancer patients could be well distinguished from normal people (Fig. 4f). The ROC curve of each miRNA was drawn to analyze corresponding predictive performance. The predicted AUC of miR-6875-5p, miR-3192-5p, miR-1290, miR-663a, and miR-1343-3p were, respectively, 0.92, 0.96, 0.94, 0.95, and 0.93 (Fig. 5a–e). In conclusion, each feature miRNA exhibited a favorable predictive performance for lung cancer.

Table 3.

Clinical information of subjects

/WebMaterial/ShowPic/1448868Fig. 4.

qPCR and hierarchical clustering analyses of 5 feature miRNAs. a–e Relative expression of miR-6875-5p, miR-3192-5p, miR-1290, miR-663a, and miR-1343-3p, respectively. f Hierarchical clustering analysis of relative expression of 5 feature miRNAs. Pink: tumor, blue: normal. *p < 0.05.

/WebMaterial/ShowPic/1448860Fig. 5.

ROC analysis of 5 feature miRNAs. a–e ROC curves of disease diagnosis based on the expression of miR-6875-5p, miR-3192-5p, miR-1290, miR-663a, and miR-1343-3p, respectively. X axis: 1-specificity, Y axis: sensitivity.

/WebMaterial/ShowPic/1448858Discussion

Evidence has described strong relevance between certain miRNA expression and cancer development. Zhang et al. [27] discovered a high miR-31 level in colorectal carcinoma tissue and its association with radiotherapy sensitivity. This miRNA is a promising prognostic biomarker for colorectal carcinoma, which was also revealed in the subsequent cellular and molecular experiments. Zhou et al. [28] applied expression detection, cell viability experiment, and protein detection to reveal the promoting effect of miR-665 in ovarian cancer, suggesting its probability as a predictive biomarker. Generally, different miRNAs might serve as diagnostic or predictive biomarkers of cancers. Five serum miRNAs, miR-1290, miR-663a, miR-3192-5p, miR-1343-3p, and miR-6875-5p, were obtained through differential expression analysis and the mRMR method. MiR-1343-3p has been reported to be highly expressed in lung cancer patients [29], which coincided with our analysis. In Yining Wu’s study, the authors found that miR-1290 is a potential diagnostic and prognostic marker of lung adenocarcinoma [30]. Huang et al. [31] discovered that miR-663a is a potential biomarker for the diagnosis of osteosarcoma. Neda Gilani’s work demonstrated that miR-1343-3p can be used as a reliable biomarker for gastric cancer diagnosis [32]. These results suggest that these the 3 miRNAs can be used as potential markers for cancer diagnosis. Our results also observed 5 miRNAs (including the above 3 miRNAs) as indicators of lung cancer diagnosis, revealing their potential being diagnostic markers for lung cancer.

Early screening of high-risk lung cancer populations by detecting circulating miRNAs has been extensively applied. In 2010, Bianchi et al. [33] researched a 34-miRNA-based signature for early diagnosis of non-small-cell lung cancer. Researchers in this investigation constructed a predictive model using linear discriminant analysis and hierarchical clustering analysis and undertook early screening of lung cancer in the large-scale high-risk lung cancer population. The accuracy of that model is 80%, and the AUC of ROC is 0.89, suggestive of a rather stable predictive performance. Thereafter, that team simplified 34 miRNAs to 13 miRNAs in 2015 and established a 13-miRNA model with a good early screening performance [34]. However, with the application of machine learning and pervasion of molecular diagnosis, more suitable algorithms might exist for feature gene screening and model construction for lung cancer. Different from Fabrizio Bianchi’s research, this investigation used the mRMR method to screen feature miRNAs, and these miRNAs-based RF classifiers showed well-predictive performance. Lastly, the performance was further validated by detecting their expression in clinical samples.

Overall, this investigation acquired serum miRNA expression profiles from the Gene Expression Omnibus database and used the mRMR method to explore feature miRNAs of lung cancer. Afterward, the 5-miRNA-based RF classifier showed optimum classification performance as determined by IFS curve analysis, and this was also validated by the ROC curve. Next, PCA and hierarchical clustering analysis further demonstrated that the classifier could effectively distinguish between lung cancer samples and normal samples. Lastly, the expression mode of these miRNAs was validated by qPCR in the blood samples. The predictive performance of each miRNA was verified by the curve. Although this investigation determined lung cancer-related feature genes and their expression modes in lung cancer patients using bioinformatics analysis and qPCR detection, some limitations should still be acknowledged. This investigation was not involved in corresponding cellular and animal experiments, and thus, the specific mechanism cannot be deeply researched. Associated experiments will be designed in our plans in pursuit of specific effects of these 5 feature genes in lung cancer.

Statement of Ethics

Our study did not require an ethical board approval and consent to participate statement because it did not contain human or animal trials.

Conflict of Interest Statement

The authors have no conflicts of interest to declare.

Funding Sources

The authors declare there is no funding used in the article.

Author Contributions

Conceptualization: Xiaoyan Huang. Data curation: Xi Chen. Formal analysis: Xiaoyan Huang. Methodology: Xiong Chen. Validation: Xiaoyan Huang. Investigation: Xi Chen. Writing – original draft: Wenling Wang. Writing – review and editing: Wenling Wang and Xiong Chen.

Data Availability Statement

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

This article is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC). Usage and distribution for commercial purposes requires written permission. Drug Dosage: The authors and the publisher have exerted every effort to ensure that drug selection and dosage set forth in this text are in accord with current recommendations and practice at the time of publication. However, in view of ongoing research, changes in government regulations, and the constant flow of information relating to drug therapy and drug reactions, the reader is urged to check the package insert for each drug for any changes in indications and dosage and for added warnings and precautions. This is particularly important when the recommended agent is a new and/or infrequently employed drug. Disclaimer: The statements, opinions and data contained in this publication are solely those of the individual authors and contributors and not of the publishers and the editor(s). The appearance of advertisements or/and product references in the publication is not a warranty, endorsement, or approval of the products or services advertised or of their effectiveness, quality or safety. The publisher and the editor(s) disclaim responsibility for any injury to persons or property resulting from any ideas, methods, instructions or products referred to in the content or advertisements.

留言 (0)

沒有登入
gif