Integrating machine learning with mendelian randomization for unveiling causal gene networks in glioblastoma multiforme

3.1 Differential expression and batch effect correction

The analysis began by addressing batch effects between the training datasets, GSE116520 and GSE34152. Before batch correction, significant differences were observed in gene expression distributions across the two datasets, as illustrated by the boxplots in Fig. 1A. Principal component analysis (PCA) further highlighted the batch effect, with samples from GSE116520 and GSE34152 clustering separately in the PCA plot (Fig. 1C). To correct for these discrepancies, the ComBat function from the sva package was applied. Post-correction, the gene expression distributions were more consistent across datasets, as shown in the adjusted boxplots (Fig. 1B), and the PCA plot demonstrated a more cohesive clustering of samples, indicating effective batch correction (Fig. 1D).

Fig. 1figure 1

Batch Effect Correction and Differential Gene Expression Analysis. A Boxplot of gene expression data from GSE116520 and GSE34152 datasets before batch correction. The differences between the datasets are apparent. B Boxplot of gene expression data from GSE116520 and GSE34152 after batch effect correction using the ComBat method, showing improved alignment across datasets. C Principal component analysis (PCA) plot before batch correction, displaying clear separation between GSE116520 and GSE34152 datasets. D PCA plot after batch correction, illustrating improved alignment and reduced batch effects, with samples from both datasets more closely clustered. E Volcano plot showing differentially expressed genes (DEGs) between glioblastoma multiforme (GBM) and control tissues in the training set. Genes with |logFC|> 1 and adj.P.Val < 0.05 are highlighted in red (upregulated) and green (downregulated). F Heatmap of the top DEGs across samples, highlighting distinct expression patterns between GBM and control groups in the training set. Each column represents a sample, and each row represents a DEG

Subsequent differential expression analysis between GBM and adjacent normal tissues identified a total of 286 DEGs with |logFC|> 1 and adjusted p-value < 0.05 (Supplementary Table 1). The volcano plot in Fig. 1E displays the distribution of these DEGs, with significantly upregulated genes marked in red and downregulated genes in green. The heatmap in Fig. 1F provides a visual summary of the expression patterns of the top DEGs across samples, clearly distinguishing GBM from normal tissues.

3.2 WGCNA and module detection

To identify gene modules associated with GBM, WGCNA was performed. The optimal soft-thresholding power was determined to be 8, ensuring that the network followed a scale-free topology (Fig. 2A). Using this power, an adjacency matrix was constructed, and the topological overlap matrix (TOM) was calculated. Hierarchical clustering based on TOM dissimilarity identified several gene modules, each represented by a unique color in the dendrogram (Fig. 2B).

Fig. 2figure 2

WGCNA and Functional Enrichment Analysis of GBM-Related Modules. A Scale independence and mean connectivity plot used to determine the optimal soft-thresholding power in WGCNA. The soft-threshold power was chosen at 8, ensuring a scale-free network topology. B Gene dendrogram and module colors generated by hierarchical clustering of genes based on topological overlap. Modules are represented by different colors, with each branch of the dendrogram representing a module. C Heatmap showing the correlation between module eigengenes and clinical traits (GBM vs. control). The yellow module exhibited the highest positive correlation with the GBM trait. D Bar plot depicting the gene significance across modules, with the yellow module showing the highest gene significance for GBM. E Venn diagram illustrating the overlap between differentially expressed genes (DEGs) identified in the study and the yellow module from WGCNA. A total of 7 genes were common between the DEGs and the yellow module. F Circos plot visualizing the Gene Ontology (GO) enrichment analysis of the common genes, highlighting the most significant biological processes related to GBM, including cell adhesion and axon development. G Bar plot of GO terms enriched in the common genes, with the top enriched terms involved in cell adhesion and neuronal differentiation. H KEGG pathway enrichment analysis of the common genes, showing significant pathways such as viral life cycle, cardiomyopathies, and ECM-receptor interaction

The module-trait relationship heatmap (Fig. 2C) indicated that the yellow module exhibited the most significant correlation with GBM (correlation coefficient = 0.49, p = 3e−04). This module was selected for further analysis due to its strong association with GBM. The gene significance within this module was notably higher than in other modules, as shown in the bar plot (Fig. 2D), further supporting its relevance to GBM.

3.3 Functional enrichment analysis of the most significant GBM-associated module

The yellow module was subjected to functional enrichment analysis to explore its biological implications in GBM. A Venn diagram (Fig. 2E) identified seven intersecting genes between the yellow module and the DEGs identified earlier (Supplementary Table 2). These intersecting genes were analyzed for their involvement in biological processes, cellular components, and molecular functions through GO enrichment analysis. The circular plot (Fig. 2F) highlights the significant GO terms, including pathways related to cell adhesion, neuron differentiation, and protein ubiquitination.

Additionally, KEGG pathway enrichment analysis revealed that these genes were significantly involved in pathways such as the viral life cycle of HIV-1, cardiomyopathies, and ECM-receptor interaction (Fig. 2H). The dot plot (Fig. 2G) illustrates the specific GO terms enriched within the yellow module, underscoring its potential role in GBM pathogenesis through processes like integrin binding and axon development.

3.4 Evaluation of the top-performing machine learning model

A total of 113 machine learning models (Supplementary Table 3) were developed and evaluated using the training dataset, which consisted of the merged GSE116520 and GSE34152 datasets, and external validation datasets, including GSE7696, GSE90886, and GSE109857. The model with the highest (AUC was selected for further evaluation.

Figure 3A shows a heatmap comparing the AUC values across different models and cohorts. The Ridge regression model combined with the Elastic Net (Enet) algorithm emerged as the top-performing model with the highest AUC values across all datasets. This model achieved an AUC of 0.897 (95% CI 0.808–0.966) on the training set (Fig. 3B), demonstrating strong discriminatory power between GBM and control samples.

Fig. 3figure 3

Evaluation of the Machine Learning Model for GBM Classification. A Heatmap displaying the AUC values of 113 machine learning models across the training set (GSE116520 and GSE34152) and external validation sets (GSE7696, GSE90886, and GSE109857). The Ridge model, which showed the highest AUC, was selected for further evaluation. B ROC curve for the Ridge model on the training set, demonstrating an AUC of 0.897 with a 95% confidence interval (CI) of 0.808–0.966, indicating strong predictive performance. C ROC curve for the Ridge model on the GSE109857 validation set, yielding an AUC of 0.872 with a 95% CI of 0.798–0.940, confirming the model's generalizability to external data. D ROC curve for the Ridge model on the GSE7696 validation set, showing an AUC of 0.984 with a 95% CI of 0.950–1.000, further supporting the model's robustness. E ROC curve for the Ridge model on the GSE90886 validation set, with a lower AUC of 0.704 and a 95% CI of 0.432–0.914, indicating variable performance across different datasets. F Confusion matrix for the Ridge model on the training set, highlighting the model's accuracy in classifying GBM (Treat) and control samples. The model correctly classified 17 GBM samples and 23 control samples, with 10 misclassifications. G Confusion matrix for the Ridge model on the GSE109857 validation set, demonstrating its high sensitivity and specificity, correctly classifying 147 GBM samples and 5 control samples. H Confusion matrix for the Ridge model on the GSE7696 validation set, where the model achieved perfect classification with no misclassifications. I Confusion matrix for the Ridge model on the GSE90886 validation set, showing some misclassifications, consistent with the lower AUC observed in the ROC analysis

The model's performance was further validated on the external datasets. In the GSE109857 dataset, the model achieved an AUC of 0.872 (95% CI 0.798–0.940) (Fig. 3C), while in the GSE7696 dataset, the model performed exceptionally well with an AUC of 0.984 (95% CI 0.950–1.000) (Fig. 3D). However, in the GSE90886 dataset, the model's performance was lower, with an AUC of 0.704 (95% CI 0.432–0.914) (Fig. 3E).

To further assess the model's classification accuracy, confusion matrices were generated for the training set and each validation cohort (Fig. 3F–I). In the training set (Fig. 3F), the model correctly classified 23 control samples and 17 GBM samples, with four control samples misclassified as GBM and six GBM samples misclassified as control. In the GSE109857 cohort (Fig. 3G), the model showed strong classification accuracy, correctly identifying 147 GBM samples and five control samples, with minimal misclassification. The GSE7696 cohort (Fig. 3H) demonstrated a similar pattern, with only a few misclassifications. The GSE90886 cohort (Fig. 3I) exhibited slightly higher misclassification rates, reflecting the lower AUC observed in this dataset.

Overall, the Ridge regression model combined with Enet demonstrated robust performance across multiple datasets, making it a promising tool for distinguishing GBM from control samples. The use of ROC curves and confusion matrices provided a comprehensive evaluation of the model's accuracy and generalizability.

3.5 Identification and validation of key prognostic markers using ridge regression model

The Ridge regression model, which was identified as the top-performing method, was further utilized to identify and validate key prognostic markers in GBM. Differential expression analysis identified 286 DEGs, of which significant genes are highlighted in Fig. 4A. The volcano plot (Fig. 4A) illustrates the distribution of genes with significant differential expression between GBM and control samples, highlighting genes such as ANXA2P1, FOXO4, ITGA10, KLHL3, MAP1A, PTTG3P, and SFRP2 as key markers (Supplementary Table 4).

Fig. 4figure 4

Identification and Validation of Key Genes Associated with GBM. A Volcano plot showing differentially expressed genes (DEGs) between GBM and control samples. The highlighted genes—MAP1A, FOXO4, KLHL3, ITGA10, SFRP2 (downregulated in GBM), and ANXA2P1, PTTG3P (upregulated in GBM)—were selected for further analysis based on their significant log fold changes (logFC) and adjusted p-values. B Box plots depicting the expression levels of the selected genes (ANXA2P1, FOXO4, ITGA10, KLHL3, MAP1A, PTTG3P, SFRP2) in GBM (Treat) and control samples. All selected genes show statistically significant differential expression between the two groups (***p < 0.001, **p < 0.01). C Correlation matrix visualizing the relationships between the expression levels of the selected genes. The matrix shows both the correlation coefficients and the statistical significance of these correlations, highlighting strong positive and negative correlations among certain gene pairs. D ROC curves for each selected gene, demonstrating their predictive ability for distinguishing between GBM and control samples. The AUC values indicate the model's classification performance for each gene, with MAP1A showing the highest AUC of 0.883. E Gene interaction network generated using GeneMANIA, illustrating the connections between the selected genes and other associated genes. The network includes various types of interactions, such as physical interactions, co-expression, and shared protein domains, with functions primarily related to the negative regulation of kinase activity, proteolysis, and protein phosphorylation

In Fig. 4B, the expression levels of these seven significant DEGs are compared between GBM and control samples, showing consistent differential expression patterns, with all selected genes showing statistically significant differences (p < 0.001). The correlation plot in Fig. 4C further examines the relationships between these key genes, revealing notable correlations such as a strong positive correlation between ANXA2P1 and ITGA10, and a strong negative correlation between FOXO4 and PTTG3P.

The diagnostic performance of each key marker was assessed using ROC curves (Fig. 4D), where the AUC values ranged from 0.722 for SFRP2 to 0.883 for MAP1A. These AUC values indicate that these genes have strong potential as diagnostic markers for GBM, with MAP1A, KLHL3, and FOXO4 showing particularly high discriminatory power.

Lastly, a gene interaction network analysis was conducted using GeneMANIA to explore the functional associations between these key markers (Fig. 4E). The network revealed that these genes are involved in critical biological processes, including the negative regulation of kinase activity and protein phosphorylation. The analysis also highlighted significant physical interactions, co-expression patterns, and shared pathways among these genes, further supporting their role in GBM pathogenesis.

3.6 Immune infiltration analysis in GBM

Immune infiltration analysis was conducted using the CIBERSORT algorithm to evaluate the relative proportions of 22 immune cell types within GBM and control samples (Supplementary Table 5). Figure 5A illustrates the distribution of these immune cell types across the samples, showing distinct differences in immune cell composition between GBM and control groups. Notably, GBM samples exhibited higher proportions of M0 and M2 macrophages, as well as activated memory CD4 T cells, compared to the control group.

Fig. 5figure 5

Immune Infiltration Analysis in GBM. A The bar plot illustrates the relative proportions of various immune cell types within the GBM (Treat) and control groups, as calculated by the CIBERSORT algorithm. This comparison shows distinct differences in the immune landscape between the two groups, highlighting the increased presence of certain immune cell types in GBM samples. B The heatmap shows the correlation coefficients between different immune cell types in GBM and control samples. Positive and negative correlations are depicted, indicating potential interactions or mutual exclusivity between specific immune cell populations. C Box plots display the fraction of individual immune cell types across the GBM and control groups, emphasizing significant differences in the infiltration levels of certain immune cells, such as macrophages and T cells, between the two conditions. D A lollipop plot ranking the immune cell types based on their correlation coefficients with the expression of key genes in GBM, as determined through correlation analysis. This plot identifies the immune cells most strongly associated with gene expression changes in GBM, with macrophages M1 and M2 showing notable correlations. E A cell–cell interaction network, illustrating the associations between selected key genes (ANXA2P1, FOXO4, ITGA10, KLHL3, MAP1A, PTTG3P, SFRP2) and immune cell types. The network highlights significant interactions and their directions (positive or negative), providing insights into the complex immune microenvironment in GBM

The correlation heatmap in Fig. 5B reveals the relationships among the different immune cell types. Strong positive correlations were observed between activated memory CD4 T cells and M1 macrophages, as well as between resting memory CD4 T cells and M2 macrophages, indicating potential interactions or co-regulation within the tumor microenvironment. A detailed comparison of the immune cell fractions between GBM and control samples is presented in Fig. 5C. This analysis highlights significant differences in the fractions of various immune cells, particularly an increased presence of M2 macrophages and a reduction in resting CD4 memory T cells in GBM samples. Figure 5D further explores the correlation between key immune cells and gene expression, with macrophages (M1 and M2) and dendritic cells showing the highest correlation coefficients. These correlations suggest that the presence of these immune cells may be linked to specific gene expression profiles in GBM, reflecting their potential role in tumor progression or immune evasion. Finally, the network plot in Fig. 5E integrates the identified key genes with the immune cell infiltration data. The plot shows significant associations between the expression of key genes such as ANXA2P1, FOXO4, ITGA10, and KLHL3, and the infiltration of specific immune cell types. For instance, ANXA2P1 and ITGA10 are positively correlated with M2 macrophage infiltration, further underscoring the immunosuppressive environment in GBM.

These findings provide a comprehensive overview of the immune landscape in GBM and highlight potential interactions between immune cells and key genes that may contribute to the tumor's immunosuppressive microenvironment.

3.7 Mendelian randomization analysis of KLHL3 and glioblastoma Risk

MR analysis was conducted to investigate the causal relationship between the expression of specific genes and glioblastoma risk, focusing on seven genes identified as potential candidates based on previous analyses. These genes were screened for their relevance using a combination of differential expression analysis, machine learning model evaluation, and functional enrichment studies. Among these, KLHL3 was selected for detailed MR analysis due to its significant differential expression and robust association with glioblastoma risk factors identified in earlier steps.

Figure 6A illustrates the MR analysis results using various methods, including Inverse Variance Weighted (IVW), MR Egger, Weighted Median, and Simple Mode. The consistent slope across these methods for KLHL3 suggests a protective effect against glioblastoma, with the IVW method showing a particularly strong association. The forest plot in Fig. 6B presents the MR effect sizes for each SNP used as an instrumental variable, highlighting KLHL3's impact on glioblastoma risk. Notably, the red markers indicate significant SNPs detected by the MR Egger method, suggesting some degree of pleiotropy, while the black markers represent the IVW estimates. Figure 6C provides a leave-one-out sensitivity analysis, which further confirmed the robustness of the MR findings by showing stable estimates even when individual SNPs were excluded from the analysis. This reinforces the reliability of the observed association between KLHL3 expression and glioblastoma risk. Finally, Fig. 6D depicts the funnel plot for detecting directional pleiotropy. The symmetry around the MR Egger line suggests no significant pleiotropic effects, strengthening the validity of KLHL3 as a causal factor in reducing glioblastoma risk (Supplementary Table 6).

Fig. 6figure 6

Mendelian Randomization Analysis of KLHL3 in Glioblastoma. A The scatter plot displays the SNP effect on KLHL3 expression versus the SNP effect on glioblastoma risk using different Mendelian Randomization (MR) methods, including inverse variance weighted, MR Egger, weighted median, and simple mode. The consistent slope across these methods suggests a potential protective effect of KLHL3 expression against glioblastoma risk. B The forest plot illustrates the MR effect sizes of individual SNPs associated with KLHL3 expression on glioblastoma risk, with results shown for both the MR Egger and inverse variance weighted methods. The negative effect sizes for certain SNPs further indicate a potential protective role of KLHL3 in glioblastoma. C The leave-one-out sensitivity analysis identifies the impact of each SNP on the overall MR estimate. The analysis shows that excluding any single SNP does not significantly alter the protective effect, indicating the robustness of the MR findings. D The funnel plot evaluates the potential for directional pleiotropy in the MR analysis. The symmetric distribution of points around the inverse variance weighted estimate line suggests a low likelihood of pleiotropy, supporting the validity of the causal inference made between KLHL3 expression and glioblastoma risk

In summary, KLHL3 was selected for MR analysis from seven potential genes based on its strong differential expression and association with glioblastoma. The results indicate that higher genetically predicted KLHL3 expression may reduce glioblastoma risk, with minimal evidence of pleiotropy or confounding.

3.8 Functional enrichment analysis of KLHL3

To explore the functional implications of KLHL3 expression in glioblastoma, both GSEA and GSVA were performed, focusing on the pathways associated with differential KLHL3 expression levels.

In the GSEA analysis, significant enrichment was observed in both high and low KLHL3 expression groups. Figure 7A, B present the GSEA results for GO terms. Specifically, the high expression group showed significant enrichment in processes related to the regulation of mucins, glomerular basement membrane, and other differentiation-related processes (Fig. 7A). Conversely, in the low expression group, there was notable enrichment in pathways involving the extracellular matrix, basement membrane, and structural constituents of the cytoskeleton (Fig. 7B). Similarly, the KEGG pathway analysis further elucidated these associations. Figure 7C, D highlight the pathways enriched in high and low KLHL3 expression groups, respectively. High KLHL3 expression correlated with pathways such as cell signaling, immune cell adhesion, and phagocytosis, indicating an active role in immune-related processes (Fig. 7C). On the other hand, low KLHL3 expression was associated with pathways like complement and coagulation cascades, ECM-receptor interaction, and ribosome biogenesis, suggesting its involvement in fundamental cellular processes and extracellular interactions (Fig. 7D).

Fig. 7figure 7

Gene Set Enrichment Analysis (GSEA) and Gene Set Variation Analysis (GSVA) for KLHL3 in Glioblastoma. A GSEA results for the high expression group of KLHL3 reveal significant enrichment in pathways related to mucin transport, cell differentiation, and glucocorticoid response. The enrichment plots show a strong correlation between high KLHL3 expression and these biological processes. B GSEA results for the low expression group of KLHL3 indicate significant enrichment in pathways involved in extracellular matrix organization and structural components of the basement membrane. These pathways are particularly associated with low KLHL3 expression, suggesting a potential link to tumor microenvironment alterations. C KEGG pathway analysis in the high expression group highlights significant enrichment in pathways related to signaling in the immune system and phagocytosis. These pathways are associated with immune responses and cellular processes that may contribute to glioblastoma progression. D KEGG pathway analysis in the low expression group identifies enrichment in pathways related to complement and coagulation cascades and Fc receptor-mediated signaling. These results suggest that low KLHL3 expression may influence the immune landscape and coagulation processes in glioblastoma. E GSVA results for KLHL3 show the differential activity of various pathways across samples. The analysis reveals upregulated and downregulated pathways in relation to KLHL3 expression levels, providing insights into its role in glioblastoma biology. F Further GSVA analysis confirms the differential expression of key signaling pathways, categorizing them as either upregulated or downregulated in glioblastoma. This analysis underscores the impact of KLHL3 on critical biological processes that could be pivotal in tumor progression

GSVA results provided further insights, as illustrated in Fig. 7E, F. The analysis revealed distinct pathway activity differences between high and low KLHL3 expression groups. The t-value bar plots (Fig. 7E, F) summarized the upregulated and downregulated pathways, indicating that KLHL3 significantly influences a variety of biological processes, particularly those related to cell adhesion, immune response, and extracellular matrix organization.

These findings collectively suggest that KLHL3 plays a critical role in glioblastoma by modulating key biological pathways, with differential expression leading to distinct functional outcomes in tumor biology.

留言 (0)

沒有登入
gif