Disruption of metazoan gene regulatory networks in cancer alters the balance of co-expression between genes of unicellular and multicellular origins

Tracking changes in co-expression between unicellular and multicellular genes in cancer

Alterations to the structure of metazoan GRNs that evolved to support multicellularity should manifest as differences in patterns of co-expression of unicellular and multicellular genes in tumors relative to normal somatic tissues. To investigate, we constructed a three-step pipeline to combine sequence orthology data with gene co-expression networks to study how genes of unicellular or multicellular origin are co-regulated in cancer (Fig. 1A).

Fig. 1figure 1

WGCNA-derived gene co-expression modules in tumor and normal samples. A Identification and classification of gene co-expression modules. (1) Phylostratigraphy was applied to assign unicellular or multicellular origins to genes. (2) WGCNA was used to cluster genes into co-expression modules based on correlation in expression. (3) Modules were classified according to the degree of enrichment of genes of unicellular or multicellular origin. B Summary of number of modules per TCGA cohort, with points colored according to tumor type. C Mean number of genes in tumor and normal modules per TCGA cohort. Color key is the same as in (B). D Distribution of numbers of unicellular-enriched, multicellular-enriched, and mixed UC-MC modules across TCGA cohorts. E KEGG and reactome terms found to be most frequently enriched in modules across all TCGA cohorts. The total numbers of tumor types containing a unicellular, mixed UC-MC, or multicellular module enriched for each term are shown on the x-axis

We first assigned evolutionary ages to human genes using phylostratigraphy, which employs phylogenetics and multiple sequence alignments to estimate the point in evolutionary time where a gene emerged [6]. Phylostratigraphic analysis revealed cancer genes tend to be highly conserved and enriched in particular for genes that emerged in unicellular organisms and during the early stages of metazoan evolution [28]. We divided species into 16 clades representing the major stages of evolution from single-celled organisms to modern humans (Fig. 1A (1)), as we have previously done [5]. Briefly, for each human gene, protein sequence alignments from OrthoMCL [29] were used to identify the most distant clade from humans in which species with high-confidence orthologs could be found (the “Methods” section). Genes with orthologs in bacteria or single-cell eukaryotes (clades 1–3) were designated as having unicellular origins (UC genes). Genes with orthologs in other multicellular species only were classified as being of multicellular origin (MC genes). Clade assignments are provided in Additional file 1: Dataset S1.

Next, we inferred the structure and topology of GRNs in both tumors and normal somatic tissues by applying the weighted gene co-expression network analysis (WGCNA) algorithm [30]. WGCNA identifies “modules” of genes with correlated expression across multiple samples (Fig. 1A (2)), implicating underlying coordinated regulatory mechanisms exist that connect those genes into identifiable co-expression modules. WGCNA co-expression modules were generated for each of the 31 solid tumor types with available RNA sequencing data in The Cancer Genome Atlas (TCGA). For the 16 TCGA cohorts with data for at least 10 normal samples, we generated both tumor and normal gene co-expression modules (bladder urothelial carcinoma (BLCA), breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), esophageal carcinoma (ESCA), head and neck squamous cell carcinoma (HNSC), kidney chromophobe (KICH), kidney renal clear cell carcinoma (KIRC), kidney renal papillary cell carcinoma (KIRP), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), prostate adenocarcinoma (PRAD), rectum adenocarcinoma (READ), stomach adenocarcinoma (STAD), thyroid carcinoma (THCA), uterine corpus endometrial carcinoma (UCEC)). Here, gene expression data from matched pairs of tumor-normal samples were used to identify co-expression modules. For cohorts with no normal samples available, only tumor co-expression modules were built, using gene expression data from all available tumor samples.

Finally, co-expression modules were classified according to the prevalence of unicellular or multicellular genes (Fig. 1A; the “Methods” section) into three categories. Modules enriched in UC genes were defined as being unicellular-enriched, while modules enriched in MC genes were defined as multicellular-enriched. Modules not enriched in either UC or MC genes were placed into the mixed UC-MC category and represent regions of human GRNs where there is communication between genes of different evolutionary ages.

In total, 16 to 92 modules were obtained per TCGA tumor-type cohort, for a total of 1503 modules (337 from normal tissues and 1166 from tumor tissues). On average, there were more modules in tumor samples (37.61 per cohort) than in normal samples (21.06 per cohort) (Fig. 1B) (two-sided Wilcoxon test p-value = 4.77 × 10−5). This trend held even when only cohorts with matched tumor and normal samples were considered (two-sided Wilcoxon test p = 6.87 × 10−5) (Additional file 2: Fig. S1). The number of modules detected per cohort was not a function of sample size (Additional file 2: Fig. S2). Tumor modules were typically about half the size of normal modules, with an average of 163.4 genes per tumor module versus an average of 278.1 genes per normal module (two-sided Wilcoxon test p-value = 1.63 × 10−5) (Fig. 1C). Again, this was true when comparing only cohorts with matched normal samples (two-sided Wilcoxon test p-value = 6.87 × 10−5) (Additional file 2: Fig. S3) and was independent of cohort size (Additional file 2: Fig. S2). Generally, the modules obtained from WGCNA provide comprehensive coverage of the transcriptome from TCGA cohorts, with more than 90% of genes assigned to modules in 28 of 31 tumor types (Additional file 2: Fig. S4). These patterns are consistent with the greater heterogeneity seen between tumors and the breakdown of links between different GRN regions in cancer.

The higher number of co-expression modules derived from tumor samples was due to an excess of modules belonging to the mixed UC-MC category, which contains a balanced mix of unicellular and multicellular genes. An average of 24.97 mixed UC-MC modules were obtained per TCGA tumor cohort, ranging from 6 to 82 per tumor type (Fig. 1D, Additional file 2: Table S1). Collectively, 774 of the 1166 (66.38%) modules found across all tumor cohorts combined were mixed UC-MC. In marked contrast, only 137/337 (40.65%) of the combined modules from all normal cohorts together were mixed UC-MC (Additional file 2: Fig. S5) (Fisher exact test p = 4.87 × 10−17, odds ratio = 2.88 (2.23–3.73)). Unicellular-enriched modules comprised 185/1166 (15.87%) of modules from tumor samples whereas 207/1166 (17.75%) of tumor modules were multicellular-enriched (Additional file 2: Table S1). In contrast, 84/337 (24.93%) modules from normal cohorts were unicellular-enriched while 116/337 (34.42%) were multicellular-enriched, significantly different proportions from tumor modules (Fisher exact test p = 1.46 × 10−16). Most tumor cohorts had similar numbers of unicellular-enriched (mean of 5.97) and multicellular-enriched (mean of 6.68) modules, comparable to normal cohorts, which averaged 5.25 unicellular-enriched and 7.25 multicellular-enriched modules, respectively (Fig. 1D and Additional file 2: Table S1). These patterns hold even if the after connections below the median edge weight (i.e., the 50% weakest edges) are removed from each module (Additional file 2: Fig. S6A).

To understand their functional roles, all co-expression modules were tested for enrichment of terms from the Kyoto Encyclopedia of Genes and Genomes [31] and Reactome [32] using gProfiler [33]. Unicellular-enriched modules were associated with basic cellular processes such as translation, metabolism, and cell cycle, consistent with the highly conserved nature of the genes in those modules. Modules from the multicellular-enriched category were largely associated with functions specific to multicellularity, such as immunity and the extracellular matrix (Fig. 1E). Interestingly, mixed UC-MC modules were enriched for terms related to transcription and mRNA processing as well as core processes such as metabolism and cell cycle, consistent with this class of modules representing processes used by metazoans regulate gene expression of unicellular pathways. Shared genes were observed between modules across TCGA cohorts to a greater extent than would be expected by chance (Additional file 2: Fig. S7), further arguing for common biological drivers behind their formation.

Gene co-expression analysis suggests GRNs in cancers can be divided into regions predominantly composed of genes from unicellular or multicellular ancestors or regions connecting both types of genes. The differences in the prevalence and size of mixed UC-MC modules between tumor and normal cohorts suggest the onset of cancer is accompanied by substantial changes in the co-expression of unicellular and multicellular genes.

Gene co-expression analysis implicates substantial rewiring of regulatory links between unicellular and multicellular genes in cancer

To assess the frequency and magnitude of rewiring between UC and MC genes into new co-expression modules, we devised a metric to measure the extent to which modules in tumors contain unique assemblages of genes relative to their matched normal counterparts. Our metric, which we refer to as the “novelty” score, is designed to capture the relative degree of change in the co-expression between conditions represented by a given module. Briefly, it reflects the number of genes within a given tumor module that were also found together in a module from the corresponding normal tissue type (Fig. 2A; the “Methods” section). Based on this score, tumor modules were designated as low, medium, or high novelty based on their overlap with normal modules from the corresponding tissue type. Low novelty modules preserve correlation in expression between tumor and normal tissues; for example, transcriptional programs that retain activity in tumors. Medium or high novelty modules have fewer co-occurring pairs with normal modules and thus better capture co-expression patterns present in tumors but not in the corresponding normal tissue.

Fig. 2figure 2

Measuring divergence in co-expression between WGCNA modules from tumors and normal tissues. A Demonstration of calculation of module novelty scores based on overlap in gene content between tumor and normal co-expression modules. B Fraction of UC-enriched (pink), mixed UC-MC (green), and MC-enriched (blue) modules with high, moderate, or low novelty, combined across all 16 TCGA cohorts with matching normal tissue data. C Distribution of novelty scores for tumor modules enriched in unicellular genes (UC-enriched; pink), containing a mix of unicellular and multicellular genes (mixed UC-MC; green) or enriched in multicellular genes (MC-enriched; blue). Data are combined across all 16 TCGA cohorts with matching normal tissue data. D Average edge weights for the edges between pairs of unicellular genes (UC-UC; pink), between unicellular and multicellular genes (UC-MC; green), and between pairs of multicellular genes (MC-MC; blue), for all tumor modules from all TCGA cohorts combined. E Single-sample GSEA scores for tumor modules with high (blue), medium (yellow), and low (gray) novelty scores, across all 16 TCGA cohorts with matching normal tissue data

Novelty of modules was tightly associated with module age. Overall, 56.66% of all UC-enriched modules and 45.85% of MC-enriched tumor modules across all cohorts were low novelty (Fig. 2B), indicating correlations in expression between pairs of genes that are both from unicellular phylostrata or both from multicellular phylostrata are more likely to be preserved in cancer. Mixed UC-MC tumor modules had higher novelty scores than modules enriched for UC or MC genes (one-sided Wilcoxon test p-values = 1.22 × 10−11 and 4.80 × 10−5, respectively) (Fig. 2C, Additional file 2: Fig. S8). Averaging across cohorts, 41.66% of mixed UC-MC modules were high novelty (Fig. 2B), forming the majority of the high novelty tumor modules identified in tumor cohorts (Additional file 2: Fig. S9). Mixed UC-MC modules still had the highest novelty after connections below the median edge weight were removed from each module (Additional file 2: Fig. S6B).

Mixed UC-MC modules appear to capture new gene co-expression patterns not present in normal tissue, indicating regulatory links between UC and MC genes are disrupted and rewired during tumor development. We hypothesized if there is rewiring of gene regulatory networks during tumor development, there would be substantial heterogeneity in co-expression of UC and MC genes between patients. In agreement, edge weights assigned by WGCNA between UC and MC genes (UC-MC) were weaker than edges between unicellular genes (UC-UC) or between multicellular genes (MC-MC) when we averaged the strength of these connection types in each module (one-sided Wilcoxon p-values < 2.2 × 10−16 in both cases), indicating higher levels of heterogeneity and/or transcriptional plasticity in the co-expression of unicellular and multicellular genes in cancer (Fig. 2D). Modules in all gene age enrichment categories contain both unicellular and multicellular genes, meaning while connections between unicellular and multicellular genes are enriched in mixed UC-MC modules, they also contribute significantly to UC-enriched and MC-enriched modules (Additional file 2: Fig. S10).

Finally, measuring the combined expression of the genes in each module in tumors using single-sample gene set enrichment analysis (ssGSEA) [34] revealed high novelty modules had the highest ssGSEA scores, followed by medium then low novelty modules (Fig. 2E). On average, in the upper quartile of ssGSEA scores in each tumor type, 31.53% of modules were high novelty modules, 27.03% of medium novelty modules, compared to just 20.79% of low novelty modules (Additional file 2: Fig. S11). This shows high and medium novelty modules result from active gene expression, as opposed to peripheral contamination from noise in the data.

Overall, these findings indicate that the greatest differences in the co-expression between tumors and normal samples occur between genes found in mixed unicellular-multicellular modules. Mixed UC-MC modules display both high levels of expression in tumors and major differences in the co-expression patterns from normal tissues, as evidenced by their higher novelty scores. This fits a model where the balance in co-expression between unicellular and multicellular genes is consistently altered in tumors through changes in the underlying gene regulatory networks.

The presence of somatic mutations impacts the composition and topology of tumor gene co-expression modules

Cancer progression is defined by the accumulation of somatic mutations that enhance fitness, often by causing dramatic changes to the regulation of transcription. Our prior work showed somatic mutations to key hubs in GRNs disrupt communication between unicellular and multicellular genes [24]. To examine how somatic mutations affect the co-expression of unicellular and multicellular genes, we overlaid data on copy number alterations, coding insertions and deletions (indels), and predicted deleterious point mutations from TCGA onto modules generated by WGCNA.

First, we calculated the percentage of known cancer drivers from the COSMIC Cancer Census gene list [35] in each tumor co-expression module. We matched genes with the tumor type in which they are considered to be drivers. Novelty score was associated with a fraction of known driver genes, with high novelty modules containing the highest percentage of COSMIC Cancer Census genes (Fig. 3A) compared to medium (one-sided Wilcoxon test p-value = 7.01 × 10−7) and low novelty modules (p = 9.49 × 10−12). Furthermore, there was an overall strong correlation between module novelty score and percentage of known cancer drivers (Additional file 2: Fig. S12) (Spearman rho, 0.72; p-value = 4.52 × 10−23). Gene age enrichment category was also associated with the presence of known cancer genes, as mixed UC-MC modules had a higher percentage of COSMIC Cancer Census genes than UC-enriched or MC-enriched modules (one-sided p-values = 0.11 and 0.032, respectively) (Fig. 3B). These findings are consistent with somatic driver mutations leading to the formation of novel gene expression associations between unicellular and multicellular genes.

Fig. 3figure 3

Locations and frequencies of driver genes within tumor gene co-expression modules. A Percentages of the COSMIC Cancer Census gene list that are found in high (blue), medium (yellow), or low (gray) novelty modules, across tumor modules with matched normal samples. B Percentage of the COSMIC Cancer Census genes found in UC-enriched (pink), mixed UC-MC (green), or MC-enriched (blue) modules, across tumor modules with matched normal samples. C Normalized degree of recurrently amplified genes within WGCNA co-expression modules from normal (left) and tumor (right) cohorts, colored according to tumor type. D Gene age enrichment distribution for modules where > 10% of genes are recurrently amplified (top) or recurrently deleted (bottom) within the indicated tumor types. Bars show the overall percentages of such modules that are UC-enriched, mixed UC-MC, or MC-enriched. E Novelty score distribution for modules where > 10% of genes are recurrently amplified (top) or recurrently deleted (bottom) within the indicated tumor types. F Degree centrality in normal modules versus tumor modules for known driver genes in breast cancer. Points colored by the type of recurrent alteration observed in that gene in the BRCA TCGA cohort. G CIRCOS plot of the “purple” module from the TCGA low-grade glioma cohort. Points show genomic locations of genes with unicellular (red) or multicellular (blue) origins. Gray lines indicate cis and trans connections within and between chromosomes for the 100 most strongly co-expressed gene pairs in this module. Locations of the most highly connected genes (EGFR, CLIC1, ERBB2, and MYL12A) are indicated, with the strongest 50 connections to the EGFR oncogene highlighted in orange

Given the importance of copy number aberrations (CNAs) in driving gene expression changes in cancer, we assessed the prevalence of recurrently amplified or deleted genes, i.e., those gained/lost in > 10% of patients within a given subtype (the “Methods” section). Tumor modules harboring recurrently amplified genes tended to be more central in the network (i.e., have stronger edge weights) overall than did modules without any recurrently mutated genes (one-sided Wilcoxon test p-value = 0.024) (Fig. 3C) suggesting amplifications enhance the strength and consistency across patients of co-expression between unicellular and multicellular genes. In contrast, deletions did not have consistent effects on gene centrality in tumor co-expression modules (Additional file 2: Fig. S13).

The enrichment of recurrently mutated genes in co-expression modules from tumors was primarily due to their associations with amplifications and deletions. Of the 337 modules derived from normal tissues, only 2 (0.59%) were at least 10% of genes recurrently amplified in cancer. In contrast, 51 of 1166 (4.37%) tumor modules contained ≥ 10% genes recurrently amplified in the corresponding cancer type (one-sided Fisher exact p = 1.56 × 10−4, odds ratio = 0.13 (0.00–0.43)). Likewise, only 1 normal module was enriched in genes affected by deletions in cancer while across all tumor types 39/1166 (3.34%) of co-expression modules contained > 0% recurrently deleted genes (one-sided Fisher exact p = 4.32 × 10−4, odds ratio = 0.086 (0.00–0.43)). These enrichments are consistent with the capacity for copy number alterations (CNAs) to dramatically elevate or reduce the expression of genes in cancer [36].

The greatest prevalence of recurrent copy number alterations was seen in high novelty modules from the mixed UC-MC category. In seven of nine tumor types with modules with more than 10% of genes harboring recurrent amplifications, the majority of these modules were of the mixed UC-MC category. Similarly, of the modules associated with deletions, the majority were mixed UC-MC in eight of the nine cohorts with modules with more than 10% of genes harboring recurrent deletions (Fig. 3D). Amplifications were also more prevalent in modules with high novelty scores. Across all nine tumor cohorts, high novelty modules comprised 50% or more of the modules containing recurrently amplified genes (Fig. 3E). In contrast, modules with recurrently deleted genes had a broader mix of novelty scores (Fig. 3E). Therefore, some amplifications appear to cause altered co-expression of unicellular and multicellular genes, leading to the formation of high novelty mixed UC-MC modules.

We assessed how the impact of recurrently mutated genes on the transcriptomes in tumors may be mediated by their topological positions within co-expression networks represented by our WGCNA modules. We hypothesized genes driving the formation of distinct co-expression modules in tumors migrate to more central and interconnected positions compared to their original location in normal modules. Conversely, genes involved in the fragmentation of normal modules during tumorigenesis would become less connected and migrate to the periphery of tumor modules. To test this, we calculated the centrality of each gene after normalizing edge weights across modules (the “Methods” section), revealing structural differences between tumor and normal modules.

Marked changes in the centrality of genes between normal and tumor co-expression modules in key driver genes from the breast cancer (BRCA) cohort of TCGA. Multiple known breast cancer driver genes had large changes in centrality (Fig. 3F) that aligned with their established roles in tumorigenesis. Oncogenes (e.g., NOTCH1 and ERBB2) tended to increase in centrality in tumor modules while tumor suppressors (e.g., BAP1, AKT1, and EP300) decreased in centrality. Similar findings emerged in other solid tumor types (Additional file 2: Fig. S14).

Across all cohorts, amplifications had the strongest changes in centrality. Recurrently amplified genes increased centrality in 20 of the 27 TCGA cohorts where they could be found (Additional file 2: Fig. S15) suggesting that upon amplification, they drive form key hubs in distinctive new co-expression modules in tumors. Deletions were also tied to changes in gene centrality, but the impact of gene deletions appears to be context-dependent as the effects on centrality vary by cohort (Additional file 2: Fig. S16). The centrality of recurrently deleted genes was on average reduced from normal tissues to tumors in 15 of the 29 TCGA cohorts examined but increased instead in 10 cohorts. It is important to note here that WGCNA builds modules around both positive and negative correlations in expression between genes [30]. The loss of strong regulators can result in strong negative correlations captured by co-expression patterns specific to certain tumor types.

Gain or loss of large chromosomal segments can lead to simultaneous but coincidental expression changes to multiple linked genes in addition to the selected driver gene or genes [37]. While that may explain some associations between recurrently amplified genes and co-expression modules in tumors, we observed many cases where recurrent amplifications were linked to co-expression changes involving loci across the genome. An example is the “purple” module from the low-grade glioma (LGG) cohort of TCGA, which harbors the EGFR oncogene, a well-known driver of disease progression in glioma [38] (Fig. 3G). Higher activity of this high novelty mixed UC-MC module was associated with poor survival in the LGG cohort (Additional file 2: Fig. S17). The LGG purple module contained genes from every chromosome except chrY and strong co-expression of EGFR with genes across the genome (Fig. 3G). The most highly connected gene in this module is CLIC1, a chloride ion channel with high expression and potential therapeutic and prognostic value in adult glioma [39, 40]. Our work links CLIC1 expression with EGFR activity in LGG.

The extent of altered co-expression of unicellular and multicellular genes increases during tumor progression

Widespread changes in co-expression patterns between unicellular and multicellular genes in tumors indicate this phenomenon is a general feature of solid cancers. Rewiring of links connecting unicellular and multicellular in GRNs could result in the emergence of many features that accompany tumorigenesis [11, 23]. If so, such changes to the co-expression of unicellular and multicellular genes would be expected to become more pronounced as tumors progress to more malignant states.

To investigate, we tracked the differences in gene co-expression between distinct stages of tumor progression, by sorting samples according to tumor grade and degree of malignancy (Methods). Comparisons were performed in three solid tumor RNA sequencing datasets: (1) data from TCGA prostate cancer (PRAD) cohort and sets of (2) benign skin nevi and melanoma samples [41] as well as (3) malignant pheochromocytomas and benign pheochromocytomas [42]. Gene co-expression modules were generated using WGCNA for each category of samples within each dataset. To assess how gene co-expression changed during tumor progression, we used our novelty score metric to compare modules from high-grade or malignant samples to those from low-grade or benign samples from the same tumor type. Where available, module novelty scores were also computed for benign and low-grade tumors against matched normal tissues, to understand how co-expression changed during the early stages of tumor development. A comparison of gene age enrichments in modules was performed to determine whether the greatest changes in co-expression occurred in UC-enriched, mixed UC-MC, or MC-enriched modules.

For the TCGA PRAD cohort, novelty scores were calculated for modules from high-grade primary tumors (annotated as grade groups 4 and 5) compared to the modules derived from low-grade primary tumors (grade groups 1–3), as well for low-grade tumor modules against modules derived from adjacent non-diseased prostate gland tissue (the “Methods” section). The greatest increase in novelty scores occurred in high-grade primary prostate tumors relative to low-grade tumors (Fig. 4A). In contrast, the novelty scores for modules from low-grade tumors were relatively modest compared to modules from normal tissue samples. Overall, just 23.08% of low-grade modules had medium or high novelty, but 75.96% of high-grade modules had medium or high novelty (Fig. 4B), indicating the majority of changes to co-expression occur during later stages of development of primary tumors in prostate cancer. In both comparisons, mixed UC-MC modules had the highest novelty scores, demonstrating the greatest rewiring in co-expression during prostate cancer progression happens between unicellular and multicellular genes (Fig. 4A). High-grade modules were predominantly mixed UC-MC across all novelty categories (21.31%, 36.61%, and 37.70% high-grade modules were of low, medium, and high novelty modules and mixed UC-MC) were mixed UC-MC across all novelty categories (Fig. 4C).

Fig. 4figure 4

Changes in co-expression between unicellular and multicellular genes during tumor progression. A Novelty scores for unicellular-enriched (pink), mixed unicellular-multicellular (green), and multicellular-enriched (blue) co-expression modules derived from low-grade prostate tumors (Gleason grade groups 1 and 2) as compared to co-expression modules from normal (non-diseased) prostate tissue (left) and for co-expression modules from high-grade tumors (Gleason grade groups 4 and 5) as compared to co-expression modules from low-grade prostate tumors (right). Data from TGCA PRAD cohort. B Number of UC-enriched, mixed UC-MC, and MC-enriched modules with low, medium, or high novelty scores from low-grade (top) and high-grade (bottom) prostate tumors. C Percentage of co-expression modules with low, medium, or high novelty scores from a comparison of low-grade prostate tumors to normal prostate tissues (gold) and from a comparison of high-grade to low-grade prostate tumors (blue), stratified by gene age enrichment category (UC-enriched, mixed UC-MC, or MC-enriched). D Novelty score distribution for UC-enriched (pink), mixed UC-MC (green), or MC-enriched (blue) co-expression modules derived from melanoma as compared to co-expression modules derived from benign nevi, using data from Badal et al. E Percentage of co-expression modules with low, medium, or high novelty scores for benign tumors compared to normal adrenal gland tissue (pink) and malignant tumors compared to benign tumors (blue), stratified by gene age enrichment category and pheochromocytoma subtype. F Schematic of the process of using random forest model to classify adult glioma samples as LGG or GBM based on ssGSEA score of WGCNA modules derived from the TCGA glioblastoma cohort. G Power of co-expression modules derived from the TCGA glioblastoma cohort to distinguish low-grade from glioblastoma samples, expressed as distribution of mean decrease in Gini coefficient for 100 replicates of a random forest model. Gene age enrichment category is indicated by color: UC-enriched (pink), mixed UC-MC (green), or MC-enriched (blue)

To understand how co-expression of unicellular and multicellular genes differs between benign and malignant lesions, we analyzed a cohort published by Badal et al. [41], which contains RNA sequencing data from 50 primary melanomas of the skin and 27 benign skin growths known as nevi. We found mixed UC-MC modules from melanomas had significantly higher novelty scores when compared to modules derived from benign nevi than did unicellular-enriched or multicellular-enriched modules (Fig. 4D) (one-sided Wilcoxon test p-values = 8.62 × 10−4 and 0.059, respectively).

To investigate the full spectrum of co-expression changes during the progression from normal tissue to benign lesions to tumors, we created separate sets of WGCNA modules from RNA sequencing data from benign and malignant pheochromocytoma samples [42], as well as healthy adrenal tissue (the “Methods” section). Tumor samples were further divided into the MAML, SDH, and VHL subtypes, which have distinct driver mutations. Across all subtypes, modules derived from benign tumors showed low novelty as compared to modules from normal adrenal glands, regardless of gene age enrichment. However, novelty scores were markedly higher when comparing modules between malignant and benign pheochromocytomas with mixed UC-MC modules on average having higher novelty than unicellular-enriched and multicellular-enriched modules (Fig. 4E) (Wilcoxon test one-sided p-values = 5.88 × 10−8, 5.15 × 10−5, 9.62 × 10−15, comparing mixed UC-MC with UC-enriched in MAML, SDH, and VHL samples, respectively, and 0.0019, 0.10, 1.27 × 10−13, comparing mixed UC-MC with MC-enriched in MAML, SDH, and VHL samples, respectively). The melanoma and pheochromocytoma data thus support the notion gene regulatory connections between unicellular and multicellular genes become progressively more altered as tumor cells develop into more malignant states.

Mixed unicellular-multicellular gene co-expression modules best distinguish low-grade from high-grade glioma tumors

The excess of high-novelty mixed unicellular-multicellular modules in high-grade and malignant tumors indicated the degree of alteration of co-expression of unicellular and multicellular genes distinguishes tumors at different stages of progression. We next hypothesized that these co-expression modules would provide significant discriminant power to classify samples based on their degree of malignancy. To test this, we focused on the low-grade glioma (LGG) and glioblastoma (GBM) datasets of TCGA. LGGs almost inevitably progress to the high-grade and fatal GBM. LGGs and GBMs thus represent linear stages in glioma evolution with clear differences in the degree of malignancy. ssGSEA scores were calculated for the 23 WGCNA modules obtained from the GBM cohort, to measure the combined level of expression of genes within each module in each of the 512 samples from the LGG cohort and the 153 samples from the GBM cohort. Module ssGSEA scores were then used to train an RF model to distinguish between LGG and GBM samples from TCGA (Fig. 4F). RF classification provides a means to determine which co-expression modules are most closely associated with disease stage in glioma. One hundred independent iterations were performed using random splits of training and testing samples each time (the “Methods” section).

The median area under the curve (AUC) of the receiver operating characteristic (ROC) curves for the 100 RF models based on GBM module ssGSEA scores was 0.814 (interquartile range, 0.818–0.894). This was comparable to the performance of RF models trained on the status of mutations normally used to grade adult glioma (median AUC = 0.814), though models based on mutations had much more variable performance (interquartile range, 0.524–0.879) (Additional file 2: Fig. S18). Thus, changes to gene co-expression have the power to accurately classify LGG from GBM samples across patients.

We expected mixed UC-MC modules would be most distinctive between LGG and GBM and therefore carry the greatest weight in the RF classifiers. To investigate, we calculated the mean decrease in the Gini coefficient for each GBM WGCNA module, a measure of the importance of a given variable to an RF model. In total, 11 of the 22 GBM modules had a mean decrease in the Gini coefficient greater than 5, indicating they made a substantial contribution to the predictive power of the RF model (Fig. 4G). Of these, 8 modules were mixed UC-MC with 40–60% genes from multicellular phylostrata (72.7%; Fisher test p-value = 0.0296). These results were subsequently validated by random forests trained and tested on orthogonal RNA sequencing data from GBM and LGG patients from the Chinese Glioma Genome Atlas (CGGA). ssGSEA scores for TCGA GBM modules could also distinguish LGG from GBM samples in the CGGA (median AUC = 0.667, IQR 0.640–0.717) (Additional file 2: Fig. S18). In the CGGA RF models, 6 of the 8 modules (75%) with a decrease in Gini > 5 had a mixed UC-MC composition (Additional file 2: Fig. S19). All 6 of these mixed UC-MC modules also had a decrease in Gini > 5 in the TCGA RF models.

The bulk of the predictive power of our gene co-expression-based RF classifiers thus derives from the mixed UC-MC class of WGCNA modules. This underscores the importance of altered co-expression and therefore altered regulatory links between genes of unicellular and multicellular origins in glioma as it progresses from low-grade to high-grade disease. Such changes are biologically meaningful, as distinctive patterns of gene co-expression clearly distinguish LGGs from GBMs in two independent cohorts.

留言 (0)

沒有登入
gif