Influence network model uncovers relations between biological processes and mutational signatures

Performance evaluation for GeneSigNet Evaluation of the method on simulated data

GeneSigNet infers a directed network representing information flow between genes’ expression and signatures’ exposure. We compared the performance of GeneSigNet to four competing approaches which, similarly to GeneSigNet, strive to predict influence flow between network nodes. We compare GeneSigNet to the following key approaches: (1) the independent component analysis-based method LiNGAM [19], (2) the partial correlation-based heuristic GeneNet [20] previously proposed to discover the causal structure in high-dimensional genomic data [21], (3) the Sparse Partial Correlation Selection (SPCS) approach [21] and (4) ElasticNet [38]. Additionally, we included (5) the regression tree-based approach used by GENIE3 method [39] for the inference of Gene Regulatory Networks (GRN)  [39, 40, 41]. We note that the task of GRN inference is different from inferring influence graphs considered in this study. However, since the method produces a fully connected directed weighted graph we used a procedure that reduces bidirectional edges to oriented edges by retaining the heavier edge in the pair. Since GENIE3 is the winning gene expression-based GRN reconstruction method [42], this allowed us to test if a basic adaptation of a GRN reconstruction method can provide a competitive solution to our problem.

To generate data, we implemented the data simulation schema provided in LiNGAM [19]. This approach starts with the construction of a lower triangular weight matrix representing the directed weighted interactions among the nodes (random variables) in a directed acyclic graph (DAG). This matrix is then used as the representation of the dependency patterns in the generation of the data set (for detailed description of the data generation, see Additional file 1: Section S1.5). We simulated a set for 100 random variables (nodes) with 1000 samples from multivariate distributions. The weight matrix used in the simulation contains 364 directed edges.

The performances of the methods are summarized in Fig. 4. GeneSigNet, GeneNet, LiNGAM, SPCS and ElasticNet construct a single network that is the output of the method-specific optimization procedure, thus their performances are represented by single points in Fig. 4A. The outcome of the regression tree approach can be defined differently depending on edge-weight cut-offs and its performance is thus represented by a set of points for various cut-offs. The results were similar for all cut-offs and for the remaining measures we report a representative result corresponding to 520 edges (big green triangle in Fig. 4A). GeneSigNet had the highest precision in predicting the direction of the influence. The F-score (the harmonic mean of precision and recall), was also the highest for GeneSigNet although only slightly better than the second best method, GeneNet. In addition, we also evaluated the methods on their ability to recover the values of the lower triangular matrix used to simulate the data. This measure was proposed in [19] and measures the correlation between edge weights used for the simulation and weights recovered by the algorithm. Despite inferring a smaller number of edges the GeneSigNet’ ability to reconstruct the input matrix was equal or better relative to other methods, suggesting that GeneSigNet indeed captured dominating influences using a smaller number of edges.

Fig. 4figure 4

Evaluation of the results on simulated data. In A, the horizontal axis denotes the number of selected edges in the prediction. The vertical axis denotes the precision of the compared methods. In B, we report comparisons based on F-score, ability to reproduce the influence weight values used to generate the simulation data (correlation), and the number of the inferred edges

Evaluation of the directionality inference on real cancer data

We also compared the performance of the methods on breast cancer (BRCA) and lung cancer (LUAD) data sets (for details, see the “Materials” section). The LiNGAM method was excluded from this evaluation, since it cannot deal with large genome-wide datasets. In this setting the true direction of most of the inferred edges is, unfortunately, unknown. Therefore, we leverage the fact that if a method infers a relation between a transcription factor (TF) and a target gene (TG) then the edge should be directed from TF to TG (under the assumption that TG is not a TF). Leveraging this principle, we evaluated the precision of the directionality inference utilising the ChEA database containing directed protein-DNA interactions [27]. Specifically, for each method, we used the set of ChEA edges that overlap with the edges inferred by the method. Since the number of edges overlapping with ChEA set was different in each experiment, we used the percentage of overlap as the reference to ensure fairness of the comparison (horizontal axis in Fig. 5). GeneSigNet had the highest precision. Surprisingly, GeneNet and the regression tree methods resulted in worse than random performance in the evaluation on the BRCA data. Unfortunately, in the case of real data, the weights of the reference edges are unknown so we do not have a measure corresponding to the correlation function used on simulated data.

Fig. 5figure 5

Correctness evaluation of directionality inference on real data sets. In A, the horizontal axis denotes the percent of ChEA edges for which a method made a prediction (out of all ChEA edges recovered by the method) whereas the vertical axis denotes the fraction of correctly predicted directions for these edges. In B, we provide the numerical values of the results visualized in A

Overall the relations between the methods’ performances observed in this evaluation were very similar to the trends observed on simulated data supporting the choice of our network reconstruction strategy.

Finally, we tested the robustness of our new partial higher moment strategy used in the second step of GeneSigNet, by performing a bootstrap sampling which is described in Additional file 1: Section S1.3. The approach was highly reproducible as summarized in Additional file 1: Section S1.3 and Fig. S1. GeneSigNet compared favorably to the competing approaches.

The GeneSigNet method was implemented in python, and installable package, source codes and the data sets used for and generated during this study are available at the Github site https://github.com/ncbi/GeneSigNet [37].

Analysis of the relations between mutational signatures and molecular pathways in breast cancer

We utilized breast cancer (BRCA) data collection obtained from ICGC which includes 266 cancer samples providing both whole-genome sequencing data and gene expression data (for details, see the “Materials” section: Breast cancer data). The breast cancer genomes harbor mutations mainly contributed by 6 COSMIC mutational signatures — SBS1, 2, 3, 5, 8, and 13. We further refined the mutational signatures based on mutation density and sample correlations. The mutations in BRCA are characterized by occurrences of short highly mutated regions whose origin is believed to be different compared to sparse mutations [8, 9, 43, 44, 45]. The information available from whole-genome sequencing allows for distinguishing these two types of mutation patterns and to treat such dense and sparse mutation regions differently. The post-processing of mutational signatures resulted in 6 signature groups that we use for subsequent analysis to construct the GSN – SBS1, APOBEC-C (clustered SBS2 and SBS13 corresponding to APOBEC hypermutation), APOBEC-D (SBS2 corresponding to disperse APOBEC mutations), DSB (SBS3 and clustered SBS8), SBS5, and SBS8D (dispersed SBS8). In addition to gene expressions and exposures of mutational signatures, we included a node indicating the binary status of homologous recombination deficiency (HRD) as it is assumed to lead to specific patterns of mutational signatures in BRCA [32]. We applied GeneSigNet to construct a GSN for genes, mutational signatures, and HRD status, and to find relations between these features.

Consistency of GeneSigNet results with current knowledge

Many relations uncovered with GeneSigNet are consistent with our current knowledge on mutational signatures, confirming the validity of our method. In particular, it is well appreciated that homologous recombination (HR) plays an important role in the double-strand break (DSB) repair mechanism and that HR deficiency is associated with the DSB signature [46]. Indeed, our network correctly predicted a strong positive influence from HRD status to the DSB signature (Fig. 6). In addition, GeneSigNet identified the known negative impact of BRCA1 expression on the DSB signature which is also consistent with the role of BRCA1 in HRD [46]. Furthermore, GeneSigNet captured the impact of HRD on chromosome separation, reflecting the role of homologous recombination in maintaining genomic stability [47, 48], and identified the association of APOBEC-D with telomere maintenance, consistent with the well recognized role of APOBEC mutagenesis in replication [49, 50].

Fig. 6figure 6

Subnetworks of GSN for BRCA centered on (induced by) MutStates associated with Homologous Recombination, APOBEC and SBS8. Edge and node colors are as in Fig. 1. In boxes, there are the names of the genes adjacent to a given MutState with edge weight cut-off (\(|w_|\ge 0.01\)). The genes in bold are discussed in more detail in the text and the genes having bidirected interactions with MutStates are underlined. If the adjacent genes are enriched with specific GO pathways (\(q-value<0.01\)) then only the pathway genes are provided in the box. The HRD status dominantly contributes to the mutation strength in the DSB repair MutState. Increased DSB exposure leads to an increase of the exposures of APOBEC-C and (indirectly) APOBEC-D MutStates. SBS8 mutations are linked to the deficiency of nucleotide excision repair. The thickness of the arrows represents the edge weight (average edge weight if multiple genes are in a box). An extended subnetwork including all MutStates and an extended list of genes, and GO-terms are provided in Additional file 1: Fig. S3, Additional file 2: Table S2, and Additional file 3: Table S4

Interestingly, our method linked SBS8 to the nucleotide excision repair (NER) pathway (Fig. 6). The etiology of this signature has remained unknown until a recent experimental study linked it to the NER pathway as well [25]. This demonstrates the power of the GeneSigNet method to uncover non-obvious relationships.

Untangling the interactions between APOBEC and DSB processes

Previous studies speculated that APOBEC related mutational signatures can arise in multiple different scenarios. First, double-strand breaks (DSB) created by the homologous recombination deficiency (HRD) provide mutational opportunities for APOBEC enzymes to act on the ssDNA regions, resulting in clustered APOBEC mutations [45, 51, 52]. In another scenario, a recent study attributed APOBEC-mediated hypermutations to the normal activity of mismatch repair which also involves creating ssDNA regions, generating “fog” APOBEC mutations [44]. The complex interplay between APOBEC activities and other DNA repair mechanisms is yet to be elucidated.

Focusing on the interactions of APOBEC signatures with the other MutStates and genes, we observe that GeneSigNet supports a positive influence of the DSB on APOBEC-C MutState, consistent with the assumption that double-strand breaks provide an opportunity for APOBEC mutations. Additionally, our analysis reveals that the expression level of the APOBEC3B enzyme is associated with the strength of the DSB signature. Indeed, a previous study proposed that APOBEC3 proteins are recruited to DSB sites to participate in the DSB repair process [14]. Thus, DSB contributes to an increase in APOBEC-C strength by two different mechanisms: (i) increased mutation opportunity due to ssDNA created by DSB and (ii) increased mutation probability due to increased APOBEC3B expression. Note that increased APOBEC expression would also increase APOBEC mutations in the “fog” regions proposed in [44].

On the other hand, the activity of APOBEC-D is positively influenced by APOBEC-C activity, without direct relation to DSB. In fact, GeneSigNet inferred a negative influence from HR status to APOBEC-D MutState, confirming that different mutagenic processes are involved in clustered and dispersed APOBEC mutations (Fig. 6).

APOBEC hypermutation activates regulatory T cells — implications for immunotherapy

Interestingly, GO enrichment analysis of the genes associated with APOBEC mutational signatures (genes influenced by APOBEC-C MutState) revealed significant enrichment in positive regulation of regulatory T cells (Tregs) differentiation (Fig. 6). Tregs, a subtype of T cells that suppress the immune response, are important for maintaining cell homeostasis and self-tolerance but can also interfere with anti-tumor immune response [53]. The top three genes (FOXP3, BCL6, and LILRB2) positively influenced by APOBEC-C signature are all related to such inhibitory mechanism to immune response [54, 55, 56]. FOXP3 is a transcriptional regulator playing a crucial role in the inhibitory function of Tregs. BCL6 is also essential for the stability of Tregs that promotes tumor growth. LILRB2 is a receptor for class I MHC antigens and is involved in the down-regulation of the immune response and the development of immune tolerance.

Our results help to understand a complicated role of APOBEC mutagenesis holds for immunotherapy. For example, patients with cancers displaying a high mutation burden are likely to produce tumor-associated neoantigens (mutated peptides presented at their surface) allowing them to benefit from immunotherapy [57]. In particular, the APOBEC mutational signature was identified as a potential predictive marker for immunotherapy response in some cancers [58, 59]. Yet, cells carrying a high mutation burden often develop mechanisms of immune tolerance involving activation of Tregs to protect themselves from the destruction [60, 61]. Consequently, observed increased number of Tregs in response to high APOBEC mutations may lead to resistance to immune checkpoint inhibitors [62, 63]. Thus, our finding suggests that a combined strategy targeting Tregs in addition to immune checkpoint inhibitors would be most beneficial for a better outcome in APOBEC hypermutated breast cancer tumors.

Analysis of the relations between mutational signatures and molecular pathways in Lung Adenocarcinoma

We next analyzed lung adenocarcinoma (LUAD) data using 466 cancer samples from the TCGA project. The exposure levels of 6 COSMIC mutational signatures (SBS1, 2, 4, 5, 13, and 40) present in the exome sequencing data were integrated with the RNAseq expression data of 2433 genes belonging to the DNA metabolic and immune system processes in Gene Ontology terms to uncover influence between signatures and genes (for details, see the “Materials” section: Lung adenocarcinoma data).

GeneSigNet uncovers immune response due to smoking

Two prominent mutational signatures in LUAD, SBS4 and SBS5, are assumed to result from exogenous causes [8]. SBS4 is associated specifically with exposure to cigarette smoking in lungs. SBS5 is known to accompany the smoking signature but it is also present in many other cancer types. Previous studies suggested that cigarette smoking stimulates an inflammatory response [10]. Consistent with these findings, the genes identified by GeneSigNet as influenced by SBS4 and SBS5 MutStates are indeed enriched with immune response genes (Fig. 7).

Fig. 7figure 7

Subnetwork of LUAD GSN centered on MutStates known to be related to smoking and APOBEC. The meaning of colors and boxes is the same as in Figs. 1 and 6. An extended subnetwork including all MutStates and an extended list of genes, and GO-terms are provided in Additional file 1: Fig. S4, Additional file 2: Table S3, and Additional file 3: Table S5

GeneSigNet also identified the influence of the signatures SBS4 and SBS5 on two APOBEC signatures — SBS2 and SBS13. The APOBEC signatures are associated with immune response and this relationship is consistent with the previously proposed immune activation due to smoking exposure [64]. In addition, GeneSigNet correctly captured the association of SBS13 (consequently SBS2) with the expressions of APOBEC3B and APOBEC3A enzymes, and also identified the association of SBS13 with pyrimidine related catabolic processes, potentially reflecting the fact that SBS13 involves a pyrimidine to pyrimidine mutation (Fig. 7).

Finally, tobacco smoking is known to induce GPR15-expressing T cells; although the exact role of GPR15 in response to smoking is yet to be elucidated [65]. Therefore, we investigated whether the results of GeneSigNet provide additional insights into this relation. Consistently with previous studies, GeneSigNet inferred a strong association between GPR15 and SBS4 without resolving the direction (see also Additional file 1: Fig. S4). Next, we analyzed the influence that GPR15 has on other nodes of the GSN network. The results of GeneSigNet suggest that GPR15 is involved in the negative regulation of several genes related to chemotaxis, including IL10, a cytokine with potent anti-inflammatory properties, and has a positive impact on lymphocyte migration and leukocyte mediated cytotoxicity (Fig. 8).

Fig. 8figure 8

Subnetwork of Lung Adenocarcinoma GSN centered on the node representing GPR15 gene. Expression of the GPR15 gene contributes to the activation of immune responses. The meaning of edge and node colors, and boxes is the same as in Figs. 1 and 6. An extended subnetwork including all MutStates and an extended list of genes is provided as Additional file 1: Fig. S4

GeneSigNet points to the role of DNA geometric changes for APOBEC signature SBS2

As discussed earlier, APOBEC can only act on single-stranded DNA (ssDNA). Interestingly, one of the GO terms associated with SBS2 MutState identified by GeneSigNet is DNA geometric change (Fig. 7). DNA geometric changes are local changes of DNA conformation such as bulky DNA adducts (a type of DNA damage due to exposure to cigarette smoke) or DNA secondary structures such as Z-DNA, cruciforms, or quadruplexes. Indeed, these structures often involve the formation of ssDNA regions which, in turn, provide mutation opportunities for APOBEC enzymes [66, 67, 68]. The formation of DNA secondary structures is often associated with DNA supercoiling — a form of DNA stress that is resolved by Topoisomerase 1 (TOP1). Interestingly, GeneSigNet identified a negative influence of TOP1 expression on one of the genes (XPA) contributing to this GO term. This suggests a relation between DNA stress mediated by TOP1 and APOBEC activity.

留言 (0)

沒有登入
gif