Pan-cancer predictions of transcription factors mediating aberrant DNA methylation

Processing of the TCGA methylation data

To study DNA methylation changes in cancer, we retrieved 8425 raw methylation datasets, generated from Illumina Infinium HumanMethylation450 BeadChip (HM450), for 32 available cancer types from the TCGA resource. We processed the data with the ChAMP pipeline [19, 20] and performed normalization using noob [21] as implemented in minfi [22, 23].

We trained a quadratic discriminant analysis aiming to classify the cancer and healthy samples in two groups for each cancer type and discarded samples that were misclassified (Fig. 1a). Based on the number and dispersion of samples, we retained 13 cancer types for further analysis: bladder urothelial carcinoma (BLCA), breast invasive carcinoma (BRCA), cholangiocarcinoma (CHOL), colon adenocarcinoma (COAD), head and neck squamous cell carcinoma (HNSC), kidney renal clear cell carcinoma (KIRC), kidney renal papillary cell carcinoma (KIRP), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), prostate adenocarcinoma (PRAD), thyroid carcinoma (THCA) and uterine corpus endometrial carcinoma (UCEC) (Additional file 1: Table S1). In the case of BRCA, representative of other cancer types, we observe distinct clustering of the healthy and cancer samples and an expected broader heterogeneity of cancer samples (Fig. 1a). The results of this analysis can be visualized interactively for all cancer types on our webserver http://bardet.u-strasbg.fr/cancermethtf/ in the section “Data”.

Fig. 1figure 1

Identification of BRCA DMRs. a Principal component analysis of the BRCA cancer (red) and healthy (blue) samples and visualization of the samples discarded using a quadratic discriminant analysis (cross). b CpG and G + C content of BRCA DMRs. Density lines are represented in blue. Two categories, CpG-poor and CpG-rich, were defined according to a threshold following y = − 1.4(x − 0.38) + 0.51. c Distance of BRCA DMRs to their closest gene TSS. Two categories, proximal and distal, were defined according to a 2 kilobase (Kb) threshold. d Levels of methylation in hyper-methylated BRCA DMRs (red) or hypo-methylated DMRs (blue) in cancer versus healthy samples in the different categories. Only DMRs with at least 20% methylation change and a starting methylation mean in healthy samples above 50% for DMRs hypo-methylated in cancer and below 50% for hyper-methylated ones are shown. e Number of BRCA hyper- or hypo-methylated DMRs in the different categories. f Example of hyper- and hypo-methylated BRCA DMRs

Identification of differentially methylated regions in cancer

We identified DMRs between the cancer and healthy samples for each cancer type by first calling differentially methylated cytosines using limma [24] and then DMRs using a modified version of DMRcate [25]. Briefly, we only used cytosines not located in exons, which are not expected to contain TF-binding sites, and searched for DMRs containing at least two significant CpGs that were consistently hypo- or hyper-methylated.

We further classified DMRs according to their genomic features. Observation of the DMRs’ CpG and G + C contents revealed two distinct categories that we termed CpG-poor and CpG-rich (Fig. 1b). Observation of the distance of the DMRs to their closest gene transcription start site (TSS) also revealed two distinct categories that we termed proximal and distal (Fig. 1c). These features enable us to minimize the possible biases of the subsequent TF motif analysis due to the preference of some TF to bind specific genomic locations.

We then selected DMRs with a minimum length of 200 bp, expected for TF-binding sites, at least 20% methylation change and a starting methylation mean in healthy samples above 50% for DMRs hypo-methylated in cancer and below 50% for hyper-methylated ones (Fig. 1d). Finally, we took advantage of recently available chromatin accessibility ATAC-seq data in TCGA cancer samples [18] expecting putative TF-binding sites to be located in open chromatin ATAC-seq peaks. Therefore, we further selected hypo-methylated DMRs in cancer if they overlapped at least one ATAC-seq peak in the corresponding cancer samples and if hyper-methylated DMRs did not overlap any peak.

The vast majority of DMRs were hyper-methylated and located in CpG-rich regions (Fig. 1e), which was expected since the TCGA methylation array probes are enriched at gene promoters [26], usually CpG-rich, and changes in DNA methylation in cancer have previously been described as hyper-methylated in gene promoters. A substantial number of DMRs were also found as hypo-methylated including in CpG-poor regions (Fig. 1e), which likely represent enhancer regulatory regions bound by TFs. Examples of a hyper-methylated CpG-rich proximal DMR, i.e., promoter and a hypo-methylated CpG-poor distal DMR are shown (Fig. 1f). The results of these analyses can be visualized interactively for all cancer types on our webserver http://bardet.u-strasbg.fr/cancermethtf/ in the section “Differentially methylated regions”.

Prediction of transcription factors driving DMRs across cancer types

In order to identify TFs driving DNA methylation changes in cancer, we search for potential TF-binding sites in DMRs. TFs bind to DNA through the recognition of short DNA sequences called motifs. We therefore performed an enrichment analysis of known TF motifs using our recently developed approach TFmotifView [27]. We extracted 4928 TF motifs from a manually annotated review [28], that we could group by similarity into 434 clusters and that represent 1048 distinct TFs. TF motif logos and clusters can be visualized on our webserver http://bardet.u-strasbg.fr/cancermethtf/ in the section “TF motif clusters”. Since hundreds of thousands of TF motif occurrences can be found over the genome, we computed an enrichment of how many of our DMRs contain at least one occurrence of each motif compared to control regions with similar genomic features or contrasted hypo-methylated DMRs with hyper-methylated DMRs from the same category. We then derived a hypergeometric p-value for each motif enrichment.

We searched for TF motifs in DMRs from all genomic categories and first focused on the hypo-methylated DMRs located in CpG-poor regions distal from genes TSS representing putative enhancer regions compared to control regions. Across almost all cancer types, the motifs from the cluster JUN/FOS were highly enriched except for BRCA and PRAD (Fig. 2a). Those TFs compose the Activator Protein 1 (AP-1) family that is involved in differentiation, proliferation, apoptosis and well known in tumorigenesis [29].

Fig. 2figure 2

TF motif enrichment in hypo-methylated, CpG-poor, distal DMRs. a Pan-cancer motif enrichment. Heatmap of best enriched motifs across all cancer types using a p-value threshold of 10–3 and selecting one motif per TF using the best p-value summed across all cancers. Motif cluster and CpG content are shown. b Pan-cancer TF expression. Heatmap of the expression of corresponding TFs using positive mean FPKM difference between cancer and healthy samples. Motifs highlighted in bold with black squares have matching motif enrichment and TF upregulation. c BRCA motif enrichment. Motif enrichment in BRCA DMRs corresponding to the BRCA column in a. Motif p-values (point color) are computed using an hypergeometric test using the number of regions that have at least a motif compared to the fold enrichment over control regions. Each point represents one of the 4928 motifs used. FOX and GATA clusters are highlighted (including several FOXA1 or GATA3 motif points). d BRCA TF expression. TF expression enrichment in cancer compared to healthy samples (log2 mean FPKM). Each point represents one of the 1048 TFs used colored according to their differential expression p-value. FOX and GATA clusters are highlighted. e Expression of FOXA1 and GATA3 TFs. Dot plot showing all samples FPKM values for FOXA1 and GATA3 in cancer compared to healthy samples corresponding to the mean value shown in d

Several other TF motifs, sometimes from the same motif cluster and/or TF family, were enriched in specific cancer types, for example the FOX cluster in BRCA (Fig. 2a). In order to disentangle which specific TF among its motif cluster could drive the cancer DMRs, we integrated matching expression available from TCGA. We hypothesized that hypo-methylated DMRs, methylated in healthy samples and unmethylated in cancer samples, could be regulated in trans by TFs not or lowly expressed in healthy samples and overexpressed in cancer samples therefore binding specifically and driving hypo-methylation in cancer. We then searched for TFs whose expression was upregulated in the cancer samples compared to the healthy samples (Fig. 2b) and selected TFs that have both their motif enriched and higher expression in cancer (Fig. 2a,b, black squares). The two most enriched TF motifs with matching upregulation were FOXA1 and GATA3 in BRCA (motif p-value < 10–5 and < 10–3, respectively; expression p-value < 10–20 and < 10–27, respectively). The FOX and GATA motif clusters, containing many putative motifs, represented most of the motifs enriched in BRCA hypo-methylated DMRs located in CpG-poor regions distal from genes TSS (Fig. 2c). Out of their corresponding TFs, FOXA1 and GATA3 genes were the most upregulated in BRCA cancer samples (Fig. 2d,e). Both FOXA1 and GATA3 TFs are known markers in breast cancer [30].

When searching for motifs enriched in hypo-methylated CpG-poor DMRs either distal or proximal, the following TFs showed both motif enrichment and TF upregulation (Fig. 2a,b and Additional file 1: Figure S1): FOXA1, GATA3, RFX5 and TFAP2A in BRCA; TCF3, GRHL2, SNAI2, PATZ1 and BATF in BLCA; JUN, JUNB, FOSL1 and BATF in COAD; RUNX3, FOSL1 and BATF in HNSC; NFIX, BACH1, JUN, BATF and BATF3 in KIRC; NFIB and BATF in KIRP; FOXA1 and FOXA2 in LIHC; HSF1, FOSL1 and BATF in LUAD; NFE2L2, HSF1, GRHL1, GRHL2, TFAP2A, TFAP2C and FOSL2 in LUSC; FOXA1 and TWIST1 in PRAD; ASCL2, SOX9, CEBPB and ESRRA in UCEC. Only very few hyper-methylated DMRs were identified as CpG-poor both distal and proximal and therefore only few TFs showed both motif enrichment and TF downregulation (Additional file 1: Figure S1): KLF5 in BRCA; FOXA2 in CHOL; ARID5B in KIRC; STAT3 in PRAD; RXRG, EGR2 and ZNF263 in UCEC. Many of those TFs have previously been involved in cancer.

When searching for motifs enriched in CpG-rich DMRs, likely to contain different TF motifs due to their distinct sequence content, we contrasted the motif content of hypo- versus hyper-methylated DMRs. Hypo-methylated DMRs were enriched for similar motifs in CpG-rich categories than in CpG-poor categories (Additional file 1: Figure S2). Hyper-methylated CpG-rich DMRs were consistently enriched in G + C-rich low complexity motifs such as EGR1 or KLF in most cancer types (Fig. 3a,b and Additional file 1: Figure S2). Both EGR and KLF motif clusters, containing many putative motifs, were enriched in BRCA hyper-methylated DMRs (Fig. 3c). Out of their corresponding TFs, EGR1 and KLF10 genes were downregulated in BRCA cancer samples (Fig. 3d,e). EGR1 has been shown to have significant tumor suppressor properties in many types of cancer [31, 32] and different KLF TFs have been involved in a large number of cancers [33].

Fig. 3figure 3

TF motif enrichment in hyper-methylated, CpG-rich, distal DMRs. a Pan-cancer motif enrichment. Heatmap of best enriched motifs across all cancer types using a p-value threshold of 10–3 and selecting one motif per TF using the best p-value summed across all cancers. Only motifs with matching expression downregulation are shown. Motif cluster and CpG content are shown. b Pan-cancer TF expression. Heatmap of the expression of corresponding TFs using negative mean FPKM difference between cancer and healthy samples. All motifs represented here have matching expression and TF downregulation. c BRCA motif enrichment. Motif enrichment in BRCA DMRs corresponding to the BRCA column in a. Motif p-values (point color) are computed using an hypergeometric test using the number of regions that have at least a motif compared to the fold enrichment over control regions. Each point represents one of the 4928 motifs used. EGR and KLF clusters are highlighted (including several EGR1 or KLF10 motif points). d BRCA TF expression. TF expression enrichment in cancer compared to healthy samples (log2 mean FPKM). Each point represents one of the 1048 TF used colored according to their differential expression p-value. EGR and KLF clusters are highlighted. e Expression of EGR1 and KLF10 TFs. Dot plot showing all samples FPKM values for EGR1 and KLF10 in cancer compared to healthy samples corresponding to the mean value shown in d

The results of all motif analyses for DMRs in all categories can be visualized interactively for all cancer types on our webserver http://bardet.u-strasbg.fr/cancermethtf/ in the section “TF motif enrichment and expression”.

Correlation between DNA methylation and binding of FOXA1 and GATA3 in breast cancer cell lines

We next set out to validate experimentally if the TFs FOXA1 and GATA3 drive changes in DNA methylation in common breast cancer cell lines. We used the HCC1954 cell line derived from a primary breast tumor and the hTERT-HME1 cell line as normal mammary epithelial cells. These cell lines have been derived from women of similar age (61 and 53 years old, respectively) and hTERT-HME1 cells were derived from a healthy donor and have a normal karyotype. HCC1954 cells show high expression of FOXA1 and GATA3 at both mRNA and protein levels compared to hTERT-HME1 normal cells (Fig. 4a,b) and therefore recapitulate well their expression observed in the TCGA BRCA samples. We performed whole genome bisulfite sequencing (WGBS) on those cell lines (Additional file 1: Figure S3a and Table S2) and searched for DMRs using DSS [34] (Additional file 1: Table S3). We identified 145,826 hypo-methylated DMRs and 121,090 hyper-methylated DMRs in HCC1954 cancer cells compared to hTERT-HME1 normal cells. Of the TCGA BRCA DMRs and although the data have more limited genome coverage, 144 out of 616 hypo-methylated DMRs (23%) and 1013 out of 1457 hyper-methylated DMRs (69%) overlapped hypo- or hyper-methylated DMRs in HCC1954 versus hTERT-HME1, respectively (see examples in Fig. 4c,d).

Fig. 4figure 4

FOXA1 and GATA3 bind hypo-methylated regions in HCC1954 breast cancer cells. a Expression levels of FOXA1 and GATA3 in hTERT-HME1 and HCC1954 cells by RT-qPCR (mean ± SEM, n = 3 independent replicates; relative to RPL13A expression). b Protein levels of FOXA1 and GATA3 in hTERT-HME1 and HCC1954 cells by western blotting. GAPDH was used as an internal control for equal loading. Stars indicate nonspecific bands. c Genome browser view (chr14:75521286-75521809) of an hypo-methylated DMR in HCC1954 breast cancer cells compared to hTERT-HME1 normal cells matching a TCGA BRCA hypo-methylated DMR, FOXA1 ChIP-seq signal in HCC1954 cells in two replicates and location of FOXA1 motifs. d Genome browser view (chr2:27209983-27210462) as in c matching a GATA3 binding sites and motif. e Number of FOXA1 and GATA3 binding peaks in the different genomic categories: CpG-poor distal from gene TSS, CpG-poor proximal, CpG-rich distal or CpG-rich proximal. f Boxplots of mean DNA methylation in HCC1954 and hTERT-HME1 cells in 200 bp windows around FOXA1 or GATA3 HCC1954 peak summits that contain at least 2 CpGs and overlapping a matching FOXA1 or GATA3 motif (FOXA1 n = 4709; GATA3 n = 1671) or in all 200 bp consecutive windows along the hg38 genome containing at least 2 CpGs (n = 5,867,466). Wilcoxon p-values are indicated

We then looked for FOXA1 and GATA3 binding sites in HCC1954 breast cancer cells by performing chromatin immunoprecipitation sequencing (ChIP-seq) using FOXA1 and GATA3 antibodies in two biological replicates each (Additional file 1: Table S4). After peak calling using peakzilla [35] and filtering out false-positive peaks due to genomic amplification in the HCC1954 breast cancer cells, we obtained 13,753 and 14,257 peaks for FOXA1 replicates samples and 2095 and 3751 peaks for GATA3 replicates samples (see examples in Fig. 4c,d and Additional file 1: Table S4). Since the ChIP signal correlated well between replicates (Additional file 1: Figure S4a), we merged peak regions for further analyses yielding 16,323 peak regions for FOXA1 and 3,949 peak regions for GATA3. Although some binding sites were shared between FOXA1 and GATA3, the majority were distinct (Additional file 1: Figure S4b). Further, 53% of FOXA1 peak regions and 69% of GATA3 peak regions contained the FOXA1 and GATA3 motifs, respectively, which is a usual fraction found in TF ChIP-seq peaks. We next categorized the peak regions according to their genomic features and found that the majority of FOXA1 and GATA3 binding sites were located in CpG-poor regions distal from gene TSS, as expected for TFs binding distal regulatory regions (Fig. 4e and Additional file 1: Figure S4c,d).

Finally, we investigated the methylation patterns at FOXA1 and GATA3 binding sites. We could show that they were located in regions with low DNA methylation levels in HCC1954 cancer cells (Fig. 4f), expected for TF-binding sites in active regulatory regions. Furthermore, FOXA1 and GATA3 binding sites were significantly more methylated in hTERT-HME1 normal cells lacking FOXA1 and GATA3 expression (Fig. 4f and Additional file 1: Figure S4e). Indeed, we found that 20% and 23% of direct FOXA1 and GATA3 peak regions containing CpGs overlapped DMRs hypo-methylated in HCC1954 cancer cells compared to hTERT-HME1 normal cells (see examples in Fig. 4c, d). Of interest, 269 and 293 TCGA BRCA hypo-methylated DMRs have FOXA1 or GATA3 motifs, respectively, (representing 39% and 42% of all 695). Although we do not expect all motif occurrences in a genome to be bound, we found that 66 and 35 of direct FOXA1 and GATA3 peak regions in HCC1954 cells, respectively, overlapped those TCGA BRCA hypo-methylated DMRs (representing 24% and 12%; see examples in Fig. 4c,d).

Altogether, these data identify FOXA1 and GATA3 binding sites that correlate with DNA hypo-methylation in HCC1954 cancer cells. These regions represent putative regions where DNA hypo-methylation could be mediated by FOXA1 or GATA3 binding, although other FOX and GATA TFs are expressed in HCC1954 cells [4].

FOXA1 and GATA3 mediate DNA hypo-methylation in HCC1954 cells

We next sought to determine if FOXA1 and GATA3 are causally involved in driving hypo-methylation in HCC1954 breast cancer cells. To test this, we used CRISPR/Cas9 to knockout (KO) FOXA1 or GATA3 in HCC1954 cancer cells. We generated two independent KO lines which were validated by Sanger sequencing and Western blot (Fig. 5a,b and Additional file 1: Figure S5a,b). We performed WGBS in two independent FOXA1 and GATA3 KO clones (Additional file 1: Fig. S3b, c and Table S2, 3) and could observe a gain of DNA methylation in FOXA1 or GATA3 KO cells compared to wildtype (WT) HCC1954 cells at FOXA1 or GATA3 binding sites, respectively (Fig. 5c–f and Additional file 1: Figure S5c-f). We found 82 FOXA1 binding peak regions located in FOXA1 KO hyper-methylated DMRs (see example in Fig. 5e) and 30 GATA3 binding peak regions located in GATA3 KO hyper-methylated DMRs (see examples in Fig. 5f), which we expect to result from a direct consequence of FOXA1 or GATA3 removal. We indeed find that FOXA1 or GATA3 binding sites are significantly enriched in FOXA1 or GATA3 KO hyper- over hypo-methylated DMRs (6.8-fold enrichment with hypergeometric p-value < 10–13 for FOXA1 and 15-fold with p-value < 10–6 for GATA3). Further, 49% of FOXA1 and 43% for GATA3 binding peak regions in KO hyper-methylated DMRs also overlapped HCC1954 hypo-methylated DMRs compared to hTERT-HME1 cells, which is significant over shuffled HCC1954 DMRs (4.4-fold enrichment with hypergeometric p-value < 10–7 for FOXA1 and 6.5-fold with p-value < 10–3 for GATA3). This shows that FOXA1 and GATA3 do maintain hypo-methylated regions in cancer compared to normal cells at a subset of their binding sites.

Fig. 5figure 5

Gain of DNA methylation upon FOXA1 or GATA3 removal in HCC1954 breast cancer cells. a Western blot analysis of FOXA1 protein levels in parental HCC1954 cells (WT), a control clone (Ctrl) and two FOXA1 KO clones (KO1 and KO2). GAPDH was used as an internal control for equal loading. b Western blot analysis of GATA3 protein levels in parental HCC1954 cells (WT), a control clone (Ctrl) and two GATA3 KO clones (KO1 and KO2). GAPDH was used as an internal control for equal loading. Star indicates nonspecific bands. c DNA methylation levels in HCC1954 FOXA1 KO cells compared to HCC1954 cells (mean across samples) in 200 bp windows around FOXA1 peak summits that contain at least 2 CpGs and overlapping a matching FOXA1 motif (n = 4473). d DNA methylation levels in HCC1954 GATA3 KO cells compared to HCC1954 cells around GATA3 peak summits as in c. (n = 1598). e Genome browser view (chr19:16093263-16093982) of an hyper-methylated DMR in HCC1954 FOXA1 KO cells compared to HCC1954 cells matching hypo-methylated DMR in HCC1954 vs HME1 and FOXA1 binding sites and motif. f Genome browser views (chr1:214377303-214378552 and chr8:42296869-42299784) as in e of two hyper-methylated DMRs in HCC1954 GATA3 KO cells compared to HCC1954 cells

Finally, we sought to determine if GATA3 is able to bind methylated DNA and induce demethylation in hTERT-HME1 normal cells that do not express GATA3. To this end, we generated hTERT-HME1 cells with stable overexpression of GATA3 and validated it by Western blot (Fig. 6a). We then profiled GATA3 binding by ChIP-qPCR and observed binding at two regions of interest that are bound by GATA3 in HCC1954 cells and gain methylation in GATA3-KO HCC1954 cells (Figs. 5f, 6b). We then profiled DNA methylation at those sites by bisulfite sequencing and observed demethylation upon GATA3 overexpression in hTERT-HME1 cells (Fig. 6c). This shows that overexpressed GATA3 can indeed mediate DNA hypo-methylation in normal cells at sites bound in breast cancer cells.

Fig. 6figure 6

Loss of DNA methylation upon GATA3 overexpression in hTERT-HME1 normal cells. a Western blot analysis of GATA3 protein levels in HCC1954 cells and hTERT-HME1 cells stably transfected with a plasmid containing the GATA3 ORF (HME1-GATA3) or non-transfected (HME1-WT). GAPDH was used as a control for equal loading. b ChIP-qPCR analysis for GATA3 binding at two positive control regions CR1 and CR2 (regions unmethylated in hTERT-HME1 and HCC1954 cells with GATA3 motif and binding in HCC1954 cells), at two regions of interest ROI1 and ROI2 from Fig. 5f (regions methylated in hTERT-HME1 cells, unmethylated in HCC1954 cells, with GATA3 motif and binding in HCC1954 cells, and gaining methylation in GATA3 KO clones), and at two negative control regions NC1 (region methylated in hTERT-HME1 cells, unmethylated in HCC1954 cells with no GATA3 motif and binding in HCC1954 cells), NC2 (intergenic region unmethylated in hTERT-HME1 and HCC1954 cells with no GATA3 motif and binding in HCC1954 cells). ChIP was performed with antibody against GATA3 or a control IgG antibody. ChIP signals are represented as percentage input. Data represent mean ± SEM (n = 3 biological replicates). c Patterns of CpG methylation of ROI1, ROI2 and NC1 by bisulfite sequencing analysis in HME1 WT and HME1 overexpressing GATA3 (HME1-GATA3). White circles represent unmethylated CpGs and black circles represent methylated CpGs

留言 (0)

沒有登入
gif