Mutations in Noncoding Cis-Regulatory Elements Reveal Cancer Driver Cistromes in Luminal Breast Cancer

Introduction

Breast cancer is the second leading cause of death in women in North America (1). Currently, treatment decisions rely on the histology and the expression of three proteins: estrogen receptor (ER) α, progesterone receptor (PR), and HER/neu (HER2; ref. 1). Approximately 80% of all breast cancers are of the luminal (ER+) subtype with luminal B (ER/PR+ HER2+) being the most proliferative and aggressive and luminal A (ER/PR+, HER2−) being the most predominant. Sixty-five percent of luminal breast tumors are ER+PR+; together, this luminal subtype makes up 52% of all breast cancers (2, 3). Large-scale analysis of whole-genome sequencing (WGS) in breast tumors has identified 99 driver genes with recurrent protein-coding alterations (4, 5) as well as a high number of mutations within the noncoding genome (4). Noncoding mutations can alter the transcription factor binding to the DNA and affect enhancer–promoter interactions to perturb gene expression (6–12). However, the inclusion of noncoding mutations to find cancer drivers remains a challenge in ER+PR+ luminal breast cancer that needs to be addressed to comprehensively resolve the role of genetic variants in oncogenesis.

The noncoding genome is known to harbor many of cis-regulatory elements, defined as binding sites for transcription factors involved in transcriptional regulation by serving as promoters, enhancers, or anchors of chromatin interactions (13). In luminal breast cancer, cis-regulatory elements are bound by key transcription factors, including ER, FOXA1, and GATA3 which have a role in maintaining the luminal phenotype as well as the growth and differentiation of breast epithelium (14). Disruption of either of these transcription factors or their binding sites can affect their binding to the chromatin (15), which can modulate downstream gene expression. A subset of transcription factors active in luminal breast cancer are classified as driver genes due to an enrichment of mutations within protein coding regions (16–18).

Mutations within regulatory elements of enhancers and promoters can be responsible for the development of disorders with the same magnitude as mutations affecting protein-coding genes (6, 19–24). A classic example of this is the TERT promoter which is frequently mutated across several cancer types as a mechanism for telomerase reactivation (25); it has been observed in 71% of sporadic melanoma and 60% to 75% of glioblastomas (6, 19–24). Variants within the TERT promoters also lead to an increased risk of breast and ovarian cancer development (26). Analysis of the Pan-Cancer Analysis of Whole Genomes (PCAWG) project showed that the long tail of infrequent noncoding mutations in promoters and distal regulatory elements converged to pathways and molecular interaction networks of oncogenic processes (10). Zhu and colleagues found frequently mutated regulatory elements in cancer genomes that interact with target genes via long-range chromatin interactions (10).

The sum of all regulatory elements bound by a transcription factor in a given cell-type has been referred to as a “cistrome” (27). Analysis of mutations across the cistromes of prostate cancer revealed a high frequency of mutations within the binding sites of key transcription factors including FOXA1, HOXB13, and androgen receptor (AR; ref. 7). In luminal breast cancer, Bailey and colleagues found 7 functionally validated mutations within the cis-regulatory elements of ESR1 that altered gene expression (8), while Cowper-Sallari and colleagues found that risk-associated SNPs in the cistrome of FOXA1 modulated the expression of downstream target genes (15). These studies highlight the key, albeit underappreciated role that cis-regulatory elements and cistromes play in tumorigenesis.

Within this study, we drew parallels to the approaches to finding driver mutations between the coding and noncoding genome. In keeping with the terminology established by The Cancer Genome Atlas (TCGA) and the PCAWG we define cancer drivers as units of the genome that are enriched in mutations more than expected by chance (5, 28). Similar to looking for hotspots of mutations within individual exons, we first focused on individual cis-regulatory elements across accessible chromatin regions. Proceeding to a broader scale, akin to looking at multiple exons that make up a gene, we explore for mutations across the cistromes of transcription factors in accessible chromatin regions of luminal breast cancer. Together, our study identified cancer driver cistromes assigned to transcription factors essential to luminal breast cancer.

Material and MethodsPatient tumor samples

Twenty-six primary tumors with cancer cell content higher than 60% estimated by a central pathologist using hematoxylin and eosin (H&E) staining were obtained from surgical specimens of patients with ER+PR+ invasive ductal carcinoma. Patients' consent and tumor stratification were obtained through University Health Network (UHN) living biobank under REB # 16–5524.

Tumor processing and Assay for Transposase-Accessible Chromatin using sequencing library preparation

Breast tumors were minced into small pieces and digested at 37°C, in mammary Epicult (STEMCELL Technologies,) media supplemented with 10% FBS (WISENT) and collagenase (STEMCELL Technologies), and further dissociated in 5 mg/mL dispase for 2 minutes. Cells were counted and live cells sorted into two populations, immune and malignant cells enriched using sytox blue (ThermoFisher Scientific) and anti-CD45 antibody (ThermoFisher Scientific). Five to 50,000 were used for Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) library preparation as described previously (29). Briefly, cells were lysed for 5 minutes followed by transposase reaction and library amplification using Nextera DNA Library Prep Kit (Illumina). Libraries were then size-selected (240–360 bp) using PippinHT (Sage Science) and sequenced (NextSeq 550) using 50 bp single reads at a coverage of 40 to 80 million reads (Supplementary Table S1).

ATAC-seq and data analysis

Reads were aligned to hg19 using bowtie2/2.0.5 using default parameters. Aligned reads were then filtered by removing duplicated and mitochondrial reads using samtools/0.1.18. We then used MACS2/2.0.10 (30) to call accessible chromatin peaks using the following parameters: macs2 callpeak -t -g hs –keep-dup all -n -B –nomodel –SPMR -q 0.005 –outdir . Peaks from all of the samples were then merged to create a catalog of accessible chromatin using bedtools merge using the following parameters: cat *_peaks.narrowPeak | bedtools sort -i stdin | bedtools merge -i stdin > Catalogue.bed.

We then generated a signal matrix by mapping the max peak signal from each sample to the full catalog of accessible chromatin using the map tool from bedtools v2.27.1.

Enrichment of genomic features in accessible chromatin regions

The accessible chromatin regions from ATAC-seq, represented using a BED file, were used as input for cis-regulatory element annotation system (CEAS) v1.0.2 (31) along with hg19 refGene, running the default chromatin immunoprecipitation (ChIP) Region Annotation and Gene-centered Annotation modules. Similarity between ATAC-profiles was estimated using all unique peaks in a pair-wise comparison between samples. A cosine similarity was used to negate the differences in global peak amplitudes and compare the relative amplitudes.

Motif enrichment

We analyzed motif enrichment using CentriMo from the Meme-suite tool version 4.9.0_4 and as a reference, we used the JASPAR_CORE_2016.meme database. This analysis was run on multiple catalogs. First, we run PM-Lum and TCGA_Lum catalogues using as a background a catalogue of publicly available DNaseI sensitive sites identified in several cell lines. The DNaseI sensitive sites were downloaded from the Encyclopedia of DNA Elements (ENCODE). Next, we ran the same analysis on PM_Lum accessible chromatin regions that overlapped mutations from International Cancer Genome Consortium (ICGC)_EU and ICGC_US datasets using as a background the full PM_Lum Catalogue.

Hotspot of cis-regulatory, significantly-mutated elements

In order to identify mutation enrichment within noncoding regions, we developed an algorithm that uses an exact binomial test for each region of interest against a sample-wide noncoding background mutation rate (https://github.com/pughlab/BCa_ATACSEQ_Project/tree/main/HoRSE). We first define the search space as the overlap between cis-regulatory elements and the ATAC catalog, as well as separate variants into noncoding and coding based on the University of California Santa Cruz (UCSC) hg19 known gene annotations. By tiling a 5 kb window across the cis-regulatory elements for the search space, we fit the number of variants found within the tiled cis-regulatory elements to a poisson model to estimate the average background mutation rate. We also used a 5 kb sliding window approach to identify the loci within each cis-regulatory element with the highest mutation burden. The highest mutation burdens were compared with the background mutation rate using an exact binomial test and corrected for multiple hypothesis testing using an FDR correction.

Mutation enrichment at motif sites

To analyze the enrichment of mutations at motif sites, we used a modified version of the previously published tool mutation enrichment at motif sites (modMEMOS; https://github.com/pughlab/BCa_ATACSEQ_Project/tree/main/modMEMOS; ref. 7; Supplementary Fig. S3). First, similar to the previous version, we scanned for motif sites using either the PM_Lum ATAC-seq Catalog or PM_Lum ATAC-seq Catalog that overlap publicly available ChIP sequencing (ChIP-seq) data run on MCF7 using MOODS/1.9.2 tool (32). The previously published version of MEMOS assumed a normal distribution of number of mutations, however, due to the low number of mutations within cis-regulatory elements, we adopted a poisson distribution to better fit our data. Additionally, MEMOS established the null distribution of mutation enrichment by randomly sampling from the entire genome followed by adding a flanking region, resulting in the potential for the background region to include the target regions. We address this by adding the maximum flanks (1,000 bp) to all motif recognition sites first, and then restricting sampling to all regions that do not overlap the ENCODE blacklist regions as well as all original motifs ± 1,000 bps. Finally, MEMOS estimates its P value for motif enrichment by calculating the distance of the number of mutations within the target cistromes from the standardized mean of the null distribution. Due to the low number of mutations within some of our cistromes, we opted for a confidence interval approach by resampling the target and background regions, followed by calculating mutation enrichment within the resampled regions and estimating the effect size of enrichment using Cohen D.

From a technical perspective of modMEMOS, we added a flanking region (0–1,000 bp) to Motif sites/ChIP peak centers using Bedtools slop and resampled the resulting bedfiles 100 times, taking 80% of the bedfiles each time. In parallel, we generated a background bedfile by randomly shuffling all of the motif sites ± 1,000 bp while excluding the Motif sites/peak center ± flanking region as well as the ENCODE blacklist regions. Similar to Motif sites/ChIP peak centers, the background bedfile was resampled 100 times, taking 80% of the regions each time. Taking into consideration our regions of interest and background file we identified the regions that overlapped mutations from ICGC-EU and US datasets, and counted the number of mutations for each transcription factor site and flanking region. Finally, we compared the mutation counts from the region of interest with the background and calculated Cohen D using the following equation: “Mean difference/pooled SD”. We determined the enrichment threshold based on the Cohen D median.

Intra-genomic replicates

To predict the effect of single-nucleotide variant (SNV) on transcription factors binding affinity, we run the Intragenomic replicates (IGR) tool (15). In summary, IGR uses ChIP-seq data of the transcription factor of interest to analyze the change in signal intensity in regions harboring SNVs compared with surrounding regions. Herein, we analyzed the binding affinity of the transcription factor that binds sites found to be enriched in mutation. Our regions of interest were the transcription factor binding sites ± 100 bp flanking regions. We used the ICGC-EU mutation dataset as the SNV file.

Essentiality screens

Project Achilles genome-wide short hairpin RNA (shRNA) essentiality screen data was downloaded from the DepMap portal, specifically the “Achilles” dataset (33, 34). The analysis was focused on breast cancer cell lines that showed consistency in subtyping according to all three genesets PAM50, SCMOD2, and SCMGENE. The probability of essentiality was used as a score 1 being most essential and 0 nonessential.

Identifying luminal-specific essentiality

Enrichment of essentiality for one breast cancer type compared with the rest was calculated using an approach inspired by gene set enrichment analysis (GSEA; ref. 35). The probability of essentiality (Pe) values were assigned a direction based on whether they were part of the cancer type of interest (COI; positive) or not (negative). A curve was fitted to the ordered Pe list and the AUC was calculated. An exact p value for each cancer type was calculated using a permutation test (n_perm = 1,000) where the cancer type index was randomized and the AUCs recalculated. All p values were corrected for multiple testing using FDR. The standardized AUC was calculated based on a minimum (min)/maximum (max) AUC range, where the min is defined as Pe = −1 for all non-COIs and Pe = 0 for all COIs, while the max has Pe = 0 for all non-COIs and Pe = 1 for all COIs.

Ethics approval and consent to participate

The UHN Ethics Board operates in compliance with the Tri-Council Policy Statement reviewed and approved this project REB #16–5524.

Availability of data and material

The ATAC-seq raw data generated from our PM_Lum cohorts were uploaded to European Genome-Phenome Archive (EGA) under accession code: EGAS00001005235. Availability of codes: https://github.com/pughlab/BCa_ATACSEQ_Project

ResultsComprehensive chromatin accessibility analysis in primary ER+PR+ luminal breast cancer

To identify cis-regulatory elements, we used ATAC-seq (29, 36) to map the accessible chromatin of 26 luminal primary ER+PR+ invasive ductal carcinomas breast tumors freshly collected at the Princess Margaret Cancer Centre (PM_Lum; n = 26; Supplementary Table S1). To enrich for malignant cells, we used flow cytometry to sort cells from dissociated tumors using the anti-CD45RO (anti-CD45) antibody (Fig. 1A). In the immune-depleted (CD45−) cancer cells, we identified a catalogue of 99,516 (41.37 Mb) unique nonduplicated regions found in accessible chromatin as defined by ATAC-seq peak coverage called using MACS2 (ref. 30; Supplementary Table S2). Furthermore, we observed that 98% of peaks from our catalog were found in more than half of the PM_Lum cohort (Supplementary Fig. S1A). Moreover, we examined the size distribution of regions of accessible chromatin in the PM_Lum samples compared with our catalog, we show that the size distribution seen in catalog is similar to the one seen in the individual samples (Supplementary Fig. S1B, top). This similarity suggests that our catalog is recapitulating an accurate ATAC-profile to those seen in our samples, rather than being skewed to smaller peaks. To profile this similarity further, we calculated the fraction of overlap between our catalogue and PM_Lum cohort and show an average of 58.7% ± 12.4% overlap (Supplementary Fig. S1B, bottom).

Figure 1.Figure 1.Figure 1.

Identifying chromatin accessibility in ER+PR+ breast cancer. A, Primary tumors were minced and dissociated for subsequent flow sorting into immune and epithelial cell populations, followed by ATAC-seq profiling. B, Heatmap showing similarities between ER+PR+ accessible chromatin profiles. Cosine similarity analysis was calculated using comparing all chromatin accessibility of samples with each other. Bar plot showing number of called peaks per sample. C, Bar plot showing the number of accessible chromatin regions from TCGA_Lum datasets that overlapped PM_Lum in blue and the ones unique to each cohort in red. D, A graph showing the chromatin accessibility saturation curve. A nonlinear regression model analysis was performed using the number of unique ATAC peaks discovered in each sample to estimate the percentage of accessible chromatin mapped in PM_Lum (Purple; n = 26 samples), TCGA_Lum (Blue; n = 41 samples), and luminal cell lines (MCF7 and T47D, Orange; n = 2). E, Percentage of distribution of mapped accessible chromatin regions within the genome. The cis-regulatory element annotation system (CEAS) is utilized to perform genomic distribution analysis of the accessible chromatin region mapped by ATAC-seq. **p value < 0.001, two-sided t test; The box plot ranges are Q1, Median, and Q3; the whiskers are ± 1.5x the IQR. F, Bar plot showing p values for cosine similarities between PM_Lum and TCGA_Lum in comparison with immune cells' accessible chromatin. Red dotted line represents p value = 0.01, two-sided t test. G, Lollipop graph showing enriched motif families in ER+PR+ breast tumors (p value < 0.01, Fisher exact test). The catalog of 26 ATAC-seq data was used. Enrichment of motifs within ATAC-seq regions against DNaseI hypersensitive sites from several cell lines was computed. Motif families were obtained using the Jaspar database. The size of the circles represents the number of target peaks for each motif.

To examine the quality of our data, we ran a similarity pair-wise comparison between accessible chromatin profiles using cosine similarity metric. Our data indicated a high degree of agreement of accessible chromatin distributions between our PM_Lum samples (Cosine similarity μSc = 0.82 ± 0.07; Cosine similarity; Fig. 1B). To identify whether our catalog of accessible chromatin regions was representative of other ER+PR+ breast tumors, we leveraged TCGA ATAC-seq data derived from nonenriched bulk ER+PR+ tumor tissues (n = 41; TCGA_Lum; ref. 37). Compared with our cohort, TCGA_Lum showed a higher number of unique accessible chromatin regions (272,291 peaks; 289.89Mb) that encompassed 93.6% (93,172/99,516 peaks) of our PM_Lum catalogue. Of note, the PM_Lum accessible chromatin regions represented only 34.2% of the TCGA_Lum catalog (Fig. 1C), suggesting our depletion of immune cells may have enhanced the signal specific to cancer cells. Consistent with this observation, we estimated that our analysis led to the mapping of 88% of accessible chromatin within our cohort of 26 samples while the TCGA_Lum cohort reached similar saturation (87%) with 41 samples (Fig. 1D). Thus, we established a catalog from our PM_Lum cohort of high-confident accessible chromatin regions that were found across our cohort and illustrate a high level of robustness by being found almost entirely within the independent TCGA_Lum catalog.

We next characterized the genomic distribution of accessible chromatin regions across different genomic features [e.g., promoters, distal regions, coding exons untranslated regions (UTR), and intronic regions) within the PM_Lum and TCGA_Lum catalogs. Using the CEAS tool to estimate the relative enrichment level of accessible regions in gene features (31), we found that on average 36% of cis-regulatory elements mapped to promoters, 23% to introns, 23% to intergenic regions, 14% to UTR, and 3% coding exons (Fig. 1E). Using this same approach, we found that the accessible chromatin regions captured in the ATAC-seq data from the TCGA_Lum cohort had a similar distribution of intergenic regions (PM_Lum = 23%, TCGA_Lum = 26%; P = 0.10, two-sided t test) and coding exons (PM_Lum = 3%, TCGA_Lum = 3%; P = 0.42, two-sided t test). However, in contrast to the PM_Lum cohort the TCGA_Lum shows a higher distribution to introns (TCGA_Lum = 39%, PM_Lum = 23%; P < 0.001, two-sided t test) and a lower distribution to promoters (TCGA_Lum = 26%, PM_Lum = 36%; P < 0.001, two-sided t test) and UTRs (TCGA_Lum = 6%, PM_Lum = 14%; P < 0.001, two-sided t test; Fig. 1E). To assess the concordance of these patterns with two widely used cell lines, we examined the genomic distribution of accessible chromatin using DNase I hypersensitive sites from both MCF7 and T47D lines. This analysis found, relative to both cohorts of primary tumours, the cell lines had a higher percentage of open chromatin regions in distal intergenic regions and lower percentage in promoter regions (Fig. 1E). Since the accessible chromatin regions differed significantly from our set of primary tumours, we did not include them in subsequent analyses. Thus, our results highlight that both PM_Lum and TCGA_Lum accessible chromatin catalogs favor noncoding regions, where most accessible regions are found in the promoter, intergenic, and intronic sequences as opposed to the coding exons.

Considering that we used cell sorting to exclude immune cells from our tumor samples using an anti-CD45 antibody, we examined whether the difference in accessible chromatin profiles that we saw between TCGA_Lum and PM_Lum was due to immune infiltration. We tested for this immune infiltrate by comparing the similarity of the PM_Lum and TCGA_Lum profiles to a known immune reference comprised of publicly available chromatin accessibility data (DNaseI) from 12 immune-cell types [trophoblast, CD1c+, myeloid progenitors, CD14+ monocytes, T helper17, T helper1, T helper2, CD8+α T cells, naive thymocytes T cells, CD4+ α-β T cells, natural killer (NK) cells, and B cells]. Our results showed that the TCGA_Lum chromatin accessibility profile was significantly more similar to the accessible chromatin profile for 9 of the 12 immune-cell types (trophoblast, CD1c+, myeloid progenitors, CD14+ monocytes, T helper17, T helper1, T helper2, CD8+α T cells, naive thymocytes T cells) compared with the PM_Lum profile (Fig. 1F; P < 0.001, one-sided t test). The accessible chromatin profile for 3 of the 12 immune cells tested (CD4+ α-β T cells, NK, and B cells) were not significant given the fact that CD45RO is not expressed in CD4+ T cells, NK, and B cells (38). Altogether, our data suggests that although there are similarities between TCGA_Lum and PM_Lum, the cell sorting performed on our PM_Lum cohort led to a depletion of immune cells, resulting in a more cancer-cell–enriched accessible chromatin catalog.

Cis-regulatory elements work through the recruitment of transcription factors that bind to unique DNA recognition sequences. We therefore assessed the sequence composition of accessible chromatin regions from ER+PR+ breast tumors through DNA recognition motif enrichment analysis. Using the JASPAR database as a reference for motif recognition sites and the pan-cancer ENCODE DNase I hypersensitive sites as a background, we utilized the CentriMo method to identify 40 significantly enriched DNA recognition motif families, 6 of which are known to play an important role in luminal breast cancers: AP-2 (TFAP2A), Forkhead (FOXA1), STAT (STAT3), C/EBP (CREBBP), NR1 (RORA), GATA (GATA3; refs. 15, 39–41; P < 0.001; Fisher exact test; Fig. 1G; Supplementary Table S3.1). To corroborate our findings, we performed a similar DNA recognition motif enrichment analysis on the TCGA_Lum catalog. We identified 57 DNA recognition motif families enriched in this cohort; 33 of 57 overlapped with the motifs enriched in our PM_Lum catalogue, 24 of 57 were unique to the TCGA catalogue, and 7 of 40 (HSF, MyoD/ASC, RHR, MADS, NFAT, NF, and, B-ATF) were found only in the PM-Lum catalogue (Supplementary Fig. S1C; Supplementary Table S3.2). All of which have been linked to breast cancer development, tumor invasion, and drug resistance (39, 42–45). Together, these results demonstrated that our PM_Lum catalog defines a broad spectrum of motif recognition sites, 82.5% of which are also found in the TCGA_Lum catalogue and 6 which are established markers of luminal breast cancer biology, thus reflecting the luminal breast cancer specificity of our catalog.

Individual cis-regulatory elements are rarely recurrently mutated

The enrichment of mutations within promoters and enhancers of key breast cancer genes, such as TERT (6, 24) and FOXA1 (18), suggests potential for recurrent mutations in additional regulatory regions. To search for other mutations in cis-regulatory elements in ER+PR+ breast cancer, we integrated our PM_Lum catalogue with somatic mutations from 348 ER+PR+ breast cancers in two WGS breast studies (ICGC-EU; ref. 4; n = 306 and ICGC-US; ref. 46; n = 42). Of the 1,048,537 mutations found across WGS of ER+PR+ breast cancer samples from ICGC-EU and ICGC-US, an average of 1.7% (ICGC-US = 1.76%; ICGC-EU = 1.78%; 0.7%–3.4%; n_SNVs: min = 4,295, max = 35,650) were detected within our PM_Lum catalog, which comprises 1.3% of the genome (Fig. 2A). To identify whether our PM_Lum catalog captured mutations specific to ER+PR+ breast cancers, we compared the localization of mutations to 19 ICGC WGS cancer cohorts (Supplementary Table S4). We found that these additional 19 cancer types all had significantly lower fractions of mutations overlapping our PM_Lum catalog as compared with ER+PR+ breast cancer samples, with the exception of BOCA, PAEN-AU, and PRAD-UK (Fig. 2A). We then performed the same analysis using the TCGA_Lum catalog of accessible chromatin and found similar results, with luminal breast tissue having a higher percentage of mutations localized to this region when compared with other tissues (P < 0.01, two-sided t test; Supplementary Fig. S2A). These results highlight that mutations with luminal breast cancers are predominantly found within our accessible chromatin catalog, thus setting the stage for interpreting mutations in cis-regulatory elements relevant to luminal breast cancer biology.

Figure 2.Figure 2.Figure 2.

Mutation enrichment at cis-regulatory elements in ER+PR+ breast cancer. A, Box plot showing the percentage of regions from PM_Lum catalog overlapping mutation calls from WGS from multiple cancer types. The box plot ranges are Q1, Median, and Q3; The whiskers are ± 1.5x the IQR. B and C, Manhattan plots indicating regulatory regions significantly enriched in mutations using our in-house algorithm. The PM_Lum catalogue was used as accessible chromatin targets and the ICGC_EU WGS (B) or ICGC_US (C) was used as mutation calls. Dotted lines indicate q = < 0.01, exact binomial test.

To identify highly mutated regulatory elements in ER+PR+ breast cancer, we analyzed frequently mutated regulatory elements using the ActiveDriverWGS method (10). Restricting our analysis to our PM_Lum catalog as the target regions, we found no driver mutations after multiple testing correction using two separate datasets, ICGC-EU (Supplementary Fig. S2B, Supplementary Table S5.1) and ICGC-US (Supplementary Fig. S2C, Supplementary Table S5.2) WGS data (q < 0.01; FDR). By running a similar analysis on the TCGA_Lum catalog, ActiveDriverWGS identified one highly mutated distal region (chr10:8115662–8116163) using ICGC-EU (Supplementary Fig. S2D) and none using ICGC-US (Supplementary Fig. S2E). Although ActiveDriverWGS is a robust tool for calling drivers in regulatory elements, it takes a conservative one-to-one approach between mutations and active elements, negating the cumulative effect of multiple mutations within a hotspot region. To address these limitations, we designed an algorithm Hotspot of cis-Regulatory, Significantly-mutated Elements (HoRSE) that relaxes the stringency of ActiveDriverWGS by looking for clusters of hotspot mutations within regulatory elements against a background of global and local somatic mutation rates (Supplementary Fig. S2F; Online methods). Using HoRSE, we found 5 unique cis-regulatory elements enriched for somatic mutations across ICGC-EU and -US (nICGC-EU = 5, nICGC_US = 1; q < 0.01, exact binomial test) with PLEKHS1 being the only cis-regulatory element significantly enriched in both WGS cohorts (ICGC-EU: n = 12/308; ICGC-US: n = 6/42; Fig. 2B and C; Supplementary Table S5.4). Two of the somatic mutations identified within the PLEKHS1 promoter are thought to be attributed to APOBEC DNA-editing activity (4, 47). In the ICGC-EU dataset, we identified 4 cis-regulatory elements enriched for somatic mutation in addition to PLEKHS1 (Promoters of INTS2; n = 6/308, APLP1; n = 6/308, and CCDC107/RMRP; ref. 18, 47; n = 6/308, and Distal Region: chr11:129512774–129513782; n = 7/308; Fig. 2B and C; Supplementary Table S5.3). We calculated the number of samples within our cohort with ATAC-seq coverage of these mutationally enriched regions and show that they are within accessible chromatin seen across several samples (PLEKHS1; n = 4/26; INTS2, n = 15/26; APLP1, n = 15/26; CCDC107/RMRP, n = 25/26; chr11:129512774–129513782; n = 15/26; Supplementary Table S6). Additionally, by applying our algorithm on the regions covered by the TCGA_Lum catalog, we revealed 19 significantly mutated regions in the ICGC-EU dataset regions including CCDC107/RMRP (Supplementary Fig. S2G) and 3 regions in ICGC-US (promoter: RARA and 2 distal regions: chr8:98131092–98131993 and chr17:38603438–3860433; Supplementary Fig. S2H). Our results highlight the small number of recurrent mutational hotspots across all the cis-regulatory elements of luminal breast cancer. Thus, similar to how the search for driver genes is hindered when focusing on single exons, our results show the hunt for cancer drivers within individual cis-regulatory elements may be too limiting resulting in the few observed recurrently mutated regions.

Noncoding mutations reveal cancer driver cistromes in luminal breast cancer

The genome can be looked at as a collection of cis-regulatory elements that can be organized into cistromes, based either on the DNA recognition sequence content or on actual occupancy by transcription factors. As our previous analysis highlights the limitations of identifying drivers using individual cis-regulatory elements, our next step was to assess the presence of cancer driver cistromes in ER+PR+ luminal breast tumors. First, we measured the enrichment for DNA recognition motifs within the PM_Lum catalog of cis-regulatory elements found to be mutated in primary luminal breast tumors from the ICGC-EU and -US studies. This revealed significant enrichment for several DNA recognition motifs related to the JUN, FOS, Forkhead, NFAT, POU, and REL families of transcription factors across both ICGC dataset (Fig. 3A). The NF1, C2H2, IRF, and HD-CUT DNA recognition motifs were uniquely enriched in cis-regulatory elements mutated based on the ICGC-EU dataset (Fig. 3A).

Figure 3.Figure 3.Figure 3.

Mutation analysis at recognition sites of motifs enriched in ER+PR+ breast cancer. A, Lollipop graph showing enriched motif families in PM_Lum catalog overlapping SNVs from ICGC-EU (red) and ICGC-US (blue) against the total PM_Lum catalog (p value < 0.01; grey: p value > 0.01, Fisher exact test). B and C, graph (top) and heatmaps (bottom) showing the enrichment of mutations at DNA recognition sites found to be significantly enriched in the PM_Lum catalog using ICGC-EU (B) and ICGC-US (C) mutation calls. Cohen D was calculated based on resampling and the value indicates significant enrichment. The red dotted line indicates Cohen D median.

To focus on DNA recognition motif-based cistromes relevant to luminal breast cancer, we subdivided cis-regulatory elements from our catalogue of accessible chromatin regions based on the presence of DNA recognition motifs enriched in mutated cis-regulatory elements across both ICGC-EU and ICGC-US datasets, namely JUN, FOS, Forkhead, NFAT, POU, or REL. We calculated the frequency of mutations across varying window sizes (0 to 1,000 bp) around the cis-regulatory elements from each of the motif-based cistromes using modMEMOS (modMEMOS and Flanking Regions; Supplementary Fig. S3; Online methods; refs. 7, 18). We estimate the effect size of mutation enrichment in DNA recognition motifs compared with a background model using Cohen D, a statistical value that represents the standardized difference between two means. Using a window of 50 bp flanking the motif recognition sites, as defined by the work from Mazrooei and colleagues (7, 18), we found an enrichment for mutations near the JUN, FOS, and Forkhead motif-based cistromes in both ICGC-EU (Fig. 3B) and ICGC-US data sets (Fig. 3C). Additionally, cis-regulatory elements proximal to POU motif cistrome were found to be enriched in mutations uniquely in the ICGC-EU dataset (Fig. 3B). These results suggest that noncoding mutations preferentially accumulate across cis-regulatory elements that harbor specific DNA recognition motifs, namely JUN, FOS, or Forkhead motifs.

Given that transcription factors of the same family can bind the same DNA recognition motif, we explored the transcription factor–based cistromes to examine whether these variants are targeting transcription factor–binding sites specific to breast tumors. We leveraged the publically available collection of ChIP-seq datasets of transcription factors (n = 48) and cofactors (n = 30) from the luminal breast cancer cell line (MCF7; ref. 48) to identify luminal specific transcription factor–based cistromes. We first clustered all cistromes according to their similarity in ChIP-seq signal across our catalog of cis-regulatory elements from luminal breast tumors and identified 7 distinct clusters (Supplementary Fig. S4), including one consisting of the ER, FOXA1, and GATA3 transcription factors (TFs_1). We next used modMEMOS to quantify the enrichment of mutations over these cistromes. Using the mutation calls from the ICGC-EU dataset, we identified 28 cancer driver cistromes (AHR, AR, CEBPB, CREBBP, CTCF, ELF1, ER, FOSL2, FOXA1, FOXM1, GABPA, GATA3, JUND, MAX, MYC, NR2F2, REST, TCF12, TEAD4, TFAP2A, TFAP2C, and ZNF217; Fig. 4A). We further refine these transcription factor–based cistromes by including only the cis-regulatory elements that harbor a matched DNA recognition site for the designated transcription factor family. Using modMEMOs on these DNA recognition site-specific transcription factor–based cistromes, we identified 10 cancer driver cistromes (CTCF, ELF1, ER, FOSL2, FOXA1, FOXM1 GATA3, JUND, TFAP2A, and TFAP2C) that are enriched in mutations in both the ICGC-EU (Fig. 4B) and ICGC-US (Fig. 4C) datasets. Consistent with the motif-based cistromes that we identified as cancer drivers (Fig. 3B and C), we observed a similar enrichment of most, but not all transcription factor–based cistromes that compose each motif family (Forkhead, JUN, and FOS), with the exception of the REL motif family (Supplementary Fig. S5). Furthermore, we found that in the majority of cases, not all transcription factor–based cistromes for a given motif family defined cancer driver cistromes (e.g., Forkhead motif family). Rather mutations were found to be enriched in specific transcription factor–based cistromes (Supplementary Fig. S5). Altogether, our results highlight that key transcription factor–based cistromes are cancer drivers independent of their motif families or based on their similarity to other cistromes, indicating the mutations are selectively enriched within specific driver cistromes.

Figure 4.Figure 4.Figure 4.

High enrichment of mutations at cistromes of key transcription factors involved in ER+PR+ breast cancer. Heatmaps showing enrichment of mutations at ChIP-seq peak centers and flanking regions (0–1000 bp) using ICGC-EU WGS dataset (A), transcription factor–binding sets using ICGC-EU (B), and ICGC-US WGS datasets (C). Cohen D was calculated based on resampling and the value indicates significant enrichment [Enrichment > Median (Cohen D)]. Transcription factors showing a consensus in mutation enrichment are marked in bold.

We next examined if the noncoding mutations within or flanking (100 bp) the DNA recognition motif found within the cancer driver cistrome for CTCF, TFAP2C, GATA3, FOXA1, ER, FOSL2, JUND, TFAP2A, ELF1, and FOXM1 could alter transcription factor binding to the chromatin. Using the IGR method (15) predicted that less than 40% of the noncoding mutations could alter the binding intensity of any of these transcription factors to the chromatin (CTCF: Down = 36%, Up = 15%; TFAP2C: 31%,28%; GATA3: 19%,10%; FOXA1: 14%,17%; ER: 18%,9%; FOSL2: 27%,13%; and JUND: 14%,5%; TFAP2A: 32%,20%; ELF1: 33%,18%; FOXM1:20%,13%; Supplementary Fig. S6). These results argue that despite the enrichment of mutations observed over transcription factor–based cistromes, only a minority of these mutations can directly impact the binding affinity of transcription factors to cis-regulatory elements.

Cancer driver cistromes correspond to transcription factors essential to luminal breast cancer

To better understand why specific transcription factor–based cistromes are enriched for noncoding mutations in luminal breast cancer, we examined whether this enrichment reflected the dependency to some as opposed to all transcription factors expressed in luminal breast cancer. Using the genome-wide shRNA essentiality screen data from luminal breast cancer cell lines generated as part of the DepMap project (33, 34), we found that 4 of the 10 transcription factors linked to cancer driver transcription factor cistromes were exclusively essential in luminal breast cancers (GATA3, ESR1, FOXA1, TFAP2A) and 5 additional transcription factors were essential in all breast cancers, regardless of subtype (CTCF, FOXM1, TFAP2C, JUND, and FOSL2; Fig. 5, P < 0.05). ELF1 was the only transcription factor linked to a cancer driver cistrome not essential in luminal breast cancer cells (Fig. 5). While we found that the CREBBP and CEBPG transcription factors were essential preferentially in luminal breast cancer, we did not identify these transcription factor cistromes as cancer drivers as they were only significantly enriched in mutations in the ICGC-EU dataset. Altogether these results support the identification of cancer driver cistromes based on transcription factors that are essential to the growth of luminal breast cancer.

Figure 5.Figure 5.Figure 5.

Cancer driver cistromes are of transcription factors essential to luminal breast tumors. A heatmap showing the probability of the essentiality of the transcription factor in several breast cancer cell lines with different subtypes (Luminal, triple-negative breast cancers (TNBC), and HER2). Column annotation indicates the enrichment of mutations at binding sites ± 50 bp, and rows annotation shows cell line subtype.

Discussion

Our study depicts the cancer driver cistromes specific to luminal ER+PR+ breast cancers as identified by an enrichment of noncoding mutations flanking DNA recognition motifs of cis-regulatory elements accessible in luminal breast tumors. Using flow-sorting to enrich the cancer cell population, we generated a robust catalogue of luminal-enriched accessible chromatin regions. Within this catalogue, we identified seven recurrently mutated cis-regulatory elements that occur at a low frequency. By expanding our search to transcription factor–based cistromes, we identified 10 cancer drivers and showed that a minority of the noncoding mutations can directly impact the transcription factor binding to cis-regulatory elements. Finally, we show that 9 out of the 10 transcription factor cistromes are essential to breast cancer, and 4 of which are specific to luminal breast cancer.

Somatic variants and genomic rearrangements affecting the protein-coding regions of luminal breast cancers have been well characterized (4, 5, 49, 50), these regions account for less than 2% of the genome (51, 52). The importance of acquired genetic variants found in cis-regulatory elements is highlighted in a luminal breast cancer study by Bailey and colleagues (8) and across multiple breast cancer subtypes by Rheinbay and colleagues (18). Bailey and colleagues identified several somatic mutations with functional consequences within the promoters and enhancers that regulate the ESR1 gene (8). The study by Rheinbay and colleagues describes somatic mutations across several promoters, including FOXA1, and their effect on gene expression (18). Our analysis of the mutation burden within luminal ER+PR+ breast cancer cis-regulatory elements yielded only seven significant hits. Across both the ICGC-US and -EU cohorts, we found significant enrichment of mutations in the PLEKHS1 promoter that is likely a result of APOBEC DNA-editing activity (4), however, m

留言 (0)

沒有登入
gif