Analysis of archaic human haplotypes suggests that 5hmC acts as an epigenetic guide for NCO recombination

Archaic linkage blocks

The linkage analysis was carried out with the genotype data of 99 individuals of the Luhya people from Webuye, Kenya (LWK) [15, 16]. Only those SNPs were considered, for which the derived alleles were found in the genome of the Denisovan hominid [14] and the ancestral allele was still present in chimpanzees [17]. This resulted in a dataset of 1.9 million “archaic” SNPs. As an East-African population, the LWK genomes were unaffected by late interbreeding with Denisova h [14]. Therefore, most of the derived alleles of these SNPs should have been introduced approximately between 5.0 and 0.5 Mya (i.e., after the separation from chimpanzees but prior to the separation from Denisova h.) (Fig. 1a). This provides a time window of more than 500,000 years in which the archaic haplotypes, formed by the derived alleles, could be rearranged by recombination (Fig. 1b). The sequence of these “derived” haplotypes is preserved in the Denisova genome, while the alleles of the corresponding ancestral haplotype can be defined by the sequence of allelic states in the chimpanzee genome.

Fig. 1figure 1

Archaic linkage blocks. a Timeline for the formation of archaic haplotypes. As a rough estimate, the chimpanzee line separated from the h. sapiens line about 5 million years ago, while the Denisova h. split off around 500,000 years ago. This provides a time window of 0.5–5 million years, at which point mutations accumulated in the genome of the common ancestor of Denisova h. and H. sapien. The derived “archaic” haplotypes formed by these mutations were preserved in Denisova and serve as a reference for the recombination events rearranging these haplotypes in the human during a period of at least 500,000 years after separation. b Formation of archaic linkage blocks. NCO recombination between the derived archaic haplotype (blue), consisting of the derived alleles of these SNPs (solid circles), and the corresponding ancestral haplotype (green) created a set of haplotype variants that are still present in modern humans. In the given example the archaic linkage block consists of 6 SNPs of which SNPs 1, 3, and 6 are still in perfect linkage (block SNPs). The derived alleles of these SNPs form the core haplotype (hap), which represents the remaining fragment of the derived archaic haplotype. SNPs 2, 4, and 5 are singletons whose alleles had been shifted between haplotypes by NCO recombination. Their association to the derived haplotype is indicated by the linkage disequilibrium (D’a,hap) between their derived alleles (a) and the core haplotype (hap). c Example of an archaic linkage block. The panel shows the location of the SNPs of an archaic linkage block in reference to the intron/exon structure of the ESYT2 gene. In this visualization, the association of the derived alleles to the core haplotype (r2 = 1) is indicated by the respective r2a,hap value. The horizontal line connecting the block SNPs indicates the location of the core haplotype; single vertical lines indicate the NCO rate and location of the singleton SNPs. d Haplotype composition of the linkage block. The panel displays the haplotype composition of the ESYT2 linkage block defined for the 99 individuals of the LWK cohort. The presence of derived alleles (red circles) and ancestral alleles (white circles) for each haplotype variant is indicated. Numbers indicate the absolute frequency of the haplotype in the cohort; downward and upward arrows on top of the plot respectively indicate retracting and expanding derived alleles, a double arrow indicates a neutral allele that switched the position in several haplotypes without affecting its allele frequency; horizontal lines refer to block SNPs

Due to CO recombination, the 1,897,400 archaic SNPs of the LWK cohort were broken down into 237,312 discrete blocks (archaic linkage blocks), each comprising 2 to more than 500 SNPs (Table 1). While the majority of the archaic SNPs in the LWK cohort were found to be still arranged in the original allelic order defined by the Denisovan haplotypes (block SNPs), 876,329 SNPs were identified as singleton SNPs, presumably the result of NCO recombination (Fig. 1b). Concordant with the GC bias of gene conversion [8, 22,23,24], the majority of these singleton SNPs had indeed transmitted G:C over A:T (Additional file 1: Fig. S1a). This applied for 57% of the non-CpG and for 62% of the CpG SNPs, confirming that the allelic rearrangements of singleton SNPs in our database are likely the result of gene conversion events.

Table 1 Archaic SNPs in the LWK population (1000 genome project). The table indicates the average number of block SNPs and singletons in the archaic linkage blocks of LWK and summarizes the distribution of the archaic alleles in structurally defined genomic sub-regions as well as in regions defined by ChIP-seq tracks. Numbers indicate the absolute count of archaic SNPs detected in the region; numbers in brackets indicate the percentage in reference to the total amount of the respective SNP type

One example of an archaic linkage block is shown in Fig. 1c. The large block consists of 28 block SNPs and 10 singleton SNPs. They are located close to the 3′ end of ESYT2 and cover about 10kb, including 5 exons, of the gene. The haplotype variants formed by these SNPs are shown in Fig. 1d. Intact versions of archaic and ancestral haplotypes are generally the most common haplotype variants found in archaic linkage blocks (Additional file 1: Fig. S1b). In the case of the ESYT2 block, 46 of the haplotypes in the LWK cohort are identical to Denisovans, while 55 matched chimpanzees, representing perfectly preserved derived and ancestral haplotypes, respectively. Each of the remaining 97 haplotypes had allelic swaps in up to 3 SNP positions due to NCO recombinations.

The NCO recombination rates of these singleton SNPs were estimated by determining the linkage disequilibrium (LD) of their derived allele (a) in reference to the respective core haplotype (hap) (Fig. 1b). As r2 is biased by the allele frequencies, for this study D' has been selected as a measure for the LD. In the following, 1 – D’2a,hap was therefore used to indicate the “NCO rate” of these singleton SNPs, a proxy of their recombination rate.

Chromatin and genic sub-regions

In an initial attempt to identify links between NCO recombination rate and “function,” we correlated the NCO rate (1 – D’2a,hap) of our archaic singleton SNPs with “Genome-wide annotation of variants”-scores (GWAVA) [25]. GWAVA is a tool which aims to predict the functional impact of non-coding genetic variants. It is based on a wide range of annotations of non-coding elements (largely from ENCODE/GENCODE), along with genome-wide properties such as evolutionary conservation and GC content. As the genetic variation is also likely to be affected by selection, in parallel, we correlated the data set with “Fitness consequence of functional annotation scores” (fitCons) [26], which are indicative of the putative fitness contribution of a SNP. Although direct correlations with the entire data set were weak (rho values of the Spearman’s rank correlation were only 0.07 for GWAVA and 0.01 for fitCons scores), a correlation with the average scores of NCO rate bins revealed a significant trend for GWAVA-scores and no association with fitCons scores (Fig. 2a). The failure to detect any major impact by natural selection was not due to compromised databases, as a strong, locus-specific, association with fitCon scores was detected, when the allele set was restricted to SNPs located in the MHC genes (Additional file 1: Fig. S2).

Fig. 2figure 2

Correlation of the NCO recombination rate with functional genomic parameters. a Function vs. fitness. The NCO rate 1 – D’2a,hap was used as a proxy for the NCO recombination rate. The correlation of this parameter of about 1.9 million archaic SNPs is shown for function-associated GWAVA scores (left panel) and fitness-associated fitCons scores. The plot displays the average scores for about 1.0 million block SNPs (red dot) and 900,000 singleton SNPs (blue dots). The singleton SNPs were binned according to the 1 – D’2a,hap value defined for the LWK population. Dashed horizontal lines indicate the average scores of the entire set of archaic SNPs. Asterisks indicate significant differences between data point pairs (p < 0.05; Mann-Whitney U test). b Meiotic CO recombination hotspots. The solid line represents the average NCO rate (1 – D’2a,hap) plotted in reference to the distance of the alleles to meiotic CO recombination hotspots (defined by DMC1 ChIP-Seq tracks of human testis [20]). The dashed line of the second peak represents the hypothetical NCO rate in reference to a second recombination hotspot located at the average distance of about 67 kb. c Gene boundaries. The average NCO rate (1 – D’2a,hap), is plotted in reference their distance to the boundary of the closest annotated gene. Peaks in NCO rate are evident at both 5′- and 3′-boundaries (indicated by dashed vertical lines). Gene regions are marked in red, intergenic regions in blue; dashed horizontal line represents the average NCO rate of the entire set of archaic alleles. (d) Genic sub-regions. Gene regions (red) were further delineated into their sub-regions (5′UTRs, introns, splice sites, exons and 3′UTRs); intergenic regions (blue) were divided according to their distance to the boundaries both upstream and downstream of the genic region (0–5 kb, 5–50 kb, >50kb). Bars represent the average NCO rate (1 – D’2a,hap) in the respective region. Asterisks indicate significant differences between the indicated bars (p < 0.05; Mann-Whitney U test). e Open chromatin. The line chart depicts the association of the NCO recombination rate with open chromatin. The average NCO rate (1 – D’2a,hap) is plotted in reference to the distance of the alleles to the boundary of open chromatin (defined by ENCODE tracks of DNase I sensitivity). Gray bars indicate the number of archaic SNPs in the respective distance-bin

To further characterize the functional link between NCO recombination and “function,” we correlated the NCO rate parameter 1 – D’2a,hap with annotated regions in the genome (Table 1). A particularly strong influence on the NCO recombination rate was also expected to be observed for meiotic recombination hotspots. While these hotspots primarily represent pre-determined breakpoints for CO recombination, they often also serve as initiation sites of NCO recombination [8, 27, 28]. DMC1 is a homolog of the bacterial strand exchange protein RecA and acts as a specific marker of these hotspots [29, 30]. The NCO rate correlation with DMC1 loci in human testis defined by ChIP-Seq [20] indeed revealed extremely high 1 – D’2a,hap values for alleles located proximal to the binding sites of DMC1 (Fig. 2b). The effect, however, was very local: at a distance of approximately 1000 bp from the DMC1 binding sites, the NCO rate declined to half. With an average spacing of about 65kb, the influence of hotspots is, therefore, restricted to less than 5% of the SNPs (Table 1). Due to the erosion of PRDM9 motifs over time [31] and the incomplete coverage of the various PRDM9 alleles [32], the detected number of SNPs affected by these hotspots is likely to be an underestimate.

CO hotspots are mostly located in regions that direct genetic recombination away from functional genomic elements [33] with reduced rates observed in exons [3, 6]. Contrary to this notion, for NCO derived singleton SNPs a correlation with the distance to the closest gene revealed a particularly sharp peak of NCO rate proximal to the 5′ boundary and, to a weaker extent, also at the 3′ boundary of annotated gene regions (Fig. 2c). Further delineation of genic sub-regions corroborated this finding with high NCO rate values observed in the 5′- and 3′-UTRs as well as in the respective flanking regions located upstream and downstream of the genes (Fig. 2d). We also observed a particularly high NCO rate in exons, followed by splice sites and introns (Fig. 2d). In intergenic regions, the NCO rate dropped with increasing distance to the gene boundaries. A correlation with ENCODE tracks of DNase I sensitivity [18, 19] further revealed an increase in NCO rate in regions of open chromatin (Fig. 2e). This applies particularly for alleles in longer open stretches where a sharp NCO rate increase was observed for alleles located more than 500bp from the boundaries. However, this set comprised of less than 0.5% of the total SNPs. Furthermore, in line with prior reports [34], the recombination rate was also found to be directly associated with the local GC content and CpG islands (Additional file 1: Fig. S3a, b).

Epigenetic marks

To determine if the NCO recombination rate is influenced by epigenetic histone modifications, we superimposed the NCO rate (1 – D’2a,hap) with encode ChIP-Seq tracks of various histone modifications (H3K27me3, H3K4me1, H3K4me3, H3K4me2, H3K9ac, H3K27ac, H3K79me2, H3K36me3, H3K9me3). In order to assess their impact independently for each genomic location, separate NCO rate correlations were carried out for seven genic sub-regions (intergenic, < 5kb upstream, 5′ UTR, intronic, exonic, 3′-UTR, and > 5kb downstream regions).

Notably, all of the analyzed histone marks showed association with the NCO rate in at least some of the genic subregions (Fig. 3a, Additional file 3: Table S2). Direction and extent, however, varied strongly with the type of methylation. Depending on the general direction, the impact of the modification could be classified either as “recombination-promoting” (H3K27me3 >> H3K4me1, H3K9ac, > H3K4me3, H3K4me2 > H3K27ac) or “recombination-repressing” (H3K9me >> H3K36me3, H3K79me2). For individual histone marks, the strongest positive association was observed for H3K27me3, while the strongest negative association was observed for H3K9me3. The highest NCO rate values were detected in bivalent regions—tagged by both H3K27me3 and H3K4me3 marks (Fig. 3a, Additional file 1: Fig. S3c). Bivalent histone modifications are hypothesized to “poise” silenced genes for rapid activation [35], a state, which, according to this data, seems to prime the loci for NCO recombination.

Fig. 3figure 3

Correlation of the NCO recombination rate with epigenetic marks. a Heatmap and cumulative scores of histone and DNA marks. The impact of epigenetic marks on the NCO recombination rate was assessed by determining the average NCO rate (1 – D’2a,hap) for alleles located on tracks of histone marks (bivalent H3K27me3 & H3K4me1, H3K27me3, H3K4me1, H3K4me3, H3K4me2, H3K9ac, H3K27ac, H3K79me2, H3K36me3, H3K9me3), of 5-methylcytosine (5mC), and 5-hydroxymethylcytosine (5hmC), as well as of open chromatin defined by DNase I sensitivity. To avoid any bias due to the genomic location of the SNPs, the data set was divided into the respective genic sub-region (intergenic, upstream > 5 kb, 5′ UTR, intronic, exonic, 3′ UTR, downstream < 5 kb). Each combination of epigenetic mark and genic sub-region were tested independently using Mann-Whitney U test to determine if the overlap with the respective mark led to a significant difference in the NCO rate (Additional file 3: Table S2). Left panel: The color code in the heat map indicates the deviation from the average the NCO rate Δ(1 - D’2a,hap) of the respective sub-region in reference to regions without any mark (ranging from − 0.1 (blue) to 0.1 (red)). “ns” indicates a non-significant association; solid horizontal line separates recombination promoting marks (green arrow) from recombination repressing marks (red arrow). Right panel: The bars represent the average increase in the NCO rate score for each of the marks across the entire genome. ChIP-Seq tracks of histone marks were compiled from ENCODE data of various tissues and cell lines while tracks of 5mC, DNase I, and 5hmC marks were derived from the embryonic stem cell H1 [1, 21]. CO recombination hotspots marked by DCM1 tracks were excluded from the analysis. b Fold change in track overlap. The fold change in the overlap with 5hmC tracks (red), H3K27me3/H3K4me3-defined bivalent regions (orange), open chromatin-related DNase I tracks (blue) and 5mC-free tracks (green) and 5mC-containing tracks (gray) in response to the increase in the average NCO rate (1 – D’2a,hap) is shown. The dots indicate the fold change for the binned NCO rate average. Only singleton SNPs are shown. Dashed lines represent linear regressions; slope coefficient (m) and r2 value are indicated. Fold change for bivalency and DNase I was compiled from the ENCODE data set while 5mC and 5hmC were compiled from data provided for the H1 stem cell

To analyze also epigenetic DNA modifications, ChIP-Seq tracks for 5-methyl-cytosine (5mC) and 5-hydroxymethyl-cytosine (5hmC) were downloaded together with matching DNase I-sensitivity tracks. In contrast to the histone tracks, which were compiled from a number of different cell lines, the DNA tracks were derived from for the embryonic stem cell H1 [1, 21]. In line with expectation, an increased NCO rate was observed in open DNase I-assessable chromatin, while reduced recombination rates were detected in regions enriched for 5mC, indicating the closed state of chromatin (Fig. 3a). The strongest association however was observed for 5hmC, an intermediate of oxidative C-demethylation and marker of recently opened chromatin (Fig. 3a). While DNase I showed a positive association in nearly all genic sub-regions (Fig. 3a, left panel), the average score was substantially weaker than the score of H3K4me1, H3K27me1, and of bivalent regions (Fig. 3a, right panel), suggesting that “openness” alone may not be sufficient to promote recombination.

In all analyzed sub-regions, the gain in NCO rate was the highest when the alleles overlapped with 5hmC tracks (Fig. 3a). The average score of 5hmC in fact eclipsed the scores of all other tested markers, suggesting that 5hmC was the marker most closely associated with the local NCO rate (Fig. 3a, right panel). Plots of the NCO rate vs. the fold change of the track overlap revealed a nearly linear correlation for all markers. The coefficient was 1.09 for 5hmC and 0.85 for bivalency, compared to only 0.4 and 0.24 for DNase I and 5mC, respectively, confirming the strong impact of 5hmC and bivalency on the recombination rate (Fig. 3b).

NCO rate and 5hmC content in epigenetic sub-regions

To confirm the impact of the epigenetic environment on the local NCO rate we used an independent data set from the NIH Roadmap Epigenomics Consortium that comprises the epigenomes of 111 tissues, cells, and cell lines [1]. It allows annotation of 15 epigenetically defined sub-regions defined mostly by the presence of characteristic histone marks (Additional file 1: Fig. S4). Based on this annotation, NCO recombination rates (as defined by NCO rate) could be determined independently for each of these sub-regions. These rates could then be correlated with the fraction of open and closed chromatin (defined by DNase I and 5mC, respectively) and the overlap with 5hmC tracks (Fig. 4a). Strikingly similar patterns emerged when comparing NCO rate in the various epigenetic regions (Fig. 4a, upper panel) with the relative track-overlaps of 5hmc (middle panel). The Spearman correlation between the 5hmC distribution and NCO rate revealed r = 0.88 and p = 2E-16. In line with the prior data the association with DNase I-defined open chromatin was weaker (r = 0.67, p = 0.008). The same applied also for the 5mC tracks, which inversely mirrored the DNAse I pattern (r = − 0.65, p = 0.011).

Fig. 4figure 4

Association of 5hmC and NCO recombination rates. a Association in epigenetically defined regions. NCO rate, chromatin state and DNA marks are shown for 15 epigenetic sub-regions comprising active transcription start sites (TssA), TssA flanking regions (TssAFlnk), transcribed regions (Tx), weak transcribed regions (TxWk), Tx flanking regions (TxFlnk), gene-associated enhancer (EnhG), enhancer (Enh), zinc finger nuclear factor regions (ZNF), repeats (Rpts), heterochromatin (Het), bivalent transcriptional start site (TssBiv), bivalent flanking regions (BivFlnk), bivalent enhancer (EnhBiv), polycomb repressed region (ReprPC), weak polycomb repressed region (ReprPCWk) and quiescent regions (Quies) [1]. Histone marks are shown in Additional file 1: Fig. S4. The calculations were carried out separately for each of 111 reference genomes using the entire data set of 1.9 million archaic SNPs. The boxplots indicate for each region the average NCO rate (1 − D’2a,hap), as well as the average track overlap of these SNPs with 5hmC marks, DNase I-defined, and 5mC-defined open chromatin. The NCO rate was calculated using the data from the LWK cohort, while 5mC, DNase I, and 5mC tracks were defined for the H1 stem cell line [1, 21]. The r- and p-values refer to a Spearman rank correlation carried out between these relative frequencies of the marks and the respective NCO rate data. b Correlation of NCO recombination rate with the 5hmC frequency. The plot shows the average NCO rate and corresponding track overlap of the SNP with 5hmC marks for each of the 15 epigenetically defined subregions. The line was derived by Pearson correlation, the p and r2 values are indicated

The direct correlation of the between NCO rate of the subregions and the respective 5hmC frequency revealed an r2 value of 0.67 (Fig. 4b). Within these subregions, the highest NCO rate and 5hmC frequency values were detected in bivalent regions marked by the presence of both H3k27me3 and H3K4me3. This applied for “poised” versions of promoter (TssBiv), enhancer (EnhBiv), and flanking regions (BivFlank), while the “non-poised” active counterparts TssA, EnhG, Enh, and TssAFlnk were ranking substantially lower. Slightly lower values were also detected in Polycomb-repressed regions (ReprPC), while the lowest values were detected in ZNF/Repeat and Heterochromatin regions. Low values were also detected for alleles in transcribed regions (Tx, TxWk, TxFlnk). This however may be due to the fact that these regions largely comprise of introns, which are associated with low NCO rate (compare Fig. 2d).

A more detailed analysis on the frequency of 5hmC marks revealed that they were indeed strongly enriched in all regions identified before to be associated with increased recombination rates. This applies for bivalent regions, open chromatin, CpG islands (CGI), and DMC1 tracks (Additional file 1: Fig. S5a). Plots representing the association with the GC content further revealed similar maxima curves for both 5hmc enrichment and NCO rate (Additional file 1: Fig. S5b, left panel) and also a peak in 5hmC depositions in CpG islands closely matched the sharp NCO rate increase observed at these sites (Additional file 1: Fig. S5b, right panel). Analysis of the 5hmC density in genic regions revealed that it spiked at the 5′- and, to a lesser extent, the 3′-boundaries of genes (Additional file 1: Fig. S5c, left panel), closely mirroring the NCO rate profile shown in Fig. 2c. Also, the breakdown of genic sub-regions revealed a similar pattern as described for NCO rate: the 5hmC density was highest at 5′-UTR and gradually decreased in intergenic regions with increasing distance to the gene boundaries. It was low in introns, increased in splice sites, and was highest in exons (Additional file 1: Fig. S5c, right panel). Thus, 5hmC seems to be prominently present in all the functionally relevant loci that, according to our analyses, are associated with an increased NCO recombination rate.

留言 (0)

沒有登入
gif