Integrative analysis of transcriptomic and epigenomic data reveals distinct patterns for developmental and housekeeping gene regulation

Definition of DEGs and SEGs

To identify the differentially expressed genes (DEGs) in each of the three lineages, and one set of similarly expressed genes (SEGs), the strategies as outlined below were used.

To identify DEGs:

(a)

A gene is differentially expressed between two lineages if the difference in its expression is >3-fold with a p-value < 0.01 as determined using the MAD-score estimate [100]. We require that a gene is differentially expressed when compared to both the other lineages.

(b)

To avoid multiple pairwise comparisons with subjective thresholds, we adopted the Berger-Parker Dominance index [101]. It is widely used in ecology to measure variability of species and it is also used by Illumina for base call quality identification (GATK documentation [102]). For any gene g with gene expression GE = (GE1,GE2,GE3) (GEi is gene expression in each lineage here), the Dominance index between values GE1, GE2, and GE3 is the ratio of max(GEi) to the sum of all three values:

$$Dom=max(GEi)/\sum GEi$$

A well-established threshold Dom = 0.6 [102,103,104] indicating a strong dominance of one lineage was applied to define DEGs.

(c)

Unsupervised clustering using seq_monk [94, 105] with default parameters for DESeq2 [106] application.

The SEGs across all three lineages were obtained by the following strategies:

(a)

Minimization of Dominance index across three lineage’s values, as low as Dom = 0.34. Note that completely even distribution is indicated by Dom = 0.33, and we use 0.34 as threshold to indicate almost even distribution. Setting this Dom threshold, we aimed to obtain an amount of SEGs comparable with the amount of DEGs. Therefore, a threshold Dom = 0.34 indicating an almost even dominance of all three lineages for a gene was applied to define SEGs.

(b)

Minimization of MAD-score across three lineage’s values.

Each DEG-identification strategy resulted in three sets of DEG genes (ectoderm, endoderm and mesoderm uniquely expressed genes). The sets were of comparable sizes (around 700 genes for ectoderm and mesoderm, and around 1000 genes for endoderm). The lists of genes for each strategy were highly overlapping, their Jaccard similarity was around 0.9. Moreover, all further analysis results were valid for each strategy’s set of genes. Sets of known marker genes for each lineage [47, 107] were used to optimise and calibrate our chosen DEG sets. Based on this criterion, the Dominance index-derived DEG sets were chosen to capture the maximum number of known marker genes for gene sets of roughly the same size. We used the Dominance index distribution to determine the high dominance value, Dhigh = 0.6 [102], which we picked as threshold for highly dominant (in one lineage) genes, DEGs. For SEG identification, we collected all genes with almost minimal Dominance equal Dlow = 0.34 (equally valued across the three lineages) here.

To determine our threshold significance, we ran a permutation test. For SEGs, we randomly and uniformly picked one gene expression level within each lineage (across all expressed genes at least in one layer) 1000 times, creating 1000 random three-tuples, which are our simulated genes’ GE. Then we compute Dom for each simulated gene.

The p-value of our hypothesis H0 for SEGs (H0: a value Dom = 0.34 can be obtained by chance) is estimated as the proportion of permutations that give a Dom value ≤ 0.34. Here, the null hypothesis can be rejected with a p-value of 0.001 since none of the permutation gives a Dom value ≤ 0.34.

For DEG’s threshold Dom = 0.6, we conducted a permutation test similarly, but we sampled from a set of genes expressed in one of the layers (because our DEGs should be expressed in a corresponding layer). For example, we sample from a set of genes with GE > 2 in ectoderm for the ectoderm layer set. In this case our null hypothesis is that Dom = 0.6 for ectoderm layer genes can be reached by chance for any permuted gene with GE_ect > GE_mes and GE_ect > GE_end (according to DEG definition). In contrast, we have shown that it is very unlikely to get value Dom = 0.6 for permuted GE layers of the set with maximal GE in the ectoderm layer, p = 0.0012.

Genome architecture and transcriptome architecture

The genome architecture is characterised by the distance between each gene i and its nearest neighbour, di = min(|TSSi−1 − TSSi|, |TSSi+1 − TSSi|), where TSS is the most 5′ annotated transcription start site and genes have been ordered by their TSS. The transcriptome architecture is similarly defined, but we exclude genes whose expression level is below a threshold when calculating distances. Threshold values were from [−3, −2, −1, 0, 1, 2, 3] in units of log2 RPKM.

To evaluate the differences in nearest neighbour distances for DEG and SEG gene sets we used the Mann-Whitney test at significance p < 0.05.

GO analysis

We used four different GO enrichment tools: panther [108], Gorilla [109], goliath, and g:Profiler [110] to evaluate the categories enriched amongst similarly and differentially expressed genes. Default parameters were applied. All four methods were consistent and the results from g:Profiler are reported in Table 1, Additional file 1: Table S2 and Fig. S2E.

3D organisation: TADs and genes

TAD annotations were used from [29, 33]. We computed the numbers of SEGs and DEGs within each TAD. We then computed the lengths of DEG- and SEG-containing TADs, and gene density within them as the number of genes divided by the length of the TAD.

We compared gene density and lengths of DEG and SEG-containing TADs and tested significance of their differences with the Mann-Whitney test at p < 0.05 level.

Chromatin accessibility

To identify accessible regions, we divided the genome into non-overlapping 100-bp windows and we computed a number of GC dinucleotides in each window. We also computed the accessibility level within each window, Aj (j = 1,2,3 for the three lineages) as a percentage of all GC-methylated dinucleotides counts divided by the total number of GC dinucleotides.

Similarly to DEGs vs SEGs, we applied the Dominance index [101] strategy to call differentially accessible regions (DARs) for ectoderm, endoderm, and mesoderm and similarly accessible regions (SARs) within each 100-bp window:

where A1, A2, and A3 are the accessibility levels for the three lineages, defined as the fraction of accessible (methylated here) GC to all GC in the window [47].

We selected windows with high dominance of one lineage level, A*, over other lineage levels. The dominance threshold Dhigh = 0.6 was selected according to [102,103,104]. We required Dlow = 0.34 to define SARs. We computed the permutation test at p < 0.0001 level to ensure significance of the dominance threshold. All further analysis (see sections about linking DEGs and DARs) was done for data collected by Dominance index strategy.

QC of accessibility data and the NMT_seq GC bias

At the first step, we filtered out the windows with insufficient coverage (fewer than 25 reads), to avoid calling low-confidence DARs. We computed the number of GCs in non-overlapping 100-bp windows throughout the genome, based on sequenced read data. Note that for NOME_seq technology, methylated GC means that the area around it is occupied by a nucleosome. We denote the number of reads in each window by Ai and the number of GCs by Ci. We filtered out the relative coverage-unbalanced and GC-number-unbalanced windows. The set of QC tests/filters and parameters are as follows:

(a)

Ai > 25 and Ai/Ci > 2

(b)

Relative coverage balance between lineages, adjusted for the cell number in each set. We required that variability between the three values for coverage in a window is less than one standard deviation over mean:

dij = (covi − covj), i,j in are differences between coverages for three lineages in a window

$$|max |_| - mean(_,_,_)| < std(_,_,_)$$

(c)

GC number balance between lineages within corresponding windows. We required that variability between the three values for GC numbers in a window is less than one standard deviation over mean.

dif_numGCij = (numGCi − numGCj), i,j in are differences between GC numbers for three lineages in a window

$$|max |dif\_numG_| - mean(nG_,numG_,numG_) | < 3*std(numG_,numG_,numG_)$$

For all windows, both DARs and SARs, we computed the medians of GC counts (Additional file 1: Fig. S5B). The SAR set was subdivided into highly accessible (HA) and low accessible (LA), HA accessibility threshold is more than 35% of accessibility in a window, while LA is less or equal 35%. The distributions of retained and filtered out windows depending on GC counts and minimal coverage threshold are shown in Additional file 1: Fig. S5A.

Count of DAR and SARs around TSS (normalised by number of expressed genes) and H3K27ac

We analysed the spatial distribution of accessible chromatin regions (ACR) around TSSs for both DEGs and SEGs. We fixed the vicinity of a gene to be XK bp (X is 20,40,80,120,…300,400, ‘K’ means kilobases here), then ACR regions were counted within (−XK,XK) intervals around TSS. The value was then normalised by the number of genes in the region of interest.

We introduced an accessibility index to measure and visualise distribution of DARs and SARs around DEGs and SEGs:

$$\mathrm=\frac_\left(_\right)/_\left(_\right)}_},$$

where \(_(_)\) = number of accessible regions in bin j of vicinity \(_\),\(_\) = vicinity of (differentially or similarly) expressed gene i, centred at the transcription start site of that gene;\(_(_)\) = number of expressed genes in vicinity \(_\);\(_\)= number of (differentially or similarly) expressed genes in the given set.

Since we defined DARs in a lineage-specific way, the mesoderm-accessible DARs were counted around mesoderm upregulated DE genes, and correspondingly, ectoderm and endoderm-accessible DARs were counted around ectoderm or endoderm upregulated DE genes. We computed DAR/SAR frequency in a large fixed vicinity of DEGs/SEGs TSS, the size of vicinities mimicking the TAD’s size ranges, from 20 K, 40 K,….up to 1 MB for some chromosomes.

We performed a permutation test to assess lineage specificity by swapping DAR sets and comparing the resulting clustering distribution around non-matched DEGs TSS. We computed a background frequency of DARs for all genes within the genome.

We used a set of Chip-Seq-derived lineage-specific H3K27ac (annotation and data from [111]), for the same E7.5 day of embryo development. We computed a frequency of matched DARs around the H3K27ac. To test the specificity of DAR clustering around lineage-specific H3K27ac, we ran a permutation test swapping DARs across lineages. We applied the widely used differentially methylated regions finding method Defiant [112] and compared with our approach by using H3K27ac sets as markers for optimal performance. Our method was more sensitive: we retrieved twice as many H3K27ac peaks (for the roughly same amount of DhMR regions), compared to Defiant.

Correlation of chromatin accessibility with gene expressionChromatin abundance coefficient

To correlate DAR/SAR frequency to GE level, we developed the chromatin abundance coefficient (CAC) for a set of genes. To compute CAC, the number of open chromatin windows (DARs or SARs) was calculated across the TSS vicinity Vi for each gene gi in a set first, N(Vi). Then the number is normalised by the count of expressed genes in this vicinity, Ng(Vi) (normalised accessible region frequency):

$$\mathrmVi = N(_) / Ng(_)$$

It is then compared with the average gene expression (within the same normalised frequency counts) across genes having GEi, by computing a Pearson correlation coefficient:

$$\mathrm= Pearson\, correlation \,(N(_) / Ng(_), \,mean(G_))$$

If the CAC value is positive and high (R2 > 0.7), we define the corresponding sets of genes and DARs as linked.

Determining ‘domains of influence’ for DEGs-DARs, and linked gene-DAR combinations

We searched what TSS vicinity ranges give high or low correlations with GE, and at what vicinity the correlation vanishes. Assuming that the majority of DARs within the vicinity are associated with their corresponding genes, we identify a vicinity of TSS giving maximal correlation. The zone of maximal influence, Z*, is defined as below:

$$Z^* = argmax (Rk=}(}) | Zk \subset \)$$

where Rk and Zk = , Rk = correlation coefficient (for number of DARs from Zk and GE of genes from a given gene set). We compute the ‘Gene Expression-DARs frequency’ correlations for different upstream/downstream fixed zones. We computed the CAC correlation for SEGs and SARs in a similar way.

H3K27ac and DEGs

We computed the normalised frequency of H3K27ac enhancers (annotation from [18], data from [36]) around the TSS vicinity of DEGs, similarly to DEG-DARs above. We compared these frequencies with GE expressions to determine if there is a positive correlation. We computed ‘zones of enhancer influence’ for the H3K27ac sets in a similar way to above.

Algorithm to link DEGs and DARs

We searched a range of vicinities for the one giving maximal CAC (while keeping high enough DAR density) correlation of average gene expression and DAR’s frequency. We do it separately for each corresponding lineage and chromosome. We assume that the majority of these maximally correlated DARs are putative enhancers for their target genes within a given region of high correlation (domain of influence).

The algorithm to link DEGs and DARs is as follows:

(1)

Retrieve the TSS vicinity zone giving maximal CAC correlation coefficient.

(2)

Retrieve the corresponding genes, TADs (containing these genes) and their linked DARs within this zone, excluding pairs that are found in different TADs.

We retrieved linked DARs for each gene from DEG to make a catalogue (Additional files 2, 3 and 4) of differentially expressed genes and their differentially accessible chromatin regions, sitting in the corresponding TADs. The requirement of exclusion of DARs that are found in different TADs made the majority of DARs localised in the 100 K vicinity of TSS, with only around 15% of them spreading further than 100 kB.

DhMR and ShMR detection, and clustering around DARs, H3K27ac and SARs

We applied the same procedures to find DhMRs as we did for DARs, but with a wider window of 500 bp. Likewise to DARs and SARs, we applied the Dominance index [101] strategy to call differentially hypomethylated regions (DhMRs) and similarly DNA hypomethylated regions (ShMRs) within each 500-bp window. Dominance index of methylation level across the three per-lineage values, within each window:

$$Dom =\mathrm(_)/\sum _.$$

where M1,M2,M3 are the fraction of methylated CpG within a window in ectoderm, endoderm and mesoderm tissues. We used a threshold Dom = 0.6 [102, 103] to classify a DhMR as differentially hypomethylated, and Dom = 0.34 to classify ShMR as similarly hypomethylated, likewise identification of DAR/SAR. We fixed the vicinity = 10,000 bp around DAR and SAR central point, and computed the number of DhMRs and ShMRs within this range. We computed the Jaccard index for DARs and DhMR and for ShMRs and SARs. We ran a set of permutation tests to ensure significance of DhMR/ShMR clustering at p < 0.01.

Difference in DEG and SEG regulation with respect to TF distal and proximal bindingDesign: choosing the regions for TFBS enrichment analysis

To search for TF binding motifs, we pooled all DARs/DhMRs sequences (putative enhancers) into three lineage-specific groups (Additional file 1: Fig. S7A top: pink, green, blue). We also took DNA sequences from 100 bp upstream and 50 bp downstream of TSS [113] to form three pools of core promoters (DEG CPs). Similarly, SARs were combined into one group and promoters of SEGs into another (SEG CPs) (Additional file 1: Fig. S7A bottom: orange). We investigated the motif enrichment of DNA within these eight groups (Additional file 1: Fig. S7B), and enriched motifs were ranked based on the p-values of their overrepresentation. Next, we filtered corresponding TFs by their expression in our transcriptome data sets.

We hypothesised that promoter-enhancer specificity is manifested by higher degree of similarity within than between regulatory neighbourhoods, e.g. DEG enhancers are more similar to DEG promoters than to SEG promoters. To test how different is DEG’s and SEG’s promoter-enhancer specificity with respect to TF binding, we computed RMSE-based weighted similarity scores within and between the DEG and SEG motif repertoires. We also computed promoter-promoter and enhancer-enhancer specificity (similarity scores, defined as rmse = root mean square error) within and between the DEG and SEG neighbourhoods separately (Additional file 1: Table S3).

We test the hypothesis that TF binding is similar between DEGs and SEGs. Our alternative hypothesis is that it is different (therefore specific, see definition [3]).

We define a region to be a ‘putative DEG enhancer’ if it is lineage-specific, accessible and hypomethylated. There are 960 ectoderm-specific, 5230 endoderm-specific and 1382 mesoderm-specific putative enhancers. We define a region to be ‘putative SEG enhancers’ if it is similarly accessible and hypomethylated across all three lineages (6354 regions). Collectively, we have four sets of putative enhancers: three sets of DEGs, and one set of SEG’s enhancers (Additional file 1: Fig. S7B). In line with [17], we define ‘core promoter (CP)’ as the region [−100, 50] relative to the TSS. We have four DEG CP sets and one SEG CP set (Additional file 1: Fig. S7B). We measure the similarity of two sets of regions based on the similarity of their TFBS repertoires.

We say that a local set, consisting of a promoter with associated putative enhancers (or other promoters), which are likely to be 3D-close, constitutes a regulatory neighbourhood of the given promoter.

Obtaining enriched motifs (TFs repertoires) within putative enhancers and promoters

We ran five motif enrichment tools (each with default parameters) to check a consistency of motif search. We use the intersection of motif searches: Homer [56], RSAT [114], GREAT [115], DMINDA2 [116], AME meme suit [117]—on our eight sets of putative enhancer and core promoters (the list of these motifs is presented in Additional file 5), and obtained eight lists of significantly enriched TF motifs, Additional file 1: Fig. S7A, with a p-value < 0.001. Each motif within a list is ranked in ascending order based on p-values (Additional file 1: Fig. S7C).

We filtered the lists of enriched motifs by their gene expression in our data (GE in log2(RPKM), GE > 0), discarding around 25% of motifs corresponding to non-expressed genes. We sorted filtered TF motifs according to where their corresponding genes were expressed: predominantly in one lineage (DEG) or almost evenly across all three (SEG).

Measurement of difference of TF repertoire

We measure TF difference by overall weighted motif similarity between each pair of the eight TF lists. We extracted the union of all motifs within all eight region sets, which is our main feature. It includes 277 TFBS motifs (Additional file 5). For each set (e.g. ectoderm enhancers), we ranked motifs according to their frequency. Because the lists of motifs are of different length, we ranked motifs by quartiles of the list, e.g. 1,2,3,4. For example, Maz and Zn281 are ranked as 1 (in first list frequency quartile) in ectoderm enhancers, but as 2 in ectoderm promoters, while Po3f1 is ranked 2 in ectoderm enhancers, and ranked 0 (not present) in promoters. We represent each set of enhancers and CPs as a feature-vector of its list’s rank-values for each feature entry or zeros, if a motif is missing. We compute a pairwise weighted similarity between ranked lists of TF motifs (feature-score vectors) as RMSE of fitting feature-scores to the line \(y=x\) (a model assuming the lists were equal in frequency) for each two vectors:

$$RMSE =\sqrt_^}-fi)}(\widehat-fi)}$$

where \(\widehat_}\) are the points of the line \(y=x\), and \(fi\) are the motif’s feature-scores for both vectors. In this way, we account for motifs that are absent in one of the lists, and on the degree of score (reflecting motif’s frequency) for those which are common between the lists. The higher the RMSE, the smaller the similarity between two TF lists. We first computed overall similarity within and between DEG and SEG enhancers and promoters as a median value across pairwise similarity for each lineage pair (Table S3). Then we computed the difference between overall similarities of DEG-DEG and DEG-SEG sets. We tested how significant this difference is by both t-test and permutation tests. We ran a set of rank permutation tests within each list of motifs per promoter or enhancer regions to check a significance of similarity difference. We permuted each motif’s ranks and each motif’s occurrence, simulating DNA sequence shuffling while preserving the same GC-richness (by drawing from GC-rich motifs for SEG enhancers and promoters).

We computed overall similarity between and within enhancers and CPs: we compared core promoters of DEGs and SEGs, then corresponding putative enhancers, and finally, promoters-enhancers (Table S3).

Retrieving motifs contributing to enhancer-promoter difference between DEGs (developmental) and SEG (housekeeping) regulation

We defined two distinct sets of motifs, common between enhancer-core promoters within DEGs and within SEGs, by selecting motifs that occurred across lists of DEG and SEG regulatory regions at high (the first p-value quartile) scores. We compared them with those found in [51].

In-depth analysis of motifs’ features contributing to sequence-encoded enhancer-promoter specificity

This analysis includes the following features of motifs: nucleotide content, complexity, motif length.

We define a single-nucleotide content, (Pa,Pc,Pg,Pt), of a motif as a proportion of occurrence A,C,G or T in motif’s sequence, such as

$$Px= count( x)/length(motif), x\mathrm\},},},}\}$$

A di-nucleotide content is a count of each adjacent pair divided by the length of the motif −1.

$$Pxy= count( xy)/(length(motif)-1), \,xy\mathrm\},},},}\dots .},},},}\}$$

A di-nucleotide entropy of a motif is given by the formula below

$$entropy=-\sum Pxy(log2(Pxy))$$

where summation goes over all nucleotide adjacent pairs in a motif. We take the entropy value as a measure of a motif’s complexity.

We studied the motifs specific to SAR-SEG promoters and enhancers (intersection of high-ranked motifs) and specific for DAR-DEG promoters and enhancers (union of all three DAR-DER high-ranked motif intersections). We ran non-parametric statistical tests (Wilcoxon and Kruskal–Wallis) to infer significance of the differences.

We also computed the percentage of known TFs [75] in DEGs and SEGs, and average percentage across all mouse genes. We used a one tailed Fisher test to determine significance.

Difference within DEG’s putative enhancers: inferring lineage-specific driver TFs

We searched for TF motifs which could contribute to differences in developmental regulation across lineages. We focus on DEG’s putative enhancers, as in Additional file 1: Fig. S7B bottom. We defined lists of distinct motifs within DEG enhancers. We retrieved lineage-specific-enriched TF motif repertoires, and the most significant ones are listed in Fig. 5B.

Confirming per-lineage difference and function by literature search

We compared our results of enriched TFs in Fig. 5B with TFs which are reported to be important in lineage specification and pattern formation. We checked if the TFs—producing genes for the enriched motifs—were expressed mostly in DEGs or in SEGs. We confirmed from the literature what motifs were pioneer (happened to be DEG-produced), and which were not (SEG-produced) (Additional file 1: Table S4 for references).

Data used

Pseudo-bulk per ectoderm, endoderm, mesoderm for transcriptome, chromatin accessibility, and Methylome for E7.5, from scNMT_seq, as in [18]. The set of H3K27ac is from [36] and H3K4me3 is from [18]. Raw sequencing data together with processed files (RNA counts, CpG methylation reports, GpC accessibility reports) are available in the Gene Expression Omnibus under accession number GSE121708. Data can be downloaded from ftp://ftp.ebi.ac.uk/pub/databases/scnmt_gastrulation.

留言 (0)

沒有登入
gif