Profiling the immune epigenome across global cattle breeds

Sample collection and cell isolation

The Holstein Friesian, N’Dama and Nelore cattle were sampled at The Royal (Dick) School of Veterinary Studies at the University of Edinburgh (Edinburgh, UK), ILRI (Nairobi, Kenya) and the Centro Avançado de Pesquisa Tecnológica do Agronegócio de Bovinos de Corte, Sertãozinho (São Paulo, Brazil), respectively. All sampled animals were female, but due to the availability of animals to be sampled at each location, animals could only be matched by age within sites. At the time of sampling, the Holstein Friesian cattle were approximately 10 months old, the N’Dama cattle were approximately 48 months old and the Nelore cattle were between 117 and 141 months old.

The ATAC-seq, RRBS and RNA-seq reference profiles of purified cell subsets were generated from the same nine cattle (three cattle of each breed), and these cattle will subsequently be referred to as Holstein Friesian, N’Dama and Nelore 1, 2 and 3. The methylomes of fifteen cellular mixtures were also generated to assess how accurately their composition could be computed using the reference profiles. Six of the cellular mixtures were generated in vitro by mixing purified cell populations from Holstein Friesian cattle 4 and 7 at defined proportions. The remaining nine cellular mixtures were blood samples collected from Holstein Friesian 4, 5 and 6; N’Dama 1, 2 and 3; and Nelore 1, 2 and 3. Lastly, a pilot comparative analysis of RRBS and Whole Genome Bisulfite Sequencing (WGBS) used cell populations isolated from Holstein Friesian 2, 3, 8 and 9.

Blood was sampled via jugular venepuncture into blood bags or 60 ml syringes containing citrate-phosphate-dextrose solution with adenine. Peripheral blood mononuclear cells (PBMCs) were obtained from blood by density gradient centrifugation: 30 ml blood was overlaid onto 20 ml Ficoll Plaque Plus (GE Healthcare) in a 50-ml centrifuge tube and centrifuged at 800×g for 30 min with no brakes. Cells at the interface layer were collected and washed once in phosphate-buffered saline (PBS). To lyse red blood cells, cells were incubated in 5 ml ammonium chloride lysis buffer (0.1X 0.175M Tris pH 7.4 + 0.9X 0.16M Ammonium Chloride) at room temperature for 5 min, followed by centrifugation for 10 min at 300×g. Cells were then washed twice and resuspended in PBS.

To enrich for granulocytes, following Ficoll density gradient centrifugation, all layers above the RBC layer were removed by pipetting, leaving approximately 5 ml of solution above the RBC layer to ensure it was not disrupted. To lyse RBCs, 10 ml of the RBC layer was incubated with 40 ml ammonium chloride lysis buffer (0.1X 0.175M Tris pH 7.4 + 0.9X 0.16M Ammonium Chloride) at room temperature for 5–10 min, followed by centrifugation for 10 min at 500×g. Cells were then washed three times with PBS.

Flow cytometry analysis and cell sorting

The antibody panels used for flow cytometry are detailed in Additional file 2: Table S6. Due to the difference in flow cytometers that were accessible in each sampling location, amendments to fluorophores used in the screening and sorting of cells from each breed were required to ensure compatibility with the lasers and filters available. Nevertheless, the markers used to define each population of interest were unchanged and equivalent levels of purity were often achieved (Additional file 1: Fig. S7, Additional file 2: S7).

Antibody dilutions were prepared in FACS buffer (PBS + 0.5% BSA + 2mM EDTA). Following antibody staining, cells were resuspended in FACS Collection Buffer (PBS + 2% BSA + 2 mM EDTA). To sort Holstein Friesian and N’Dama cells, 1 μg/ml DAPI was added to the FACS Collection Buffer. The cell sorter used for the Nelore cells did not have a violet laser; hence, DAPI was not used. Fluorescence-activated cell sorting (FACS sorting) of Holstein Friesian, N’Dama and Nelore cells was performed on a Becton Dickinson FACSAria III Cell Sorter, a Becton Dickinson Influx Cell Sorter and a Becton Dickinson FACSAria II Cell Sorter, respectively. Granulocytes were sorted using FACS based on their forward and side scatter profile. A representative gating strategy used for sorting is shown in Additional file 1: Fig. S7. Purity of sorted cell populations was validated by post-sort flow cytometric analysis of 1000 events. All cells were stained and sorted within 9 h of blood collection and kept on ice between processing steps. Sorting was performed to > 94% purity (Additional file 2: Table S7), and then cells were pelleted at 350×g for 5 min and resuspended in an appropriate volume of PBS for downstream processing.

Prior to FACS sorting of NK cells and CD8 T cells, a negative pre-sort using anti-IgG magnetic MACS beads (Miltenyi Biotech) was performed to remove B cells, CD4 T cells, monocytes and γδ T cells and consequently enrich for NK cells and CD8 T cells. For this, up to 109 PBMCs were resuspended in culture media (RPMI 1640 Medium + 10% FBS) and incubated with IL-A58, IL-A12, IL-A24 and GB21A antibodies (2 μg/ml; Additional file 2: Table S6) for 15 min at 4 °C. Following two washes with culture media, cells were resuspended in 10 μl immunomagnetic anti-IgG beads (Miltenyi Biotec) per 1 × 107 cells and incubated for 10 min at 4 °C. Cells were then washed and resuspended in 2 ml culture media. Approximately 2.5 × 108 cells were passed through a 40-μm sterile filter onto each LD column (Miltenyi Biotec), which was placed on quadroMACS magnet and the flow through containing the negatively sorted cells collected. Following antibody staining, MACS-sorted cells were then FACS sorted as previously described.

To isolate adequate amounts of RNA from Holstein Friesian NK cells for sequencing, RNA was pooled across three or four NK cell isolations for each animal. To further enrich for NK cells prior to FACS-sorting, T cells were removed during the MACS pre-sort by the addition of 2 μg/ml MM1A to PBMCs prior to addition of the anti-IgG beads.

Due to a fault with the cell sorter, B cells from N’Dama 2 could not be sorted by FACS; thus, this sample was isolated using MACS positive selection. For this, 2.5 × 107 PBMCs were resuspended in MACS buffer (PBS/0.5% BSA), and cells were incubated with 0.3 mg/ml IL-A58 for 15 min at 4 °C. Cells were washed, resuspended in 25 μl anti-IgG beads (Miltenyi Biotec) and incubated for 10 min at 4 °C. Cells were then washed and resuspended in 500 μl MACS buffer. Cells were passed through a 40-μm sterile filter onto a LS column placed on a quadroMACS magnet (Miltenyi Biotec). After washing the column twice with 500 μl MACS buffer, the column was removed from the magnet, and 1 ml MACS buffer was added to the column. The bound cells were flushed out of the column using a plunger, and their cell purity was assessed by post-sort flow cytometric analysis.

Cells recovered from each sorting session were used to generate paired RRBS and ATAC-seq data. The one exception to this was the Nelore 1 granulocytes, where the cells for RRBS and ATAC-seq were isolated on different dates.

Spike-in of P815 mouse cells

To check for any global changes in methylation or chromatin, cultured P815 mouse cells were washed with PBS and spiked-in to the FACS-sorted Holstein Friesian cells at a 1:10 ratio, respectively, to generate the single cell type reference ATAC-seq and RRBS data. P815 is a mastocytoma cell line derived by methylcholanthrene treatment of a male DBA/2 mouse [69]. The identity of the cells was not authenticated and cultures were not tested for mycoplasma contamination.

Generation of artificial in vitro cellular mixtures and screening of lysed whole blood

In vitro cellular mixtures were prepared using FACS-sorted cells from Holstein Friesian cattle at 1:1, 4:1 and 2:3 ratios of monocytes to CD4 T cells and B cells to γδ T cells, yielding a total of six artificial mixtures. To prepare whole blood for flow cytometry, RBCs were lysed using lysis buffer as described for the enrichment of granulocytes. The lysed whole blood samples were screened using the same antibodies as used for FACS, with granulocytes identified based on their forward-side scatter. Flow cytometry analyses were performed on a Becton Dickinson LSRFortessa X-20, a Becton Dickinson FACSCanto and a Becton Dickinson FACSMelody Cell Sorter for Holstein Friesian, N’Dama and Nelore cattle, respectively. The proportions of each cell type were used as the reference values to assess the accuracy in lysed blood deconvolution using CIBERSORTx. However, when calculating the sum of all the cell type proportions within PBMCs, 10–36% of cells were of an unknown cell type or were positive for multiple of the cell surface markers screened within a single antibody panel. As no reference DNA methylation profiles were generated for these cell populations, the fraction of these cells within lysed blood could not be estimated. Therefore, to enable the CIBERSORTx predicted proportions to be compared to the flow cytometry proportions, the proportion of unknown cells and cells positive for multiple markers were distributed across the characterised cell types based on their proportions within lysed blood (Additional file 2: Table S8).

DNA extraction

DNA was extracted from sorted cells, in vitro artificial mixtures and lysed blood samples using a QIAGEN DNeasy blood and tissue kit with proteinase K and RNase treatment (QIAGEN), following the manufacturer’s recommendations. The concentration of the genomic DNA was determined using a Qubit 3.0 Fluorimeter (Invitrogen). Genomic DNA was checked for degradation and contamination by gel electrophoresis.

Selection of method to study DNA methylation at base pair resolution

We carried out a pilot comparative analysis of RRBS and WGBS data to determine which method captured the highest proportion of CpG sites at sufficient coverage. For this, genomic DNA from CD4 T cells, monocytes, γδ T cells and CD8 T cells was submitted to Diagenode, Belgium, for their RRBS service, which included library preparation using Diagenode’s Premium RRBS Kit and 50 bp single-end sequencing on an Illumina HiSeq 3000 instrument. Genomic DNA from monocytes was also submitted to BGI, Hong Kong, for their WGBS service, including library construction and 150 bp paired-end sequencing on an Illumina HiSeq 4000 instrument. Briefly, WGBS library preparation included fragmentation of genomic DNA to 100–300 bp using sonication. This was followed by DNA end repair, in addition to 3′-adenine overhangs, and ligation of methylated sequencing adapters. DNA was denatured and bisulfite converted using the ZYMO EZ DNA Methylation-Gold kit. Libraries were then desalted, size selected, PCR amplified and size selected again.

We found that for similar sequencing costs the two approaches covered a similar number of CpG sites with ≥ 10x coverage, but RRBS captured more sites within CGIs and required a smaller amount of DNA (Additional file 1: Fig. S8, Additional file 2: Table S9). Therefore, RRBS was used to generate all subsequent DNA methylation data.

Reduced representation bisulfite sequencing

DNA from sorted cells, in vitro artificial mixtures and lysed blood samples was submitted to Diagenode for their RRBS service. This service included Qubit sample quantification, DNA quality assessment using the Fragment AnalyzerTM and library preparation using Diagenode’s Premium RRBS Kit. Briefly, sample preparation involved digestion of 100 ng DNA using MspI, followed by end repair and addition of methylated control DNA and unmethylated control DNA. Adapters were ligated to the fragments and AMPure beads were used to remove adapter dimers. Samples were quantified using qPCR and pooled. DNA was bisulfite converted and the optimal number of cycles for the enrichment PCR was determined using qPCR. PCR was then performed to amplify DNA fragments, followed by clean-up using AMPure beads. DNA was quantified using Qubit and the fragment size was monitored on a 2100 Bioanalyser (Agilent). The bisulfite-converted DNA was then sequenced on a HiSeq 3000 instrument using 50 bp single-end sequencing.

ATAC sequencing

Fifty thousand sorted cells in PBS were pelleted in a 96-well v-bottomed plate by centrifugation at 500×g for 2 min at 4 °C, and the supernatant was carefully removed. Cell pellets were lysed in 100 μl cold lysis buffer (10 mM Tris hydrochloride, pH 7.4, 10 mM sodium chloride, 3 mM magnesium chloride, 0.1% IGEPAL CA-630) [70]. Immediately following lysis, nuclei were spun at 500×g for 10 min at 4 °C. Supernatant was removed by pipetting, and nuclei were resuspended in 50 μl transposase mixture (25 μl 2x Tagment DNA buffer, 2.5 μl TDE1 Tagment DNA (Illumina) and 22.5 μl nuclease-free water) and disrupted by pipetting. Lysed cells were transferred to 1.5 ml microcentrifuge tubes and incubated for 30 min at 37 °C in an Eppendorf ThermoMixer with agitation at 300 rpm. Transposed DNA was purified using a QIAGEN MinElute Reaction Cleanup Kit according to the manufacturer’s recommendations and eluted in 14 μl of nuclease-free water. Transposed fragments were amplified using v2_Ad1.1 (index i5) and v2_Ad2.1 - v2_Ad2.12 (index i7) primers from [71]. To determine the number of PCR cycles required, qPCR reactions were carried out in duplicate 10 μl reactions using 0.5 μl transposed DNA, 1x NEBNext High-Fidelity PCR Master Mix (NEB), 1.25 μM dual-index PCR primers, 0.5X SYBR Green I (Invitrogen) and 15 μM ROX reference dye (Agilent Technologies). Samples were incubated at 72°C for 5 min, then 98 °C for 30 s, followed by thermal cycling for 30 cycles at 98 °C for 10 s, 63 °C for 30 s and 72 °C for 1 min. The normalised reporter signal was plotted against the cycle number, and the cycle number that corresponded to ¼ of maximum fluorescent intensity was determined. The remaining 12.5 μl undiluted transposed DNA was then amplified by the determined number of PCR cycles in 50 μl reactions using 1X NEBNext High-Fidelity PCR Master Mix and 1.385 μM dual-index PCR primers. Libraries were amplified for a total of 11–18 cycles. Amplified DNA was purified using a QIAGEN MinElute PCR Purification Kit, following the manufacturer’s protocol, and eluted in 20 μl nuclease-free water.

Samples were purified using 1.4x AMPure XP beads (Beckman Coulter) to remove DNA fragments below 150–200 bp and eluted in 50 μl nuclease-free water. Additional size selection was performed to remove large DNA fragments (> 1 kb). For this, 0.5x AMPure beads were added to the sample to bind larger DNA fragment, which were discarded. 0.9x AMPure beads were added to the supernatant and DNA was eluted in 20 μl 0.1x TE buffer. Libraries were quantified using a Qubit 3.0 Fluorimeter (Invitrogen), and the insert size was assessed on a 2200 TapeStation System (Agilent Technologies) using High Sensitivity D1000 ScreenTape and Reagents (Agilent Technologies). Samples were pooled into multiple pools based on barcode compatibility. All Holstein Friesian ATAC-seq libraries were submitted for 75 bp paired-end sequencing on an Illumina HiSeq 4000 instrument to yield at least 96M + 96M reads per sample.

To increase the read coverage of ten of the Holstein Friesian ATAC-seq samples, the samples were pooled and re-sequenced using 50 bp paired-end sequencing on an Illumina NovaSeq 6000 instrument to yield approximately 350M + 350M reads per sample. These re-sequenced samples were the granulocyte samples from each animal, the monocyte and γδ T cell samples from Holstein Friesian 2 and 3, the CD4 T cell samples from Holstein 1 and 3 and the NK cell sample from Holstein Friesian 1. The N’Dama and Nelore pooled libraries were submitted for 50 bp paired-end sequencing on a NovaSeq 6000 instrument to yield approximately 83M + 83M reads and 300M + 300M reads per sample, respectively.

Transcriptome sequencing

Sorted cells were pelleted at 300×g for 5 min and resuspended in 700 μl Tri reagent for every 5 × 106 cells. Samples were transferred to QIAshredders (Qiagen), which were centrifuged at 12,000×g for 2 min. RNA was then isolated from the homogenates using a miRNeasy mini kit (Qiagen) with on-column DNase digestion, following the manufacturer’s recommendations. RNA quality was assessed using the 2200 TapeStation System. RNA samples were prepared for sequencing using TruSeq stranded mRNA-seq library preparation by Edinburgh Genomics. Samples were sequenced on an Illumina HiSeq 4000 instrument using 150 bp paired-end sequencing to yield at least 80M + 80M reads per sample.

Whole genome sequencing

Blood was collected into PAXgene blood DNA tubes (Qiagen) and genomic DNA was isolated using a PAXgene Blood DNA Kit (Qiagen), following the manufacturer’s instructions. DNA concentration was quantified on a Qubit 3.0 Fluorimeter (Invitrogen). The purified DNA was submitted to Edinburgh Genomics for TruSeq nano DNA library preparation and whole genome sequencing (WGS) on an Illumina HiSeqX using 150 bp paired-end sequencing to yield ~15x coverage.

RRBS read alignment and methylation calling

Quality assessment was carried out using FASTQC [72]. The Trim Galore software [73] was used to trim the 3′ end on all reads using a minimum Phred quality score of 20 and to remove adapter contamination. To remove potential methylation-biased bases from the MspI digestion end-repair reaction, a further 2 bp was trimmed from the 3′ end of adapter-trimmed reads given no trimming was performed based on score quality. As a spike-in of mouse cells had been used for some samples, trimmed reads were mapped to the concatenated bovine ARS-UCD1.2 (GCA_002263795.2; [74]) and mouse GRCm38.p5 (GCA_000001635.7) reference genomes using Bismark [75]. The minimum alignment score function was set at L,0,−0.2. Reads aligning to the cow autosomes, X chromosome and mitochondrial genome were extracted prior to extraction of the methylation calls. After adapter trimming and filtering out of low quality data, between 8 and 27 million RRBS reads per reference sample uniquely mapped to the Bos taurus (ARS-UCD1.2) reference genome, resulting in a total of 940 million clean single-end reads across all reference RRBS samples (Additional file 2: Table S1). There were between 13 and 31 million filtered uniquely mapped reads per mixture RRBS sample (Additional file 2: Table S5). Following assessment of the percentage methylation across each possible position in the read, the methylation calls for the first 5 bp from the 5′ end and last 3 bp from the 3′ end of reads were ignored to remove potential methylation bias.

ATAC-seq read alignment and peak calling

Trim Galore software was used to remove Nextera adapters and read pairs where at least one of the two sequences became shorter than 20 bp and to trim the 3′ end on all reads using a minimum Phred quality score of 20. Sequences were mapped to the concatenated bovine ARS-UCD1.2 (GCA_002263795.2) extended with the Y chromosome from the Btau_5.0.1 assembly [76] and mouse GRCm38.p5 (GCA_000001635.7) reference genomes using Bowtie2 [77]. The alignments settings -D 20 -R 3 -N 0 -L 20 -i S,1,0.50 were applied. The maximum fragment length for valid paired-end alignments was 2000 bp, and at most, 10 distinct, valid alignments were reported for each read using the parameter -k 10. For the Holstein Friesian samples that were sequenced on the HiSeq and NovaSeq sequencers, reads were then merged into a single file. Reads aligned to the cow autosomes and X chromosome were extracted and unmapped reads removed. Mapping of ATAC-seq data resulted in a total of 3.5 billion paired-end bovine reads (64 million on average per sample) (Additional file 2: Table S2). The fragment size distributions of the ATAC-seq libraries were then calculated using Picard [78] and for each sample the insert size counts were normalised by dividing by the total number of counts.

Further filtering of reads was performed simultaneously with peak calling using Genrich (available at https://github.com/jsh58/Genrich). Genrich was run separately for each breed-specific cell type. Genrich was run in ATAC-seq mode, PCR duplicates were removed, and peaks called with a false discovery rate (FDR)-adjusted P-value above 0.05 were excluded. In ATAC-seq mode, Genrich analyses 100 bp intervals centred on the transposase cut sites. The distance of the midpoint of each read to the nearest Ensembl-annotated TSSs was then calculated using bedtools [79].

RNA-seq data analysis

Kallisto [80] was used to index the ARS-UCD1.2 (GCA_002263795.2) reference cDNA sequences and to quantify the abundance of transcripts by pseudoalignment of reads to the reference with 100 bootstraps. Following pseudoalignment, between 39 and 139 million Holstein Friesian and Nelore mRNA-seq reads were mapped to the reference (Additional file 2: Table S3).

Whole genome sequencing variant calling

Alignment and variant calling was performed for the nine cattle used in this project, alongside a further 289 cattle (see Availability of Data and Materials) (using the same methods as described in [35]). Variants were then filtered using plink 1.90b4 [81]. Firstly, animals with missing variant call frequencies > 0.1 were removed (--mind 0.1). Next, variants with missing call frequencies above 0.05 and variants with a minor allele frequency below 5% were removed (--geno 0.05 --maf 0.05). Lastly, principal components were computed using the --pca parameter for the remaining variants, where the number of components was equal to one less than the number of animals included in the analysis.

Quantification of ATAC-seq reads and RRBS percentage methylation

The coordinates of CGIs across the bovine genome (ARS-UCD1.2) were calculated using EMBOSS [82] and promoters were defined as 1000 bp upstream and 500 bp downstream of Ensembl-annotated TSSs.

The number of Genrich output ATAC-seq reads that overlapped CGIs and promoters were then counted using bedtools where at least 50% overlap was achieved. The read counts were normalised to reads per kilobase of feature per million reads mapped (RPKM) by first dividing by the region length and then by the total number of Genrich-output reads, divided by 1 million.

For analysis of CpG methylation levels, percentage methylation of individual CpG sites was calculated using the cytosine report files from the Bismark methylation extractor and the Methylkit package in R [83]. CpG sites with high (above 99.9th percentile of coverage in each sample) and low (below 10x coverage) read coverage were excluded. CpG site coverage was normalised using a scaling factor based on median CpG coverage. CpG site percentage methylation was then calculated by dividing the number of reads containing a cytosine at a given CpG site by the total coverage of that CpG site. CGI and promoter percentage methylation were calculated using Methylkit by dividing the number of reads containing a cytosine within a region by the region coverage.

Calculation of the distance between genomic features

The least genomic distance between two genomic features (CpG site, TSS, CGI, promoter) was calculated using the R package valr. Where a feature was a region, the distance was calculated from the start or end of the region to the nearest upstream or downstream feature.

Principal component analysis and sample correlation

The ATAC-seq PCA was performed using normalised read counts at CGIs where the sum across all samples was > 50 RPKM and where the counts were scaled across samples using the scale function from base R. The RRBS PCA was conducted using the percentage methylation of CpG sites covered by at least 10 reads in all samples. The RNA-seq PCA was performed using the transcripts per million values for each sample.

Spearman’s rank correlation between samples of each data type was calculated using the same inputs as used for the PCA, and samples were clustered using complete linkage hierarchical clustering. Further hierarchical clustering based on percentage methylation was performed using CpG sites with over 10x coverage in all samples that had a standard deviation across samples within the upper 50% quantile of all sites. Spearman’s rank correlation between samples was then calculated and samples were clustered using the Ward1 method.

Annotation of CpG sites and islands

The percentage of CpG sites overlapping the promoter (defined as 1000 bp upstream and downstream of the TSS), exon, intron or intergenic regions was calculated using the R package genomation [84]. For CpG sites that overlapped multiple genomic features, precedence was given as follows: promoter > exon > intron. The percentage of CpG sites overlapping CGIs or CGI shores was calculated using Methylkit.

Differential methylation analysis of CpG islands

Methylkit was used to perform CGI differential methylation analysis using a chi-squared test with correction for overdispersion for the following pairwise comparisons: (i) all samples compared between cattle subspecies, (ii) all cell types compared between each breed and (iii) all breeds compared between each cell type. For the analysis between cattle lineages, the Holstein Friesian and N’Dama samples, both representing breeds of the taurine lineage, were compared to the indicine Nelore samples, where cell type was fitted as a covariate. To compare all cell types between breeds, pairwise comparisons were performed where each breed was compared to the other two breeds, with cell type again fitted as a covariate. Differential methylation analysis was also performed for all breeds between cell types, with breed fitted as a covariate. Following the differential methylation test, a sliding linear model was used to adjust P values to q values.

An Upset plot was then generated to explore the relationship between pairwise comparisons. For this, a significantly differentially methylated CGI was assigned 1 where the q value was ≥ 0.01, and the methylation difference was ≥ 25%; otherwise, a CGI was assigned 0. An UpSet plot showing the top 40 sets containing the highest number of CGIs was created using the R package ComplexHeatmap.

CGIs were then restricted to those that were significantly differentially methylated between Holstein Friesian and Nelore and/or between taurine (Holstein Friesian and N’Dama) and Nelore. CGIs were further restricted to those within 2 kb of a TSS and where the mean expression of the nearest gene was >10 TPM across all samples.

Linking genetic and epigenetic divergence

Plink version 1.90p was used to calculate IBS scores at each CGI, having excluded all variant sites where more than one animal had an uncalled genotype. IBS scores were calculated both between all pairs of populations, as well as within each of the three populations separately. Only CGIs with at least one CpG site covered by at least five reads were retained, with the small number of islands with a negative IBS score also being excluded. Multiple linear regression was used to test for associations between IBS scores and methylation divergence while controlling for within population IBS scores and the number of variants in the CGI. P values were converted to q values to account for multiple testing. To test whether variants disrupting CpG sites were solely driving the association between IBS and methylation divergence, we first identified all CpG sites in the genome using the Biostrings R package. The bedtools subtract function was then used to exclude variants intersecting one of these sites and IBS and q values recalculated as before.

To calculate FST scores, we compiled cohorts of Holstein Friesian, N’Dama and Bos indicus whole genome sequence derived genotypes from previous studies. As well as the 9 animals from this study, we used data for 33 Holstein-Friesian and 10 N’Dama animals from [35] and 13 N’Dama from [85] (see Availability of Data and Materials). Due to the comparative limited availability of publicly available Nelore genomes, we used data from 14 indicine animals from [35]. These are shown in purple in Fig. 1B and as can be seen cluster very closely to the three studied Nelore. In total, this gave 36 Holstein Friesian samples, 26 N’Dama samples and 17 indicus samples between which we calculated FST scores at each variant using vcftools having first excluded variants with more than 20% missing genotypes. As for the IBS analysis, only CGIs with at least one CpG site covered by at least five reads were retained. FST scores were then averaged across the variants within each CGI, and islands with less than four variants were excluded. Mean FST scores at the remaining islands were then correlated to their absolute methylation differences for each pair of populations using a Spearman’s rank correlation.

Correlation between RNA-seq gene expression, ATAC-seq chromatin accessibility and RRBS percentage methylation

For Holstein Friesian and Nelore separately, the ATAC-seq RPKM values and the RNA-seq TPM values were averaged across samples at each promoter and transcript, respectively. Non-coding transcripts were removed from the analysis as they were not profiled using mRNA-seq. Spearman’s rank correlation between log10-transformed values was then calculated.

To compare between RNA-seq gene expression and percentage methylation, the nearest annotated TSS to each CGI was identified. To ensure transcripts were not associated with multiple CGIs, only the CGIs nearest the transcript were retained in the analysis, but no minimum distance threshold was applied. CGIs associated with non-coding transcripts were removed from the analysis. For each breed, the percentage methylation of each CGI and the TPM values of each transcript were averaged across samples. The CGIs were then split into ten bins based on their percentage methylation and the RNA-seq values for each bin were plotted. For the same CGIs, ATAC-seq RPKM values at CGIs were also averaged across samples and plotted against bins of corresponding CGI percentage methylation.

Unsupervised clustering of CpG islands based on their methylation and chromatin accessibility profiles

The following steps were performed separately for each breed. Firstly, the percentage methylation and ATAC-seq RPKM values at corresponding CGIs were combined into a single matrix, where CGIs were required to contain at least one CpG site covered by ≥ 5 RRBS reads. CGIs were then restricted to those covered in all samples, and any CGIs with ≥ 50 ATAC-seq RPKM were removed, as they were generally uncorrelated to percentage methylation. The percentage methylation and ATAC-seq RPKM values were scaled within a sample across CGIs. The CGIs were then clustered using the R package mclust 5.4.5, which uses finite Gaussian mixture modelling [86]. Based on the Bayesian information criterion scores for all the available models, the VVV model was used with six components. The RNA-seq TPM values of the corresponding transcripts were then appended to the corresponding CGIs. Non-coding transcripts and associated CGIs were removed from the analysis.

To compare the CGI classification between breeds, CGIs were restricted to those containing at least one CpG site covered by ≥ 5 RRBS reads and ≤ 50 ATAC-seq RPKM in every sample across breeds. The number of CGIs falling within each combination of clusters across breed pairs was then calculated, where the clusters were ordered by increasing median percentage methylation within a breed. The sharing of CGIs between different breed clusters was then visualised as a circos plot using the Circos application [87]. GO term enrichment analysis was performed for CGIs within 2 kb of a TSS that were classified in Holstein Friesian cluster 3 and N'Dama cluster 6. Since a high proportion of these genes had human orthologs, functional annotation analyses were performed using the GENE2FUNC function of the web-based platform FUMA v1.3.6a [88]. FUMA was run using default settings with Benjamini–Hochberg multiple testing correction.

GO term enrichment analysis was also performed on the nearest genes to the CGIs within each breed-specific cluster. For this, transcripts were restricted to those where the TSS was situated inside or within ± 10 bp of their associated CGI, as these could be more reliably associated with the given transcript than a TSS further from a CGI. Enrichment analyses were performed for each cluster using FUMA, with genes across all clusters for a given breed as background.

Individual CpG site differential methylation, CIBERSORTx application and signature matrix generation

Differentially methylated CpG sites were identified in pairwise comparisons of unmixed, individual cell types using Methylkit, which uses a chi-square test with correction for overdispersion. Differential methylation statistics were calculated using (i) the Holstein Friesian reference samples, (ii) the Nelore reference samples, (iii) Holstein Friesian and N’Dama reference samples and (iv) the reference samples from the three cattle breeds. Differential methylation was not performed on the N’Dama samples alone, since for some cell types only one or two samples were collected.

CpG sites were selected where the differential methylation q value was ≤ 0.1 across any cell type comparison, with no restriction on the percentage methylation difference. A matrix containing the percentage methylation at these CpG sites in each sample of the individual cell types was used to generate a reference file for upload to CIBERSORTx [89]. The percentage methylation values of the cellular mixtures at the same CpG sites were extracted to generate a corresponding mixture file. Only CpG sites that were covered by all the samples were retained in the reference and mixture files, which were then uploaded to CIBERSORTx. The CIBERSORTx signature matrix consisting of ‘barcode’ CpG sites was generated from the reference file using default CIBERSORTx parameters (300–500 barcode CpG sites were considered for each cell type). To impute the fraction of cell types within the cellular mixtures, CIBERSORTx was run using 1000 permutations. Deconvolution accuracy was determined by Spearman’s rank correlation between CIBERSORTx predicted and expected proportions. For the lysed blood samples, the expected proportions were measured using flow cytometry, and for the artificial mixtures, the expected proportions were the proportions at which the two cell types were admixed.

留言 (0)

沒有登入
gif