Enricherator: A Bayesian method for inferring regularized genome-wide enrichments from sequencing count data

Understanding protein-DNA interactions and their impacts on biological processes requires knowledge of where along the genome these interactions occur and how strong these interactions are. A natural extension of knowing where a protein interacts with a genome is the question of how changes in genotype or environmental conditions lead to changes in protein-DNA interactions. Although ChIP-seq has provided researchers with the data necessary to identify regions of the genome bound by a protein, current computational methods struggle to provide accurate estimates of enrichment of binding and confidence in estimates of binding (for reviews, see 1, 5). Difficulties in calculating enrichments frustrate downstream analyses for which accurate estimates of enrichment and confidence are necessary, such as peak calling and differential binding analysis. Furthermore, current state-of-the art computational tools for peak calling and comparison of protein occupancy across conditions of interest often make assumptions about peak shape that result in no one tool performing robustly across the wide variety of peak shapes encountered in ChIP-seq data collected from various types of proteins (for review, see [5]). We assert that many of the problems noted above arise due to the lack of a statistical framework that properly handles the physical structure of sequencing fragments represented by ChIP-seq data and the discrete nature of the count data inherent in using sequencing technologies, as well as the interpretive issues that arise from performing these analyses in the context of an entire genome.

We developed a computational tool called Enricherator to estimate enrichments of targeted/enriched genomic loci and confidence in those enrichments. Enricherator accomplishes this goal in the following ways: 1) using a negative binomial likelihood to appropriately handle sequencing count data, 2) locally pooling information based on sequencing library fragment sizes in input and enrichment sequencing data, 3) applying a shrinkage prior to enrichment scores to moderate potential false positives due to the large number of loci considered, 4) explicitly fitting a term to allocate sequencing fragments between signal and noise, and 5) using the variational Bayes algorithm to enable inference of genome-wide point-estimates and credible intervals of enrichment with reasonable computing requirements. Enricherator is not a peak caller. Rather, it provides high-quality estimates of enrichment that can be used to perform downstream analyses such as peak calling. The statistical model used by Enricherator was inspired by the approach to RNA-seq data analysis implemented in the popular DESeq2 package [6]. A preliminary version of Enricherator was applied to RNA:DNA hybrid pulldown data in [3] after we found that existing ChIP-seq analysis workflows were unsuitable for our needs. Since that publication, we have made the following major methodological improvement to Enricherator: sequencing fragment counts are now treated as having been drawn from a mixture model, with the first component of the mixture representing the sequencing fragments of biological interest (signal), and the second component representing sequencing fragments from noise.

We have found that our enrichment inferences benefit from Enricherator (we have applied it to HBD-seq, RNA polymerase ChIP-seq, and other ChIP-seq datasets) in three major ways. First, Enricherator decreases false positive identifications of enriched regions of the genome, likely through its combination of signal-to-noise allocation, usage of a shrinkage prior on enrichment estimates, and pooling of information across nearby genomic loci. Second, because Enricherator uses Bayesian methods, it enables direct calculation of contrasts between conditions of interest and confidence in those contrasts, which is a feature we have found extremely useful in comparing RNA:DNA hybrid enrichments across genotypes [3] and in comparing protein occupancy across different environmental conditions or genotypes. Third, Enricherator provides evidence ratios (the Bayesian analog of a test statistic used in frequentist hypothesis testing) indicating the strength of evidence for enrichment above a user-defined value, which we use to perform fully Bayesian peak calling. We envision Enricherator being broadly useful to the biology community as a method of first choice to estimate enrichments genome-wide. Therefore, we provide a publicly-available containerized version of Enricherator. The source code for Enricherator, instructions for pulling the container from Singularity Container Services, and instructions for its use can be found at its github repository (https://github.com/jwschroeder3/enricherator).

留言 (0)

沒有登入
gif