Deep learning in regulatory genomics: from identification to design

Since the turn of the century, genomics is a rising data-driven discipline [1], which aims to elucidate the function of all of the nucleotide sequences using high-throughput technologies such as genome sequencing and transcriptome profiling. Deep learning is a data-driven information technology that has made great successes in the artificial intelligence community, including computer vision and natural-language processing (NLP) [2]. In plant biology, deep learning is starting to be used in a wide range of different fields, including plant breeding 3, 4, 5, 6, 7 and fruit taste [8]. Genomics and deep learning, both being data-driven, are a natural match. Indeed, their combined use has already achieved considerable progresses in the fields of regulatory genomics 9, 10•, 11, gene expression modeling 12••, 13, 14, and cancer diagnosis [15] in the past decade. As such, we will focus our review on the combination of these approaches.

Regulatory genomics refers to the study of functional noncoding DNA that contributes to the regulation of gene expression. The simplest units of regulatory genomics are transcriptional factor-binding site (TFBS) and cis-regulatory elements (CREs), which are often 5–20-bp DNA fragments recognized by a specific transcriptional factor (TF) protein [7]. In 2015, the pioneering work of DeepBind was the first successful deep learning application in genomics, and amazingly almost completely solved the long-standing problem of TFBS predictions [10]. Based on basic units of TFBS, larger genomic regions assembled by a combination of spaced TFBSs are called cis-regulatory modules, these include both gene-proximal promoters and distal enhancers [16]. Such elements are believed to act as master regulators of target gene expression and are naturally core objects of regulatory genomics [17]. Deep learning usually characterizes promoters and enhancers by modeling their associated epigenomic signals, including chromatin accessibility and histone modifications 11, 18.

A considerable number of elegant reviews have comprehensively demonstrated the fundamental network structures of deep learning, including fully connected neural networks (FCNN), convolutional neural networks (CNN), and recurrent neural networks, demonstrating how to apply these modeling approaches to solve regulatory genomics problems 1, 4, 9, 19, 20, 21. For example, the first successful case of DeepBind took short DNA fragments (varying lengths of 14–101 bp) as the inputs and employed their binding intensities as the outputs to learn adjustable parameters of filter matrix in the CNN layer and weight matrix in the FCNN layer [10]. However, two important research trends seem to be emerging in view of new advances in recent years: (i) accurate modeling of more complex input of very long regulatory DNA sequence, which requires more complex model architecture that needs to be assembled from modules or blocks 12••, 22•; (ii) increased attention for the biological interpretability of the models 23•, 24••, 25, 26. This would help to define critical nucleotide bases with regulatory effects, which could then meet the specific biological demand and are the ideal target of downstream bioengineering applications such as drug targets in humans [27] and breeding-by-editing in plants [4].

Here, we first review recent innovations of model architecture on deep learning modeling methods in the field of regulatory genomics, and subsequently summarize existing biological interpretability methods for the identification of CREs. Finally, we discuss how to employ deep learning models and interpretability methods as we move from identification to design of genomic regulatory elements.

留言 (0)

沒有登入
gif