Species-aware DNA language models capture regulatory elements and their evolution

Using language models as an alignment-free method to capture the conservation and evolution of regulatory elements

Sequence alignment is a well-known and highly effective method to study the evolution of biological sequences. It can be used to detect homologies, find conserved subsequences, and pinpoint sequence motifs. We first assessed whether sequence alignments could be a viable starting point to capture conserved regulatory sequence elements in 3′ regions across large evolutionary distances. To this end, we aligned annotated 3′ untranslated regions (UTRs) and, as control, coding sequences of Saccharomyces cerevisiae to the genomes of a variety of fungal species (Fig. 1A). Coding sequences could be successfully aligned even between highly diverged species. In contrast, the ability to align 3′ regions almost completely disappears beyond the Saccharomyces genus.

Fig. 1figure 1

Masked language modeling can serve as an alternative for alignments, which struggle to capture the conservation of regulatory elements over large evolutionary distances. A BLAST hits of S. cerevisiae CDS and 3′ UTRs in other fungal species. B Regions 3′ to the stop codon of CBP3 orthologues in different fungal species. Instances of Puf3-like motifs (TGTA*ATA) are indicated in red, and a star indicates experimental evidence of Puf3 binding. It appears that Puf3 binding is conserved whereas the location of the motif is not. C Masked language models are neural networks trained to reconstruct masked nucleotides from context. We illustrate this with the example of a Puf3 motif (TGTAAATA), where the second to last T has been hidden. Since this motif is highly conserved, the model may learn that, given this context, a T is most likely. For each masked nucleotide, the model returns a probability distribution over the letters A, C, G, and T. We can extract sequence representations from the model by pooling the hidden representations of the last four layers of the model. The architecture of the LM corresponds to DNABERT [14], with the modification that we make the model species-aware, by providing a token denoting the species where the sequence is originally from. D We train language models on hundreds of highly diverged fungi. In each genome, we locate the annotated coding sequences and we extract the non-coding sequences immediately before the start codon (5′ region) and immediately after the stop codon (3′ region). We train separate models for each region. Each model is trained on more than 10 million sequences

Nevertheless, many regulatory sequence elements of 3′ UTRs are conserved far beyond the genus boundary. The 3′ region of the cytochrome B assembly factor CBP3 illustrates this. Experiments have shown that the RNA-binding protein Puf3 binds 3′ UTR of this gene in S. cerevisiae and the far diverged (about 500 Mya [20]) Neurospora crassa [21]. Moreover, we found that the Puf3 consensus binding motif can be found 3′ to the stop of the CBP3 homolog in almost all yeasts. However, we observed that the motif appears to be highly mobile, complicating alignment (Fig. 1B). This example illustrates the need and potential for alignment-free approaches to leverage conservation across large evolutionary distances.

In principle, masked language modeling should be able to leverage evolutionary information without requiring alignment because their representation of sequence is more flexible and expressive, alleviating the rigid order constraint that sequence alignments subsume. Specifically, we expect that nucleotides with regulatory function should be more conserved, within and particularly across species, and therefore easier to reconstruct when masked than remaining background non-coding sequences.

Prior to applications of masked language models in genomics, methods addressed the issue of lack of alignments of non-coding sequences. Generally, these methods either make use of motif representations learned from experimental data in model organisms to investigate genomes of other species [22, 23] or are based on k-mer enrichments [24,25,26,27,28]. The language modeling approach is similar to the latter, as learning to predict masked nucleotides from context requires capturing subtle patterns of co-occurrence. However, in contrast to these approaches, language models implicitly model such patterns using flexible, nonlinear functions which enables them to learn informative representations of their input. Moreover, the attention mechanism guides the model towards the parts of the input relevant for a particular prediction. This, in theory, allows capturing high-order dependencies between nucleotides or motifs without requiring a tabulation of all possible dependencies or explicit assumptions regarding their shape and nature.

To explicitly test this, we trained masked LMs (Fig. 1C) on non-coding regions adjacent to genes extracted from a vast multispecies dataset, comprising 806 fungal species separated by more than 500 million of years of evolution. We trained distinct models for the 1000 nucleotides 5′ of the start codon (5′ region) and the 300 nucleotides 3′ of the stop codon (3′ region) of all annotated coding sequences (Fig. 1D). The 5′ region typically contains the 5′ UTR and the promoter of the gene [29]. The 3′ region typically contains the 3′ UTR [30]. Hence, we expect these regions to be enriched for non-coding sequences and to capture both transcriptional regulatory elements as well as post-transcriptional regulatory elements involved in mRNA stability and translation. Importantly, no annotation of UTR, promoters, nor transcription start site or polyA site was provided to the model. Models were trained with (species LM) and without species tokens in the input (agnostic LM). However, we provided no information about the phylogeny of the species, nor did we indicate to the models which sequences flank orthologous genes. The models were trained on all extracted sequences jointly. We focused on fungi, since many fungal genomes are available, they evolve quickly and their transcriptional control generally makes less use of extreme long-range interactions which are difficult to model with current approaches. We held out the entire Saccharomyces genus to test the generalization performance of our model in the well-studied species Saccharomyces cerevisiae.

Language models reconstruct known binding motifs in an unseen species

Binding motifs, which represent the sequence specificities of DNA and RNA-binding proteins (RBP), are considered to be the “atomic units of gene expression” [31]. Accordingly, to verify that LMs can capture aspects of the regulatory code in an alignment-free manner, we first needed to verify that they capture important known motifs.

To test this, we analyzed to what extent our 3′ and 5′ LMs could reconstruct nucleotides in the held-out species S. cerevisiae. We compared the reconstruction obtained by our LMs with a number of baselines, including approaches based on k-mer frequencies and alignment of species beyond the genus. We also computed the reconstruction achieved by aligning S. cerevisiae with other Saccharomyces species, which can be regarded as an estimate of the upper range of achievable reconstruction.

We found that all approaches—except the alignment of close species—perform similarly when we compute the reconstruction accuracy over all nucleotides (Fig. 2A and Additional file 1: Fig S1). This changes drastically when considering the reconstruction of known motifs. Instances matching the Puf3 consensus motif, for example, are reconstructed almost as well by the species-aware 3′ model as they are by the alignment of close species (Fig. 2A), strongly outperforming the alignment with far species and other baselines. We observe similar results for other RBP motifs, as well as the consensus motifs of a number of transcription factors (TF) in the 5′ region (Additional file 1: Fig S1). Remarkably, by applying Modisco clustering [32] on all nucleotides weighted by their information content as computed by our LMs, we were able to recover some of these motifs de novo (Fig. 2B, Additional file 1: Fig S2). This approach seems to recover a similar number of known motifs as STREME [33], a dedicated motif-finding tool (Additional file 1: Motif Discovery), although developing a general-purpose motif-finder based on the LM was out of the scope of this paper.

Fig. 2figure 2

Language models reconstruct likely regulatory sequences in a held-out species and recover known binding motifs. A Reconstruction accuracy for nucleotides within instances of RNA-binding protein consensus motifs and across all nucleotides in S. cerevisiae 3′ UTR sequences (those longer than 300 bp have been truncated). We compare the agnostic and species 3′ LM to a variety of baselines. The dashed line represents the accuracy achieved by the intra-genus alignment. Star indicates that the species LM significantly (P < 0.05, binomial test) outperforms the best baseline. For Puf3 and Whi3, Modisco clustering on the species LM reconstructions recovers the motif (depicted above the respective plots). B A sample of known transcription factor motifs recovered by applying Modisco clustering to the 5′ species LM reconstructions, (manually) matched to the respective high-confidence PWM from the YeTFaSCo [34] database

For many motifs we tested, the species-aware language models reconstructed slightly better than their already strong agnostic counterparts. In sum, our analysis clearly demonstrates that language models trained on a large compendium of highly diverged genomes are (1) able to learn conserved regulatory elements and (2) able to transfer this knowledge to unseen species.

Context-sensitive reconstruction of motifs is predictive of in vivo binding

It has been shown for many genomes that only a fraction of the instances of a particular motif is functional, i.e., binds the respective protein in vivo [35,36,37]. This is because, among other reasons, binding also depends on the context of the motif. Therefore, we next sought to evaluate whether our models can recognize these relationships.

An important piece of context for many motifs is their position with respect to certain genomic landmarks. One example of this is that RBP motifs can only be functional if they are located in a transcribed region. Another example is that in yeast TF binding sites tend to be located relatively close to the transcription start site (TSS) [38]. As a first test of whether our LMs are capable of locating these genomic landmarks—despite their location not having been indicated during training—we computed the actual and predicted nucleotide biases as a function of the distance to the TSS (imputed using CAGE data [39]) and the distance to the end of the 3′ UTR [30]. In both cases, we observed that the LMs track local changes in nucleotide biases (Fig. 3A, Additional file 1: Fig S3).

Fig. 3figure 3

Reconstruction of motifs depends on the context and predicts whether a motif instance will be bound in vivo. A Actual and predicted (by the 5′ species LM) nucleotide biases as a function of the distance to the TSS (imputed using CAGE data). The model keeps track of local variations in nucleotide biases. B Reconstruction fidelity (log-likelihood of the individual observed nucleotides, averaged per motif instance, according to the 3′ species LM) of instances of the Puf3 motif (TGTAAATA), as a function of the distance to the end of the annotated 3′ UTR. The predictions of the model for masked nucleotides are indicated for two instances of the motif (blue circles). Reconstruction fidelity is notably degraded beyond the 3′ UTR end (P = 2.2 × 10−15, Mann–Whitney U). C ROC curve evaluating to what extent the reconstruction fidelity of our 3′ LMs, as well as the phastCons conservation score, can serve as a predictor of whether a Puf3 motif instance is within or beyond the 3′ UTR boundary. The LMs greatly outperform the conservation score. D Reconstruction fidelity (log-likelihood of the observed nucleotides according to the 5′ species LM) of instances of the Tbf1 consensus motif (ARCCCTA), as a function of the distance to the closest 3′ TSS (imputed using CAGE data). Blue indicates that the motif instance was bound in vivo according to Chip-exo data. Motif instances that are around − 100 to − 250 nt to the TSS are better reconstructed than those further away or in the 5′ UTR (P = 1.2 × 10−11, Mann–Whitney U). E ROC curve evaluating to what extent the reconstruction fidelity of our 5′ LMs, as well as the phastCons conservation score and an expert-curated PWM, can serve as a predictor of whether a Tbf1 motif instance is bound in vivo. The LMs again greatly outperform the alternative methods

To explicitly test whether the 3′ LM locates and accounts for genomic landmarks when reconstructing motifs, we compared the reconstruction of instances of the Puf3 consensus motif located within annotated 3′ UTR regions to those that are located beyond the 3′ UTR yet still within 300 bp of the stop codon. We find that motif instances within the annotated 3′ UTR are reconstructed significantly better (Fig. 3B–C), consistent with the function of RBPs. In contrast, the phastCons [40] score, an alignment-based measure of conservation, appeared to be a poor predictor of whether a Puf3 site is within a 3′ UTR or not. We repeated this analysis for the other 3′ UTR motifs, finding similar results (Additional file 1: Fig S4A, B). For a TF, we expected this relationship to be reversed, and indeed the E-box motif was reconstructed better when found outside of 3′ UTR regions (Additional file 1: Fig S4C).

Having shown that the models did not simply learn a mere lexicon of over-represented motifs, but instead seemed to account for the context in which motif instances occur, we next asked whether the reconstruction fidelity could predict whether a motif is bound in vivo. To this end, we compared the reconstruction of Puf3 motif instances located in 3′ of genes that have been experimentally verified to bind Puf3p [41] to those without verified binding. Strikingly, we found that the reconstruction fidelity serves as a predictor of whether a gene containing a Puf3 consensus motif is bound by Puf3p—despite our models never having been exposed to binding data (Additional file 1: Fig S5).

We repeated a similar analysis for Tbf1 (Fig. 3D) and a variety of other transcription factors (Additional file 1: Fig S6, S7). Specifically, we evaluated the reconstruction of the consensus binding motifs of these TFs as a function of their distance to the closest upstream TSS. We observed that reconstruction improved when the motifs were located at a biologically plausible distance [38] from the TSS. Moreover, the reconstruction fidelity is highly predictive of in vivo binding to motif instances as measured by Chip-Exo [42], outperforming the phastCons score and, in some cases, expert-curated PWMs constructed using binding data (Fig. 3E, Additional file 1: Fig S6, S7) [34].

Distinct motifs have been established to exhibit associations with specific groups of co-regulated genes. An example in S. cerevisiae is the Rap1 motif, which is found primarily in the promoters of ribosomal protein genes [43, 44]. Accordingly, we find that instances of the Rap1 consensus motifs tend to be better reconstructed if they are found within 1 kb 5′ of a ribosomal protein gene (Additional file 1: Fig S8A). In other words, the reconstruction of the Rap1 motif serves as an indirect predictor of whether a gene belongs to the ribosomal protein module. We performed a similar analysis for the RRPE motif, which is primarily found near genes involved in ribosome biogenesis, and obtained similar results (Additional file 1: Fig S8B).

Overall, this analysis demonstrates that LMs do not just learn a lexicon of conserved motifs, but additionally pick up on correlations between the motifs and their context which are predictive of whether motif instances are bound in vivo. Notably, this is achieved purely from genomic sequences without requiring any additional experimental data during training. This suggests that the attention mechanism provides an effective way to integrate motif interactions, although determining which exact interactions are learned is difficult to disentangle. Finally, the ability to outperform the phastCons conservation score shows the advantages of an alignment-free approach.

Language models account for changes in the regulatory code between species

Our previous analyses have focused on the held-out species S. cerevisiae. However, one of the main use cases we envision for genomic LMs is to serve as a method to explore understudied species. We thus analyzed how the LMs perform when evaluated across fungal species.

In Fig. 1B, we showed that the Puf3 motif, but not its location, is conserved in the 3′ regions of CBP3 homologs. We applied our 3′ model to these sequences and found that, in most species, the Puf3 motif tends to be well reconstructed compared to the background (Fig. 4A). We next applied Modisco clustering to the predictions of the model on the 3′ regions of all CBP3 homologs in our dataset. We recovered the Puf3 motif, as well as two versions of the Puf4 motif (Additional file 1: Fig S9A) [45]. This indicates that our method allows motif discovery not just across genes in one organism, but also for individual genes across organisms.

Fig. 4figure 4

LMs can trace the movement and disappearance of motifs across species and they account for the evolution of the transcription initiation mechanism. A We applied the 3′ species LM to the 3′ region of CBP3 homologs in a number of fungal species (compare Fig. 1A). Darker color indicates that the model assigns a higher probability to the correct nucleotide at that position. In most species, the Puf3 motif instance (delineated with black bars) is notably reconstructed better than the remaining nucleotides. Star indicates that this difference in reconstruction is significant (P < 0.05, Mann–Whitney U). Species with gray background were held out during LM training. B We computed the reconstruction fidelity (log-likelihood) achieved by the species 5′ LM for the S. cerevisiae consensus Rap1 motif (CAYCCRTACAY) instances and for instances matching shuffled versions of this motif in 60 fungal species. The difference in reconstruction between the true and shuffled motif instances, expressed as log2 fold change, is plotted against the − log10P-value of this difference, computed using a Mann–Whitney U test. We observe that in species that have no BLAST match to S. cerevisiae Rap1p, the reconstruction fidelity of the S. cerevisiae Rap1 motif is generally not much better than that of shuffled versions thereof, indicating that the model correctly accounts for species context when reconstructing motifs. C Reconstruction fidelity (log-likelihood of the observed nucleotides according to the 5′ Species LM) of instances of the TATA-box (TATAWAWR), as a function of the distance to the closest 3′ TSS (imputed using CAGE data). Positive values indicate that the TATA-box instance is located in the 5′ UTR. We observe that in S. pombe, the TATA-box is best reconstructed when located ca. − 30 bp to the TSS. In S. cerevisiae, which uses a scanning mechanism to initiate transcription and therefore allows more flexible positioning of the TATA, the model reconstructs TATA well overall, but somewhat better when located 50 to 120 bp 5′ from the TSS

Over evolutionary timescales, certain motifs may either change drastically or fully disappear, particularly if the binding protein evolves or is lost. As an example, we considered Rap1p, a conserved protein known to control telomere length [46]. However, as noted previously, Rap1p additionally acts as a regulator of ribosomal protein expression in certain parts of the yeast lineage, a change that is associated with the acquisition of a transactivation domain by Rap1p [44]. To determine whether our models can reflect such changes, we analyzed the reconstruction of the S. cerevisiae Rap1 motif in 60 fungal species. Specifically, we computed for each species the difference in reconstruction of instances of the consensus motif and instances that correspond to shuffled versions thereof. This procedure controls for GC content and general differences in reconstruction fidelity between species. We found that in species close to S. cerevisiae, the motif instances are reconstructed significantly better than the shuffled instances (Fig. 4B). By contrast, in species where we cannot find a significant BLAST match to S. cerevisiae Rap1p we observed no such enrichment. An example of this is Y. lipolytica, a species known to not have a Rap1p homolog [47]. We performed a similar analysis for two other motifs, finding consistent results (Additional file 1: Fig S9B–C).

In addition to motif evolution, the proper context of motifs can change as well. A famous example is the positional constraint of the TATA box. In most eukaryotes including the Schizosaccharomyces yeasts, the TATA box is preferentially located about 30 bp 5′ from the TSS [29]. Budding yeasts, however, use a scanning mechanism to initiate transcription and therefore the position of the TATA box in these species is more flexible, but usually located between 50 and 120 bp 5′ from the TSS [29]. To verify whether our model correctly recapitulates these constraints, we located all instances of TATAWAWR in the respective species. We then assessed how well the model reconstructs these nucleotides as a function of the distance to the closest upstream TSS [39]. We found that in S. pombe, reconstruction fidelity of the TATA box notably peaks around 30 bp 5′ to the TSS, whereas instances located further or beyond the TSS are generally reconstructed poorly (Fig. 4C). In S. cerevisiae, on the other hand, TATA boxes were generally well reconstructed (likely also reflecting the AT-bias of the budding yeasts), but we observed a peak in the region 120 to 50 bp 5′ of the TSS. Thus, the model applies specific constraints when reconstructing motifs in a way that reflects the evolution of the initiation code.

One feature of the species LM is that it learns a representation for each species in the training dataset. To explore what information is contained in this representation, we first performed PCA analysis on the species representations of the 5′ species LM. We found that the first principal component, which explains about 10% of variance, is highly correlated with GC content of the respective species (r =  − 0.86, Additional file 1: Fig S10). Next, we found that species of the same taxonomic class tend to be closer together than those of different classes (Additional file 1: Fig S11, S12). This suggests that the representations encode features that at least partially correspond to the fungal taxonomy.

Next, we investigated whether changes in motif preferences are also captured by the species representation. To this end, we considered the previously characterized differential preferences of TF motifs in nucleosome-depleted regions between S. cerevisiae and C. albicans [48, 49]. Specifically, S. cerevisiae nucleosome depletion was associated with the Reb1 motif, whereas in C. albicans it was associated with the k-mer aCACGACc (“Tsankov” motif here and elsewhere)—which to our knowledge is not known to have any role in S. cerevisiae. We compared how the reconstruction accuracy achieved by the 5′ species LM for these motifs is affected by swapping the species token. We find that with our S. cerevisiae proxy token (K. africana), the species LM reconstructed Reb1 instances better, both in S. cerevisiae and C. albicans 5′ regions (Additional file 1: Fig S13). Conversely, with the C. albicans token, the C. albicans-specific Tsankov motif is reconstructed better in both species.

In conclusion, we find considerable evidence that LMs can account for changes in the regulatory code and learn meaningful motif-context relationships in a species-specific manner.

Species-aware language model representations encode biologically meaningful features and are directly predictive of many molecular phenotypes

While our previous analyses demonstrated that the reconstructions of any LM are already very informative and could potentially be used to explore regulation in understudied species, we can also extract the learned sequence representations and leverage them to predict gene-expression-related traits in a supervised fashion. We note that this makes most sense for tasks that are data-constrained, such as gene-level measures of expression or half-life where there can only be as many data points as there are unique genes/transcripts. Accordingly, to test the predictive power of the representations themselves, we selected several gene-level omics assays, including RNA half-life measurements in S. cerevisiae and S. pombe, RNA-seq-based gene expression in S. cerevisiae and microarray measures of condition-specific gene expression for a number of yeast species [50,51,52,53,54]. We additionally included three reporter assays testing 3′ sequences [55, 56] and promoter sequences in isolation [57].

We then used our masked language models to generate sequence representations for the different sequences assayed in the respective experiments (Methods). We trained linear models in cross-validation using LM representations as input for different tasks. We used linear models specifically to ensure that the predictive power derives mostly from the LM and not from a fine-tuning procedure or a heavily engineered nonlinear fitting with many tunable hyperparameters. As a baseline for what can be achieved using “naive” sequence representations, we also trained linear models on k-mer counts. Furthermore, if available, we compared against state of the art.

We found that LM sequence representations outperform simple sequence representations based on k-mer counts across all tasks, a finding consistent with previous work [16]. Remarkably, the species LM performed best on all tasks (Fig. 5A, Additional file 1: Table S1). On a task where most of the signal comes from the coding sequence, it performed on par (better, but not significantly) with expert hand-crafted features (Cheng et al. [50]) to predict mRNA half-life. Furthermore, simple ridge regressions trained on species-aware representations significantly outperformed hyperparameter-optimized deep neural networks (Zrimec et al. [53]) on gene expression prediction from non-coding sequence. This clearly shows that species LMs learn sequence representations rich in information without requiring labeled data.

Fig. 5figure 5

Sequence representations of the species LM outperform other methods on a variety of downstream tasks. A Performance (R2) of linear models trained on embeddings from language models compared to state-of-the-art models and k-mer count regressions, where the best k from is shown. Star indicates that the Species LM significantly (P < 0.05) outperforms the second best. B Effect of mutation of 3′ sequences on expression. Observed log2 fold changes, as measured in Shalem et al. [55] are well predicted by the species LM representation. C Observed and predicted effects of mutation on expression as a function of distance to the stop codon for the YDR131C 3′ sequence. D Motifs recovered through in-silico mutagenesis followed by Modisco clustering on our linear model for the S. cerevisiae half-life task. Motifs with a negative effect on half-life are depicted upside down. We recover (2 of 4) motifs found by Cheng et al. [50]: the Puf3 motif and the Whi3 motif. Additionally, we find two motifs not found by this previous analysis, the Puf4 motif and the efficiency element, both of which have known effects on RNA stability

One of the datasets we considered, the Shalem et al. 3′ MPRA [55], used a tiled mutagenesis design to measure the effect of individual subsequences of native 3′ sequences on expression. Here, the species LM performs extremely well, outperforming the agnostic model by 20 percentage points of explained variation (Fig. 5B, C). The results of this controlled experiment demonstrate that supervised models trained on species-LM sequence representations can capture causal determinants of expression found in 3′ UTR sequences.

While LM representations prove advantageous on species included in the dataset (S. pombe), we emphasize that they generalize to unseen species (S. cerevisiae). To ensure that LM performance is independent of dataset composition, and to ascertain that training on more species is beneficial, we repeated these analyses using different models pre-trained on different sets of species. We found tha

留言 (0)

沒有登入
gif