Current limitations in predicting mRNA translation with deep learning models

Experimental methods

We outline the experimental procedure for RNA and ribosome footprint sequencing of HepG2 cells.

Cell culture

The HepG2 cell line was obtained from the laboratory of Dr. Salvatore Piscuoglio (DBM, Basel) and was cultured in Dulbecco’s Modified Eagle Medium (DMEM) containing 4.5 g/l glucose, 10% fetal calf serum, 4 mM L-glutamine, 1X NEAA, 50 U/ml penicillin and 50 µg/ml streptomycin at 5% \(CO_2\), \(37^\textrm\). Cells were passaged every 3–4 days.

Cell lysis and collection

Cells were grown in 15-cm dishes to achieve a 70–80% confluency. Medium was replenished 3 h prior to cell lysis. Cycloheximide (CHX) was added to a final concentration of 100 µg/ml to arrest elongating ribosomes. Medium was immediately discarded and cells were washed once with ice-cold PBS containing 100 µg/ml CHX. Five hundred microliters of lysis buffer (20 mM Tris-HCl pH 7.5, 100 mM NaCl, 10 mM \(\textrm_2\), 1% Triton X-100, 2 mM dithiothreitol (DTT), 100 µg/ml CHX, 0.8 U/µl RNasin plus RNase inhibitor (Promega), 0.04 U/µl Turbo DNase (Invitrogen), and EDTA-free protease inhibitor cocktail (Roche)) was added directly to the cells on the Petri dish. Cells were scraped and collected into 1.5 ml tubes. Then, samples were incubated for 5 min at \(4^\textrm\) at continuous rotation (60 rpm), passed through a 23G needle for 10 times, and again incubated for 5 min at \(4^\textrm\) at continuous rotation (60 rpm). Lysates were clarified by centrifugation at 3000×g for 3 min at \(4^\textrm\). Supernatants were centrifuged again at 10,000×g for 5 min at \(4^\textrm\).

Ribosome footprint sequencing

The ribosome footprinting sequencing protocol was adapted from protocols described in Refs. [18, 19, 44]. An equivalent to 8 OD260 of lysate was treated with 66 U RNase I (Invitrogen) for 45 min at \(22^\textrm\) in a thermomixer with mixing at 1000 rpm. Then, 200 U SUPERase·In RNase inhibitor (20 U/µl, Invitrogen) was added to each sample. Digested lysates were loaded onto 10–50% home-made sucrose density gradients in open-top polyclear centrifuge tubes (Seton Scientific). Tubes were centrifuged at 35,000 rpm (210,100×g) for 3 h at \(4^\textrm\) (SW-41 Ti rotor, Beckmann Coulter ultracentrifuge). Samples were fractionated using the Piston Gradient Fractionator (Biocomp Instruments) at 0.75 ml/min by monitoring A260 values. Thirty fractions of 0.37 ml were collected in 1.5 ml tubes, flash frozen, and stored at \(-80^\textrm\). The fractions (typically 3 or 4) corresponding to the digested monosome peak were pooled. RNA was extracted using the hot acid phenol/chloroform method. The ribosome-protected RNA fragments (28–32 nt) were selected by electrophoresis on 15% polyacrylamide urea TBE gels and visualized with SYBR Gold Nucleic Acid Gel Stain (ThermoFisher Scientific). Size selected RNA was dephosphorylated by T4 PNK (NEB) for 1 h at \(37^\textrm\). RNA was purified using the acid phenol/chloroform method. Depletion of rRNA was performed using the riboPOOL kit (siTOOLs biotech) from 433 ng of RNA according to the manufacturer’s instructions. Libraries were prepared using the SMARTer smRNA-Seq Kit for Illumina (Takara) following the manufacturer’s instructions from 15 ng of RNA. Libraries were purified by electrophoresis on 8% polyacrylamide TBE gels and sequenced on the Illumina NextSeq 500 sequencer in the Genomics Facility Basel (Department of Biosystems Science and Engineering (D-BSSE), ETH Zürich).

RNA-sequencing

RNA was extracted from 15 µl of cell lysate using the Direct-zol RNA Microprep Kit (Zymo Research) following the manufacturer’s instructions and including DNase treatment for 15 min at room temperature. Samples were eluted with 15 µl nuclease-free water. The RNA integrity numbers (RIN) of the samples were between 9.9 and 10.0, measured using High Sensitivity RNA ScreenTape (TapeStation system, Agilent). RNA was quantified using a Qubit Flex fluorometer (Thermo Fisher Scientific). Libraries were prepared using the SMART-seq Stranded for total RNA-seq kit (Takara) from 5 ng of RNA and sequenced on the Illumina NextSeq 500 sequencer in the Genomics Facility Basel (Department of Biosystems Science and Engineering (D-BSSE), ETH Zürich).

Data sets

The data sets used in this study are as follows.

Optimus50

Constructs consisted in 25 nts of identical sequence (for PCR amplification) followed by a 50-nt-long random 5′UTR sequence upstream of the GFP coding region. Their sequences and associated mean ribosome load measurements were obtained from the GEO repository, accession number GSE114002 [45]. Non-sequential features were computed and annotated for each sequence with a python script. The normalized 5′UTR folding energy was determined with the RNAfold program from the ViennaRNA package [46]. The G/C-fraction was calculated using the biopython package [47]. Number of OOF and IF uAUGs were calculated with standard python methods. ORF/UTR length and number of exons were identical in this data set and therefore uninformative. Following [14], we split the 20,000 5′UTRs with the highest coverage in mRNA seq for testing and kept the rest for training.

Optimus100

Constructs were made from random sequences, human 5′UTRs of suitable size (25–100 nts), their single nucleotide polymorphism-containing variants, and 3′-terminal fragments of longer 5′UTRs. MRL measurements were done as for the Optimus50 data set. Sequences and associated MRL estimates were obtained from the GEO repository, accession number GSE114002 [45]. The non-sequential features were computed just as for Optimus50, with the UTR length being an additional degree of freedom. The 5000 5′UTRs with highest coverage in mRNA-seq are held out for testing, just as in [14].

Human genetic variants

Sample et al. [14] extracted 3577 5′UTR SNPs from the ClinVar database [31] and constructed variant 5′UTRs containing these SNPs. These variants were transfected to HEK 293 cells, and the respective MRL was measured as described in the paragraph about Optimus50. We also appended non-sequential features as outlined there, with the UTR length as an additional variable. The sequences and MRL were downloaded from GEO repository GSE114002 [45].

Yeast50

Yeast colonies were grown in media without HIS3. Yeast cells were transduced with plasmids containing the HIS3-ORF attached to a random pool of \(\sim 500,000\) randomized 50-nt-long 5′UTRs. The growth rate is directly controlled by the amount of HIS3 protein, which only is controlled by the 5′UTR sequence. The data were obtained from GEO, accession number GSE104252 [48]. The calculation of non-sequential features followed the exact same procedure as for Optimus50. The top 5% 5′UTRs in terms of read coverage were used for testing.

DART

We downloaded the training data from Suppl. Tab. S2 of [15]. Non-sequential features were calculated as for Optimus50. Since uAUGs are mutated in this data set to avoid ambiguity in the translation start site, we did not include the number of OOF or IF uAUGs in the list of non-sequential features to learn from. Also, DART uses a luciferase reporter only including the first bit of the coding sequence, so neither the number of exons nor the CDS length are meaningful; therefore, we did not include these features, either. The first bit of the CDS sequence is available as a separate column in their Suppl. Tab. S2. We use three different random splits of 10% of the data for testing.

Human mRNA sequences

The human transcript sequences were pulled from ENSEMBL [49] with pybiomart. We use the GRCh38.105 annotation and the GRCh38.dna_sm.primary_assembly.fa primary assembly file. The human transcriptome sequences were assembled with gffread version 0.12.7 [50].

Yeast mRNA sequences

We used the R64.1.1 yeast genome [51] with the R64.1.1.110 annotation from the Saccharomyces cerevisiae Genome Database (SGD). We enriched this annotation with the longest annotated transcript from TIF-seq, see [52], providing us with 5′UTR sequences. Gffread [50] yielded the yeast transcriptome.

Yeast TE data

We used ribosome footprinting (GSM2278862, GSM2278863, GSM2278864) and RNA sequencing data (GSM2278844, GSM2278845, GSM2278846) from the control experiments performed in [19], downloaded from the European Nucleotide Archive, accession PRJNA338918 [53]. The riboseq analysis was conducted as in [54]; the RNA-seq analysis was performed using zarp [55]. All non-sequential features (log ORF length, UTR length, G/C-content fraction of the UTR, number of exons, number of OOF uAUGs, number of IF uAUGs, normalized 5′UTR folding energy) were computed or extracted from the genome annotation. The 10% of transcripts with the highest TIN were used for testing purposes.

HEK 293 TE data

Ribo-seq and mRNA-seq data were obtained from the European Nucleotide Archive, accession PRJNA591214 [56]. The riboseq analysis was conducted as in [54]; the RNA-seq analysis was performed as in [55]. For the calculation of the translation efficiency, we only took into account RNA-seq and ribo-seq reads in the CDS, not on the entire transcript. For stringency in the attribution of reads to mRNAs, we calculated relative isoform abundances by running salmon [57] on the RNA-seq samples and selected the most abundant isoform as representative, to which we mapped the RNA and ribo-seq reads. The 10% of transcripts with the highest TIN (squared average over the three replicates) were used for testing.

HepG2 TE data

We followed the experimental procedure outlined in the experimental methods. The rest of the analysis was done as for the HEK 293 TE data. The data was deposited in the European Nucleotide Archive under accession PRJNA1045106 [58].

ClinVar data

We downloaded the ClinVar data base vcf file (vcf_GRCh38 [32]). With bedtools-intersect (v 2.30) [59], we identified variants from ClinVar in annotated genes and only kept variants of annotated 5′UTRs. With a python script, we calculated the coordinates of the polymorphisms on all affected transcripts. Then, we constructed the variant 5′UTRs in the human transcriptome (created with gffread [50] from the GRCh38.105 ENSEMBL annotation) and extracted the coding regions. This left us with 84,127 mutated transcripts. Next, we computed the non-sequential features as for Optimus50. We predicted the variant TE with the transfer-learning version of TranslateLSTM (trained on human endogenous HEK 293, pre-trained on the Optimus100 data set). Matching the transcript variants and predictions to transcripts for which we have TE measurements left us with 7238 transcripts.

Model architectures

We implemented previous published models that predict the translation output from 5′UTR sequences according to the their description in the respective studies. We used tensorflow 2.10, along with cuda toolkit 11.8.0 on NVIDIA titanx GPUs with 12GB of graphics memory.

Optimus 5-Prime

Optimus 5-Prime was the first neural network trained to predict translation initiation efficiency [14]. It consist of three convolutional layers with 120 8nt-long filters each. They all feature a relu activation function and are succeeded by two dense layers, one reducing the input dimensionality to 40 with another relu nonlinearity, and a last dense layer reducing to a single number. The two last layers are separated by a dropout layer that stochastically ignores 20% of the input signals during training. The configuration allowing predictions for 5′UTR s up to 100 nts in length has 714,681 parameters. Two different configurations of Optimus 5-Prime were proposed: one trained of a pool of \(\sim 280,000\) 5′UTR sequences of 50 nts and another trained on a pool of \(\sim 105,000\) 5′UTRs of 25–100 nts. Variable lengths were handled by anchoring the 5′UTR at their 3′end (adjacent to the start codon) and padding the 5′ end with 0s in the one-hot encoded representation. Since endogenous 5′UTR s vary widely in length, we used the latter configuration and data set, considering it to be more realistic. However, the size of the model is also larger. To run the Optimus models on the local scientific computing infrastructure, the model training was re-implemented in a python script, rather than a jupyter notebook as in the git repository cited in [14].

Framepool

Framepool [16] technically overcomes the limitation on 5′UTR length. While also relying on convolutional layers and a final two-layer perceptron, Framepool introduced an operation called “framewise pooling.” This was motivated by previous observations that out-of-frame uORFs have a strong impact on the translation output. Framewise pooling involves the pooling of output of convolutional layers separately for the +0, +1, and +2 frames. The subsequent multi-layer perceptron (MLP) takes as input the average and the maximum of each of the three pools (per convolutional filter). This makes the input of the final MLP independent of the input length and allows for varying UTR lengths from a technical standpoint. Trained on the same data sets as Optimus 5-Prime, the performance on data of varying UTR length was increased. The number of parameters in Framepool is only about a third of what Optimus 5-Prime requires for UTR lengths \(\le 100\) nt, namely 282,625 parameters. We pulled Framepool from the git repository referenced in [16]. A python script related the model to our format of input data.

MTtrans

While CNNs are generally not a natural choice when it comes to modeling sequences of variable length, recurrent neural networks (RNNs) were developed exactly for this purpose. Conventional RNNs suffer from the so-called vanishing gradient problem, whereby memory of distant context is lost. Moreover, they can only memorize the left-side context, since they process sequences from left to right. These problems are solved by long-short term memory units (LSTM) [25] and bidirectional layers. However, as there is no correspondence between output cells and position in the sequence, the interpretability of this type of model is more challenging. MTtrans [17] has been proposed as an attempt to get the best of both CNN and LSTM worlds. It follows the general idea of detecting motifs, with four convolutional layers stacked on top of each other, batch normalization, L2 regularization, and dropout layers in between to avoid over-fitting and ensure stability. This component of MTtrans is called “shared encoder” and is followed by two bidirectional gated recurrent unit (GRU) [60] layers and two dense layers to make the final prediction. GRUs are quite similar to LSTMs, but they do not feature an output-gate [25, 60] and therefore have fewer weights to adjust than LSTM layers. This second component of MTtrans is called “task-specific tower,” because it is re-trained for each data set (task), while the encoder is shared across all tasks. By training the encoder on data sets of different organisms and cells, the authors aim to capture general features of translation that apply to all of the studied systems. This is an example of transfer learning, hence the “trans” in the name MTtrans. MTtrans appears to be considerably bigger than its two predecessors, with \(\sim 2,100,000\) parameters. A re-evaluation of the results in [17] was unfortunately not possible since the code in the provided github repository was still work in progress. Therefore, we attempted reconstructing MTtrans in our own ML framework, but will quote the numbers reported in [17], wherever available.

TranslateLSTM

The start of the coding region has a biased nucleotide composition that also plays a role in translation initiation (c.f. [61]). Putting the first 100 nts into another bidirectional LSTM model therefore provides additional information about initiation likelihood. These three models, bidirectional LSTM for 5′UTRs, bidirectional LSTM for beginning of ORF, and non-sequential features, can now be concatenated into a big model. There is, of course, a lot of redundance in these inputs, as the folding energy of the 5′UTR is determined by its nucleotide sequence, GC-content, and length of the 5′UTR. One way to mitigate this redundance is to use high dropout rates after the final bidirectional LSTM layer of both RNNs (5′UTR and ORF). For training from scratch, we used dropout rates of 0.8 for the 5′UTR model and 0.2 for the CDS model. After concatenating the numerical input data with the two bidirectional LSTMs, a final dense layer computes a single number, the logarithm of the translation efficiency, scaled to a Gaussian distribution with unit variance and expectation value 0. The network was implemented in python, using the keras API with a tensorflow backend [62, 63]. We used the adam algorithm [64] for training, with the default learning rate of 0.001 that proved to superior even to more sophisticated learning rate schedules. Beyond dropout layers, overfitting is prevented by imposing an early stopping criterion. To this end, we used a keras callback object. This object monitors the validation loss an terminates training once it sees a consistent increase in the validation loss over a given number of epochs (“patience” parameter). We set the patience to 10 epochs and restored the best weights within these 10 epochs after termination. Of the randomly reshuffled training data, 20% serve validation purposes. To further improve the performance of the model, we pretrained it on the variable-length Optimus100 data set before training on the endogenous data. In that scenario, we used slightly lower dropout rates for the 5′UTR LSTM of 0.5.

Linear models

As the translation initiation efficiency was reported to be explained, to a large extent, by a small number of mRNA features [6], we have included in our study two variants of a small linear model. The features were as follows. First, upstream open reading frames (uORFs) were reported in many studies to reduce the initiation of translation at the main ORF [65]. The effect was found to be largely due to uORFs that are in a different frame than the main ORF, which we have referred to as “out-of-frame ORFs” or “out-of-frame AUGs,” because the presence and position of stop codons matching these ORFs is not generally considered. Thus, one of the linear models included the numbers of out-of-frame and in-frame AUGs, while the other only the former. The secondary structure of 5′ cap-proximal region of the mRNA is known to interfere with the binding of the eIF4F cap-binding-complex [5], and thus a weak positive correlation has been observed between the free energy of folding of the first 80 5′UTR nts and the translation initiation efficiency of yeast mRNAs [6,7,8]. A more minor impact on yeast translation has also been attributed to the ORF length (negative correlation with TIE) [6, 66], 5′UTR length, and G/C content [6]. For human cells, the number of exon-exon junctions has also been linked to TIE [23]. Additional file 1: Fig. S6 shows density plots of these parameters, comparing the major data sets we used, i.e., the three MPRA data sets, DART, and the two endogenous ones.

The linear models are of compelling simplicity: they only have as many parameters as features they cover, plus a single global bias term. For instance, the linear model describing the Optimus50 data set consists of weights multiplying the normalized 5′UTR folding energy, the G/C-content, the number of IF and OOF upstream AUGs, and the bias term, totaling to 5 parameters.

留言 (0)

沒有登入
gif