RNAprofiling 2.0: Enhanced cluster analysis of structural ensembles

The method of sampling RNA secondary structures from the Boltzmann distribution [1], [2] under the nearest neighbor thermodynamic model (NNTM) [3] provides critical base pairing alternatives to the minimum free energy (MFE) configuration. Such information can be essential to understanding how RNA sequences fold — and the functionality of these important molecules. Yet, the power of ensemble analysis can only be realized by identifying the underlying patterns in a sufficiently large set of suboptimal structures.

RNAprofiling, or just profiling for short, refers to the overall cluster analysis method that organizes and analyzes a collection of secondary structures according to a set of features. It was developed [4] to identify the dominant combinations of base pairing signals in the Boltzmann ensemble. RNAprofiling 1.0 (denoted here Pv1) consistently achieves high sample compression together with low information loss on “small” sequences, on the order of 100 nucleotides (nt). We present here an updated version, RNAprofiling 2.0 (denoted Pv2), which can mine a stable, informative structural signal from Boltzmann samples of much longer sequences.

In contrast to other cluster analysis methods like Sfold [5] and RNAshapes [6], Pv2 does not generate the sample to be analyzed. Rather, it is available to leverage the ensemble analysis power of state-of-the-art software packages like RNAstructure [7] and ViennaRNA [8]. Hence, we demonstrate here that Pv2 will reliably report the high probability base pairing combinations for sequences up to 600 nt.

We note that the signal from the Boltzmann ensemble at the substructural unit level, i.e. the features being considered, remains strong well-beyond 1000 nt. However, the probability of different combinations of these units, i.e. their profiles, decays with sequence length. Like prediction accuracy, this is a reflection of the NNTM itself, and the sampling method employed. Given a particular Boltzmann sample as input, Pv2 outputs high quality information in a useful quantity for further hypothesis generation.

As described, the content of that information is determined directly from the input sample. When introduced [4], it was established that RNAprofiling provides complementary information to both Sfold and RNAshapes. Moreover, a thorough analysis [9] compared the three, where Pv1 analyzed Boltzmann samples generated by GTfold [10]. It was found that all three improved over the MFE, but there was no clear advantage among cluster analysis methods in terms of base pair prediction accuracy.

In terms of efficiency, for a sequence of length ∼600 nt, Pv2 analyzes a Boltzmann sample of 10,000 structures in about 20 seconds, with the sample generation taking about 5 sec. Shorter sequences and/or smaller samples take correspondingly less time to analyze, e.g. about 2 sec to analyze a sequence ∼200 nt and sample of 1000 structures. In contrast [9], Sfold takes about 25 seconds (sampling + analysis) at this 200 length/1K size scale, as does RNAshapes.

Regardless of which cluster analysis method is used, there are two key points for experimentalists [9]. First, as well-known to the ribonomics community, prediction quality improves if more than one conformation is considered. Second, the quality is substantially enhanced if the conformations are initially considered at lower granularity/higher abstraction. This supports a multilayered approach to RNA secondary structure determination where an early computational step identifies critical structural differences “to be vetted by further computational analysis, experimental testing, and/or biological insight.”

The new version of RNAprofiling presented here significantly enhances the method’s ability to do just that. The new code is freely available at github.com/gtDMMB/RNAprofilingV2 under a GPLv2 license and can be run online through the rnaprofiling.gatech.edu website.

留言 (0)

沒有登入
gif