Nucleic acid structure prediction


Nucleic acid structure prediction is a computational method to determine secondary and tertiary nucleic acid structure from its sequence. Secondary structure can be predicted from one or several nucleic acid sequences. Tertiary structure can be predicted from the sequence, or by comparative modeling.
The problem of predicting nucleic acid secondary structure is dependent mainly on base pairing and base stacking interactions; many molecules have several possible three-dimensional structures, so predicting these structures remains out of reach unless obvious sequence and functional similarity to a known class of nucleic acid molecules, such as transfer RNA or microRNA, is observed. Many secondary structure prediction methods rely on variations of dynamic programming and therefore are unable to efficiently identify pseudoknots.
While the methods are similar, there are slight differences in the approaches to RNA and DNA structure prediction. In vivo, DNA structures are more likely to be duplexes with full complementarity between two strands, while RNA structures are more likely to fold into complex secondary and tertiary structures such as in the ribosome, spliceosome, or transfer RNA. This is partly because the extra oxygen in RNA increases the propensity for hydrogen bonding in the nucleic acid backbone. The energy parameters are also different for the two nucleic acids. The structure prediction methods can follow a completely theoretical approach, or a hybrid one incorporating experimental data.

Single sequence structure prediction

A common problem for researchers working with RNA is to determine the three-dimensional structure of the molecule given only a nucleic acid sequence. However, in the case of RNA much of the final structure is determined by the secondary structure or intra-molecular base pairing interactions of the molecule. This is shown by the high conservation of base pairings across diverse species.

Thermodynamic models

Secondary structure of small RNA molecules is largely determined by strong, local interactions such as hydrogen bonds and base stacking. Summing the free energy for such interactions should provide an approximation for the stability of a given structure. To predict the folding free energy of a given secondary structure, an empirical nearest-neighbor model is used. In the nearest neighbor model the free energy change for each motif depends on the sequence of the motif and of its closest base-pairs. The energy of fragments are added together to get the total free energy. The structure with the minimal free energy is the most stable.
Examples of RNA models include:
  • Turner 1999 and Turner 2004.
  • Andronescu 2007 parameters for RNA with any of the four canonical bases. This is derived from earlier wet lab experiments and from other forms of RNA structure data.
  • Langdon 2018 parameters for RNA with any of the four canonical bases.
As of 2020, most software packages default to Turner 2004. A 2020 benchmark shows that Turner 2004 is the worst of the four for RNA-RNA interaction tasks, with the best outperforming it by 5 pairs. It is known that Langdon 2018 is the best for structure prediction.

The most stable structure

The simplest way to find the MFE structure would be to generate all possible structures and calculate the free energy for it, but the number of possible structures for a sequence increases exponentially with the length of RNA: number of secondary structures = N, N- number of nucleotides. For longer molecules, the number of possible secondary structures is huge: a sequence of 100 nucleotides has more than 1025 possible secondary structures. This calls for smarter methods.

Dynamic programming algorithms

Most popular methods for predicting RNA and DNA's secondary structure involve dynamic programming. One of the early attempts at predicting RNA secondary structure was made by Ruth Nussinov and co-workers who developed a dynamic programming-based algorithm that maximized the length and number of a series of "blocks". Each "block" required at least two nucleotides, which reduced the algorithm's storage requirements over single base-matching approaches. Nussinov et al. later published an adapted approach with improved performance that increased the RNA size limit to ~1,000 bases by folding increasingly sized subsections while storing the results of prior folds, now known as the Nussinov algorithm. In 1981, Michael Zuker and Patrick Stiegler proposed a refined approach with performance comparable to Nussinov et al.'s solution but with the additional ability to find also find "suboptimal" secondary structures.
Dynamic programming algorithms provide a means to implicitly check all variants of possible RNA secondary structures without explicitly generating the structures. First, the lowest conformational free energy is determined for each possible sequence fragment starting with the shortest fragments and then for longer fragments. For longer fragments, recursion on the optimal free energy changes determined for shorter sequences speeds the determination of the lowest folding free energy. Once the lowest free energy of the complete sequence is calculated, the exact structure of RNA molecule is determined.
Dynamic programming algorithms are commonly used to detect base pairing patterns that are "well-nested", that is, form hydrogen bonds only to bases that do not overlap one another in sequence position. Secondary structures that fall into this category include double helices, stem-loops, and variants of the "cloverleaf" pattern found in transfer RNA molecules. These methods rely on pre-calculated parameters which estimate the free energy associated with certain types of base-pairing interactions, including Watson-Crick and Hoogsteen base pairs. Depending on the complexity of the method, single base pairs may be considered, and short two- or three-base segments, to incorporate the effects of base stacking. This method cannot identify pseudoknots, which are not well nested, without substantial algorithmic modifications that are computationally very costly.

Predicting pseudoknots

One of the issues when predicting RNA secondary structure is that the standard free energy minimization and statistical sampling methods can not find pseudoknots. The major problem is that the usual dynamic programing algorithms, when predicting secondary structure, consider only the interactions between the closest nucleotides, while pseudoknotted structures are formed due to interactions between distant nucleotides. Rivas and Eddy published a dynamic programming algorithm for predicting pseudoknots. However, this dynamic programming algorithm is very slow. The standard dynamic programming algorithm for free energy minimization scales O in time, while the Rivas and Eddy algorithm scales O in time. This has prompted several researchers to implement versions of the algorithm that restrict classes of pseudoknots, resulting in performance gains. For example, pknotsRG tool includes only the class of simple recursive pseudoknots and scales O in time.
The general problem of pseudoknot prediction has been shown to be NP-complete.

Suboptimal structures

The accuracy of RNA secondary structure prediction from one sequence by free energy minimization is limited by several factors:
  1. The free energy value list in a nearest neighbor model may be incomplete or inaccurate.
  2. * Modified bases are often assumed to fold in the same way as the unmodified version, but they actually have differences ranging from subtle to major.
  3. Not all known RNA folds in such a way as to conform with the thermodynamic minimum.
  4. Some RNA sequences have more than one biologically active conformation.
For this reason, the ability to predict structures which have similar low free energy can provide significant information. Such structures are termed suboptimal structures. MFOLD is one program that generates suboptimal structures.
A collection of structures can be represented by a "dot plot", a visualization of a square matrix containing the frequency of two bases being paired. This is essentially a type of heat map.

Boltzmann ensemble

A collection of possible structures forms an ensemble. By sampling from the ensemble according to the Boltzmann distribution, many possible structures can be returned.
The states in an Boltzmann ensemble can also be represented by a "dot plot", with magnitudes being the probability of finding any two base being paired in the ensemble. These magnitudes implicitly represent thermodynamic data as well.

Comparative secondary structure prediction

Multiple evolutionarily related RNA sequences share patterns of covariation between sites. If two sites are often seen changing in sync, there is a high possibility that there is a structurally required hydrogen bond between those positions. This offers an additional source of information for what the true structure could look like.
In general, the problem of alignment and consensus structure prediction are closely related. A few different approaches to the prediction of consensus structures can be distinguished:
  1. Folding of alignment
  2. Alignment of singular predicted structures, in some cases jointly with sequence alignment
  3. Simultaneous alignment of sequence and ensembles of predicted structures
Each of these steps can also be mixed and matched in an iterative fashion.

Align then fold

A practical heuristic approach is to use multiple sequence alignment tools to produce an alignment of several RNA sequences, to find consensus sequence and then fold it. Covariation can be extracted from a multiple sequence alignment of multiple homologous RNA sequences with related but dissimilar sequences. The quality of the alignment determines the accuracy of the consensus structure model. Consensus sequences are folded using various approaches similarly as in individual structure prediction problem:
  • Pfold implements SCFGs. Covariation is converted into a "dot plot" of base pair probabilities, which is considered on top of the usual single-sequence version of Pfold.
  • RNAalifold from the Vienna suite is essentially an alignment-based variant of RNAfold from the same suite. Both use a thermodynamic folding approach and can run in two modes: a MFE mode, which returns a single structure, and a partitioning mode, which gives several possible structures. The score of a consensus structure is a combination of average thermodynamic energies and extra points derived from co-variation.
  • * A linear-time variant is found as LinearAlifold.
  • ILM uses combination of thermodynamics and mutual information content scores. Like suggested in its name, it is based on Nussinov style loop-matching between predicted structures. In each iteration, a "best" helix/stem is found in the predicted secondary structure and the rest of the structure is folded again. This allows the prediction of pseudoknoted structures.