Skip Navigation


Nucleic Acids Research Advance Access originally published online on January 7, 2009
Nucleic Acids Research 2009 37(5):1378-1386; doi:10.1093/nar/gkn987
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (338K) Freely available
Right arrow Screen PDF (219K) Freely available
Right arrowOA All Versions of this Article:
37/5/1378    most recent
gkn987v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Smit, S.
Right arrow Articles by Heringa, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Smit, S.
Right arrow Articles by Heringa, J.
Related Collections
Right arrow Nucleic acid structure
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2009, Vol. 37, No. 5 1378-1386
© 2009 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.


Computational Biology

RNA structure prediction from evolutionary patterns of nucleotide composition

S. Smit1,2,*, R. Knight3 and J. Heringa1

1Centre for Integrative Bioinformatics VU (IBIVU), Vrije Universiteit, 1081 HV Amsterdam, 2Centre for Medical Systems Biology, Niels Bohrweg 2, 2300 RA Leiden, The Netherlands and 3Department of Chemistry and Biochemistry, University of Colorado, Boulder CO 80309, USA

*To whom correspondence should be addressed. Tel: +31 020 5983714; Fax: +31 020 5987653; Email: S.Smit{at}few.vu.nl

Received September 11, 2008. Revised November 21, 2008. Accepted November 21, 2008.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 FUNDING
 REFERENCES
 
Structural elements in RNA molecules have a distinct nucleotide composition, which changes gradually over evolutionary time. We discovered certain features of these compositional patterns that are shared between all RNA families. Based on this information, we developed a structure prediction method that evaluates candidate structures for a set of homologous RNAs on their ability to reproduce the patterns exhibited by biological structures. The method is named SPuNC for ‘Structure Prediction using Nucleotide Composition’. In a performance test on a diverse set of RNA families we demonstrate that the SPuNC algorithm succeeds in selecting the most realistic structures in an ensemble. The average accuracy of top-scoring structures is significantly higher than the average accuracy of all ensemble members (improvements of more than 20% observed). In addition, a consensus structure that includes the most reliable base pairs gleaned from a set of top-scoring structures is generally more accurate than a consensus derived from the full structural ensemble. Our method achieves better accuracy than existing methods on several RNA families, including novel riboswitches and ribozymes. The results clearly show that nucleotide composition can be used to reveal the quality of RNA structures and thus the presented technique should be added to the set of prediction tools.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 FUNDING
 REFERENCES
 
The discovery of many noncoding RNAs, involved in catalysis and gene regulation, has intensified the attention for RNA research 1. To understand the structural, functional and mechanistic properties of these molecules, structure prediction is an essential tool. Efforts aimed at predicting RNA structures from sequence information started over three decades ago 2, and the field is still under rapid development as evidenced by the many prediction methods developed in recent years [reviewed in (3–9)]. Differences between methods arise from the type of input (single versus multiple sequences, aligned versus unaligned sequences), their prediction target (base pair types and structural topology), and the kinds of evidence they use for the prediction (e.g. covariance, thermodynamics or experimental data). Free energy minimization methods are extensively developed, but their accuracy is limited due to several factors including our incomplete knowledge of the RNA folding process (5,6,8). Comparative sequence methods are in general more accurate than single-sequence minimum free energy methods. When many (hundreds or thousands) of sequences are available, very accurate structural models can be derived [over 97% in the ribosomal RNA 10]. However, when a limited number of sequences is available, as is often the case, accuracy typically ranges from 30%–90%, depending on the number, length and similarity of the sequences in the alignment 5.

Here we present a novel approach to RNA structure prediction for a set of homologous sequences, exploiting an information source thus far unused for this purpose: evolutionary patterns of nucleotide composition. We assess the quality of candidate structures in an ensemble using generic compositional patterns exhibited by biological structures. Our method fits in with two prediction strategies that are being explored in recent years: mining the information provided by an ensemble of (suboptimal) RNA structures and combining multiple information sources in a single prediction method. The former strategy has led to efficient algorithms for generating all suboptimal structures within a certain energy range 11 and for statistical sampling from the complete Bolzmann ensemble 12. The second strategy emerged because the use of a single source often does not lead to sufficient accuracy of the predictions when a limited number of sequences is available. The more powerful methods at this moment combine evidence from various sources. Examples are RNAalifold 13, RNAstructure 14, and Bayesfold 15, using covariation, thermodynamics, and experimental data. In addition, important advances in structure prediction are made due to the use of novel information sources, such as abstract shapes 16 as implemented in the RNAshapes package (17,18) or ‘nucleotide cyclic motifs’ 19 as implemented in the MC-Fold and MC-Sym pipeline 20.

To complement current structure prediction methods, we describe how nucleotide composition can aid structure prediction. Different classes of structural RNAs are known to have certain compositional biases 21 and within an RNA family the composition of structural elements changes in consistent ways throughout the evolution 22. Building on these observations, we found certain characteristics of these patterns that are shared between all RNA families. The key idea of our method is to use these characteristic patterns of nucleotide composition, known to be exhibited by biological sequences and structures, to distinguish between realistic and unrealistic foldings. The method is built around a scoring function that captures the universal properties. It is used to evaluate many structures in an ensemble. Structures more similar to the true structure will display the expected trends and will receive a more positive evaluation.

In this work we address the following questions. Can the patterns of nucleotide composition be used to identify the most realistic structures in an ensemble? Is a consensus structure calculated from top-scoring structures more accurate than the consensus from the whole ensemble? How do predictions using the nucleotide composition as the only source compare to predictions from existing structure prediction methods? First we will introduce the concept of nucleotide composition and the generic features. Then we will explain how this information is used for structure prediction. The research questions will be addressed in a performance test involving a number of different RNA families, ranging from the hammerhead ribozyme to the large subunit of the ribosomal RNA.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 FUNDING
 REFERENCES
 
We have developed a method for RNA structure prediction using characteristic patterns of nucleotide composition as observed in biological structures. The core of the method is a scoring function that describes these patterns. The prediction method is coined SPuNC for ‘Structure Prediction using Nucleotide Composition’. The SPuNC algorithm consists of the following steps: 1 given a multiple sequence alignment, generate an ensemble of candidate structures, 2 score all structures in the ensemble with a scoring function, and 3 return top-scoring structure(s) or consensus structure. In this section we will describe the compositional patterns and their generic features, the construction and application of the scoring function, the ensemble generation, and the performance measurements of the algorithm. A web interface to the method, the source code, and the datasets used in this study are available at www.ibi.vu.nl/programs/spuncwww.

Nucleotide composition
The combination of an alignment and a structure that the sequences fold into can be described in terms of nucleotide composition, which can be calculated for different parts of the RNA structure. We distinguish four structural elements in RNA structures (Figure 1A and B). The category ‘stem’ contains all paired residues, the category ‘loop’ contains all unpaired residues that connect the upstream and downstream half of a helix, ‘bulge’ contains all unpaired regions that connect exactly two helices, and ‘other’ contains all other unpaired residues, including multi-helix junctions, ends and pseudoknotted regions. This classification scheme is chosen over the more coarse-grained paired/unpaired scheme and the more detailed six-way classification 22, because it distinguishes between three important unpaired categories and it can handle pseudoknots unlike the six-way classification (in other words, it can be applied to any collection of base pairs in which each base has at most one pairing partner).


Figure 1
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. RNA structure and nucleotide composition. (A) Alignment of three RNA sequences with a reference structure (both vienna structure and pairing mask) and the corresponding structural classification. (B) Secondary structure diagram indicating four different structural categories: S = stem, L = loop, B = bulge, O = other. (C) RNA composition space, containing the nucleotide composition of 80 SSU rRNA sequences decomposed under the E. coli reference structure [source: Comparative RNA Web 23]. The space contains five distributions: S = stem, L = loop, B = bulge, O = other, T = total sequence. The inset shows the compositional patterns generated by an incorrect structure, containing more scatter and overlap in the loops and bulges and increased scatter and lower GC content in the stems.

 
The nucleotide composition of a structural element, which includes all residues in the sequence classified as such, can be calculated as the fraction of each of the four bases U, C, A and G in this element (degenerate bases, gaps, and other unknown characters are ignored in this calculation). A vector of these four fractions (for example U = 0.1, C = 0.2, G = 0.3 and A = 0.4) can be plotted in composition space, also known as the RNA simplex (Figure 1 C).

Composition can be measured in three directions. Traditionally nucleotide composition is described as GC content which is the fraction of G + C. Similarly, one can calculate the fraction of U + C and the fraction of U + G. These three fractions together give a full description of the composition of a set of residues, and form three orthogonal axes in composition space. Coordinates along these axes can be used to perform calculations on the data.

Nucleotide composition changes over evolutionary time. A single sequence in combination with its structure corresponds to five dots in composition space: one for each structural element (stem, loop, bulge and other) and one for the composition of the full sequence, which is structure independent. When this data is plotted for multiple sequences that fold into the same structure, composition space contains five distributions showing the variation in composition across the sequences (Figure 1 C). In this study, an RNA structure is described by these five distributions.

Generic patterns
For the purpose of structure prediction we are interested in compositional features that are shared by all RNA families. Therefore, building on the general observations made on RNA composition 21, we analyzed the evolutionary patterns of nucleotide composition in many RNA families. Even though the exact location and variation of the compositional distributions within the RNA simplex are family-specific, we identified several generic features, extending the observations made in several ribosomal RNA families 22.

From the multi-family survey it became apparent that the distribution of the stems has a very specific shape and location in composition space. It is an axis-like distribution along Chargaff's; axis (GC axis) with large variation in GC content, and little variation in UG and UC content, describing wobble and nonstandard base pairs respectively. The distribution is often biased towards the G-U edge, caused by G-U wobble pairs, and towards the A-G edge, because base pairs involving As and Gs are more common than base pairs involving Us and Cs. The unpaired regions form very tight and distinct distributions. In many families there is a clear separation between loops and bulges and unpaired regions have a low GC content in comparison to the stems.

The quantification of these trends was not straightforward. Since the exact location and variation differs between families, absolute values to describe these properties are not applicable to all RNAs. The critical step towards describing the universal properties was the realization that real biological structures exhibit particular compositional patterns relative to other (incorrect) candidate structures for the input alignment. Thus, the underlying idea is that when an alignment is correctly decomposed according to the true biological structure, we will observe the characteristic tight distributions, but an incorrect structure will produce scatter in all distributions and overlap between the unpaired regions (this difference between ‘right’ and ‘wrong’ is visualized in Figure 1 C and its inset). Based on this idea, we designed a scoring function that can be used to evaluate candidate structures as to their ability to reproduce the expected patterns.

Scoring function design
Decomposing an alignment according to a given structure results in five compositional distributions in composition space as described above. These distributions can be quantitatively summarized by ‘compositional properties’, such as the mean value or variation along one of the compositional axes, e.g. the standard deviation of the stems along the GC axis. Differences between RNA structures will lead to differences in the compositional patterns, which will be observable by differences in the values of the compositional properties.

We systematically compared real biological structures with incorrect candidate structures on 31 compositional properties in three well-studied RNA families: phenylalanyl-tRNA, bacterial 5S rRNA and 16S rRNA from gamma-proteobacteria. We tested both the mean and SD along each axis in each of four structural elements (24 properties) and seven composite properties, combining multiple elements and/or axes. For each property we calculated the distribution (mean and SD) of values in several ensembles of candidate structures, the Z-score of the real structure (with respect to this distribution), and similarly the Z-scores of the 10 most accurate structures in each ensemble. Finally, we selected five properties that showed consistently low (<–0.5) or high (>+0.5) Z-scores with little variation across all families (listed in Table 1 left side, the reference Z-scores and weights will be explained below).


View this table:
[in this window]
[in a new window]

 
Table 1. Compositional properties

 
The selected compositional properties and their Z-scores observed in the training data are combined into a single scoring function. This function, when applied to a novel alignment with unknown reference structure and an ensemble of candidate structures, assigns a score to each ensemble member indicating its ability to reproduce the expected compositional patterns. Specifically, the score is the weighted root mean square deviation (RMSD) from the reference Z-scores. We experimented with two kinds of reference Z-scores (trained and extreme) and weighted versus unweighted compositional properties, as specified in Table 1. Trained Z-scores are the average Z-scores observed in the training set. Extreme Z-scores are rough extrapolations of the observed Z-scores, indicating that the Z-score should be very low (–2), low (–1) or very high (+2). The weights that we tested are defined as one over the SD of the observed Z-scores in the training set and basically indicate how reliable the reference Z-score is: properties showing high variation in the training set receive a lower weight. Equation 1 specifies the exact scoring function.

Formula 1 (1)
Thus, for a given sequence alignment and ensemble of candidate structures, we first calculate the distribution of values for each property p (five in this case) in the ensemble. Subsequently, for each ensemble member m and for each property p, we calculate the Z-score of the member (zm). We then sum the squared distance of zm to the reference Z-score for the property (zref) weighted by wp over all properties. Finally, we take the mean and the square root of this sum.

Scoring function application
The scoring function assigns an RMSD score to each member of an ensemble of structures. Realistic structures will receive a low score and more unrealistic structures a higher score. The scoring function thus roughly arranges the structures in the ensemble from more accurate to less accurate. Note that, a perfect correlation between RMSD score and accuracy is not necessary and would be impossible to achieve: optimizing the properties and parameters for a specific family would improve the correlation, but would diminish the generality of the scoring function. For this reason, reporting the single best-scoring structure is not informative, so we report the average accuracy of a percentage of top-scoring structures, determined by a cutoff (top-cutoff, usually 5% or 10%). In addition, we report the accuracy of a consensus structure calculated from the top-scoring set (see Consensus calculation section). This structure contains the most reliable base pairs from the given set of top-scoring ensemble members.

In Figure 2 we visualize the application of the scoring function using 5S rRNA, one of the families in the training set, as a specific example. Given an alignment of 20 sequences, we first sample a thousand structures from the Boltzmann ensemble (see Ensemble generation section) and keep the unique structures (963 structures). Next, we calculate an RMSD score for each structure in the ensemble. For both the full ensemble and the 10% highest-scoring structures (96 structures), Figure 2 shows the accuracy distribution of all structures in the set, the mean accuracy of the structures, and the accuracy of the consensus structure calculated from the set (see Prediction accuracy section below for details on the reported accuracies). The clear shift in distribution and the improved mean and consensus accuracy in the top-scoring set relative to the full ensemble demonstrates the ability of the scoring function to select the most realistic structures from the ensemble. Since the 5S rRNA data was used for training the algorithm, this example merely shows that the scoring function can memorize learned values. The generalization of the scoring function will be presented in the Results section.


Figure 2
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2. Application of the scoring function. Data shown is for an alignment of 20 5S rRNA sequences and an ensemble of 963 unique structures. The scoring function used the extreme Z-scores and weighted properties. Consensus structures were calculated with a top-cutoff of 10% and a bp-cutoff of 0.4. The accuracy of the structures is expressed by the correlation coefficient (CC). In both sets, the mean accuracy is indicated with a dotted line and the accuracy of the consensus structure with a solid line. The height of a bar corresponds to the fraction of structures in the corresponding accuracy range, and all bars within one set add up to 1.0.

 
Prediction accuracy
The accuracy of a predicted structure with respect to the reference structure can be determined both at the base-pair level and at the level of the structural classification. The base-pair similarity involves three numbers: true positives (TP) are base pairs that occur both in the prediction and in the reference, false positives (FP) are predicted base pairs that are not in the reference structure, and false negatives (FN) are base pairs in the true structure that were not predicted. These values can be combined into three accuracy metrics (24,5,7). Sensitivity (SEN), defined as TP/(TP + FN), is the fraction of the true structure that is predicted correctly. The positive predictive value (PPV), defined as TP/(TP + FP), reports what fraction of the predicted base pairs is correct. The correlation coefficient (CC), an approximation of Matthew's; correlation coefficient, combines these two values in Formula 25. The classification similarity (CS) expresses the topological similarity between two structures. It is defined as the fraction of positions in the classification strings with identical classification. The classification similarity is relevant in this study, because the patterns of nucleotide composition are calculated from the structural classification, and a single base-pair change may have a large effect on the overall topology of the structure.

Consensus calculation
Selecting the most reliable base pairs, such as by calculating a centroid structure, or eliminating unreliable base pairs (based on their base pair probabilities) is a proven concept in RNA structure prediction (26,27). We select the most reliable base pairs by calculating a consensus structure given a set of structures. The first step is to list the frequency of occurrence of each base pair in the set of structures. The consensus structure contains all base pairs that occur with a certain frequency or higher as determined by the ‘base-pair cutoff’ (bp-cutoff). Eligible base pairs are added to the consensus structure one by one from high to low frequency if both the 5' and 3' position are not in the consensus yet. In this way, the consensus structure is free of conflicts, i.e. each base interacts with at most one other base, but might include pseudoknots. At a very high (strict) bp-cutoff only very reliable base pairs will be included in the consensus, leading to a high PPV, but a lower SEN. In contrast, a more relaxed cutoff will result in a higher SEN, but comes with the risk of including false predictions (lower PPV). By testing the consensus accuracy at different cutoff values (0 to 1, steps of 0.05) in several ensembles, we determined that a bp-cutoff of 0.4 results in the best accuracy (CC). The consensus structures in this study are therefore calculated at this cutoff value.

Ensemble generation
Given an RNA alignment, the first step in the procedure is to create an ensemble of candidate structures to be evaluated. For this task, we used the program RNAsubopt 11 from the Vienna RNA package 28 with the -p option to sample suboptimal structures proportional to their Boltzmann weights 12. For each alignment we randomly sampled a thousand structures and removed duplicates from the set. The true structure is not necessarily present in this set. The arbitrary limit of a thousand structures is chosen to limit the CPU time spent on sampling for long sequences. However, the SPuNC web interface provides three different sampling methods and gives the user control over the sample size and removal of duplicate structures.

The prediction power of our method is limited by those RNA structures present in the ensemble. Ideally the structures in the ensemble are distributed over the full range of accuracy up to 100%. For shorter sequences the current sampling methods suffice to generate an ensemble with these characteristics. However, for longer sequences (such as RNase P, and SSU/LSU rRNA) these methods produce insufficient coverage of the spectrum. For these cases we generated, in addition to the original ensemble, an extended ensemble using knowledge of the reference structure. Specifically, we constrained RNAsubopt with constraints fixing between 5% and 95% of either the unpaired or paired positions in the true structure. The extended ensemble is a merge of 500 constrained and 500 unconstrained structures (unique structures only). The purpose of applying our method to the extended ensemble is to show its potential if such extended ensembles could be generated without knowledge of the true structure.

Method performance
The input for our method is an alignment of related RNA sequences that presumably fold into the same structure. We tested the method on 15 different alignments from multiple sources (29,23,30,31), covering a wide range of RNA families (Table 2). The data are divided into three sets. The training set (T) contains three families, which are also used to derive the rules. The validation set (V) contains four families with ‘new’ data. The datasets in the benchmark group (B) are used for performance comparison of the algorithm. These sets were taken directly from the BRaliBase I benchmark study 5, although the sequences were realigned using MUSCLE 32 and the sequence corresponding to the reference structure was added in case no perfect match was found. The reference structures were either consensus structures that came with the alignments or base pair selections, using RNAview 33 and MC-Annotate 34, from experimentally determined structures downloaded from the RCSB Protein Data Bank 35. Pseudoknots were included in the reference, whereas noncanonical base pairs were excluded.


View this table:
[in this window]
[in a new window]

 
Table 2. Alignments and reference structures

 

    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 FUNDING
 REFERENCES
 
Here, we report the accuracy of our method on the 15 alignments specified in Table 2. The data are divided into a training, a validation and a benchmark group. The results are calculated using the five compositional properties specified in Table 1. In Table 3 we report the accuracy of the results obtained using extreme reference Z-scores and weighted compositional properties.


View this table:
[in this window]
[in a new window]

 
Table 3. Prediction accuracy (CC) for 15 alignments using extreme reference Z-scores and weighted properties in the scoring function, a top-cutoff of 0.1, and a bp-cutoff of 0.4

 
Scoring function identifies realistic structures
The scoring function succeeds in identifying the most realistic structures from an ensemble. This becomes immediately clear from comparing the mean accuracy of all structures in the ensemble with the mean accuracy of the set of top-scoring structures (Table 3 accuracy column 1 and 3). In the training set the average increase in accuracy is 20.7% (even 25.9% when including the extended ensembles). The method does not only perform well on the training group, as might be expected, but also generalizes to RNA families not used to derive the rules. In the validation dataset the average increase in accuracy is 13.9% with an increase above 20% for the hammerhead ribozyme and the TPP riboswitch. The results are least convincing for the benchmark set, where the average increase is only 8% (or 13.7% when using the extended ensembles). The first reason is that our method does not perform well on the high similarity tRNA dataset, because there is no enough diversity among the sequences to produce any compositional patterns. The second reason for the lower performance is the quality of the ensembles for the large molecules (RNase P, SSU rRNA, LSU rRNA). In the original ensembles, not using any information from the true structure, the most accurate structures have a CC of 0.736, 0.667 and 0.548 respectively. Possibly the sample size needs to be increased for molecules of this size, because a sample of a thousand structures covers a relatively smaller part of structure space than an equally sized sample for a shorter molecule. The method performs much better on the extended ensembles where the accuracy of the most accurate structure is above 0.9.

Improved consensus from top-scoring structures
It has been shown that the centroid structure of a set of suboptimal structures (defined as the structure with the shortest total distance to all structures in the set) is more accurate than the minimum free energy (MFE) prediction 26. Our results confirm this finding: the ensemble consensus (similar to a centroid structure) is more accurate than the MFE prediction in all datasets (Table 3 accuracy column 2 and 5). Moreover, we address the question whether the consensus structure calculated from the top-scoring structures (‘top consensus’) is more accurate than the consensus structure calculated from the full ensemble (‘ensemble consensus’). As described above, the average accuracy of top-scoring structures is higher than the average accuracy of all structures in the ensemble, and thus we expect the consensus from this set to be also more accurate than the consensus from the full set.

For the majority of datasets in our test the consensus from top-scoring structures identified by the scoring function is significantly more accurate than the ensemble consensus (Compare accuracy column 2 and 4 in Table 3). The average increase is 10.9% (15.8%) and 9.1% in the training and validation set respectively. The improvement is most pronounced in datasets with shorter sequences and a diverse ensemble, such as for 5S rRNA, the TPP riboswitch and the glmS ribozyme. For tRNA-PHE, the hammerhead ribozyme, and the purine riboswitch there are fewer structures in the ensemble which are all highly accurate. For these datasets the top consensus has about the same accuracy as the ensemble consensus, because little or no improvement over the ensemble consensus was possible in the first place. Again we obtain mixed results on the benchmark dataset. Performance is poor on the high-similarity tRNA alignment, because of reasons outlined above. For the three large molecules (RNase P, SSU rRNA and LSU rRNA) we note that the ensemble consensus is more accurate in case of the original ensemble generated by RNAsubopt, and that the top consensus is better in case the extended ensemble is used. The consensus calculation seems to be mainly hampered by the composition of the ensemble, since the scoring function rejects the least accurate structures for both ensembles and changing the bp-cutoff did not affect the results. This underlines the fact that our method would clearly benefit from improved ensemble-generation methods for these large molecules.

Accuracy compared to other methods
It is important to compare the performance of a novel structure prediction method to that of existing approaches. A full benchmark study is outside the scope of this work, but we provide two ways to put the accuracy of our method in perspective. First, we applied our method to the datasets provided by the BRaliBase I benchmark study 5, and thus the accuracies reported here can be compared to those in the original benchmark. However, this comparison can be approximate at best, because the sequences are realigned using MUSCLE 32, we use the pseudoknotted structure as reference for all datasets, we do not adjust the reference structure to the alignment via the consistency criterion, and we do not distinguish inconsistent, contradicting and compatible false positives. To provide a more direct framework in addition to this approximate comparison, we applied both RNAfold (28,36,37) and RNAalifold 13 to all alignments in our data collection (Table 3 last two columns). RNAfold predicts the minimum free, energy structure for a single sequence. This type of method achieves in general lower accuracies than multiple-sequence comparative approaches, and we thus expect our method to make more accurate predictions than RNAfold. In contrast, RNAalifold is a top-of-the-line prediction method for multiple-sequence alignments combining thermodynamics and covariation. RNAalifold outperformed most other methods tested in the BRaliBase I benchmark and is thus a good target for our method.

The most remarkable result is that the consensus from top-scoring structures is more accurate than the RNAalifold prediction for several families: the 5S rRNA, the glmS ribozyme and both riboswitches. Our method matches the results of RNAalifold for the tRNA alignment in the training set and the medium-similarity tRNA alignment in the benchmark set. On the hammerhead ribozyme our method achieves good accuracy (0.889), but not as good as RNAalifold (1.0). For the longer sequences our method is hampered by the insufficient coverage in the ensemble, as indicated above. We do demonstrate however that our method has the ability to enhance structure prediction for long molecules when diverse enough ensembles can be generated, but that is of limited practical value at the moment. It emphasizes the need to create more diverse ensembles that include structures of higher accuracy.

Additional results
Table 3 contains the results using the extreme reference Z-scores and weighted addition in the scoring function, because they produced the most accurate predictions. We provide the results for the other three method settings through the SPuNC website. The performance of the method was actually quite consistent across the different settings. The accuracy difference between trained and extreme references was slightly larger than between weighted and unweighted addition. The positive predictive value (PPV) is generally higher than the sensitivity (SEN). The results based on classification similarity (CS) are in agreement with those based on correlation coefficient (CC). Further experiments are necessary to find the optimal Z-score references and weights in the scoring function.

For completeness we also calculated the correlation between the RMSD score and the accuracy (specifically CC) for each dataset, even though the overall correlation does not influence the prediction results as long as low-scoring structures are highly accurate. The results are available through the web interface. In general the observed correlations between score and accuracy (CC) are relatively weak but highly significant (strongest correlation coefficient observed is –0.821, r2 = 0.67).


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 FUNDING
 REFERENCES
 
The structure-prediction method introduced in this paper uses evolutionary patterns of nucleotide composition to assess the quality of candidate structures for a multiple-sequence alignment. At the heart of our method lies a scoring function that assigns a score to each ensemble member, reflecting its ability to produce the patterns observed in biological structures. The performance test of our method showed that the scoring function is able to identify the most realistic structures in an ensemble. In addition, the prediction accuracy was further improved by calculating a consensus structure from the top-scoring ensemble members. Our method performed well on a wide variety of RNA families, including several novel types of RNA such as riboswitches and ribozymes.

The SPuNC algorithm takes full advantage of the topology of the molecule by exploiting signal from distinct unpaired regions in addition to that from paired regions. Also beneficial is the insensitivity to the exact alignment at each sequence position as long as the residues are classified in the correct structural element. The negative side effect of this feature is that the method can not distinguish between two structures that differ in their base pairs but result in the same classification string, but no practical consequences hereof were observed. Our method has the potential to incorporate pseudoknots and noncanonical base pairs, although adjustments to the ensemble-generation methods and the scoring function would be necessary to fully benefit from these features.

The current scoring function was designed using the compositional patterns as the only source of evolutionary signal. The purpose was to demonstrate that simple, biologically acceptable rules—stems vary mostly in GC content and are GC-rich, unpaired regions are GC-poor and have rather consistent composition over evolutionary time—can be used for structure prediction. Within this framework, the five selected compositional properties have been shown to result in accurate predictions, in some cases even outperforming existing prediction methods. However, further research is necessary to optimize the points of reference and weights in the scoring function.

Rather than capitalizing on this information source in an isolated fashion, future efforts should aim at combining compositional patterns with other sources of information. A promising direction is the incorporation of the scoring function, along with other observations, in a Bayesian framework such as BayesFold 15. The Bayesian inference approaches are becoming more feasible with advances in computational power and already have been shown to be valuable in RNA structure prediction 9. The scores based on the patterns of nucleotide composition could potentially function as prior probabilities on the ensemble members or as a set of observations used to update the posterior probabilities.

Along with optimizations in the scoring function, our method will benefit from improved ensemble-generation methods. The statistical sampling of RNA secondary structures is a major improvement over minimum free energy and near-optimal structure predictions in terms of the coverage of the energy spectrum, but further improvements can be made 9. As becomes clear from our study, a lack of accurate structures in the ensemble, as observed for long sequences such the RNAs in both ribosomal subunits, limits the predictive power of our method. A possible strategy to calculate improved ensembles is to collect all viable base pair regions (with high probability) and to combine them into new structures (which might include pseudoknots in addition). Another option to improve the performance of our method on longer sequences, which does not rely on novel methods for ensemble generation, would be to apply a genetic algorithm: start with a random sample of structures, score each ensemble member and select the most realistic structures (low RMSD scores), let them ‘reproduce’ into a new ensemble by constrained folding, repeat, and progress towards the true structure in this manner.

In conclusion, evolutionary patterns of nucleotide composition should be added to the toolbox for structure prediction. The presented method reaches encouraging accuracies on a wide variety of RNA families. Especially the good result on the riboswitches and ribozymes support the value of this method, because many more of these novel RNAs await discovery and structural characterization.


    FUNDING
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 FUNDING
 REFERENCES
 
This work is supported by the Centre for Medical Systems Biology (CMSB). The CMSB is supported by a Genomics Centre of Excellence grant from the Netherlands Genomics Initiative, which is funded by the Dutch government.

Conflict of interest statement. None declared.


    ACKNOWLEDGEMENTS
 
We are grateful to Sébastien Lemieux for providing a standalone implementation of the MC-Annotate algorithm. We also want to thank Dave Mathews and the members of the Heringa Lab and Knight Lab for their feedback on our structure prediction method.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 FUNDING
 REFERENCES
 

  1. Eddy SR. Non-coding RNA genes and the modern RNA world. Nat. Rev. Genet. (2001) 2:919–929.[CrossRef][Web of Science][Medline]

  2. Tinoco I Jr, Uhlenbeck OC, Levine MD. Estimation of secondary structure in ribonucleic acids. Nature (1971) 230:362–367.[CrossRef][Medline]

  3. Zuker M. Calculating nucleic acid secondary structure. Curr. Opin. Struct. Biol. (2000) 10:303–310.[CrossRef][Web of Science][Medline]

  4. Major F, Griffey R. Computational methods for RNA structure determination. Curr. Opin. Struct. Biol. (2001) 11:282–286.[CrossRef][Web of Science][Medline]

  5. Gardner PP, Giegerich R. A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics (2004) 5:140.[CrossRef][Medline]

  6. Mathews DH. Revolutions in RNA secondary structure prediction. J. Mol. Biol. (2006) 359:526–532.[CrossRef][Web of Science][Medline]

  7. Mathews DH, Turner DH. Prediction of RNA secondary structure by free energy minimization. Curr. Opin. Struct. Biol. (2006) 16:270–278.[CrossRef][Web of Science][Medline]

  8. Reeder J, Hochsmann M, Rehmsmeier M, Voss B, Giegerich R. Beyond Mfold: recent advances in RNA bioinformatics. J. Biotechnol. (2006) 124:41–55.[CrossRef][Web of Science][Medline]

  9. Ding Y. Statistical and Bayesian approaches to RNA secondary structure prediction. RNA (2006) 12:323–331.[Abstract/Free Full Text]

  10. Gutell RR, Lee JC, Cannone JJ. The accuracy of ribosomal RNA comparative structure models. Curr. Opin. Struct. Biol. (2002) 12:301–10.[CrossRef][Web of Science][Medline]

  11. Wuchty S, Fontana W, Hofacker IL, Schuster P. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers (1999) 49:145–65.[CrossRef][Web of Science][Medline]

  12. Ding Y, Lawrence CE. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res. (2003) 31:7280–7301.[Abstract/Free Full Text]

  13. Hofacker IL, Fekete M, Stadler PF. Secondary structure prediction for aligned RNA sequences. J. Mol. Biol. (2002) 319:1059–1066.[CrossRef][Web of Science][Medline]

  14. Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc. Natl Acad. Sci. (2004) 101:7287–7292.[Abstract/Free Full Text]

  15. Knight R, Birmingham A, Yarus M. BayesFold: rational 2° folds that combine thermodynamic, covariation, and chemical data for aligned RNA sequences. RNA (2004) 10:1323–36.[Abstract/Free Full Text]

  16. Giegerich R, Voss B, Rehmsmeier M. Abstract shapes of RNA. Nucleic Acids Res. (2004) 32:4843–4851.[Abstract/Free Full Text]

  17. Reeder J, Giegerich R. Consensus shapes: an alternative to the Sankoff algorithm for RNA consensus structure prediction. Bioinformatics (2005) 21:3516–3523.[Abstract/Free Full Text]

  18. Steffen P, Voss B, Rehmsmeier M, Reeder J, Giegerich R. RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics (2006) 22:500–503.[Abstract/Free Full Text]

  19. Lemieux S, Major F. Automated extraction and classification of RNA tertiary structure cyclic motifs. Nucleic Acids Res. (2006) 34:2340–2346.[Abstract/Free Full Text]

  20. Parisien M, Major F. The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature (2008) 452:51–55.[CrossRef][Web of Science][Medline]

  21. Schultes E, Hraber PT, LaBean TH. Global similarities in nucleotide base composition among disparate functional classes of single-stranded RNA imply adaptive evolutionary convergence. RNA (1997) 3:792–806.[Abstract]

  22. Smit S, Yarus M, Knight R. Natural selection is not required to explain universal compositional patterns in rRNA secondary structure categories. RNA (2006) 12:1–14.[Abstract/Free Full Text]

  23. Cannone JJ, Subramanian S, Schnare MN, Collett JR, D'S;ouza LM, Du Y, Feng B, Lin N, Madabusi LV, Muller KM, et al. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics (2002) 3:2.[CrossRef][Medline]

  24. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics (2000) 16:412–424.[Abstract/Free Full Text]

  25. Gorodkin J, Stricklin SL, Stormo GD. Discovering common stem-loop motifs in unaligned RNA sequences. Nucleic Acids Res. (2001) 29:2135–2144.[Abstract/Free Full Text]

  26. Ding Y, Chan CY, Lawrence CE. RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA (2005) 11:1157–1166.[Abstract/Free Full Text]

  27. Mathews DH. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA (2004) 10:1178–1190.[Abstract/Free Full Text]

  28. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. Fast folding and comparison of RNA secondary structures. Monatsh. Chem (1994) 125:167–188.[CrossRef]

  29. Szymanski M, Barciszewska MZ, Erdmann VA, Barciszewski J. 5S Ribosomal RNA Database. Nucleic Acids Res. (2002) 30:176–178.[Abstract/Free Full Text]

  30. Sprinzl M, Vassilenko KS. Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. (2005) 33. (Database issue):D139–D140.

  31. Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. (2005) 33. (Database issue):D121–D124.

  32. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. (2004) 32:1792–1797.[Abstract/Free Full Text]

  33. Yang H, Jossinet F, Leontis N, Chen L, Westbrook J, Berman H, Westhof E. Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res. (2003) 31:3450–3460.[Abstract/Free Full Text]

  34. Gendron P, Lemieux S, Major F. Quantitative analysis of nucleic acid three-dimensional structures. J. Mol. Biol. (2001) 308:919–936.[CrossRef][Web of Science][Medline]

  35. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. (2000) 28:235–242.[Abstract/Free Full Text]

  36. Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. (1981) 9:133–48.[Abstract/Free Full Text]

  37. McCaskill JS. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers (1990) 29:1105–19.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Print PDF (338K) Freely available
Right arrow Screen PDF (219K) Freely available
Right arrowOA All Versions of this Article:
37/5/1378    most recent
gkn987v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Smit, S.
Right arrow Articles by Heringa, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Smit, S.
Right arrow Articles by Heringa, J.
Related Collections
Right arrow Nucleic acid structure
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?