Nucleic Acids Research, 2003, Vol. 31, No. 9 e49
© 2003 Oxford University Press
On the normalization of RNA equilibrium free energy to the length of the sequence
Bioinformatics Program, Boston University, Boston, MA 02215, USA
*To whom correspondence should be addressed. Tel: +1 617 353 5463; Fax: +1 617 353 5462; Email: dp{at}bu.edu
Received November 17, 2002; Revised February 19, 2003; Accepted March 3, 2003
| ABSTRACT |
|---|
|
|
|---|
There is no universal definition of stability for RNA secondary structures. Here we present an approach that is based on normalization of the equilibrium free energy to the length of the sequence: a segment of RNA is said to be stable if the ratio of the equilibrium free energy to the length of the segment is greater than a certain threshold value. Discarding the segments whose normalized equilibrium free energies are smaller than the threshold allows us to view the secondary structure at different levels of stability. Confined to only highly stable structures, the algorithm for secondary structure prediction admits a number of simplifications that make it computationally tractable for large sequences and advantageous over most other methods on a genome-wide scale. This method was applied to the Caenorhabditis elegans genome to localize the regions that encode stable secondary structures. In particular, 36 of 56 previously reported micro-RNAs were localized to 4% of the genome. A fraction of long (
400 nt) stable inverted repeats in the genomic sequence of C.elegans was found. Their distribution is very uneven, and skewed towards the ends of chromosomes. This method can be used for genome-wide detection of transcription termination signals, putative micro-RNAs, and other regulatory elements that involve stable RNA secondary structures. | INTRODUCTION |
|---|
|
|
|---|
Existing RNA folding algorithms fall into two major classes: MFOLD-type algorithms and covariance models. MFOLD-type algorithms find structures with the lowest equilibrium free energy and share a common dynamic programming core, while covariance models recognize positions that are covarying to maintain base complementarity and use the strategy of hidden Markov models (1). The dynamic programming approach to RNA secondary structure prediction was pioneered
30 years ago. Nussinov et al. presented a simple free energy function that is minimized when the secondary structure contains the maximum number of complementary base pairs (2). Tinoco et al. calculated the free energy as a sum of independent energies for each of the loops in the structure, rather than for each of the complementary base pairs (3). Zuker and co-workers introduced a more realistic free energy function that is based on experimentally determined thermodynamic parameters and takes into account coaxial stacking energies (4,5). Zukers algorithm requires O(n3) time and O(n2) memory for a sequence of length n but does not allow for pseudoknots. The output of an MFOLD-type algorithm is a list of intramolecular base pairings (i.e. the secondary structure) that corresponds to the lowest equilibrium free energy. The structure is derived from the recursively calculated dynamic programming matrix, whose entries are the minimal folding energies of all segments of the sequence. This matrix also contains information about all suboptimal structures. Suboptimal structures have energies that are nearly equal to the lowest equilibrium free energy and, therefore, are almost as thermodynamically stable as the optimal structure. Some of the base pairs are present in all or in the majority of suboptimal structures, while other pairings are specific for only a few of them. This suggests that the secondary structure consists of core and variable parts.
The definition of the core part, as being the segment that has the largest negative equilibrium free energy, is confronted with the following difficulty: longer segments have lower free energy because they have more bases to pair, and the segment with the largest negative free energy turns out to be the entire sequence. This leads us to the idea of normalization of the equilibrium free energy to the length of the segment, i.e. in order to treat all parts of the sequence equitably, we need to divide the equilibrium free energy of each segment by an appropriate quantity that is a function of length. In this work, we explore only the case when this function is linear, since it is the most appropriate normalization in the context of stability: the minimum equilibrium free energy over all possible sequences of length n grows linearly with n. A linear normalization factor also has an advantage of scaling the free energies from zero to one, and allows analytical development of statistical frameworks. The maximum number of base pairs approximation, which is discussed in the next section, makes this method computationally tractable for large sequences and advantageous over most other methods on a genome-wide scale.
| METHODS |
|---|
|
|
|---|
The free energies do not need to be calculated separately for each segment of the sequence, as they are already encrypted in the dynamic programming matrix provided by a secondary structure prediction tool. Denote the entries of this matrix by Eij. Once Eij are known, we divide them by the lengths of corresponding segments. This transforms Eij to
ij, where
The quantity
ij has units of linear density of free energy and is referred to as the stability factor. If we now choose a threshold
for the stability factor and drop all segments that have values of
ij smaller than
, then the rest of the sequence will correspond to the stable part of the secondary structure. This stable part depends on
and is called
-stable. Stable structures with different levels of stability are related to each other monotonically: if
1 >
2, then the
1-stable structure (as a set of base pairs) is a subset of the
2-stable structure.
Before elaborating on these ideas, let us make a number of simplifications. First, if we are interested in highly stable secondary structures, i.e. ones with only few bases unpaired, then we do not need much accuracy in the prediction of the equilibrium free energy and can roughly approximate it (up to a constant factor) with the maximum number of paired bases. This facilitates an increase in speed and the ability to manipulate the matrix Eij analytically. Once the regions that have high density of equilibrium free energy are found, their secondary structures can be refined with MFOLD. Later we show that stemloops are, in general, more stable than the other types of unknotted secondary structures. Branching structures usually have large unpaired regions at the points where they branch and, therefore, have a smaller percentage of paired bases per unit length. By the same reasoning, the most stable stemloops are ones that have long stems and short loops. We make use of these observations by considering only non-branching secondary structures. This allows us to implement the calculations of Eij in quadratic time and space using a simplified version of the original Nussinov algorithm (2). If we confine ourselves to the stemloops whose lengths are uniformly bounded by a constant k, then we would not need Eij for |j i| > k. In this case, calculation of Eij can be done in linear time and space.
From now on, we consider only non-branching secondary structures and assume that Eij is the maximum number of complementary base pairs in the segment between bases i and j. It is clear that 0
Eij
j i + 1 and, therefore, 0
ij
1 for all i and j. This shows that the natural range of the stability threshold is from zero to one. As we will see later, the typical values of stability threshold for highly stable structures (i.e. ones of interest to us) vary from 0.6 to 0.9.
Approximation of the free energy with the maximum number of complementary bases allows us to answer important statistical questions. First, we estimate the probability distribution function of equilibrium free energies. Then we calculate the probability of occurrence of an n-long and
-stable subsequence in a random sequence of a given length. From these results, we infer the critical length of a stemloop, whose presence in a sequence of given length is essentially non-random. Namely, the critical length (in bases) of a stemloop n0 at the significance level
is given by the formula
where
Here N is the sequence length,
is the stability threshold, e is the base of the natural logarithm, and p is the parameter that describes the nucleotide composition of the sequence (p = 0.25 for uniform distribution of bases). The reader is referred to the Appendix for a detailed derivation of these formulae. Table 1 gives an example of calculations of critical stemloop lengths for p = 0.25 and N = 108.
|
| RESULTS |
|---|
|
|
|---|
Based on the method described above, we developed STLS which is a program for prediction of stable RNA secondary structures. The input of STLS consists of three components: nucleotide sequence, stability threshold and other parameters. The first two arguments have obvious meaning. A description of other parameters is found in the STLS manual. The output of STLS tabulates stability factors, positions and secondary structures of all
-stable segments. A copy of STLS can be obtained from our website http://genomics10.bu.edu/dp/rna/stls/ or from the authors on request. In this section, we give a number of tests for STLS and also list some of its applications.
To test the accuracy of STLS predictions, we generated 700 random 400 nt long sequences with uniform distribution of bases. For each sequence, we took the secondary structure Ss predicted by STLS and compared it with the secondary structure Sv predicted by the VIENNA algorithm (6). As a quantitative measure of accuracy, we used
, the ratio of the number of base pairs that belong to both Ss and Sv to the number of base pairs that belong to Ss, i.e.
= | Ss
Sv|/| Ss|, where |S| denotes the cardinality of a set S. In other words,
is the specificity of the prediction of STLS with respect to the prediction of VIENNA. For
= 0.68, 0.72, 0.78 and 0.80, we obtained the following values of
: 0.88, 0.91, 0.98 and 0.99, respectively. This indicates that the predictions of STLS are consistent with the predictions of the traditional method if the stability threshold is high enough.
Another accuracy test was performed on the set of 56 micro-RNAs in Caenorhabditis elegans. Recall that micro-RNAs are small (
22 nt) regulatory RNAs that are generated from the common stemloop precursors by the process requiring Dicer, a protein that cleaves the stemloop (7). These precursors are structures with short loops and long stems, which have only a few small bulges or internal loops. This study was motivated by the challenge to determine whether STLS can find micro-RNA sequences in the C.elegans genome. The complete sequence of C.elegans was obtained from the WORMBASE website (8) and then scanned with STLS using stability thresholds varying from 0.68 to 0.82. For
= 0.68, it yielded approximately 44 000 stable stemloops, which together correspond to
4% of the genome. It turned out that 36 of 56 micro-RNAs (64.3%) were on our list.
However, micro-RNAs correspond to only a small fraction of the stemloops found by STLS. It is interesting to explore the rest of the list. We examined all 44 000 stemloops for their lengths, locations in chromosomes and membership in annotated functional parts of the genome. The distribution of length (Fig. 1) revealed that most of them are short (<100 nt). However, there is a large fraction of very long stable stemloops (
400 nt), which can be seen at higher thresholds (
= 0.82). In the next experiment, we investigated the spatial density of stable stemloops. Figure 2 shows that this distribution is very uneven and biased towards the ends of chromosomes. To verify this observation numerically, we subdivided all chromosomes into 250 bins of equal length and calculated the number of stemloops ni, and GC content ßi for each bin. The values of the Pearson correlation coefficient r (for ni versus 2·|ßi 0.5|), Spearman rank correlation r' and
2 statistics (for ni) are given in Table 3. The P-values for the
2 test with n = 249 degrees of freedom were computed using approximation by normal distribution with mean n and standard deviation
2n.
|
|
|
In an attempt to elucidate possible functions of the stemloops found, we mapped them to annotated introns, exons, intergenic regions, and 5'- and 3'-untranslated regions (UTRs). The quality of this analysis, however, is very dependent on the quality of the predictions of intronexon boundaries and UTRs. As the actual length of the UTRs was not known, we used 180 nt long sequences upstream and downstream of the corresponding genes. The corresponding densities of the stable stemloops in introns, exons, genes (introns + exons), intergenic regions, 5'- and 3'-UTRs were 23.4, 11.6, 16.3, 17.4, 7.9 and 10.8 bases of stemloops per 1000 of bases of sequence, respectively. Note that introns have higher, while UTRs have lower values of stemloop density compared with exons and intergenic regions.
| DISCUSSION |
|---|
|
|
|---|
A common argument against the maximum number of base pairs approach to the RNA secondary structure prediction problem is that the strands of a helix are held together by coaxial stacking interactions of paired bases rather than by hydrogen bonds. In the VIENNA package, the energies of helices are calculated by adding stacking energies for each pair of neighboring base pairs. However, if we set the stability threshold at 0.78 or higher, then the predictions of both VIENNA and STLS are essentially the same (specificity
= 98%), although STLS does not take stacking energies into account. This happens because at high value of threshold, we a priori confine ourselves to very long helices. In any double-stranded structure of length n there are n base pairings, and n 1 stacking interactions. Certainly, as n increases, the differences between stacking energies of different base pairs average out, yielding a quantity that is proportional to the number of stacking interactions. Therefore, when the normalization is performed, the discrepancy between a purely base pairing energy function and the energy function that also takes stacking energies into account decreases as n gets larger and (n 1)/n becomes closer to 1. This explains why lengthy stemloops were predicted correctly by STLS at high threshold levels. Of course, the predictions of STLS and VIENNA differ more significantly when
is smaller than 0.78, which has to do with the fact that stacking does not treat all combinations of bases equally, while STLS scores them the same. Note that one of the advantages of STLS is an increase in calculation speed (
5000-fold for a 500 nt long sequence). Thus, in the cases when lower thresholds are needed, one can use STLS for preliminary identification of the stable regions, and then refine their secondary structures with VIENNA or MFOLD.
In the experiment with micro-RNAs, we were able to localize
64% of micro-RNAs (7) to a list of 44 000 segments (
4% of the genome) without any prior knowledge about their positions and relying solely on the hypothesis that they belong to stable stemloops. This result, of course, is a usual interplay between sensitivity and specificity: the sensitivity drops dramatically when the specificity increases (Table 2). There were no micro-RNAs found for
= 0.76. This fact might indicate that the stemloop precursors of micro-RNAs have to be stable but slightly imperfect in order to be recognized by the cellular machinery and bypass degradation pathways.
|
While micro-RNAs are needed to repress the translation of a target gene, the function of the other stemloops, as well as their origin, remains unclear. Figure 1 shows that most of them have a length of
100 nt, but there is also a fraction of longer (
400 nt) stemloops. According to Table 1, the presence of these long stemloops is qualified as essentially non-random. It is remarkable that they were seen at all four thresholds, separating from the short fractions when the threshold increases. The similarity of the structures of the peaks in Figure 1b, c and d suggests that these very long stemloops are threshold independent.
From now on, we set
= 0.82. The regularity of the peaks in Figure 1d suggests that such a family of stemloops might appear as a result of gene duplication or be developed by exogenous factors such as multiple incorporation of double-stranded genetic material into the same spot on DNA. The latter hypothesis is supported by the fact that the abscissae of the peaks in Figure 1 are almost multiples of each other. However, the spatial correlation between the locations of stemloops (Fig. 2) is indicative of generation through tandem duplication. Indeed, repetitive elements often have inverted repeats associated with them (in the form of long terminal repeats), and as such would be expected to score highly in STLS.
It is very remarkable that the distribution of stable stemloops in C.elegans chromosomes is uneven and biased towards their ends (see Fig. 2 and P-values in Table 3). This finding is in good agreement with the results of Surzycki and Belknap on the distribution of MITE-like repeats in C.elegans (9) and with the fact that central regions of autosomes (chromosomes IV) have a higher density of genes than their arms (10). One may expect a greater probability of paired bases in sequences that are either GC-rich or GC-poor, since the probability of any two bases being complementary is higher in such sequences. Our analysis (Pearson and Spearman statistics in Table 3) showed that this did not contribute significantly.
Also, it is not entirely clear whether the clusters of stable stemloops carry out any biological function. Recall that the density of the stable stemloops in introns and intergenic regions is at least twice as high as in UTRs. This observation is not surprising because the UTRs usually contain important cis-elements that are responsible for proteinRNA interactions; secondary structures potentially would compete with or disrupt such interactions and therefore might be expected to be selected against in the UTR sequence.
Conclusions
We present an efficient method for prediction of stable RNA secondary structure. The key part of the method relies on the normalization of the equilibrium free energy to the length of the sequence. In the class of stable secondary structures, the algorithm was shown to have good performance for long sequences, and the results are consistent with other RNA secondary structure prediction methods. Using this method, we located the regions in the C.elegans genomic sequence that encode stable secondary structures and characterized their distributions. In particular, we localized 64% of micro-RNAs previously reported by Lau in
4% of the genome relying solely on the property of micro-RNAs belonging to the stable stemloops. We report that there is a fraction of long (
400 nt) stable stemloops in the C.elegans genome; their distribution is very uneven and skewed towards the ends of chromosomes. The method we developed can be used for the detection of transcription termination signals and putative micro-RNAs, as well as many other regulatory elements that correspond to stable secondary structure.
| ACKNOWLEDGEMENTS |
|---|
D.D.P. wishes to thank Chris Sander, Debbie Marks, Charles DeLisi and Robert Berwick for their valuable comments and insightful discussions. This work is supported by the Bioinformatics Program at Boston University.
| APPENDIX |
|---|
|
|
|---|
We recall the formal language of secondary structures. Let X = (x1,...xn) be a sequence of letters from the alphabet
= {A,C,T,G}. A secondary structure is a set S of pairs (i,j) with 1
i
j
n such that for all (i,j), (i',j')
S, the condition i = i' implies j = j', and vice versa. The relationship
, where (i',j')
(i,j) if i < i' < j' < j, defines a partial order on S. A secondary structure is said to be non-branching if
is a linear order, i.e. if
' or
'
for all
,
'
S. Consider the following additive energy function
where e(.,.) is a scoring matrix. For simplicity, now we assume that e(x,y) = 2 if x and y are WatsonCrick complementary bases and e(x,y) = 0 otherwise. We define
E(X) = masx{E(X,S)|S is non branching}
i.e. E(X) is the maximum number of complementary bases in the sequence X over all possible non-branching secondary structures.
For a given sequence X, the value of E(X) is calculated recursively by the formula
Eij = max{Ei+1j, Eij1, Ei+1j1 + e(xi,xj)},
where Eij = E((xi,...,xj)), i,j = 1...n. Then the matrix Eij is transformed to the matrix
ij using equation 1, and then all regions that have
ij greater than a certain threshold value are identified. If the length of a stemloop has a prior upper bound d, then in place of Eij we consider the matrix E'ik = Eii+k, which is also expressed recursively as
E'ik = max{E'i+1k1, E'ik1, E'i+1k2 + e(xi,xi+k)},
where i = 1...n d and k = 1...d, and then it is transformed to the matrix
ij.
Now we want to calculate the probability of observing a stable stemloop in a random sequence. Suppose that X is a sequence of independent random letters x1,...,xn that came from the same (but not necessarily uniform) distribution. Let p denote the probability that two independent random letters from this distribution are complementary. Then En = E(X) is a random number that depends only on n and p.
We are ready to estimate the probability distribution of En. Define Pn(k) as the probability that En
k. We may assume that k is an even integer. For non-branching structures, the event {En
k} can be decomposed into the sum of smaller mutually exclusive events. Namely, the outmost arc of a non-branching secondary structure that connects bases i and j can be placed in n + l 1 different ways, where l = j i + 1, and there are n possibilities for choosing the value of l. Since the probability that the outmost arc has length l is smaller than or equal to p, we get
Applying equation 4 to itself recursively s times, we get
This process stops when k = 2s. The last term in the product, i.e. the sum of ls1 ls over ls = k,...ls1, is equal to 1/2 (ls1 k)2. The next innermost sum, which after change of index becomes the sum of (ls2 k i)i2 over i = 0,...,ls2 k, is calculated using the following continuous approximation
Thus, after s iterations, we get
Using the Stirling formula, we have
where e is the base of the natural logarithm. Equivalently, if we denote k/n by
, then
where
(
) is given by equation 3. Note that Pn(
) is the probability that a random sequence of length n has a stability factor greater than or equal to
. Now we fix
and consider Pn(
) as a function of n. The Inequality 5 gives us an upper limit estimate for Pn(
). We are interested in a range of
such that (
(
))n decays when n increases. This condition holds only if
0 = e
p/(1 + e
p). Particularly, for the uniform distribution, we have p = 0.25 and
0 = 0.57. The approximation for k! assumes that the values of k and, therefore, of n are large enough. Thus, Inequality 5 effectively estimates only the tail of the function Pn(
).
It trivially follows from Inequality 5 that if N>>n then the probability that a random sequence of length N contains an n-long and
-stable subsequence is N · Pn(
). Now we are interested in the critical value of n, at which the presence of an n-long and
-stable subsequence in a sequence of length N can be considered as non-random. Simple algebra proves that if N>>n, N(
(
))n<<1 and n is large enough, then the critical value of n at significance level
is given by equation 2.
| REFERENCES |
|---|
|
|
|---|
- Durbin,R., Eddy,S.R., Krogh,A. and Mitchison,G. (1998) Biological Sequence Analysis. Cambridge University Press.
- Nussinov,R., Pieczenik,G., Griggs,J.R. and Kleitman,D.J. (1978) Algorithms for loop matching. J. Appl. Math., 35, 6882.
- Tinoco,I.,Jr, Uhlenbeck,O.C. and Levine,M.D. (1971) Estimation of secondary structures of ribonucleic acids. Nature, 230, 362367.[CrossRef][Medline]
- Walter,A.E., Turner,D.H., Kim,J., Lyttle,M.H., Muller,P., Mathews,D.H. and Zuker,M. (1994) Coaxial stacking of helixes enhances binding of oligoribonucleotides and improves predictions of RNA folding. Proc. Natl Acad. Sci. USA, 91, 92189222.
[Abstract/Free Full Text] - Mathews,D.H., Sabina,J., Zucker,M. and Turner,H. (1999) Expanded sequence dependence of thermodynamic parameters provides robust prediction of RNA secondary structure. J. Mol. Biol., 288, 911940.[CrossRef][Web of Science][Medline]
- Hofacker,I.L., Fontana,W., Stadler,P.F., Bonhoeffer,L.S., Tacker,M. and Schuster,P. (1994) Fast folding and comparison of RNA secondary structures. Monatsh. Chem., 125, 167188.[CrossRef]
- Lau,N.C., Lim,L.P., Weinstein,E.G. and Bartel,D.P. (2001) An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science, 294, 858862.
[Abstract/Free Full Text] - Stein,L., Sternberg,P., Durbin,R., Thierry-Mieg,J. and Spieth,J. (2001) WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res., 29, 8286.
[Abstract/Free Full Text] - Surzycki,S.A. and Belknap,W.R. (2000) Repetitive-DNA elements are similarly distributed on Caenorhabditis elegans autosomes. Genetics, 97, 245249.
- Duret,L., Marais,G. and Biémont,C. (2000) Transposons but not retrotransposons are located preferentially in regions of high recombination rate in Caenorhabditis elegans. Genetics, 156, 16611669.
[Abstract/Free Full Text]
This article has been cited by other articles:
![]() |
E. de Wit, S. E.V. Linsen, E. Cuppen, and E. Berezikov Repertoire and evolution of miRNA genes in four divergent nematode species Genome Res., November 1, 2009; 19(11): 2064 - 2074. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. L. S. Ng and S. K. Mishra De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures Bioinformatics, June 1, 2007; 23(11): 1321 - 1330. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. NG Kwang Loong and S. K. Mishra Unique folding of precursor microRNAs: Quantitative evidence and implications for de novo identification RNA, February 1, 2007; 13(2): 170 - 187. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. B. Carlini Context-Dependent Codon Bias and Messenger RNA Longevity in the Yeast Transcriptome Mol. Biol. Evol., June 1, 2005; 22(6): 1403 - 1411. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





