RAGA: RNA sequence alignment by genetic algorithmCédric Notredame1,*, Emmet A. O'Brien1,2 and Desmond G. Higgins1,2
1EMBL Outstation-The European Bioinformatics Institute, Welcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and 2Department of Biochemistry, University College, Cork, Ireland
Received July 23, 1997;Revised and Accepted October 1, 1997
ABSTRACT
We describe a new approach for accurately aligning two homologous RNA sequences when the secondary structure of one of them is known. To do so we developed two software packages, called RAGA and PRAGA, which use a genetic algorithm approach to optimize the alignments. RAGA is mainly an extension of SAGA, an earlier package for multiple protein sequence alignment. In PRAGA several genetic algorithms run in parallel and exchange individual solutions. This method allows us to optimize an objective function that describes the quality of a RNA pairwise alignment, taking into account both primary and secondary structure, including pseudoknots. We report results obtained using PRAGA on nine test cases of pairs of eukaryotic small subunit rRNA sequence (nuclear and mitochondrial).
INTRODUCTION
Most methods of alignment are based on the primary structure of the sequences to be analysed (1 ). Alignment may be straightforward when the primary structure is conserved but becomes less and less accurate as the evolutionary distance increases. In the case of RNA it may be possible to use secondary structure information to supplement the weak primary structure information. Such alignments, using primary and secondary constraints, have been built for rRNAs (2 ,3 ). Their construction is at least partially manual and is usually based on identification of sets of correlated mutations which suggest secondary structure interactions.
One justification for such methods is the fact that accurate alignment is still the main non-experimental way to establish a reliable secondary structure for a long RNA molecule. The only other alternative is ab initio prediction. Several techniques of this type have been developed over time (4 ,5 ), but they recently received renewed attention through the use of stochastic heuristic-based approaches, like simulated annealing (6 ,7 ) and genetic algorithms (8 ,9 ). Nevertheless, they remain limited by the fact that our understanding of the in vivo folding process is still incomplete. In contrast, homology analysis based on alignments does not have these limitations. Multiple alignments reveal the positions of the sequences on which some constraints exist, regardless of the actual cause of these constraints. Several algorithms have been developed for aligning RNA sequences taking into account primary and secondary information. Some methods attempt to simultaneously align and fold sequences (10 -12 ). Their main drawback is that they remain limited to sets of short sequences (<200 nt long).
To reduce the complexity of this problem it is also possible to align a sequence (or a set of sequences) of unknown structure to some pre-established reference master structure. Such alignments include non-local interactions and their solution has been shown to be NP hard (13 ). Nevertheless, for small sequences sensible results can be obtained. Several methods of this type, based on the use of stochastic context free grammar (SCFG) for the description of RNA non-pseudoknotted secondary structure, have been described (11 ,14 ,15 ). Pseudoknots, however, are important motifs in RNA folding (16 ) and recently some new results have been obtained on this aspect of RNA analysis. This includes the work of Tabaska and Stormo (17 ) for aligning an RNA sequence to a pseudoknotted structure in polynomial time. Unfortunately, all these methods have the limitation of being computationally very expensive and therefore remain restricted to small sequences (<200 nt long).
This problem can be partially overcome by heuristic methods. Corpet and Michot have described an approach of this type (18 ). In this case a heuristic allows identification of the portions of an alignment that can be made without using secondary structure information. The remaining portions, if they are small enough, can then be aligned using non-local interactions. This is done with a specialized dynamic programming algorithm. Although this algorithm is less efficient than that described for SCFG-based alignments, the heuristic filtering makes it possible, in some cases, to align long RNA molecules (e.g. >1500 nt). At the moment this is largely beyond the scope of any SCFG-based algorithm. Unfortunately, the algorithm cannot deal with very divergent sequences and does not support the computation of pseudoknots. As opposed to the SCFG-based scoring scheme, that used by Corpet and Michot has no real theoretical justification. Nevertheless, it has the merit of being conceptually simple as well as leading to computation of sensible alignments (as judged by comparison with established reference alignments) (18 ).
For this reason we took the overall approach and scoring scheme of Corpet and Michot but used a genetic algorithm (GA) to carry out the optimization. This has two significant advantages. Firstly, in the GA context there is no difference between the handling of pseudoknots and any other secondary structure. Secondly, it is possible to attempt to find alignments between much longer sequences, such as complete small subunit rRNAs, which can be >2000 nt in length. The genetic algorithms (GA) (19 ,20 ), like simulated annealing (21 ) or Gibbs sampling (22 ), is a stochastic optimization technique. It involves an attempt at optimizing some cost function (objective function, OF) by modifying and combining a population of solutions (individuals). GAs do not guarantee an optimal solution, but are known to perform well with combinatorial or enumeration problems.
We based our approach directly on a previous package, SAGA (Sequence Alignment by Genetic Algorithm) (23 ). This algorithm was improved and parallelized. A suitable cost function that describes the quality of a RNA alignment was introduced, mostly based on the function described by Corpet and Michot (18 ). The package was name PRAGA for Parallel RNA Alignment by Genetic Algorithm. We compared this algorithm with traditional techniques of sequence alignment and, to some extent, with the program RNAlign (18 ).
METHODS
The aim is to align two related sequences of RNA, knowing the secondary structure (master structure) of one of them (master sequence), in order to predict the position of these structural elements in the second sequence (slave sequence). In the correct alignment elements of the two sequences sharing the same structure and/or sequence should be aligned (Fig. 1 a-c). A measure (OF) can be designed that allows evaluation of the quality of such an alignment. This measure takes into account the quality of the sequence alignment and the stability of the folding induced by the master sequence onto the slave sequence. To produce the best scoring alignment according to this measure we used a GA. We also describe a parallel GA that we have designed to gain some speed over a serial one. The results obtained with several sets of sequences were compared with established reference alignments of the same sequences, using three comparison methods.
RESULTS
Dynamic programming reference
For each test case a pairwise alignment was produced using dynamic programming with local gap penalties. Another was made without local penalties using ClustalW. We compared these alignments to their reference using m1, m2, m3 and found that alignments made with local penalties were ~10% (as measured with m1 and m3) more similar to the reference than alignments made without local gap penalties. We then measured the distance between the sequences in each reference alignment, as described in the previous section. Similarity to the reference alignment, using measures m1, m2 and m3, was plotted against these distances. As expected, we find that there is a clear correlation between the level of identity of the sequences aligned and the similarity of their pairwise alignment to the reference. In order to improve on these results we introduced secondary structure information into the alignment procedure and used PRAGA to do so.
Figure 3. Complexity. (a) Time (in generations) required to find a solution as a function of [lambda]. The distances are as in Table 1 and measure the distance between the slave and the master sequence. Four test cases producing alignments of similar length were used (6, 5, 4 and 3, which have lengths comprised between 900 and 1000 nucleotides). (b) Time (in generations) needed to find an optimum as a function of alignment length. For each of the four test cases measures were made varying [lambda]. The four test cases used (6, 8, 7 and 9) have a comparable distance between the master and the slave sequence (1.23-1.33). In our system (see Methods) the time required for one generation was ~54 s.
Efficiency and accuracy of PRAGA
Since the optimization procedure is central to our work, we analysed PRAGA for its ability to perform this task. We looked at two criteria: the accuracy of optimization and the consistency of the results. Our algorithm, being a stochastic heuristic, can be expected to give different results when run several times with the same set of parameters. In order to have a program that is as reliable as possible one would like to minimize the level of variation from one run to another.
Figure 4. Evaluation of optimum [lambda] on test case 7. (a) m1 and m3 (see Methods) were measured on the alignment produced by PRAGA with different values of [lambda]. (b) As (a) but with m2.
First, we checked, through the use of crossovers and mutations, that our program was able to reproduce the patterns of gaps and matches present in any of the reference alignments. We did so by using as an OF the measure of overall identity (m1) between a PRAGA alignment and the reference alignment. For all the test cases the GA was able to produce alignments 100% identical to the reference. Several runs were made for each test case that showed a total consistency in the scores. This is a good sign that PRAGA has the potential to explore the whole solution space when aligning two sequences.
Since dynamic programming with local gap penalties is equivalent to the OF described in the method with [lambda] = 0, we checked that when using such an OF PRAGA was able to reproduce the dynamic programming alignments. In all cases it managed to produce alignments having exactly the same score as the dynamic programming reference. Here again we found a very good consistency from run to run (<0.1% deviation). When looking at the similarity between these alignments and the reference we found that the deviation was significantly higher (2.2% on m1, 2.1% on m3, 0.1 residues on m2). The highest variations were found for alignments where the two sequences aligned shared a low level of identity. This fact is not surprising, since it is well known that several alternative alignments of the same sequences can share the same score (36 ). This is simply a consequence of the OF properties.
Figure 5. Optimal values of [lambda] as a function of the slave/master distance. For each test case the values of [lambda] leading to the best alignment were measured on plots similar to that shown in Figure 4 and plotted against the slave/master distance. [
In a second stage PRAGA was tested with values of [lambda] set between 1 and 9. Each test case was analysed, four runs being made with each value of [lambda]. Since no mathematical optimal solution was available to serve as a benchmark for these alignments, we focused our analysis on the consistency of the program. We found that overall the deviation of the score of equivalent alignments was <0.5%. This deviation tended to remain constant with different values of [lambda]. The deviation of the score of the comparison with the reference alignment was higher (3.2% on m1, 2.1% on m2 and 0.4 residues on m3) and tended to increase slightly with higher values of [lambda]. In order to verify that the use of dynamic programming (DPAN) was not adding some uncontrolled bias, most of these experiments were repeated while switching off the DPAN seeding and DPAN mutation described in Methods. Results obtained in that way were consistent with the rest of our experiment. This also allowed us to confirm that the use of DPAN gives an ~3-fold speed-up to the optimization procedure and does not create any premature convergence problem.
Finally, an attempt was made to establish the complexity of the algorithm as a function of the different parameters (Fig. 3 a and b). Due to the properties of the OF the time needed to compute one generation increases linearly with the average length of the alignments in the population. This average length is roughly similar to the length of a regular dynamic programming alignment. For a typical test case (7 in Table 1 ) the average time needed for one generation is of the order of 54 s CPU time. The number of generations needed to reach an optimal solution, however, is a function of several factors, including the value of [lambda], the length of the alignment and the similarity of the sequences. Our experiments show that the level of similarity has a significant effect on the time requirement. This is in agreement with previous observations made on protein sequences using a similar model of alignment (23 ). The complexity of the gap pattern (as seen from the point of view of the operators) tends to increase with the distance between the two sequences, making it harder for the GA to find the right configuration.
Figure 6. Comparison of PRAGA and dynamic programming with local gap penalties using the m3 measure. Each point corresponds to one of the test cases in Table 1. PRAGA alignments were obtained using an optimal value for [lambda].
We also found (Fig. 3 a) that for a fixed length and a fixed level of identity the number of generations needed to reach the optimum increases with [lambda]. However, more remarkable is that for [lambda] = 0 the number of generations required tends to be independent of the alignment length. This means that under these conditions the time requirement increases almost linearly with the sequence length (this observation holds when seeding is done in a random way). In theory this is a clear improvement over dynamic programming, which requires at least a quadratic amount of time. In practice, however, the overhead is so large that the sequences to be aligned would need to be extremely long (>10 000 nt) for this speed-up to become really noticeable and we still need to check that this linearity holds for such long sequences.
Tuning of [lambda]
Corpet and Michot described their OF as giving the best results with [lambda] = 3. Since we aligned sequences with a wide range of identity, it was important to know whether [lambda] should be set to a value that is a function of the distance between the slave and the master sequence. For each of our seven test cases the accuracy, as measured by m1, m2 and m3, was plotted against [lambda]. Most of the graphs show reasonable continuity, as shown in Figure 4 a and b. We deduced an optimal value for [lambda] from each of these graphs and found the results to be mostly consistent with each similarity measure used (m1, m2 or m3). Figure 5 is a plot of `optimal [lambda]' against the slave/master distance. It shows that the value of [lambda] should roughly reflect the level of identity between the two sequences analysed. It should be higher for sequences of low identity and low for very similar sequences.
Our results also indicate that [lambda] is quite a robust parameter and that a variation of one or two around the optimal value has little effect on the actual quality of the alignment. Such a robustness means that one can perform a dynamic programming alignment beforehand, measure the distance (using Kimura or any other scheme) and deduce from this a reasonable [lambda]. For instance, [lambda] should be set to 1 for closely related sequences (distance <0.5 estimated substitutions/site), 3 for more distantly related pairs (distance <1) and 5 for more remote homologues (distance >1).
Comparison with dynamic programming
The new alignments generated with PRAGA were compared with those obtained by dynamic programming. In all cases (Fig. 6 , Table 1 ) we found that using our method leads to a significant improvement over the dynamic programming approach regardless of the comparison measure. Although both methods follow the same trend and decrease in accuracy when the slave/master distance increases, PRAGA is clearly less affected than dynamic programming. The accuracy of the alignments produced by PRAGA is also clearly a function of the slave/master structural similarity. For instance, let us consider test cases 6-9. These alignments have comparable distances (1.23-1.33 estimated substitutions/site), therefore it seems that the factor responsible for the lower accuracy observed for 6 and 8 is mostly due to the fact that in these alignments the level of conservation of secondary structure is lower than for 7 and 9 (see column `pairs' in Table 1 ).
Comparison with RNAlign
An attempt was made to align each of the nine test cases using the program RNAlign (18 ). This was done on a Pentium PC with 64 MB memory. Only two of the test cases (3 and 5) could be aligned successfully. All the others caused the machine to run out of memory or the program to issue a warning message. For 3 and 5 we tuned [lambda] as we did for PRAGA. The optimal values found were the same as those reported with the GA. We found RNAlign alignments to be roughly similar to PRAGA with the three measures (for instance measure m3 gave 89.3% for test case 3 and 68.6 for 5). These results are quite consistent with those obtained with PRAGA (Table 1 ) and constitute one more piece of evidence that the optimization procedure is accurately performed by our program.
The reason why the other test cases could not be aligned has to do with the way RNAlign works. It first produces a dynamic programming alignment with local gap penalties. In the second stage it identifies some `anchor points' in this alignment. These are regions of the alignment that can be considered as correctly aligned, using some conservative evaluation scheme. During the last stage, considering the area in between the anchor points, RNAlign performs a complex dynamic programming that takes into account both primary and secondary structure constraints. This dynamic programming is very intensive in terms of time [O(M2N3)] and memory [O(M2N2)]. This means that when trying to align sequences with low levels of identity the setting of anchor points is difficult and leads the program to re-align stretches of the alignment much too long to be handled in that way. In practice, the longer the sequences, the more similar they need to be for RNAlign to align them. To overcome these limitations the authors use a multiple alignment (Bank) instead of a single master sequence. From this multiple alignment they remove areas of very low identity by a semi-automatic method and then use this reduced profile in RNAlign.
Computation of pseudoknots
Pseudoknots are structures that involve interaction of a loop with a domain on the 3'- or 5'-side of its stem (16 ,37 ). They can be considered as RNA tertiary motifs. Computationally, prediction of pseudoknots is very difficult using traditional approaches (38 ). In their method Corpet and Michot (18 ) had to exclude pseudoknots. It is interesting to notice that in the case of PRAGA there is no real distinction between a pseudoknot and any other type of Watson-Crick interaction. This means that the algorithm should have no more difficulty in aligning pseudoknots than normal stems. In the previous experiments, in order to remain consistent with RNAlign, we excluded these interactions from the master structure. In order to demonstrate the ability of PRAGA to deal with such structures we re-introduced some of them. We only considered pseudoknots involving more than two residues and associated through Watson-Crick interactions. By doing so it is possible to add 6 bp to the previously used structure. These base pairs are boxed in green in Figure 7 a and b.
Figure 7.Comparison of dynamic programming (a) and PRAGA (b) on test case 9. The boxed stems indicate stems that were included in the master structure (Homo sapiens mitochondrion). Blue portions are the elements not shared by the two structures (and therefore not part of the master structure). A stem is considered correctly aligned if at least one position of its alignment is strictly identical to the reference alignment. This scheme does not take pairing into account. This explains why it is possible to have only the left or the right strand of a stem correctly aligned. The pseudoknots boxed in green are those that were used in the alignment (see text).
The GA was then used with this new master structure, setting [lambda] to the optimal value previously reported. The experiment was performed on four of the test cases (4, 6, 7 and 9) and the results are given in Table 2 . They show unambiguously that our program can efficiently use pseudoknot information in order to improve the alignment. It should be noted that the computation of pseudoknots has no noticeable effect on the algorithmic complexity previously discussed. The fact that even without having pseudoknots present in the master structure (Table 2 , PRAGA-PN) PRAGA improves the alignment is due to the constraint imposed by other structures in the vicinity of these pseudoknots (see Fig. 7 ).
Alignments were made incorporating into the master structure some of the positions known to form a pseudoknot (green boxes in Fig. 7). These positions make a total of six new pairs of nucleotides. The alignments were compared to the reference for their accuracy on the newly added positions. m2 and m3 were calculated on the new pairs. The results (Struc +PN) were compared with those obtained by dynamic programming with local gap penalties (DP) and by PRAGA without the pseudoknot information (Struc -PN).
DISCUSSION
PRAGA is a powerful tool for RNA alignment. Using existing structures allowed us to predict quite accurately the core structure of several ribosomal sequences, even those remotely related to the sequence for which the master structure was known. We also show that this type of analysis can capture some of the tertiary properties of the folding, such as pseudoknot interactions.
The next step with PRAGA will be implementation of an OF that allows the use of a whole alignment instead of a single master sequence. We are currently investigating ways of maximizing the information that can be extracted from such alignments (sequence weighting, local penalties, local substitution schemes, use of secondary structure, etc.). Using a GA gives us a lot of freedom in the design of the OF. In practice, almost any type of constraint or information can be built into an OF and used for optimization purposes. This could include, for instance, SCFG-based functions, which have sounder theoretical justifications than the function we have been using here (39 ). Another possible extension of PRAGA would be to use the alignment to predict non-conserved stems between correctly aligned core stems with traditional folding prediction methods (4 ). We are currently investigating ways of defining the local reliability of a given alignment in order to produce a complete structure.
The RAGA algorithm itself is mostly adapted from SAGA (23 ). This means that the operators used by RAGA were initially designed for aligning protein sequences (linear alignments). It is surprising that these operators do so well in a context where long range non-linear pairings are involved. We believe that this can be explained by the balance between potentially disruptive operators, such as crossovers, and the ability of the other operators (mutations) to fix these disruptions. This balance is automatically maintained by dynamic scheduling of the operators usage probability. It should be noted that the uniform crossover is much less disruptive than would be other forms of exchange, such as the one point crossover described for SAGA. In many cases the uniform crossover tends to respect long range interactions, simply because they often belong to stable parts of the alignment and are likely to be widespread in the population. This approach could probably be taken further and new operators could be designed with respect to the RNA tree structures (40 ) or using stochastic context-free grammar.
A very important aspect of PRAGA is its ability to deal with noise. By noise we mean unconserved secondary elements that are present in the master structure and absent in the slave structure. We have shown that although such elements significantly decrease the performance of our algorithm, overall, even in these cases, PRAGA remains significantly better than any other alternative we know of. In the future we will focus our attention on defining a better OF that would be able to discriminate this type of event, therefore minimizing their negative effect.
Although our algorithm has been specifically designed for aligning rRNA, it can in theory be applied to any structured RNA. An aspect of PRAGA that still needs to be studied would be its ability to discriminate between RNA structural homologues. The program is too slow to be used for scanning databases as it is now, but since it is a GA, it can also produce sub-optimal solutions in a shorter time. In our experience the quality of the solution found by PRAGA in the first 10% of a run is usually a good indicator of the overall score that may be reached after further optimization. This property could be used to design a `short PRAGA' as some sort of filter, later focusing the long runs on candidates having a potentially good structural match with some query sequence.
ACKNOWLEDGEMENTS
This work was partly supported by a grant from the EC Biotechnology Program (BIO4-CT95 0130). We wish to thank Burkhart Rost and Miguel Andrade for useful comments. We also wish to thank the three anonymous referees for their useful remarks and interesting suggestions.
REFERENCES
1 Needleman,S.B. and Wunsch,C.D. (1970) J. Mol. Biol., 48, 443-453.MEDLINE Abstract
2 Michot,B., Qu,L.H. and Bachellerie,J.P. (1990) Eur. J. Biochem., 188, 219-229.MEDLINE Abstract
5 Gouy,M. (1987) In Bishop,M.J. and Rawlings,C.J. (eds), Nucleic Acid and Protein Sequence Analysis: A Practical Approach. IRL Press, Oxford, UK, pp. 259-284.
6 Schmitz,M. and Steger,G. (1996) J. Mol. Biol., 255, 254-266.MEDLINE Abstract
7 Shapiro,B.A. and Wu,J.C. (1996) Comput. Applicat. Biosci., 12, 171-180.
8 Gultayaev,A.P., van Batenburg,F.D.H. and Pleij,C.W.A. (1995) J. Mol. Biol., 250, 37-51.
13 Lathrop,R.H. (1994) Protein Engng, 7, 1059-1068.
14 Sakakibara,Y., Brown,M., Underwood,R.C., Mian,I.S. and Haussler,D. (1994) In 27th Hawaii International Conference on System Sciences. IEEE Computer Society Press, Los Alamitos, CA, pp. 284-293.
15 Lefebvre,F. (1995) In ISMB-95. AAAI Press, CA, pp. 222-230.
26 Gerstein,M. and Levitt,M. (1996) In Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA, pp. 59-67.