ESSA: an integrated and interactive computer tool for analysing RNA secondary structure
ESSA: an integrated and interactive computer tool for analysing RNA secondary structureF. Chetouani, P. Monestié1, P. Thébault, C. Gaspin1 and B. Michot*
Laboratoire de Biologie Moléculaire Eucaryote du C.N.R.S., Université Paul Sabatier, 118 route de Narbonne, 31062 Toulouse Cedex, France and 1Station de Biométrie et d'Intelligence Artificielle, I.N.R.A., Chemin de Borde-Rouge, Auzeville BP 27, 31326 Castanet-Tolosan Cedex, France
Received April 17, 1997;Revised and Accepted July 7, 1997
ABSTRACT
With ESSA, we propose an approach of RNA secondary structure analysis based on extensive viewing within a friendly graphical interface. This computer program is organized around the display of folding models produced by two complementary methods suitable to draw long RNA molecules. Any feature of interest can be managed directly on the display and highlighted by a rich combination of colours and symbols with emphasis given to structural probe accessibilities. ESSA also includes a word searching procedure allowing easy visual identification of structural features even complex and degenerated. Analysis functions make it possible to calculate the thermodynamic stability of any part of a folding using several models and compare homologous aligned RNA both in primary and secondary structure. The predictive capacities of ESSA which brings together the experimental, thermodynamic and comparative methods, are increased by coupling it with a program dedicated to RNA folding prediction based on constraints management and propagation. The potentialities of ESSA are illustrated by the identification of a possible tertiary motif in the LSU rRNA and the visualization of a pseudoknot in S15 mRNA.
INTRODUCTION
Understanding of RNA structure-function relationships requires complex working strategies which make large use of computer programs in addition to bench experiments. They involve structural and functional predictions deduced from sequence analysis and interpreted in the light of a wide and diverse knowledge provided in part by computer approaches. The determination of the 3D folding of Group I ribozymes is probably the best example of structured RNA analysis strategy (1 ). The first step consists of the prediction of reliable secondary structure folding which can be defined as the set of C:G, A:U Watson-Crick and G:U wobble pairs allowing their readable representation in two dimensions. Three approaches were developed which have in common the knowledge of the primary structure, with a view to identify the set of hydrogen-bonded nucleotides involved in stabilizing their folding. The measure of nucleotide accessibility to chemical and enzymatic structure probes allows the identification of paired and unpaired positions (2 ,3 ), thermodynamic optimization proposes optimal and suboptimal foldings (4 ), whereas comparative analysis consists of a systematic search for compensatory mutations in an alignment of homologous RNA sequences from several organisms (5 -7 ). These three approaches provide different structural information and have their own limitations, but the third one has the main advantage of pointing directly to biologically significant structural features. Nevertheless, the determination of the secondary structure folding often requires the conjunction of these three complementary methods. Thus, structural information from different origins must be used simultaneously in order to converge towards a model that is in agreement with all available data. This has led to the consideration of RNA modelling as a constraint satisfaction problem (8 ,9 ).
The second step is the representation of these secondary structure folding models which must be comprehensive enough to serve as a support for the evaluation, refinement and management of folded RNA, but also for their own interpretation with a view to predicting new structure-function relationships according to the user knowledge. Further analyses include comparison of models, searching for functional structural motifs, checking for pseudoknots and possible alternative interactions, identification of co-varying positions and higher order interaction. Interpretation of results necessitates the integration of many different types of information encompassing all those related to the structure and the function of the molecule studied, or from homologous molecules. Among them, accessibility to enzymatic and chemical probes, position and type of modified nucleotides, RNA-RNA and RNA-protein inter-molecular contacts are essential. The extensive viewing and management of this knowledge is a prerequisite to point to those structural features which could play biological roles. This reveals the necessity of possessing highly interactive and integrated computer programs which allow an easy representation of RNA secondary structures and the direct investigation of the model produced through a set of diversified analysis functions.
Several programs are available, each dedicated to one of the numerous problems posed by RNA secondary structure modelling. Thus, softwares predict RNA folding or probabilities of pairing from individual sequences with thermodynamic rules (10 -12 ). Others focus on the representation and display of secondary structure (13 ,14 ) or the identification of co-varying positions, Watson-Crick or not, in aligned sequence datafiles (15 -17 ). The management and analysis of specialized and aligned sequence files such as those containing ribosomal RNAs have also led to the development of specific softwares (18 ,19 ). Unfortunately each of these programs has its own application field and works independently, usually under different operating systems, when they should be closely linked with a view to fitting with the RNA analysis. Softwares rarely include several of these programs. The GCG package (20 ) connects thermodynamic secondary structure prediction programs with several modes of representation, whereas a more recent program (21 )combines viewing of the structure and the use of probabilities of pairings.
The aim of ESSA is to propose an interactive approach of RNA secondary structure analysis which integrates their representation, the visualization of various types of information, and analysis with a view to covering the most important aspects devoted to a better understanding of their structure-function relationships. ESSA was also designed to communicate with other programs dedicated to RNA secondary structure. More particularly a communication protocol was developed with SAPSSARN (9 ), a program which relies on the probabilities of pairing (11 ) associated with constraints management and propagation to help with the prediction of RNA folding. We present here the numerous applications of ESSA and give two examples of its predictive possibilities. The first concerns a search in the LSU rRNA for a structural motif involved in a tertiary contact first identified in autocatalytic group I introns (22 ). The second illustrates its capacity to fold RNA and predict complex interactions owing to ESSA-SAPSSARN communication. We show how a pseudoknot involving a conformational switch in the mRNA of ribosomal protein S15 (23 ) can be viewed.
MATERIALS AND METHODS
Various import formats are supported by ESSA including RNAsearch (13 ), RNAd2 (14 ), FoldRNA of the GCG package (20 ) and RNAlign (24 ). More specifically, ESSA exchanges data with SAPSSARN through a communication protocol based on a client server model. The X server acts as an intermediary between ESSA and SAPSSARN client applications. Data exchanged are the sequences and the secondary structure: once both applications have run and the sequence is known, each modification of the secondary structure in any application is recorded as a request to the other and is managed by the server.
RESULTS
ESSA relies on the representation of RNA folding, a central core which proposes a first set of functions dedicated to the management of secondary structures (Fig. 1 ). This display serves also as a support to emphasize the viewing of remarkable structural features and to update diverse knowledge via the keyboard. Running on this core, a set of analysis functions was developed allowing search for structural motifs, calculation of the thermodynamic stability of any substructure and folding comparisons. Moreover, ESSA communicates by files with other programs dedicated to RNA secondary structure. Thus, comparative analysis functions provided by our program are linked with specialized and structured databanks produced by RNAlign. ESSA is also connected with RNA secondary structure models computed by several other programs, among them the outputs from FoldRNA. Finally, we have developed a communication protocol owing to a real time exchange of information with SAPSSARN. This program integrates the thermodynamic approach with probabilities of pairing and propagation of constraints given by the user with a view to offering an effective tool in terms of structure prediction.
DISCUSSION
Secondary structure determination and representation are key steps of RNA analysis. Not only do these folding models serve as working hypotheses for their own structural refinements, but they should also lead to a better knowledge of higher structure interactions and to the prediction of structural features and molecular mechanisms involved in their function. ESSA is the first package which proposes an integration of several crucial aspects of RNA folding analysis (Fig. 1 ). A comprehensive visual representation of secondary structure constitutes the core part of ESSA around which are organized editing facilities, and a set of analysis functions allowing structural motif identification, thermodynamic calculation and secondary structure comparison. This makes ESSA the first program which integrates the three approaches currently used to predict secondary and tertiary contacts. Moreover, the communication protocol developed between ESSA and SAPSSARN enhances the predictive aspect of each of these two programs: ESSA benefits from the constraint satisfaction approach whereas SAPSSARN becomes more readable owing to a clear display of its results. Finally, a high level of interactivity, in a very natural and intuitive way, allows the user to drive each stage of a working session according to his knowledge.
The two methods implemented for displaying secondary structure cover the main aspects required by the biologists: RNA foldings are easy to produce, they highlight the main features of RNA and long molecules are easy to manage (Fig. 3 ). The complementary features of these two programs make them well adapted to different application fields. The fully automatic program (13 ) is suited to fast production of untangled secondary structure models. It is also helpful in removing overlaps with the interactive approach by giving a first readable untangled draft version. Nevertheless, it is sensitive to the complexity of the folding which can be defined as a function of (i) the number of multibranched loops, (ii) their size and (iii) the number of branches rooted on each of them. Thus it is possible that, given a configuration of the parameter set, no solution can be computed. This was the case for numerous sequences extracted from the datafile containing the universal core of secondary structure of the LSU rRNA. By contrast, the interactive approach (14 ) works whatever the length of the molecule and its folding complexity. It is better adapted to the evolutionary approach since it allows the representation of subdomains that are homologous among related molecules of different species with similar orientations in order to emphasize structural homologies even if long insertions/deletions interrupt conserved domains.
The palette of colors and symbols allows the user to create his own code to identify data as diverse as tertiary contacts, crosslinked nucleotides, inter-molecular interactions involving either RNA-protein or RNA-RNA and any other structural and functional features able to help in the interpretation of folding in terms of structure-function relationships. This integration of diverse knowledge increases largely the predictive value of the secondary structure model. For instance, the superposition of modified nucleotides along the universal core of secondary structure of rRNA molecules with the viewing of long complementarities with small nucleolar RNAs (snoRNAs) were at the basis of the recent finding of their guide function in the 2'-O-methylation of rRNA (27 ,28 ), a role which was later confirmed by in vivo experiments (28 ,29 ). These labelling functions are more particularly suited to visualizing probe accessibilities. Such information facilitates the determination of RNA secondary structure and the interpretation of computed solutions as illustrated in the case of S15 mRNA (Fig. 6 ). They become essential when chemical probing is not restricted to Watson-Crick pairings but is performed to identify the more diverse hydrogen bonds involved in tertiary interactions.
The interest of these editing and viewing tools for a better understanding of structure-function relationships was enhanced by the integration of several more advanced analysis functions. The searching sequence motif function takes advantage of the visual representation of labelled RNA folding to let the user estimate by eye, owing to his expertise, the significance of the occurrences. It allows the identification of complex structural motifs by searching sequentially each sequence segment of the query structural feature. The visual inspection of the structural environment of each occurence makes it relatively straightforward to estimate the drift with the searched one. In particular, insertion/deletions are easy to appreciate (Fig. 5 ) whereas they would have been difficult to take into account in an automatic approach such those developed to scan databases (30 ,31 ) or by a measure of similarity (32 ). Thus despite its simplicity, this method has a high predictive value, in particular through its coupling with aligned and structured datafiles, which increases the significance of occurences by integrating the constraints which are exerted during the course of evolution.
Using this approach we identified in the LSU rRNA a structural motif closely related to an 11 nt motif previously demonstrated to be involved in 3D interactions with a GAAA terminal loop in group I introns (22 ), group II introns (33 ) and in the RNA component of RNase P of Bacillus megaterium (34 ). Surprisingly, whereas this motif is present several times in group I introns, we found it only once along the entire LSU rRNA conserved core. This suggests a different mode of evolution for structural motifs involved in 3D interactions in rRNA, perhaps biased by the presence of numerous RNA binding proteins. The strong evolutionary constraint which is exerted on the key features of the LSU rRNA group I-like motif strongly supports its probable essential role in the elaboration of ribosomes or in their functioning. It could be a key feature in the 3D organization of this region which directly binds the ribosomal protein L23 (Fig. 5 a) (35 ,36 ). Accordingly, the identification of its interacting partner would be an invaluable help. The differences observed in the receptor between the LSU rRNA and the group I intron could reflect variations in its interacting substrate. Alternatively, these differences could be induced by a contact with L23. A selex approach (37 ) associated with a search for covariating positions should help in identifying the substrate whereas NMR should reveal the precise spatial organization of this loop.
The most widely used tool for determining secondary structure folding is certainly the thermodynamic approach. Although it often fails to predict the overall folding of an RNA, this method brought important local information about the potential to form short stems or helix regions. ESSA, which gives the possibility of modifying a folding, can also test alternative interactions in terms of free energy difference with the original model. Thus, by choosing among the various thermodynamic models proposed (38 ,39 ) and by adjusting, if necessary, the thermodynamic parameters, the user can compensate for the partial understanding of the different parameters acting on RNA folding. In contrast, the comparative approach relies on the construction of aligned sequence files followed by a search for compensated changes (40 ,41 for the most recent review). Once a secondary structure is determined for an aligned family of homologous sequences, it can be included in the alignment giving rise to a structured and specialized databank. Programs were recently developed to automatically increment new sequences in this particular format by aligning them using both sequence and secondary structure homologies (24 ). ESSA optimizes the use of these files through the production of four modes of secondary structure extraction and consensus display. The derivation of secondary structure models on the one hand and the identification of tertiary contacts on the other often necessitates the simultaneous use of information coming from the three approaches as complementary constraints to drive RNA folding. This has prompted us to develop communication between ESSA and SAPSSARN, to allow the user to participate directly in the computational folding of RNAs. He can thus manage interactively either within the ESSA display or within the SAPSSARN matrix any kind of structural constraint. These constraints are then propagated within the SAPSSARN matrix resulting in the elimination of forbidden pairs. For example, by removing pseudoknot constraints in SAPSSARN, these, if any, are visualized in ESSA as illustrated by S15 mRNA analysis (Fig. 6 ). Accordingly, this communication also allows us to address the question of the RNA 3D interactions and of the dynamics of the interactions by pointing to alternative interactions.
Our ultimate goal is to make ESSA a unique tool for analysing RNA; from their sequences to the production of 3D models. Toward this aim we are currently integrating programs devoted to the identification of tertiary contacts based on the high visualization potentialities of ESSA. We also plan to develop communications with programs dedicated to 3D RNA reconstruction. The selection function of ESSA will allow the extraction of basic structural features in order to build their 3D structures separately and then assemble these pieces of the 3D RNA puzzle (42 ).
ACKNOWLEDGEMENTS
We thank Dr Jean Pierre Bachellerie and Pr Eric Westhof for their constant interest and support. This work was financially supported in part by grants from the Groupement d'Intérêt Public, Groupement de Recherches et d'Etude sur les Génomes (GIP GREG) and from French Education and Research ministry to F.C. and B.M. (ACC SV 13 and 07).
2 Ehresmann,B., Ehresmann,C., Romby,P., Mougel,M., Baudin,F., Westhof,E. and Ebel,J.P. (1990) Hill,W.E., Dahlberg,A., Garrett,R.A., Moore,B., Schlessinger,D. and Warner,J.R. (eds), The Ribosome, Structure, Function and Evolution. American Society for Microbiology, Washington DC, pp.148-159.
5 Woese,C.R. and Pace,N.R. (1993) in Gesteland,R.F. and Atkins,J.F. (eds) The RNA World, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, pp. 91-117.
9 Gaspin,C. and Westhof,E. (1995) J. Mol. Biol., 254, 163-174.MEDLINE Abstract
10 Gouy,M. (1987) in Bishop,M.J. and Rawlings,C.J. (eds), Nucleic Acid and Protein Sequence Analysis, A Practical Approach. IRL Press, Oxford, Washington DC, pp. 259-284.
40 Gutell,R.R. (1996) in Zimmermann,R.A. and Dahlberg,A.E. (eds), Ribosomal RNA: Structure, Evolution, Processing and Function in Protein Biosynthesis. CRC Press, Boca Raton, Florida, pp. 111-128.
41 Michel,F. and Costa,M. (1997) RNA Structure and Function. Cold Spring Harbour Laboratory Press, Cold Spring Harbor, NY, in press.
42 Westhof,E., Masquida,B. and Jaeger,L. (1996) Folding Des., 1, R78-R88.