RNA base-amino acid interaction strengths derived from structures and sequences
RNA base-amino acid interaction strengths derived from structures and sequencesBrooke Lustig*, Shalini Arora and Robert L. Jernigan1
Department of Chemistry, San Jose State University, San Jose, CA 95192, USA and 1Laboratory of Experimental and Computational Biology, National Cancer Institute, NIH, Building 12B, Room B116, 12 South Drive, MSC 5677, Bethesda, MD 20892-5677, USA
Received April 28, 1997Accepted May 14, 1997
ABSTRACT
We investigate RNA base-amino acid interactions by counting their contacts in structures and their implicit contacts in various functional sequences where the structures can be assumed to be preserved. These frequencies are cast into equations to extract relative interaction energetics. Previously we used this approach in considering the major groove interactions of DNA, and here we apply it to the more diverse interactions observed in RNA. Structures considered are the three different tRNA synthetase complexes, the U1A spliceosomal protein with an RNA hairpin and the BIV TAR-Tat complex. We use binding data for the base frequencies for the seryl, aspartyl and glutaminyl tRNA-synthetase and U1 RNA-protein complexes. We compare with the previously reported DNA major groove peptide contacts the results for atoms of RNA bases, usually in the major groove. There are strong similarities between the rank orders of interacting bases in the DNA and the RNA cases. The apparent strongest RNA interaction observed is between arginine and guanine which was also one of the strongest DNA interactions. The similar data for base atomic interactions, whether base paired or not, support the importance of strong atomic interactions over local structure considerations, such as groove width and [alpha]-helicity.
INTRODUCTION
The problem of understanding RNA-protein interactions is important because RNA is more involved in function than is DNA. However, the larger structural variations manifested in RNA compared to DNA make its study more difficult. The complexity of the problem resembles the case of protein-protein interactions which has met with some recent success using approaches (1 -3 ) similar to that taken below. In some ways, the difference between DNA and RNA binding sites corresponds to a difference in dimensionality-DNA double helical structure is nearly one dimensional. On the other hand, RNA presents highly variable surfaces for interaction, more similar to protein structures. One way of comprehending the complexities of such structures is to deconstruct them in terms of the interactions that stabilize them. Furthermore, if there were sufficiently large numbers of diverse structures, then the effects of both the RNA and the protein structure would be averaged out, and the dominant atomic interactions would become evident. Here we are going to compile and analyze the structural data available for RNA-protein structures to learn about their interactions. This will be done at a coarse grained level of base-amino acid pairs rather than detailed individual atomic pairs. The principal difficulty in learning about RNA-protein interactions remains the fact that there are relatively few available structures.
Interactions between RNA and proteins ought to provide a variety of interactions similar to that between pairs of proteins. For proteins there are potentially 20 types of residues to interact compared to the four nucleotide bases. Two questions arise. How do the RNA-protein interfaces achieve a comparable level of diversity for recognition? Are there important structural motifs for binding between proteins and RNA? The search for protein binding motifs has proven to be almost pointless, since a wide variety of protein structure elements are now known to interact with DNA. The much greater diversity of RNA structures would seem to make the dominance of only a few structural motifs even less likely. However, this does not preclude the occurrence of dominant motifs of atomic interactions. So, for RNA-protein binding, we are simply going to look at the frequencies of base-amino acid interactions, without consideration of their structural context.
How do the structural differences between RNA and DNA affect protein binding? In RNA there are additional features in comparison to DNA, that facilitate specific recognition by proteins. The additional G@U base pair type, beyond the canonical A@U and G@C types of base pairs, adds to the diversity of possible interactions. Furthermore, the available RNA structures already show a remarkable variety of other ways in which bases can hydrogen bond to one another, e.g., triplets, purine-purine and pyrimidine-pyrimidine pairs. In addition, there are unpaired bases in bulges and loops that can interact with amino acids. So, the bases themselves do offer a rich diversity for binding to the 20 types of amino acids. Also some amino acids are capable of interacting simultaneously with several base pairs so this provides a further variety on the RNA surface for protein interactions. Overall, this catalog of potential interacting RNA surface features affords a sufficient number of ways to achieve their specific recognition by proteins. But, as we will see below highly favorable interactions can dominate amino acid-base interactions.
For a given protein binding site on RNA, how variable can its sequences be? From analyses of DNA-protein binding sequences (4 ), it appears that the strengths of individual interacting pairs are not so critical. DNA binding sequence frequencies indicate that some interacting bases can be replaced. In part, this may reflect the replacement of one hydrogen bond acceptor or donor by a similar one from another base. However, there is also the possibility that substitutions can be energetically compensating, i.e., a more weakly binding base might be acceptable, if another simultaneous base substitution elsewhere in the binding site were made with a stronger binder. Are RNA-protein binding sequences similarly variable?
The advent of sequence libraries for selecting active binding sequences is having a major impact on the study of these systems. The present approach could be applied directly to assist in the design of better combinatorial libraries. Other approaches applying pattern recognition methods are being developed to design and analyze combinatorial libraries for peptides and related polymers (5 ,6 ). Another useful approach has been to examine and analyze nature's functional combinatorial libraries by aligning and determining DNA base preferences from variant sequence data (7 ).
Others have also been cataloging the interactions found from the limited set of three-dimensional DNA-protein structures as determined by X-ray crystallography (8 , Mandel-Gutfriend,Y., Margalit,H., Jernigan,R.L. and Zhurkin,V.B., personal communication). But, the present approach goes beyond the strictly structural to include additional information both from binding data and from sequence variability. We have previously derived self-consistent normalized relative energies for each of the four DNA bases interacting in the major groove with a specific amino acid (4 ) by using an extensive set of data collected from combinatorial multiplex DNA binding of zinc finger domains (9 ). The five strongest interactions found were: Lysine@guanine, Lysine@thymine, Arginine@guanine, Aspartic acid@cytosine and Asparagine@adenine. These relative energies correlated well with those derived from DNA binding data for Cro and [lambda] repressors and the R2R3 c-Myb protein domain (10 -12 ), as well as similar interaction energies derived directly from frequencies of bases determined to be in contact with particular amino acids in the bacteriophage [lambda] operator sequences.
A major objective of the present work is to calculate RNA-protein potentials. Those for major groove interactions can be compared directly with those derived for the major groove of DNA. RNA differs from DNA in some ways, but since we consider multiple structures as well as only relative values among the four bases, many differences such as those arising from the greater stiffness of the RNA backbone relative to DNA might be important. There are still some remaining differences but the present considerations are only semi-quantitative, and we will be comparing only the strongest effective interactions. The present considerations include data from BIV TAR-Tat binding where NMR was used to identify specific base-amino acid contacts (13 ,14 ). In addition we use the more extensive data for RNA base sequence frequencies of acyl tRNAs and U1 RNAs (15 ,16 ) that are identified by X-ray at positions in specific contact with particular amino acids (17 -22 ).
METHODS AND RESULTS
We use frequencies of contacts between bases and amino acids to derive relative interaction energies from the acyl tRNA-synthetase and U1A spliceosomal protein-RNA complexes, as we did earlier for zinc fingers interacting with DNA (4 ). First we calculate the logarithms of frequencies for all occurrences of a j-type base interacting specifically with an I-type amino acid so that the interaction energy eIjis of the form
eIj ~ -ln fIj
1
where fIj is the sum over all the sets of the relative frequencies in which a base type j interacts with all occurrences of a residue type I. For each of the four bases, the relative interaction energies are then normalized as
{SIGMA from j} ln {f sub {I j}} = 0
2
This corresponds to a reference state that shifts the values so that the mean for the four bases is zero.
The U1A spliceosomal protein-RNA hairpin structure from X-ray indicates 14 base-amino acid contacts at Arginine52@G16, Arginine52@A6, Glutamic acid19@U7, Asparagine16@G9, Asparagine15@G9, Lysine80@U8, Asparagine16@U8, Glutamine85@*C10, Tyrosine86@C10, Lysine88@C10, Aspartic acid92@C12, Serine91@A11, Threonine89@A11, and Aspartic acid90@C12 (22 ); these include both peptide side chain and peptide backbone interactions. The corresponding collected frequencies for the RNA sequences (16 ) are utilized. The binding domain of the protein is primarily at the loop A6 through C15 of the 21 base synthetic RNA hairpin loop. The amino acid types of the contacts identified by X-ray are considered here to be conserved (23 ). The sets of four relative base-amino acidenergies are explicitly derived using equations 1 and 2. Stacking and hydrophobic interactions have not been considered here. The only major groove contact was reported for Arginine52@G16.
We have used similar sequence data (15 ,24 ) and structures for the seryl (17 ), aspartyl (18 ) and glutaminyl (19 ,20 ) tRNA-synthetase complexes. They present a more diverse set of interactions than the U1A spliceosomal protein with RNA, since almost half areanticodon loop contacts involving Glutamic acid188@*UAsp34, Arginine119@UAsp35, Glutamine138@UAsp35; Alanine414@CGln34, Arginine341@UGln35, Glutamine517@UGln35, Arginine520@UGln35 and Arginine402@GGln36. The remaining contacts are found at Glutamic acid327@GAsp73, Asparagine330@AAsp72; Alanine555@GSer19, Glutamine545@GSer47a, Glutamine545@GSer47n and also include the two major groove contacts at Asparagine330@UAsp1 and Asparagine330@AAsp72. This calculation also includes in the same way the data for the U1 RNA-protein case.
Relative interactions for individual amino acids with the four bases are shown in Figure 1 for the combined data from theU1 RNA-spliceosomal protein and tRNA-synthetase structures and sequences. Also shown for comparison are the strong DNA interactions derived previously for major groove interactions. The RNA cases include diverse interactions, and only the first two cases designated by Rm and Nm are for major groove interactions. For RNA major groove interactions, the most favored pair is arginine with guanine which was also one of the five strongest pairs for DNA. In the case of non-major groove interactions in RNA, other strongest cases for interaction that can be seen in this figure are Arginine@uridine, Lysine@cytosine, Aspartic acid@cytosine, Glutamic acid@guanine, Alanine@guanine, Tyrosine@cytosine and Serine@ or Threonine@adenosine.
DISCUSSION
It is noteworthy that there is a clear correlation between the relative interaction energies for the DNA and RNA major groove atomic contacts with arginine and asparagine. This suggests, given their importance in DNA base-amino acid interactions (4 ), that simple charge or hydrogen bond considerations are the explanation for the sequence dependence of RNA base-amino acid contact preferences rather than a dependence on the RNA structure. Inspection of the interaction energies is generally consistent with simple base and amino acid charge considerations. Also it is significant that the relative interaction energies for exclusively non-major groove RNA base-amino acid contacts appear to be completely different in character from those associated with major groove DNA and RNA. This is consistent with the lack of specificity noted for minor groove interactions in DNA (31 ) and RNA (19 ).
Focusing on the more extensive data in Figure 1 for RNA, the dominant major groove interaction is Arginine@guanine and for the non-major groove cases Arginine@uracil, Lysine@cytosine, Aspartic acid@cytosine, Glutamic acid@guanine, Alanine@guanine, Tyrosine@cytosine and Serine/Threonine@adenine. The importance of arginine in specific and non-specific binding has been previously noted (32 ). We limit the category of specific interactions to base-amino acid contacts. We have not considered other non-specific interactions involving phosphates and riboses here because preliminary analyses showed no significant DNA or RNA sequence specificity (Lustig,B., unpublished results; 33 ). Shi and Berg (33 ) have shown that zinc finger RNA does not differ from zinc finger DNA in sequence, but in RNA has an enhanced binding which may involve increased interactions with phosphates or 2'-OHs. Our results suggest that there are diverse ways to obtain specific interactions but that there are some dominant interactions. Remarkably, several of these stand out already in the present limited data. It must be noted, however, that the interactions present in the structures here are likely to provide an incomplete list of all RNA interactions.
The specific occurrence of Arginine@guanine pairs in several recent structures (25 -29 ) is noteworthy. The occurrences of this pair in diverse structural contexts is particularly important. For example, in the HIV-2 Tar-Argininamide complex (25 ), this pair occurs with the argininamide stacked between U and A bases where the U is also involved in a U@A@U triplet. In other cases arginine was shown even to cause conformational transitions in DNA (26 ). And, in another RNA study (29 ), arginines which were originally in [alpha]-helices still bind even when the helices have been disrupted by changing the peptide sequences. This suggests that the arginine pair of hydrogen bonds formed with N7 and O6 of guanine is extremely strong. Perhaps, these are sufficiently strong that they form in spite of the structural context. The present approach can readily treat this class of strong interactions, even when the data are limited. In other less favored cases, averaging over sufficient numbers of structures is required.
A clearer, more precise elucidation of RNA base-amino acid interactions requires a more extensive set of structures or experimental data such as those that could be derived from combinatorial RNA-protein binding studies for a variety of well characterized three-dimensional structures. Ultimately the present type of results could be utilized for sequence design in a variety of problems. If a given surface region of protein were targeted for binding to a new RNA, then the protein sequence could be utilized directly to suggest the composition of RNAs that would be likely to bind most specifically. Sequences with such a composition could be screened experimentally with an appropriately designed RNA combinatorial library (34 ).
REFERENCES
1 Wallqvist,A., Jernigan,R.L. and Covell,D.G. (1995) Protein Sci. 4, 1881-1903.MEDLINE Abstract
2 Laskowski,R.A., Thornton,J.M., Humblet,C. and Singh,J. (1996) J. Mol. Biol. 259, 175-201.MEDLINE Abstract
3 Miyazawa,S. and Jernigan,R.L. (1996) J. Mol. Biol. 256, 623-644.MEDLINE Abstract
23 Watson,J.D., Hopkins,N.H., Roberts,J.W., Steitz,J.A. and Weiner,A.M. (1987) Molecular Biology of the Gene, Fourth Ed. Benjamin-Cummings, Menlo Park, CA.