Comparative sequence analysis of ribonucleases HII, III, II, PH and D
Comparative sequence analysis of ribonucleases HII, III, II, PH and DI. Saira Mian*
Sinsheimer Laboratories, University of California Santa Cruz, Santa Cruz, CA 95064, USA and Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
Received March 19, 1997;Revised and Accepted May 29, 1997
ABSTRACT
Escherichia coli ribonucleases (RNases) HII, III, II, PH and D have been used to characterise new and known viral, bacterial, archaeal and eucaryotic sequences similar to these endo- (HII and III) and exoribonucleases (II, PH and D). Statistical models, hidden Markov models (HMMs), were created for the RNase HII, III, II, PH and D families as well as a double-stranded RNA binding domain present in RNase III. Results suggest that the RNase D family, which includes Werner syndrome protein and the 100 kDa antigenic component of the human polymyositis scleroderma (PMSCL) autoantigen, is a 3' -> 5' exoribonuclease structurally and functionally related to the 3' -> 5' exodeoxyribonuclease domain of DNA polymerases. Polynucleotide phosphorylases and the RNase PH family, which includes the 75 kDa PMSCL autoantigen, possess a common domain suggesting similar structures and mechanisms of action for these 3' -> 5' phosphorolytic enzymes. Examination of HMM-generated multiple sequence alignments for each family suggest amino acids that may be important for their structure, substrate binding and/or catalysis.
INTRODUCTION
As the details of RNA metabolism have emerged, there has been a concomitant increase in interest in the enzymes that carry out these events. Although it has been suggested that proteasomes, multiprotein complexes involved in processing and turnover of cellular proteins, could also be involved in cellular RNA breakdown and RNA processing (1 ), one group of enzymes has long been known to be important in such events. Ribonucleases (RNases) are enzymes involved in many functions such as RNA processing, stability, turnover and degradation (reviewed in 2 ,3 ). For example, mRNA stability influences gene expression in virtually all organisms from bacteria to mammals and the abundance of a particular mRNA can fluctuate manyfold following a change in the messenger RNA (mRNA) half-life without any alteration in transcription (reviewed in 4 ). Another testament to the general importance of these enzymes is evidence that self-incompatibility in flowering plants involves an RNase (reviewed in 5 ,6 ).
The focus of this work is identification and characterisation of new and known viral, bacterial, archaeal and eucaryotic sequences similar to Escherichia coli RNases HII, III, II, PH and D using the recently developed statistical modelling method of hidden Markov models (HMMs) (7 -10 ). A double-stranded (ds) RNA binding domain present in RNase III is examined also. An HMM of the type created and used here is a sequence of nodes in which each node corresponds to a column in a multiple sequence alignment for a family of related sequences. The HMM technique allows identification, modelling and analysis of the core elements of a family likely to be determinants of the folding, structure and function of that family. For the RNases examined here, the results can provide guidance for further experimental and theoretical work as well as insights into the relationships within and between the different families.
RNases HII, III (also called RNase C), II (also called RNase B), PH and D were selected for study because of their important roles in many organisms (reviewed in 2 ,3 ,11 ,12 ). In particular, they act on a wide spectrum of substrates and include both endo- (HII and III) and exoribonucleases (II, PH and D). RNase HII degrades the RNA moiety of RNA-DNA hybrids (13 ,14 ). Processing of ribosomal RNA precursors (pre-rRNAs) and of some mRNAs requires the ds specific RNase III (15 ). RNase PH is both a phosphorolytic nuclease that removes nucleotides following the CCA terminus of tRNA and a nucleotidyltransferase which adds nucleotides to the ends of RNA molecules by using nucleoside diphosphates as substrates (16 ,17 ). RNase II and polynucleotide phosphorylase (PNPase) are the two principal nucleases involved in processive 3' -> 5' degradation of single-stranded (ss) mRNA (see, for example, ref. 18 ). RNases II, PH and D are three of at least five 3' -> 5' nucleases required for 3' processing of tRNA precursors (pre-tRNAs) (12 ,19 ). A number of these RNases also have a role in the efficacy of some therapeutic molecules. Antisense agents such as antisense oligonucleotides and ribozymes bind to DNA or RNA sequences and block the synthesis of cellular or viral proteins by interfering with transcription and translation (reviewed in 20 ). Antisense oligonucleotides form stable duplexes that are substrates for cleavage by RNase H, which, like RNase HII, acts on RNA-DNA duplexes. In addition, RNase II and PNPase appear to be the major nucleases that degrade hammerhead ribozymes (21 ) as well as RNA-OUT, a 69 nucleotide antisense RNA that regulates Tn10/IS10 transposition (22 ). Thus, studies of these RNases may yield insights into intracellular degradation of foreign RNAs and subsequent development of more stable ribozymes and antisense molecules. Furthermore, the five RNase families examined here provide a glimpse into the myriad of roles that RNases play in how cells grow, differentiate and respond to their environment.
METHODS
Escherichia coli RNases HII, III, II, PH and D were used as query sequences in database searches performed with the BLAST suite of programs (23 ) run with default parameters and a merged, non-redundant collection of sequences derived from PIR, SwissProt and translated GenBank. Database sequences were considered to exhibit a statistically significant similarity to the query if the smallest sum probability P(N) <= 0.05, P(N) being the lowest probability ascribed to any set of high scoring segment pairs for each database sequence. Partial sequences, fragments and expressed sequence tags (ESTs) identified by these BLAST database searches were retained but not employed for HMM training and database discrimination experiments. HMMs were trained for proteins belonging to the RNase HII, III, II, PH and D families as well as a ds RNA binding domain present in RNase III by the procedure outlined below.
An HMM was created using the SAM (Sequence Alignment and Modeling Software System) suite (7 ,24 ) running on a MASPAR MP-2204 with a DEC Alpha 3000/300X frontend at the University of California Santa Cruz (UCSC). In an HMM, use of a match state indicates that a sequence has a residue in that column whereas using a delete state denotes that the sequence does not. Insert states allow sequences to have additional residues between columns and represent regions of the sequence that are not part of the core elements of the family being modelled. To improve the ability of the HMM to generalise, to fit sequences not employed for training, Dirichlet mixture priors (25 ,26 ) were employed. Free Insertion Modules (FIMs) were utilised to allow the HMM to model a region or motif within a larger sequence. Multiple models were trained and the best used for further studies.
Any sequence can be compared to a model by calculating the likelihood that the sequence was generated by that model. Taking the negative (natural) logarithm of this likelihood gives the NLL score. For sequences of equal length, the NLL scores measures how `far' they are from the model and can be used to select sequences from the same family. To assess the specificity and sensitivity of an HMM, it can be used in database discrimination experiments to distinguish between sequences that belong to the family used to train it from those that do not. The programme hmmscore was used to evaluate how much better a sequence fits a model than some underlying background distribution or null model (NULL) and to assess the significance of the resultant score. Database searching using the HMM involves computing log-odds (NLL-NULL) (27 ,28 ) scores for all sequences in a non-redundant protein database obtained from the NCI (29 ) and updated weekly at UCSC. Taking into account the number of sequences in this database (~211 000 different proteins in late 1996) and an expected number of false positives of 0.01, a significant log-odds score is 22.6. Scores higher than this value denote fewer expected false positives. A database search was performed and based upon examination of the log-odds scores and an HMM-generated alignment, new family members were identified, added to the training set and the HMM retrained. This cycle of `search, align and retrain' was repeated until no new sequences were identified in databases up to December 1996. This final HMM was utilised to generate a multiple sequence alignment of the final training set and the partial sequences retained from the initial BLAST searches.
RESULTS
An aim of this study was to train and use HMMs that minimised the numbers of false positives and false negatives. Amongst ~211 000 different proteins, sequences that were not part of their respective training set had log-odds scores <15.0 whereas training set sequences had scores >31.0 (RNase HII), >25.0 (RNase III), >27.2 (ds RNA binding domain), >60.4 (RNase II), >26.6 (RNase PH) and >21.1 (RNase D). For all six families, inspection of the HMM-generated alignments and examination of the log-odds scores suggested there were no false positives amongst sequences with log-odds scores >21.1 and that such sequences could be classified as being members of the family being modelled. However, it cannot be assumed that there are no false negatives amongst sequences with log-odds scores <15.0. There may be remote homologues that have diverged to a degree that the current HMMs may be too specific (overfit the data) and thus unable to classify them as belonging to a particular family. Further generalisation of the HMMs is required to detect such distant family members.
HMM-generated multiple alignments of members of the six families examined are shown in Figures 1 -6 which were produced using ALSCRIPT (30 ). [sect] denotes new members of a family identified here and [Dagger] partial sequences retained from BLAST searches but not employed for HMM training. Existing members of the RNase III (15 ,31 ,32 ), ds RNA binding domain (33 ,34 ) and RNase II (35 -40 ) families have been described elsewhere. Subsequent discussions will focus on new family members. Invariant positions are defined as those residues conserved across all the sequences in an alignment and whose locations are marked by filled triangles. Amino acids conserved in the majority of sequences are highlighted and columns that are predominantly hydrophobic boxed. Columns containing `.' correspond to insert states and numbers indicate the lengths of insertions in sequences at that position (if present).
DISCUSSION
Figure 1 shows new eucaryotic RNase HII family members (yeast, 12:Sc_N2369, 13:Sp_C4G902; worm, 14:Ce_T13H52; mammals, 15:Mm_ESTs, 16:Hs_EST). Since RNase HII acts on RNA-DNA duplexes, they may be involved in DNA replication as well as being candidates for mediating the effect of antisense oligonucleotides.
Figure 2 shows new RNase III family members from bacteria (10:My_ORF, 12:Si_ORF) and eucarya (yeast, 16:Sp_C8A4.08C; worm, 17:Ce_K12H4.8, 20:Ce_F26E4.b; mammals 21:Mm_EST, 22:Hs_ESTs). A S.cerevisiae RNase III (RNase RNT1; 18:Sc_RNT1) cleaves pre-rRNA at a U3 snoRNP- dependent site (15 ) suggesting that some of the other eucaryotic sequences may be important in pre-rRNA processing. Schizosaccharomycespombe and Caenorhabditiselegans each have two RNase III members suggesting involvement in processing different pre-rRNA sites or other RNAs. Three positions in RNase III have been mutated (31 ,42 ,43 ). The first, an invariant Gly (glycine) important for activity in two different sequences, occurs in a highly conserved octapeptide that contains three of the four invariant residues. A second occurs at a variable position. The third is a conserved, functionally important Glu (glutamic acid) present in all RNase III members apart from a bacterium (4:Bs_RNIII) where it is Lys (lysine). In E.coli RNase III, a E -> K,A mutation uncouples substrate binding from cleavage so that it is unclear whether the Bacillussubtilis sequence that has a naturally occuring Lys at this position behaves in a similar manner.
ACKNOWLEDGEMENTS
This work was supported by NIH grant number GM17129 and by the Director, Office of Energy Research, Office of Health and Environmental Research, Division of the US Department of Energy under Contract No. DE-AC03-76F00098. The data and multiple alignments are available in electronic form upon request.
13 Tomasiewicz,H. and McHenry,C. (1987) J. Bacteriol.,169, 5735-5744.MEDLINE Abstract
14 Itaya,M. (1990) Proc. Natl. Acad. Sci. USA,87, 8587-8591.MEDLINE Abstract
15 Elela,S., Igel,H. and Ares,M.,Jr (1996) Cell,85, 115-124.MEDLINE Abstract
16 Jensen,K.F., Andersen,J.T. and Poulsen,P. (1992) J.Biol. Chem.,267, 17147-17152.MEDLINE Abstract
17 Kelly,K.O. and Deutscher,M.P. (1992) J.Biol. Chem.,267, 17153-17158.MEDLINE Abstract
18 Coburn,G. and Mackie,G. (1996) J.Biol. Chem.,271, 15776-15781.MEDLINE Abstract
19 Zhang,J. and Deutscher,M. (1988) J. Biol. Chem.,263, 17909-17912.MEDLINE Abstract
20 Putnam,D. (1996) Am. J. Health-System Pharmacy, 53, 151-160.
21 Wang,J., Qiu,L., Wu,E. and Drlica,K. (1996) J. Bacteriol.,178, 1640-1645.MEDLINE Abstract
22 Pepe,C., Maslesa-Galic,S. and Simons,R. (1994) Mol. Microbiol.,13, 1133-1142.MEDLINE Abstract
23 Altschul,S., Gish,W., Miller,W., Myers,E. and Lipman,D. (1990) J. Mol. Biol.,215, 403-410.MEDLINE Abstract
24 Hughey,R. and Krogh,A. (1996) Comp. Appl. Biosci.,12, 95-107. The hidden Markov model software can be accessed from http://www.cse.ucsc.edu/research/compbio/sam.html.MEDLINE Abstract
25 Brown,M., Hughey,R., Krogh,A., Mian,I., Sjölander,K. and Haussler,D. (1993) Intelligent Systems Mol. Biol. 1, 47-55.
29 NCI (1996) NRP (Non-Redundant Protein) and NRN (Non-Redundant Nucleic Acid) Database. Distributed on the Internet via anonymous FTP from ftp.ncifcrf.gov, under the auspices of the National Cancer Institute's Frederick Biomedical Supercomputing Center.