Nucleic Acids Research, 2002, Vol. 30, No. 7 1427-1464
© 2002 Oxford University Press
Comparative genomics and evolution of proteins involved in RNA metabolism
National Center for Biotechnology Information, National Library of Medicine, 8600 Rockville Pike, Building 389, National Institutes of Health, Bethesda, MD 20894, USA
Received November 1, 2001; Revised December 20, 2001; Accepted January 2, 2002.
| ABSTRACT |
|---|
|
|
|---|
RNA metabolism, broadly defined as the compendium of all processes that involve RNA, including transcription, processing and modification of transcripts, translation, RNA degradation and its regulation, is the central and most evolutionarily conserved part of cell physiology. A comprehensive, genome-wide census of all enzymatic and non-enzymatic protein domains involved in RNA metabolism was conducted by using sequence profile analysis and structural comparisons. Proteins related to RNA metabolism comprise from 3 to 11% of the complete protein repertoire in bacteria, archaea and eukaryotes, with the greatest fraction seen in parasitic bacteria with small genomes. Approximately one-half of protein domains involved in RNA metabolism are present in most, if not all, species from all three primary kingdoms and are traceable to the last universal common ancestor (LUCA). The principal features of LUCAs RNA metabolism system were reconstructed by parsimony-based evolutionary analysis of all relevant groups of orthologous proteins. This reconstruction shows that LUCA possessed not only the basal translation system, but also the principal forms of RNA modification, such as methylation, pseudouridylation and thiouridylation, as well as simple mechanisms for polyadenylation and RNA degradation. Some of these ancient domains form paralogous groups whose evolution can be traced back in time beyond LUCA, towards low-specificity proteins, which probably functioned as cofactors for ribozymes within the RNA world framework. The main lineage-specific innovations of RNA metabolism systems were identified. The most notable phase of innovation in RNA metabolism coincides with the advent of eukaryotes and was brought about by the merge of the archaeal and bacterial systems via mitochondrial endosymbiosis, but also involved emergence of several new, eukaryote-specific RNA-binding domains. Subsequent, vast expansions of these domains mark the origin of alternative splicing in animals and probably in plants. In addition to the reconstruction of the evolutionary history of RNA metabolism, this analysis produced numerous functional predictions, e.g. of previously undetected enzymes of RNA modification.
| INTRODUCTION |
|---|
|
|
|---|
All cells synthesize a vast array of RNAs, using DNA or RNA templates, through a nucleoside polymerization reaction catalyzed by RNA polymerases (1). The mRNAs are templates for the ribosomal synthesis of proteins. Ribosomal RNAs are central to the functions of the ribosome, such as recognition and positioning of the mRNA and formation of the peptide bond during protein synthesis, whereas tRNAs are adaptors that deliver aminoacyl units to the site of protein synthesis and read the genetic code during translation through complementary pairing with codons in mRNA. In addition to these ubiquitous RNAs that are embedded in the Central Dogma of molecular biology (2), there is a plethora of other RNAs whose occurrence ranges from universality to a presence in only one of the terminal lineages of life. These include, among others, the ubiquitous signal recognition particle RNA involved in secretion, the nearly universal RNase P ribozyme, the small guide RNAs of eukaryotes and archaea that participate in processing and modification events to produce mature mRNAs, rRNAs and tRNAs, the bacterial tmRNA involved in protein degradation, the telomerase RNA from eukaryotes that acts as the template for the synthesis of chromosomal termini, the guide RNAs of trypanosomes involved in RNA editing, the small temporal (st) RNA, such as Lin-4, implicated in post-transcriptional regulation in animals, and the animal RoX1/2 and XIST RNAs, which contribute to chromosomal organization (1,38). From the time a RNA chain is elongated by the RNA polymerase to its ultimate destruction by ribonucleases, it undergoes interactions with numerous proteins that either form a variety of ribonucleoprotein (RNP) complexes or catalyze various reactions that modify the RNAs composition or structure. This complex set of processes centered around RNAprotein and RNARNA interactions constitutes what can be termed RNA metabolism.
Thus defined, RNA metabolism is an integral part of the basal processes of molecular biology, namely transcription, translation and secretion, as well as numerous other cellular systems that employ RNAs in various capacities. The diversity of functional contexts notwithstanding, a number of computational analyses of proteins involved in RNA metabolism have brought out several unifying themes in the form of domains that bind RNA molecules and/or catalyze reactions of RNA processing and modification across these different contexts. This justifies the above definition of RNA metabolism and calls for a synthetic treatment of the cellular processes that involve RNA. Several previous computational analyses have considered specific aspects of RNA metabolism and concentrated on the identification of previously undetected domains in proteins involved in these processes (923). The results from such studies cast light on the early evolution of life, the last universal common ancestor (LUCA), the events surrounding the divergence of the major lineages of life, and potentially even the transition from the ancient RNA world to the modern-type, protein-dominated cellular systems.
In order to obtain a comprehensive view of the evolution of RNA metabolism, we conducted a large-scale computational analysis of the proteins involved in RNA metabolism. This analysis was chiefly based on detection of statistically significant similarities through sequence and structure comparisons, determination of orthologous and paralogous relationships between proteins, and utilization of contextual information derived from domain fusions, operon organization and phyletic patterns. This allowed us to define the major transitions and relative temporal order in the evolution of the principal branches of RNA metabolism and to gain some insights into the earliest phases of lifes evolution. Using the parsimony principle, we reconstructed the probable repertoire of genes and functions related to RNA metabolism that were present in LUCA. The analysis also enabled systematization of the vast amounts of information on RNA metabolism that have become available through genome sequencing, and produced structural and functional predictions that might facilitate further experimental studies on RNA metabolism.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Eighteen complete bacterial genomes [Aquifex aeolicus (Aae), Bacillus subtilis (Bs), Borrelia burgdorferi (Bb), Campylobacter jejuni (Cj), Chlamydia trachomatis (Ct), Deinococcus radiodurans (Dr), Escherichia coli (Ec), Haemophilus influenzae (Hi), Helicobacter pylori (Hp), Mycobacterium tuberculosis (Mt), Neisseria meningitides (Nm), Pseudomonas aeruginosa (Pa), Rickettsia prowazekii (Rp), Synechocystis PCC6803 (Ssp), Thermotoga maritima (Tm), Treponema pallidum (Tp), Ureaplasma urealyticum (Uu) and Xylella fastidiosa (Xf)], seven complete archaeal genomes [Aeropyrum pernix (Ap), Archaeoglobus fulgidus (Af), Halobacterium sp. NRC-1 (Hsp), Methanobacterium thermoautotrophicum (Mta), Methanococcus jannashii (Mj), Pyrococcus horikoshii (Ph) and Thermoplasma acidophilum (Ta)] and six (nearly) complete eukaryotic genomes [Arabidopsis thaliana (At), Caenorhabditis elegans (Ce), Drosophila melanogaster (Dm), Homo sapiens (Hs), Saccharomyces cerevisiae (Sc) and Schizosaccharomyces pombe (Sp)] were investigated. The sequence data for all genomes were obtained using the Genome Division of the Entrez system at the National Center for Biotechnology (NCBI) (http://www.ncbi.nlm.nih.gov/Entrez/Genome/main_genomes.html).
The domains listed in Table 1 were included in this study. A set of representative sequences was chosen for each domain, and position-specific scoring matrices (PSSMs or profiles) were generated by running the PSI-BLAST program (2426) against the non-redundant protein sequence database at the NCBI, with the expectation (E) value of 0.01 typically used as the cut-off for including sequences into the profile. PSI-BLAST searches (E value = 0.01) using the constructed profiles were run against the protein sets from each of the genomes included in the study, and lists of all proteins containing the given domain were compiled. After verifying the presence of the target domain through examination of the conservation of the salient amino acid sequence and structural motifs, other domains present in the respective proteins were identified by running PSI-BLAST searches with these sequences as queries and by comparing them with libraries of domain-specific profiles using the NCBI CD-search (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml) and an additional profile collection (L.Aravind, unpublished data) or by using hidden Markov models (HMMs) for conserved domains (27). The proteins were then clustered by sequence similarity using the BLASTCLUST program (I.Dondoshansky, Y.I.Wolf and E.V.Koonin, unpublished data). Multiple alignments, whenever deemed necessary, were constructed using the T_coffee program (28) and phylogenetic trees were constructed using the PHYLIP and MOLPHY packages (2931). Protein secondary structure was predicted using the PHD program (32) and structure coordinate files were handled using the Swiss-PDB Viewer program (33).
|
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
Scope and approach
A meaningful computational analysis of proteins involved in RNA metabolism requires a clear definition of the components under investigation so that it does not draw in every other cellular protein based on an indirect connection with such a pivotal class of molecules as RNA. We restrict our scope essentially to those proteins that more or less directly interact with every known type of RNA from the time it is synthesized by the RNA polymerase until its ultimate degradation by nucleases. Briefly, this includes (i) certain components of the transcription machinery itself that directly interact with the transcript in the process of elongation and termination; (ii) proteins involved in the processes that, at least in eukaryotes, occur shortly after transcription, namely polyadenylation and capping; (iii) various complexes and enzymes involved in the maturation of RNAs, including numerous enzymes that catalyze covalent modification of RNA (e.g. methylation) and the complex splicing machinery involved in pre-mRNA processing in eukaryotes; (iv) translation and its regulation; (v) post-transcriptional gene regulation (PTGR), which, in its simplest form, involves various RNAses that catalyze mRNA degradation [a more complex form of such regulation in eukaryotes is post-transcriptional gene silencing (PTGS)]; (vi) proteins that interact with diverse RNAs during maturation of functional complexes, such as the ribosome, the signal recognition particle, RNase P, SnuRPs, SnoRPs, telomerase, the SsrA particle and the Ro-yRNA particles. Not included in this analysis are proteins that regulate transcription through interaction with DNA and generic structural proteins of various complexes, such as those containing WD40 or tetratricopeptide (TPR) repeats, which have similar roles in proteinprotein interaction in both RNA metabolism and other systems.
Proteins involved in RNA metabolism were collected through a systematic survey of the literature. This was followed by profile sequence analysis using PSI-BLAST to identify the domains present in these proteins. Once identified, these domains were categorized into two principal classes: (i) enzymatic domains and (ii) interaction domains. The latter class mainly consists of non-catalytic, RNA-binding domains (RBDs) and some proteinprotein interaction domains that are predominantly associated with the formation of multisubunit complexes involved in RNA metabolism. Table 1 shows the list of the principal domains included in this analysis. One or more PSI-BLAST PSSMs that ensured complete coverage without inclusion of false positives were prepared for each of the domains and a representative set of complete proteomes (see Materials and Methods) sampled across the three primary kingdoms (bacteria, archaea and eukaryotes) was searched with these profiles to detect all occurrences of each domain. The proteins recovered from all the proteomes were then pooled together and potential orthologous sets were delineated by clustering with BLASTCLUST. These groups of orthologs were corrected and optimized using the symmetry of recovery in single-pass BLAST searches (34) and phylogenetic tree construction and analysis using the minimum evolution (least squares) and maximum likelihood methods (2931). The domain architecture of each individual protein was then determined by using libraries of PSSMs and HMMs. Finally, we attempted to reconstruct the conservation patterns of functional complexes and pathways across the entire phyletic range of analyzed genomes by combining the results of protein domain analysis with experimental evidence extracted from the literature.
The most likely points of origin of domains and individual protein families involved in RNA metabolism were inferred from the patterns of phyletic distribution and phylogenetic tree topology and on the basis of the parsimony principle. If a particular domain or protein family is widely represented in all three primary kingdoms, bacteria, archaea and eukaryotes, the most parsimonious scenario of evolution points to its presence in LUCA. This conclusion is reinforced when the phylogenetic tree for the family in question family conforms to the standard model topology, with a bacterial and archaeo-eukaryotic primary clades (35,36). Conversely, the derivation of a family in LUCA or earlier was considered less likely when a fundamentally different topology was observed, such as grouping of bacteria with eukaryotes. In such a case, a (pre)LUCA origin of the given family would require the extra assumption of displacement of the ancestral form with the bacterial one in eukaryotes, which makes a bacterial or archaeal origin with subsequent dissemination by horizontal gene transfer a viable alternative. Along similar lines, the parsimony principle dictates that, for example, when a domain or a family is widely represented in bacteria and eukaryotes, but is only sporadically encountered in archaea, the most likely scenario involves derivation within the bacterial kingdom and independent acquisitions by eukaryotes and archaea via horizontal gene transfer. Below, in the discussion of the evolution of domains and protein families, we follow these principles of phylogenetic inference, not necessarily referring to them explicitly. All conclusions arrived at with this approach are necessarily probabilistic, but this appears to be the best we can do when reconstruction of ancient evolutionary events is concerned.
In a number of cases, detection of homologs of proteins involved in RNA metabolism required additional correction because a subset of RNA-interacting domains are also involved in DNA binding. We utilized a variety of inputs from experimental studies, phylogenetic relationships and contextual information to exclude those domains and proteins that are primarily involved in DNA binding and metabolism. Nevertheless, a relatively small fraction of the detected domains and proteins either are indeed bifunctional, being involved in both RNA and DNA metabolism, or cannot be assigned specific function with confidence due to insufficient information; such proteins were included in the present analysis for the sake of completeness.
Phyletic patterns and genome-wide demography of protein domains involved in RNA metabolism
We delineated domains involved in RNA metabolism as described above and conducted a survey of their demography across the genomes of representative organisms considered in this study. This overall demographic survey revealed a number of general trends in the evolution of these domains (Fig. 1AD). The most notable, if not unexpected, feature was the separation of the three primary kingdoms by specific phyletic patterns of many domains. A large set of enzymes and interaction domains are present universally and, in all likelihood, are part of the LUCA inheritance (Fig. 1A and C). However, another substantial fraction of the domains involved in RNA metabolism appear to have evolved in a particular superkingdom or lineage, with the greatest number of lineage-specific inventions found in eukaryotes (Fig. 1D). Many eukaryote-specific domains belong to ancient folds, but acquired their RNA-related function only in the eukaryotic lineage. Examples of such exaptation of ancient domains for functions in RNA metabolism include the mRNA-capping enzyme that was derived, at the onset of eukaryotic evolution, from the more ancient DNA ligases (3739), and the lariat-debranching enzyme that was derived from the ubiquitous calcineurin-like phosphoesterases (40,41). Similarly, superfamily (SF)-I helicases were recruited for important RNA-related functions, such as nonsense-mediated decay, only in eukaryotes, although several such helicases function in bacterial DNA recombination and repair. Some eukaryote-specific enzymes, such as the RNA-dependent RNA polymerase involved in PTGS and the Kem1/Rat1 family of 5'
3' nucleases, have large, complex catalytic domains that so far could not be traced to any ancient enzymatic fold. Although structural innovation is less common in prokaryotes than it is in eukaryotes, there are a few enzymes, for example, the RNase domains of the RNaseE/G superfamily, that appear to be innovations of the bacterial lineage.
|
The interaction domains also show a strong trend of eukaryote-specific innovation, the most prominent one being the RNA recognition motif (RRM), which apparently was derived from a more ancient nucleic acid-binding fold with a characteristic four-stranded core found in diverse DNA- and RNA-binding domains (Table 1). Another theme seen in eukaryotes is the recruitment of
-helical superstructures, such as the TPR-like fold (the HAT repeat module found in RNA processing proteins), the pumilio (PUM) repeat (42,43), and the NIC domains (16) for functions in RNA metabolism. This parallels the widespread utilization of these
-helical repeat modules in a number of other contexts in eukaryotes. Many of the distinct, small RBDs that evolved in eukaryotes, such as CCCH, Zn knuckle, C2H2-, LRP1- and C4-Little fingers utilize the common theme of stabilization through metal chelating cysteines and histidines (Fig. 1D). This type of structure is ancient, with numerous Zn-ribbon modules found in archaea (44), but many of these metal- and RNA-binding domains seem to have evolved de novo in eukaryotes, given that utilization of metal coordination to stabilize the core of a domain requires relatively few evolutionary changes, namely the emergence of a strategically placed set of metal-chelating residues.
Another major pattern in the phyletic distribution is the presence of numerous catalytic and interaction domains that are shared by eukaryotes and bacteria, to the exclusion of archaea (Fig. 1AD). Another distinct set of domains is solely shared by archaea and eukaryotes, which supports the chimeric origin of the eukaryotic systems of RNA metabolism. A subset of proteins containing domains shared by eukaryotes and bacteria function in the mitochondria and chloroplasts that have descended from endosymbiotic bacteria. This is reflected in the larger average number of proteins with such a phyletic pattern in plants that have two distinct endosymbiont organelles, mitochondria and chloroplasts. However, several domains with a bacterio-eukaryotic distribution pattern function in non-organellar contexts, such as cytoplasmic RNA degradation. Enzymes of apparent bacterial origin recruited for cytoplasmic functions include several superfamilies of RNases, such as the 3'
5' exonucleases (45). Of the domains with an archaeo-eukaryotic phyletic pattern, several are involved in core processes, such as RNA maturation, e.g. the tRNA endonucleases, and translation, e.g. PIWI (14), pelota and SUI1 domains (9).
Most of the domains involved in ancient functions, such as RNA modification enzymes and RBDs associated with RNA modification, translation and transcription (Table 1 and Fig. 1), are present in nearly constant numbers in all life forms, except that eukaryotes often have more paralogs, partly owing to the presence of organelles derived from bacteria. Eukaryotes show a striking expansion of ancient SFII RNA helicases and, to a lesser extent, of other ancient catalytic domains, such as SFI helicases, GTPases, Rossmann-fold methylases, 3'
5' exonucleases, RNase III and deaminases. A corresponding expansion of non-catalytic domains is mainly restricted to those newly invented or recruited in eukaryotes, including RRM, CCCH, Zn-Knuckle and G-patch. The advent of these RBDs correlates with the emergence of eukaryote-specific functional systems, such as pre-mRNA splicing, PTGR, and mRNA editing and modification (Fig. 1).
These observations indicate that 4045 of the approximately 100 principal domains associated with RNA metabolism originated at early stages of evolution, prior to LUCA. These domains were associated with the most ancient and conserved cellular functions, such as translation, transcription and some forms of RNA modification. The next phase of innovation marked the separation of the bacterial and archaeo-eukaryotic lineages and saw the origin of some proteins, which are involved in basic cellular functions, but are specific to one of these lineages. Finally, with the emergence of the chimeric eukaryotic lineage, domains from both the bacterial and the archaeo-eukaryotic precursor were incorporated into the eukaryotic RNA metabolism pathways. In addition, eukaryotes also invented several new domains and recruited or expanded preexisting ones, concomitant with the origin of new RNA processing systems that were largely absent in prokaryotes. No archaea-specific domains involved in RNA metabolism were identified. This might reflect the retention of most core archaeal systems in eukaryotes, which makes the corresponding domains archaeo-eukaryotic in distribution. In addition, archaea could possess some distinct domains that were not detectable through homology and remain unknown due to the paucity of experimental studies in archaeal systems.
The surveyed organisms dedicate, approximately, between 3 and 11% of their proteomes to RNA metabolism, with the highest fraction, predictably, seen in parasitic bacteria with small genomes and the lowest fraction in multicellular eukaryotes and complex bacteria. Generally, this seems to reflect (i) the central place that RNA metabolism systems occupy in all cells, compared with the substantially more variable systems of transcription, replication or DNA repair, and (ii) a more or less linear growth of the number of proteins involved in RNA metabolism with the increase of the total number of encoded proteins in free-living organisms. Below we discuss in detail specific trends in evolution of catalytic and interaction domains involved in RNA metabolism.
Evolutionary histories of catalytic domains involved in RNA metabolism
RNA modifying enzymes. Cellular RNAs are subject to a number of post-transcriptional modifications that involve modification of the bases and sugars or synthesis of non-canonical bases or nucleotides (4648). The direct nucleotide modifications include methylation of bases and sugars on N, C or O atoms, deamination and demethylation, whereas formation of non-canonical bases includes thiouridylation, pseudouridylation, thioadenylation, dihydrouridylation, and synthesis of archaeosine and queuine.
Methylases. The most common among RNA modifications are the numerous methylations of all types of RNA molecules (46). The RNA methylases come in two major classes (Table 1): (i) the Rossmann-fold methylases, which include the majority of N-, C- and O-methylases that modify both sugars and bases in RNA, and (ii) the recently described SPOUT (49) superfamily, which consists of the m1G-specific methylase TrmD (50,51), the 2'-O-methylguanosine-specific methylase SpoU (5254), and several other poorly characterized predicted RNA. The SPOUT superfamily is traceable to LUCA, but the evolution of these methylases is not considered here in detail because it has been recently described in detail elsewhere (49).
The methylases of the Rossmann-fold class share a six-stranded Rossmann-fold core with the dinucleotide-binding dehydrogenases and are distinguished from them by a methylase-specific 7th strand (20,55). This class contains the great majority of the known methylases that participate in almost every conceivable methylation reaction in biological systems, and RNA specificity appears to have emerged on multiple occasions among them. We sought to resolve the evolutionary relationships among Rossmann-fold RNA methylases using a combination of conventional phylogenetic trees and cladistic analysis based on specific shared sequence motifs (Fig. 2). Several distinct lineages of dedicated RNA methylases can be detected; some of the corresponding protein families also include related DNA methylases. The RNA methylases, typically, are highly conserved and are often associated with specific RBDs, which distinguish them from the DNA methylases; many of the latter are large proteins occurring in restriction-modification operons with a sporadic phyletic distribution. The largest monophyletic superfamily of nucleic acid methylases are the base N-methylases (the BNM superfamily). These methylases are characterized by a shared derived character, the [N/D]PP[Y/F] motif at the end of strand 4, which is associated with base specificity (Fig. 2). Phylogenetic analysis helped in identifying several distinct families within the BNM superfamily, and most of these families can be distinguished by specific derived characters in the above motif. Within the BNM superfamily, two families, namely the HemK family (19) and the MJ0438 family of predicted methylases containing the RNA-binding THUMP domain (12), are represented in all three primary kingdoms and are thus traceable to LUCA. Along with several other related families with more restricted phyletic patterns, these families form a large assemblage of (predicted) purine N-methylases with the NPP[Y/F] motif associated with strand 4. Some of the smaller families appear to be more closely related to either the HemK or the MJ0438 family and might have emerged from them through duplications much later in evolution. The RsmC family methylases that methylate G1207 in 16S rRNA (56) and RsmD,YfiC and YbiN families are bacteria-specific elaborations that are related to the HemK family, whereas the MJ0046 family apparently was derived from the HemK family in the archaeo-eukaryotic lineage. The MJ0438-related elaborations, namely the MJ0710 and MJ0284 lineages, are present in archaea and eukaryotes. The YhhF and MJ1273 families, which are restricted in their distribution to bacteria and archaea, respectively, also belong to this assemblage, but do not show a specific relationship with either the HemK or the MJ0438 family. The functions of the HemK and MJ0438 families are poorly characterized, but their nearly universal conservation pattern suggests a role in purine methylation in rRNA. In Rickettsia, the HemK methylase is fused with another methyltransferase of a different family, MicA (Fig. 2). This suggests that these two methylases coordinately function in rRNA methylation.
|
The next major assemblage within the BNM superfamily is distinguished by the motif DPP followed by a polar residue (typically R) after strand 4. One of the main families within this assemblage is the Trm2 family, which is involved in methylation of U54 in tRNA at the 5 position (57). This family with its pan-bacterial distribution appears to have emerged early in bacterial evolution and apparently was subsequently transferred to the eukaryotic lineage through the mitochondrial symbiosis. Certain bacteria encode an additional methylase family of this assemblage, TrmA, which has the same specificity (58), and appears to have branched off the more widespread Trm2 family. Similarly, eukaryotes have their own, specific methylase family related to the Trm2 family proteins and typified by CG3808 from Drosophila. Another prominent group within this assemblage is the MJ1653 family that shows a fusion to the RNA-binding PUA domain and is widespread in both archaea and bacteria. Families with a more restricted distribution, which are probably more recent offshoots of this lineage, include the YcbY family seen only in some bacteria and the archaeal MJ1233 family (Fig. 2).
The last major group of the BNM assemblage are the methylases with a circularly permuted methylase domain. All members of this group that are widespread in prokaryotes are DNA adenine methylases associated with restrictionmodification systems. In eukaryotes, this group diversified into three distinct families of adenine mRNA methylases (59) typified by the yeast proteins Kar4p and Ime4p, and Drosophila CG14906 (lost in S.cerevisiae), respectively (Fig. 2). In these families, the motif associated with strand 4 assumes the form [D/E]PPW, which is shared with DNA adenine methylases, such as MunI.
The SUN superfamily is the next major assemblage of Rossmann-like fold RNA methylases, which is the sister group of the BNM superfamily (Fig. 2) and has the diagnostic motif DAPC associated with strand 4. The Sun family enzymes, which methylate rRNA at the cytosine 5 position (23), are represented in all three primary kingdoms, consistent with their presence in LUCA. The SUN superfamily has undergone extensive radiation in archaea and eukaryotes, giving rise to two distinct families prior to the separation of eukaryotes and archaea and the eukaryote-specific Nop1 family involved in rRNA and snU RNA methylation (60).
The Erm1/KsgA family that has the motif NLP[Y/F] associated with strand 4 is another close sister group of the BNM superfamily (Fig. 2). These methylases are conserved in all life forms and are responsible for diadenine 2-methylation in rRNA (61), which suggests the presence of this modification in LUCA. The archaeo-eukaryotic Trm5 tRNA methylase family and the archaea-specific MJ1557 family also have a similar form of the strand-4 motif, suggesting that these families form a monophyletic superfamily with the KsgA family (Fig. 2).
Generically related to the BNM, SUN and KsgA-Trm5-like superfamilies are two methylase groups with a more restricted distribution. One of these is the bacterial YqlF family, which has an N-terminal S4 domain and a strand-4 motif of the form D[V/L]DF. Thus, this family shares the conserved D or N followed by two small residues and the predicted base-interacting aromatic or hydrophobic residue with the former superfamilies. The second group, the Uvi22 superfamily, also has a similar strand-4 motif, but has a unique, two small amino acid insert prior to the conserved D at the end of strand 4. While none of the members of this superfamily has been experimentally characterized as RNA methylases, the presence of the characteristic form of the above mentioned strand-4 motif supports this function. Additionally, one of the yeast members of this family is fused to a RNA deaminase (see below), suggesting a role in RNA modification (Fig. 2). This superfamily is restricted to the proteobacteria (conserved in all
-proteobacteria) amidst the bacteria, while it vastly expanded into several distinct families in eukaryotes. This pattern, taken together with phylogenetic analysis results (data not shown), suggests an origin from the mitochondrial endosymbiont. Members of this superfamily might represent a major, as yet unexplored group of eukaryotic nucleic acid methylases.
Sequence evidence and the distinct form of the strand-4 motif suggest that all methylase superfamilies described above descended from a common RNA-methylating ancestor well before the emergence of LUCA. Structural comparisons reveal even deeper links, suggesting that these methylases, in turn, form a higher-order monophyletic group with the FtsJ superfamily of methylases involved in 2'-O-methylation of uridine in LSU rRNA (62) (Fig. 2). The FtsJ/RrmJ family proper is represented in all three primary kingdoms, which points to its presence in LUCA. Several other related families, such as YgdE in bacteria and at least four distinct eukaryotic families, including two animal-specific ones, were derived at various later points in evolution, probably from a FtsJ-like precursor. Some of these, e.g. the Spb1 family, might methylate Sno RNAs (63), suggesting that other, unexplored specificities exist within this family of methylases. Structural comparisons indicate that the group of RNA methylases closest to the FtsJ superfamily is the Fibrillarin/Nop1 family, which is involved in snoRNA methylation (64). This family is restricted to the archaeo-eukaryotic lineage and might have been derived from the FtsJ superfamily through extreme divergent evolution. The archaeo-eukaryotic Trm1 methylase family and the MicA family shared by bacteria and eukaryotes appear to comprise another monophyletic group, which appears to be a sister group of all of the rRNA methylases described above (Fig. 2). Both these families share a similar form of the strand-4 motif with the signature DP followed by an aromatic and then by a small residue. Trm1 functions as a tRNA N2,N2-dimethylguanosine-26 methyltransferase (65,66) and MicA probably performs a similar, although not identical, role in bacteria and eukaryotic mitochondria. These two families might represent the archaeo-eukaryotic and bacterial branches, respectively, of an ancestral methylase that was represented in LUCA.
All the other groups of RNA methylases appear to have been derived, independently, on more than one occasion in evolution, from within the vast assemblage of small molecule and protein methylases. None of these families is traceable to LUCA; instead, they are restricted in their distribution to only one or two of the primary kingdoms. Two of these families, the Abd1p family that methylates the eukaryotic mRNA cap, and Yml014w family that is fused, in some cases, to the AlkB domain (see below), have a dyad of aromatic residues in the 4th position after the end of strand 4. This feature suggests their derivation from within the vast class of small molecule methylases. The Yml014w family has additionally lost the polar residue (D/N) at the end of strand 4. Also derived from within this small molecule methylase assemblage is the family typified by the plant Corymbosa2/Hen-1 protein. Predicted methylases of this family are present in the crown group eukaryotes and in some bacteria, such as Streptomyces and Nostoc, and retain a single aromatic residue in the 4th position after the end of strand 4. The plant representatives of this family are fused to an N-terminal RNA-binding LA domain and a double-stranded RNA-binding domain (dsRBD) (Table 1 and Fig. 2), which suggests that these proteins are RNA methylases that probably methylate substrates containing double-stranded regions (see below). The GCD14 family of methylases (67,68), which methylate A58 of tRNAs in position 1, was derived in the archaeo-eukaryotic lineage and is more closely related to protein arginine and carboxyl group methylases than to other RNA methylases. These methylases have been sporadically transferred to bacteria, such as M.tuberculosis and A.aeolicus. They are distinguished by the presence of a distinct C-terminal domain similar to the transcript cleavage factor GreA (69). This family appears to have undergone a duplication in eukaryotes, giving rise to a paralog, GCD10, whose methylase domain shows a disruption of the Rossmann-fold loop and the strand-4 region. The RrmA family that methylates G745 in position 1 in LSU rRNA (70) is another family that appears to have been derived from the small molecule methylases late in bacterial evolution, followed by inter-bacterial dispersion via horizontal transfer.
Thus, Rossmann-fold methylase appear to have been recruited for RNA methylation at an early stage of evolution, well before LUCA. From this ancient, ancestral methylase, the significant majority of the RNA methylases, including the five to six aforementioned methylase families that were probably already present in LUCA, were derived. Extensive duplication, later in evolution, particularly in eukaryotes, resulted in the formation of several more families within this large, monophyletic assembly of RNA methylases. Additionally, lineage-specific RNA methylases were apparently derived independently, on multiple occasions, from within the small molecule and protein methylase clade. At early stages of their evolution, RNA methylases formed stable fusions with several distinct RBDs, such as the S4, PUA (9), TRAM (11), THUMP (12), NusB and a potential OB-fold domain (in Trm5) (71) (Fig. 2). In addition, in eukaryotes, fusions of RNA methylases to eukaryotic-specific RBDs, including RRM and CCCH domains in the TrmA-family methylases and a G-patch domain (18) in the FtsJ family, were detected. These fusions appear to have emerged relatively late in eukaryotic evolution and probably participate in the methylation of eukaryote-specific snRNAs. Most of these pan-bacterial families of methylases appear to have been horizontally transferred to the eukaryotic genomes as a consequence of organellar endosymbiosis, resulting in a bacterialeukaryotic distribution pattern. The identification of several uncharacterized RNA methylase groups in this analysis (Table 1) may help in further investigations of the diversity of this crucial RNA modification.
Pseudouridine synthases. The modified base pseudouridine is synthesized by pseudouridine synthases via in situ isomerization of uridines in tRNAs, rRNAs and eukaryotic snRNAs, such as U5 and U3 (46,72). Pseudouridine synthases belong to two apparently unrelated superfamilies, one of which (Type I PSUS) includes the four principal ancient families, RluD, RsuA, TruB and MJ0041, whereas the other superfamily (Type II PSUS) consists of a single ancient lineage typified by TruA (22,73,74) (Fig. 3). Type II PSUS are present in a single copy in all proteomes, except for eukaryotes that have at least three enzymes of this superfamily. Within the Type I PSUS superfamily, the TruB family is traceable to LUCA; several members of this family are fused to a PUA domain, suggesting that this was the ancestral PSUS Type I domain architecture. The RluD and RsuA families originated in bacteria; each family includes several members containing the S4 RBD (9), which was probably present in the ancestor of these families, but was subsequently lost on multiple secondary occasions. Conversely, the THUMP-domain-containing MJ0041 family of PSUS appears to be an innovation specific to the archaeal lineage. The RluD family has been secondarily transferred to the eukaryotes, probably via the pro-mitochondrial endosymbiont. Type I PSUS are predicted to adopt an
+ß fold; the crystal structure of the Type II PSUS shows the presence of a core RRM-like fold common to several ancient nucleic acid-binding domains (75). This, taken together with the use of guide RNAs by the eukaryotic PSUS, suggests that Type II PSUS might have evolved from an ancient RBD that functioned in conjunction with a ribozyme, with a gradual shift of the active site from the RNA to the protein component.
|
Enzymes involved in base thiolation. A variety of thio-bases are represented in cellular RNAs, the most common ones being 2- or 4-thiouridines and their derivatives, and 2-methylthioadenine derivatives. The methylthioadenines are typically additionally modified with bulky adducts, such as threonine or 4-hydroxyisopentene in the N6 position. Recently, the enzyme responsible for adenine thiolation in E.coli, MiaB (76), has been identified and shown to consist of a C-terminal RNA-binding TRAM domain and an N-terminal biotin synthase-like, metal cluster-containing catalytic domain that is predicted to catalyze sulfur insertion via SAM-dependent organic radical generation (11,77,78). MiaB-like proteins are universally present in all life forms, indicating their origin prior to LUCA. Several organisms encode more than one version of this enzyme, which appear to have diversified through early duplications; these multiple forms might differentially function in the synthesis of different 2-methylthioadenine derivatives, such as 2-methylthio-N6-threonyl carbamoyladenosine and 2-methylthio-N6-methyladenosine (46).
Thiouridine synthase (ThioUS; ThiI protein in E.coli) is involved in the synthesis of 4-thiouridine in tRNAs and has a core PP-ATPase domain (79), which catalyzes adenylylation of the 4-carbonyl group of uridine, followed by sulfur insertion catalyzed by a rhodanese-like enzyme (80,81). This rhodanese-like enzyme either comprises a distinct domain of the ThiI protein or functions as a stand-alone protein. 2-Thiouridine is universally present in tRNA, and 2-thiouridine derivatives, typically containing an additional modification of a methyl or aminomethyl group in position 5, are also common. One of the enzymes involved in 2-thiouridine synthesis, TrmU, has been identified (82). This protein contains a PP-ATPase domain with an unusual conserved cysteine dyad inserted after strand 3 in the PP-loop domain. This suggests that syntheses of 2-thiouridine and 4-thiouridine follow similar biochemistry, which involves activation of the carbonyl group by adenylylation. In TrmU-like enzymes, the internal conserved cysteines might directly participate in sulfur insertion as a functional counterpart of the separate rhodanese-like domain, which is required for 4-thiouridine formation.
Previously, we predicted that the MJ0066 family represents a novel family of archaeal ThioUS, on the basis of the fusion of a PP-ATPase domain with a PUA domain (9). Here, we systemically investigated other PP-ATPase families that potentially could be involved in thiouridine or thiocytidine synthesis by examining fusions with RBDs, association with the ribosomal super-operon and conserved phyletic patterns typical of RNA metabolism proteins. As a result, the MTH271-MJ1157 family, which showed fusions with the KH and Zn-ribbon domains, and the MJ0690 family, which is associated with ribosomal super-operon in different archaeal genomes, emerged as candidates for these functions (Fig. 3). Furthermore, the MesJ family, which is closely related to the TrmU family, is universally conserved in all bacteria and potentially also could be involved in base thiolation.
The ThiI-family proteins contain a N-terminal THUMP domain and are bifunctional proteins that additionally participate in thiamin biosynthesis (80,81,83). These proteins are ubiquitous in archaea, but sporadic in bacteria, suggesting that they originated in archaea, with several subsequent horizontal transfers to bacteria. In contrast, the TrmU family is absent in archaea, but is nearly universal in bacteria and eukaryotes, suggesting origin in the bacterial lineage, followed by an early transfer to the eukaryotes, probably via the pro-mitochondrial route. These phyletic patterns do not seem to be consistent with the universal distribution of the simple 2-thiouridine modification in tRNAs (46,84). The only predicted universal ThioUS are the members of the MJ1157 subfamily (Fig. 3) of the MTH271-MJ1157 family containing N- and C-terminal Zn-ribbon domains. The universal distribution, taken together with the distinct bacterial and archaeo-eukaryotic clades detected during the phylogenetic analysis of this family (data not shown), suggests that these enzymes are the 2-thiouridine synthases, whereas TrmU is likely to be specifically involved in 5-methyl-2-thiouridine biosynthesis. The presence of a conserved cysteine dyad insert in the PP-ATPase domain of the MTH271-MJ1157 family, similar to TrmU, might indicate an analogous catalytic mechanism. Archaea and eukaryotes have additional subfamilies of the MTH271-MJ1157 family (Fig. 3) that, along with the MJ0066 family, could compensate for the absence of TrmU or ThiI in some of these lineages. The sporadic presence of ThiI in bacteria suggests that it might be substituted by a more widespread, but so far unidentified 4-thiouridine synthase specific to bacteria; the conserved MesJ protein could be one candidate for this function.
Queuosine and archaeosine synthases. The 7-deazaguanosines, queuosine and archaeosine, found, respectively, in bacteria and eukaryotes, and in archaea, are incorporated into tRNA through transglycosylation (46,84). The queuosine transglycosylase, Tgt (85,86), is present in a single copy in most bacteria, whereas eukaryotes, with the exception of yeast, encode two forms of this enzyme. Archaea have two distinct proteins of the archaeosine transglycosylase family, which are distantly related to Tgt (87). The MJ1022 subfamily so far has been found only in Euryarchaea; A.fulgidus, in addition, has a single copy of queuosine transglycosylase, apparently a lineage-specific acquisition from bacteria (Fig. 3). The complementary distribution of queuosine and archaeosine transglycosylases suggests that they originally diverged from a common ancestor with a TIM barrel fold (88), concomitantly with the split of the bacterial and archaeo-eukaryotic lineages. In archaea, the catalytic domain was fused with the RNA-binding PUA domain and this form of archaeosine transglycosylase underwent a duplication in Euryarchaea (Fig. 3). In eukaryotes, acquisition of the bacterial queuosine synthase through horizontal transfer from the pro-mitochondrion probably resulted in displacement of the ancestral archaeo-eukaryotic archaeosine synthase, with a further duplication leading to the forms involved in modification of organellar and cytoplasmic tRNAs.
RNA deaminases. RNA deaminases are responsible for the synthesis of certain modified nucleosides, such as inosine, and for base conversions during various RNA editing reactions. The cytidine deaminase family includes generic enzymes that catalyze generation of uridine from cytidine. In yeast, these enzymes are responsible for C
U editing (89), suggesting that they might perform a similar function in many, if not all, eukaryotes. Plants show an expansion of a specialized form of this family, with an N-terminal inactive deaminase domain, in addition to the C-terminal active one; conceivably, these proteins might be involved in a plant-specific form of regulated RNA editing. Deaminases of the Tad2p family, which generate uracil from cytosine and inosine from adenosine in the wobble position of tRNAs (90,91), are present in most bacteria and all eukaryotes, but not in archaea (Fig. 3). The Tad3p family, which comprises the second subunit of the inosine-generating deaminase, is eukaryote specific. The combination of Tad2p and Tad3p probably confers the specificity that differentiates this enzyme from generic cytosine deaminases. The eukaryote-specific Tad1p family of deaminases (92) is involved in inosine generation at A37 of tRNAAla and in adenine editing of mRNAs in animals (93,94). The animal versions typically have the characteristic dsRBD fused to the catalytic domain, whereas one of the vertebrate paralogs contains a winged helixturnhelix domain (Fig. 3). Cytosine deaminases of the vertebrate-specific APOBEC family are involved in C
U editing and are represented by at least eight paralogs in mammals (95). These enzymes appear to have been recently derived from the cytidine deaminases through rapid divergent evolution. The deaminases related to the RibD protein, which is involved in riboflavin biosynthesis, are fused to a Type I PSUS in S.cerevisiae and to a potential RNA methylase in S.pombe, suggesting that, similarly to cytidine deaminases, they might be involved in specific editing processes (Figs 2 and 3).
Specific RNA deaminases of known families are nearly absent in archaea. The corresponding functions might have been taken over by unrelated, still unknown enzymes or, at least in some cases, could be provided by related enzymes of the deoxycytidine deaminase family that are present in some archaea. This phyletic pattern suggests a bacterial origin for at least two of the major deaminase lineages, cytidine deaminases and cytosine deaminases. Following their acquisition by eukaryotes from the bacterial endosymbiont, cytosine deaminase underwent duplication to give rise to the two A
I deaminases involved in wobble-specific inosine synthesis. Additionally, members of both the cytidine and cytosine deaminase lineages were independently recruited for mRNA editing in vertebrates and possibly in other eukaryotic lineages (Fig. 3).
Dihydrouridine synthases. Dihydrouridine synthases are poorly characterized enzymes that synthesize dihydrouridine through the reduction of the aromatic ring of uracil. This base is widely found in tRNAs from all three primary kingdoms and in LSU rRNA from prokaryotes (96,97). The yeast dihydrouridine synthase Dus1p belongs to the superfamily of FAD-binding TIM barrel oxidoreductases typified by dihydroorotate dehydrogenase (98). This enzyme is universally represented in eukaryotes and bacteria, but completely missing in archaea. Eukaryotes have four main lineages within this family, which are typified by the yeast proteins Dus1p, Smm1p, Ylr405wp and Ylr401cp. The members of the first three families typically show fusions with the LRP1 Zn-finger, dsRBD and CCCH RBDs, respectively (Fig. 3); these RBDs probably target dihydrouridine synthases to specific sites in the substrate RNAs. Bacteria have at least three principal lineages of dihydrouridine synthases typified by the YhdG, YohI and YjbN proteins from E.coli (Fig. 3). The phyletic pattern of dihydrouridine synthases suggests that this enzyme emerged early in bacterial evolution and was transferred to eukaryotes, probably via the endosymbiotic route. The diversification of dihydrouridine synthases into multiple forms apparently occurred independently in bacteria and eukaryotes. Dihydrouridine has been detected in tRNAs of T.acidophilum and M.thermoautotrophicum, but appears to be missing in other archaea studied to date (99,100). Hence, at least in those archaea that appear to contain this modification, an alternative as yet undiscovered enzyme is likely to be present.
NTP-dependent enzymes involved in RNA metabolism
In addition to the PP-loop ATPases discussed above in the context of base modification, a variety of ATP- and GTP-utilizing enzymes of the P-loop NTPase fold are involved in RNA modification, processing and splicing and especially in translation itself. In addition, aminoacyl-tRNA synthetases (aaRS), which belong to two other distinct, ancient classes of ATP-utilizing enzymes, are central to the translation process. Evolutionary relationships of aminoacyl-tRNA synthetases have been examined in detail in several recent studies (10,36,101,102). Here, we briefly summarize the evolutionary history of the vast class of P-loop NTPases in the context of their repeated utilization in RNA metabolism.
GTPases. P-loop GTPases are among the central, most ancient components of RNP complexes and at least nine distinct GTPases associated with different aspects of translation are traceable to LUCA. These include the four translation factors involved in initiation and elongation, two distinct versions of the OBG family of GTPases containing the RNA-binding TGS domain, the circularly permuted YlqF-like GTPases, and two GTPases associated with the signal recognition particle and its receptor. The first seven of these families belong to a large assemblage of GTPases related to the translation factors (the TRAFAC class), whereas the remaining two are members of the signal recognition/MinD/BioD (SIMIBI) class of GTPases and related ATPases (103). These two classes correspond to the first fundamental split in the evolution of GTPases and, because both classes include proteins involved in translation, it appears likely that the primordial GTPase was a component of an ancient RNP complex that functioned as a generic regulator of translation. Even prior to LUCA, the GTPases have diversified through several duplications to perform more specific, essential functions in translation and secretion. After the radiation of the major lineages of life, many GTPases were recruited for specific functions within the translation system, such as translation-termination and RNA modification and processing. The Era family GTPases (104), which contain a C-terminal domain that is a topologically rearranged version of the KH domain, the PseudoKH domain (105), and the TrmE (ThdF) family were derived in bacteria within the TRAFAC class of GTPases and participate in rRNA and tRNA modification. TrmE is involved in the synthesis of the modified nucleotide 5-methylaminomethyl-2-thiouridine in tRNAs (106). The archaeo- eukaryotic Clp1 GTPase family of the SIMIBI class was recruited to participate in polyadenylation site selection (107). In eukaryotes, a distinct paralogous derivative of the universal translation factor EF-2, typified by Snu114p, acquired a new function in splicing as a component of the U5 RNP (103). Further details of GTPase evolution are presented elsewhere (108).
RNA helicases. The next major class of P-loop NTPases that are associated with RNA metabolism are RNA helicases and related ATPases. The known RNA helicases of cellular life forms belong to two major superfamilies, SFI and SFII, that descend from an ancient common ancestor antedating LUCA. This ancestral helicase contained two distinct
/ß domains that are present in both SFI and SFII (109). The N-terminal domain is a classic P-loop ATPase domain that belongs to the RecA-like subclass of P-loop domains (110,111). The C-terminal domain appears to represent an extremely divergent P-loop domain that might have evolved through an ancient duplication of the N-terminal domain, followed by extreme sequence divergence, which probably accompanied a functional shift to single-strand nucleic acid binding. The extant lineages of SFI and SFII helicases include both DNA and RNA helicases, and other nucleic acid-dependent ATPases. Among the helicases involved in RNA metabolism, SFII occupies a more prominent position than SFI; SFII helicases are much more prevalent in eukaryotes than in bacteria (Fig. 4). Seven major families of SFII helicases have experimentally characterized or clearly predicted roles in RNA metabolism. Two of these, namely the eIF4A-DeaD family (with the classic DEAD motif in the Walker B site) and the Ski2p-Lhr family, are widespread in all three primary kingdoms, which points to their presence in LUCA. Within the eIF4A-DeaD family, the orthologous group typified by the bacterial DeaD protein, which is involved in translation regulation (112), is widely represented in bacteria and archaea and might be the form closest to the ancestor of this family. In eukaryotes, this family has vastly expanded to include at least 30 distinct lineages, with almost 25 of them traceable to the common ancestor of the crown group (Fig. 4). Most members of this expanded helicase subfamily are subunits of pre-mRNA splicing complexes, whereas some others, such as Rrp3p (113), function in other RNA processing pathways, and Upf1p is involved in mRNA degradation (114). The pan-eukaryotic translation initiation factor eIF4A appears to be the direct equivalent of the prokaryotic DeaD-like helicases, and its function in eukaryotes might be an extension of the ancient role of these helicases in regulatory unwinding of mRNA secondary structure. Proteobacteria have a lineage-specific expansion of the DeaD lineage, with additional orthologous groups, such as RhlE and RhlB (115), whereas most of the other bacteria have only a single member.
|
The Ski2p-LHR family is a much smaller family whose ancestral form probably was involved in RNA degradation and processing (116,117). Archaea typically have three distinct helicases of this family, whereas eukaryotes have four members of the Ski2p-Mtr4p-like subfamily, all of which apparently function in conjunction with the exosomal nucleases in RNA degradation (Fig. 4). Another eukaryote-specific orthologous group within this family includes Brr2p-like proteins, which contain two helicase and sec63 domains and are involved in both cytoplasmic RNA processing and splicing as a component of U5 snRNP (118). One orthologous group within the Ski2p-LHR family, which is typified by mus308 of D.melanogaster and MJ1401 of M.jannaschii, appears to have been recruited for DNA-related functions in the archaeo-eukaryotic lineage and, in eukaryotes, shows a fusion to a DNA polymerase domain (119,120).
The remaining families of SFII helicases involved in RNA metabolism show purely eukaryotic, bacterial or bacterio-eukaryotic distribution. The Suv3 family involved in mitochondrial RNA degradation (121) and the CAF family involved in PTGS are small groups that are restricted to eukaryotes and appear to function in eukaryote-specific regulatory processes (see below). The Prp2p-Mle subfamily is found in both bacteria and eukaryotes. Eight distinct orthologous groups can be delineated within this family in eukaryotes, with the majority involved in splicing, including Prp2p, Prp16p, Prp43p, Prp22p and Mle (122). The HrpA/B proteins are bacterial representatives of the Prp2p-Mle family that are present only in proteobacteria, spirochetes and Deinococcus, which suggests dissemination via horizontal transfer among bacteria, although the initial direction of horizontal transfer responsible for the bacterio-eukaryotic distribution remains uncertain. The SecA family proteins are ubiquitous in bacteria and plants and have been shown to possess RNA helicase activity (123125). However, the role of this activity in vivo remains unclear because SecA also has a well characterized function as an ATP-dependent translocase involved in protein secretion. The RecQ family of SFII helicases is unusual in that these proteins have functions in both DNA repair and RNA metabolism. This family is represented only in bacteria and eukaryotes, with a single horizontal transfer into the crenarchaeon A.pernix. This distribution suggests that the RecQ family originally evolved in bacteria and was subsequently acquired by eukaryotes from the pro-mitochondrial endosymbiont, which was followed by extensive diversification into at least five distinct orthologous, eukaryote-specific groups. Many members of this family share a predicted RBD, the HRDC domain (126), with the RNase D family of nucleases, suggesting that the ancestral function of the RecQ family helicases might have been in RNA metabolism, with a subsequent shift to DNA-related functions. A member of this family from Neurospora has been shown to have a role in RNA metabolism, in particular PTGS (127). Orthologs of this protein are present in other eukaryotes; furthermore, fusion of the RecQ family helicases with the Zn-knuckle and the F-box domains in plants and animals (see Figs 4 and 6) indicate that this family might have more extensive RNA-related functions than presently conceived.
|
Several SFI helicases are implicated in RNA-related functions in eukaryotes; they all belong to the Smubp-Sen1p family, which is conserved throughout the archaeo-eukaryotic lineage and in a few bacteria (128). This family includes both DNA and RNA helicases and probably emerged early during the evolution of the archaeo-eukaryotic clade, rather than in LUCA. All archaeal members of the Smubp-Sen1p family are orthologs of the eukaryotic Smubp, which is a DNA-binding protein (129). However, the presence of the single-stranded nucleic acid-binding R3H domain (15) in some of the eukaryotic members of this family might point to an undiscovered role in RNA metabolism (see Figs 4 and 6). All known eukaryotic SFI RNA helicases of the Smubp-Sen1p family were derived after the divergence of eukaryotes from the common ancestor with archaea (Fig. 4). Five distinct lineages of RNA helicases of this family emerged prior to the divergence of the crown group eukaryotes and include proteins involved in a variety of functions, such as snoRNA maturation [Sen1p (130)], mRNA degradation [Nam7p (131)] and PTGS [Sde3 (132)] (Fig. 4). One of the eukaryotic SFI lineages, represented by the S.pombe SPCC1739.03 and its orthologs, is closely related to the NAM7p subfamily and is an uncharacterized group of predicted RNA helicases, which, on the basis of their phyletic pattern (133), are likely to participate in PTGS (Fig. 4). Another distinct pan-eukaryotic family, typified by the Aquarius protein (134), is predicted to include inactive helicases as indicated by the disruption of the P-loop and Walker B motifs; these proteins probably function as RNA-binding regulators, rather than as enzymes. Two small, lineage-specific expansions of these helicases were detected in Arabidopsis and C.elegans, typified by the F1E22.14 (eight members) and K08D10.5 (six members) proteins, respectively; these might represent specific adaptations for antiviral response or related processes. Most of the other SFI families, such as the RecD family, appear to have evolved in bacteria and are known to be involved only in DNA repair and recombination (119).
The PhoH family of ATPases (135) evolved in bacteria, apparently through the loss of the C-terminal
/ß domain that was present in the common ancestor of SFI and SFII helicases. A role in RNA metabolism is strongly suggested by the presence of RNA-binding PIN and KH domains in different members of the PhoH family (Fig. 4). There are two orthologous groups of PhoH-like ATPases, typified by PhoH and YlaK, respectively, that evolved as a result of an early duplication in the bacterial lineage. The PhoH proteins could either function as helicases or could be involved in ATP-dependent dynamics of as yet uncharacterized RNP complexes in bacteria.
Miscellaneous P-loop NTPases involved in RNA metabolism. In addition to the above, well characterized classes of P-loop NTPases involved in RNA metabolism, several others have less common and less thoroughly understood RNA-related functions. The most notable of such groups includes the PilT ATPases, which form a distinct class within the P-loop fold and appear to be a sister group to the ABC class (D.D.Leipe, E.V.Koonin and L.Aravind, unpublished data). The PilT ATPases implicated in RNA metabolism appear to be predominantly an archaeal innovation and are typified by MJ1533 and its orthologs that are highly conserved in archaea (136). These proteins combine the PilT ATPase domain with RNA-binding PIN and KH domains. In bacteria, a group of PilT ATPases is present sporadically in Bacillus and Synechocystis and form fusions with the RNA-binding R3H domain. These ATPases might represent a novel class of RNA helicases or could participate in other ATP-dependent reactions of RNA metabolism.
Some kinases of the P-loop fold, such as polynucleotide kinases, also participate in RNA metabolism. A generic polynucleotide kinase that probably acts on both DNA and RNA seems to be conserved in all eukaryotes except for S.cerevisiae (137139). Additionally, some lineage-specific P-loop kinases are implicated in RNA metabolism on the basis of suggestive domain fusions, including the kinase fused to yeast RNA ligase (140,141) and the animal-specific hnRNP-U (SAF-A) proteins, which contain a SAP domain, and might function as chromatin-bound polynucleotide kinases in pre-mRNA splicing (142,143). P-loop kinase domains are also fused to the ligase-related nucleotidyltransferase domains of the capping enzyme in trypanosomes (144).
The P-loop proteins of the MiaA family modify adenines, chiefly in tRNAs, through the addition of bulky adducts, such as isopentene, in position 6, using organic phosphates, e.g. dimethylallyl diphosphate, as donors of the modifying groups (145,146). These enzymes are distantly related to the AAA+ class of P-loop ATPases and are nearly ubiquitous in bacteria and eukaryotes, which is consistent with the phyletic pattern of 6-isopentenyl adenines in tRNA. MiaA probably evolved in the common ancestor of bacteria and was acquired by eukaryotes from the promitochondrial endosymbiont. On the basis of operon organization, it can be predicted that, at least in certain bacteria, such as proteobacteria, Aquifex and Synechocystis, MiaA utilizes the Hfq protein (the bacterial homolog of the eukaryotic SM proteins) as an RNA-binding subunit.
Other enzymes of RNA metabolism
At least 15 superfamilies of RNases are involved in a variety of processes, such as maturation of tRNAs and rRNAs, polyadenylation site-specific cleavage of mRNAs, and RNA degradation in various contexts and cellular compartments. A detailed evolutionary classification of RNases has been published recently (45), and therefore individual groups of these enzymes are not discussed here in detail. However, we cover some specific aspects of their evolution when reconstructing the evolution of individual functional systems in RNA metabolism (see below).
In addition, a number of other enzymes that form relatively small families, sometimes with restricted phyletic distribution, are involved in RNA metabolism. One such group is the RNA ligases that are related to the DNA ligases and appear sporadically in cellular life forms. The fungi possess a RNA ligase, which is required for the maturation of tRNAs and non-spliceosomal mRNA maturation (147), whereas in trypanosomes RNA ligases participate in mRNA editing (38,148). Homologs of these RNA ligases are encoded by several DNA viruses, including phage T4, baculoviruses and entomopox viruses (38). This observation, together with the sporadic distribution of RNA ligases, might suggest that cellular organisms acquired these enzymes independently from DNA viral sources. Additionally, a variety of other nucleotidyltransferases are involved in non-templated polymerization of ribonucleotides during polyadenylation of mRNAs, CCA addition in tRNAs and RNA editing. All these enzymes have the DNA polymerase ß-fold (149) and are considered in greater detail below in the context of evolution of the capping and polyadenylation systems.
Cyclic phosphodiesterases of the LigT superfamily hydrolyze 2'-5' phosphoesters in various contexts in RNA metabolism. The most conserved of these enzymes form the core LigT family, which apparently evolved in the archaeo-eukaryotic lineage, with a few transfers into bacteria; the animal members of this family have a fusion with the RNA-binding KH domain. They apparently catalyze hydrolysis of ADP-ribose 1'',2''-cyclic phosphate that is formed as an intermediate in tRNA processing (150,151). Additional members of this superfamily, which are not orthologs of LigT, were identified as fusions with RNA ligases in yeast, in RNA viral polyproteins, and as stand-alone proteins in Arabidopsis (L.Aravind and E.V.Koonin, unpublished observations); these proteins might have related phosphodiesterase activities in RNA metabolism.
The Macro domain (first detected in vertebrate macrohistone 2) is another highly conserved phosphoesterase that is involved in Appr-1''-p-processing (152), as part of tRNA maturation. Macro domain phosphoesterases are conserved across the three superkingdoms of life, which is compatible with the presence of such an enzyme in LUCA. Finally, several families of enzymes, such as the enigmatic RNA-dependent RNA polymerases (153156), and AlkB-like oxoglutarate-dependent dioxygenases (157), show a limited phyletic distribution. Most of these are known or predicted components of the eukaryotic post-transcription regulatory systems and are further explored below in the context of evolution of these functional systems.
Evolutionary history and trends of non-catalytic domains involved in RNA metabolism
Approximately 50 major superfamilies of non-catalytic domains, primarily RNA-binding ones, are implicated in RNA metabolism (Fig. 1A and B and Table 1). In addition, several conserved domains are found exclusively in ribosomal proteins. Below we consider some of the general and specific features of the natural history of these domains that emerge from a detailed analysis of their phyletic patterns combined with attempts on evolutionary classification.
Evolutionary mobility of domains. RBDs show remarkable diversity in terms of domain architectures. Several RBDs, such as ribosomal protein L30 and the SRP14-domain, typically occur as stand-alone proteins and in a single copy per genome. At the other end of the spectrum are promiscuous domains, such as RRM, which display over 35 distinct multidomain architectures and are found in combination with up to 20 different domains (Figs 57). These observations suggests major differences in evolutionary mobility among RBDs. Certain highly conserved, ancient RBDs, such as L30, S6 and SmpB, appear to have largely stabilized in specific functional niches in the ribosome or in lineage-specific RNP complexes and are not typically recruited to roles in more general contexts related to RNA metabolism. In contrast, some other conserved domains found in ribosomal proteins, such as S1 (158), KOW (13) and S4 domains (9), have been recruited for a variety of other functions which involve RNA binding. Some of these domains (KOW, S4), along with other mobile RBDs, such as EMAP, PUA, PIN, TRAM, THUMP, TGS, N-OB, NusB (912,71,136) and several conserved domains found in aaRS (10), form a group of moderately mobile, ancient domains. The majority of the fusions that involved these domains appear to have evolved close to the origin of one of the superkingdoms or, in some cases, even in or prior to LUCA. Most of these architectures show remarkable parallelism of fusions of different RBDs to various RNA modification and processing enzymes. It appears that these RBDs emerged at early stages of evolution and, shortly after their origin, formed fusions that facilitated the delivery of diverse catalytic activities to RNA and hence were maintained in most lineages. These moderately mobile domains formed lineage-specific fusions on relatively rare occasions, such as those of N-OB and EMAP to the C-termini of plant and vertebrate TyrRS, respectively (10), or the fusion of TRAM to a FtsJ-like methylase in Thermoplasma (11).
|








