Identification of new RNA modifying enzymes by iterative genome search using
known modifying enzymes as probes
Identification of new RNA modifying enzymes by iterative genome search using known modifying enzymes as probes
Claes
Gustafsson
,
Ralph
Reid
1
,
Patricia J.
Greene
and
Daniel V.
Santi*
Departments of Pharmaceutical Chemistry and of Biochemistry and Biophysics,
University of California,
San Francisco
, CA 94143-0448,
USA
and
1
Biomolecular Resource Center, University of California,
San Francisco
, CA 94143-0541,
USA
Received June 14, 1996
;
Revised and Accepted August 12, 1996
ABSTRACT
The complete nucleotide sequences of the
Haemophilus influenzae
and
Mycoplasma genitalium
genomes and the partially sequenced
Escherichia coli
chromosome were analyzed to identify open reading frames (ORFs) likely to
encode RNA modifying enzymes. The protein sequences of known RNA modifying
enzymes from three families-m
5
U methyltransferases,
[Psi]
synthases and 2
'
-
O
methyltransferases-were used as probes to search sequence databases for homologs. ORFs
identified as homologous to the initial probes were retrieved and used as new
probes against the databases in an iterative manner until no more homologous
ORFs could be identified. Using this approach, we have identified two new m
5
U methyltransferases, seven new
[Psi]
synthases and four new 2
'
-
O
methyltransferases in
E.coli
. Many of the ORFs found in
E.coli
have direct genetic counterparts (orthologs) in one or both of
H.influenzae
and
M.genitalium
. Since there is a near-complete knowledge of RNA modifications in
E.coli
, functional activities of the proteins encoded by the identified ORFs were
proposed based on the level of conservation of the ORFs and the modified
nucleotides.
INTRODUCTION
As of April 1996, high throughput genomic sequencing has provided hundreds of
viral genome sequences, ~20 organellar sequences and the complete nucleotide sequences of two free
living organisms:
Haemophilus influenzae
(1.8 Mbp;
1
) and
Mycoplasma genitalium
(0.6 Mbp;
2
). Also, 74% of the
Escherichia coli
chromosome (4.7 Mbp) has been reported in the
E.coli
database collection release 25 (January 1996;
3
). The numerous open reading frames (ORFs) now identified brings the genome
projects to the next level of analysis: to identify the functions of the
uncharacterized ORFs. A practical first approach towards this objective
involves prediction of function through homology comparisons of known proteins
to uncharacterized ORFs. Indeed, such comparisons of
E.coli
ORFs have led to assignments of many general functions (
4
).
RNA modifications have been well characterized in
E.coli
. Mature RNA contains many modified nucleotides, of which three are 5-methyluridines (m
5
U), seventeen are pseudouridines (5-ribosyluridine, [Psi]) and seven are 2'-
O
methylated nucleosides (Nm, where N denotes A, C, G or U)
(reviewed in
5
). Six of the enzymes that catalyze formation of the m
5
U, [Psi] and Nm nucleotide modifications have been identified in
E.coli
.
In this paper, we attempt to identify ORFs likely to encode RNA modifying
enzymes. We used the amino acid sequence of eight known RNA modifying enzymes,
six from
E.coli
and one each from
Streptomyces azureus
and
Saccharomyces cerevisiae
, as probes to search the databases for homologous ORFs. The probes used
represent enzymes which catalyze three types of RNA modifications; uracil m
5
U methyltransferases, pseudouridine synthases and 2'-
O
methyltransferases. By iterative homology searches, we have identified ORFs in
E.coli
,
H.influenzae
and
M.genitalium
likely to encode enzymes with similar function. These ORFs, together with
knowledge of RNA modifications in
E.coli
, allowed us to predict specific substrates for many of the ORF encoded
proteins. In
E.coli
, eleven ORFs which are likely to code for RNA modifying enzymes were found in
addition to the six previously characterized. These seventeen enzymes could
account for most or all of the three m
5
Us, seventeen [Psi]s and seven Nms present in
E.coli
rRNA and tRNA. We also assigned the direct genetic counterparts, or orthologs,
of the ORFs found in
E.coli
to ORFs present in
H.influenzae
and, where applicable,
M.genitalium
.
The genomic search procedure described here exploited the following information: (i) knowledge of a set of related endproducts, the production of which requires a set of potentially related unknown enzymes,
(ii) the amino acid sequence of one or more related enzymes to use as initial
probes. This procedure should be generally applicable to other situations which
meet similar criteria.
MATERIAL AND METHODS
Amino acid sequences of proteins of known function (Table
1
) were used as probes in searches for homologous sequences present in the
GenBank database (National Center for Biotechnology Information), the SwissProt
database (EBI EMBL) and The Institute of Genomic Research (TIGR) databases for
Haemophilus influenzae
Rd and
Mycoplasma genitalium
. The initial searches were carried out using the BLAST program (
6
) or the GRASTA program [a modification of FASTA (
7
)]. These programs are part of the network service provided by GenBank and TIGR
respectively. ORFs having a probability of an accidental match of 10
-4
or less to the respective probe were retrieved from the databases and further
analyzed.
.
Genes encoding known RNA modifying enzymes used as probes in the database
searches
Organism
Gene
Activity
a
Acc. number
b
E.coli
trmA
tRNA U54 -> m
5
U
P23003
E.coli
truA
tRNA U38, U39, U40 -> [Psi]
P07649
E.coli
truB
tRNA U55 -> [Psi]
P09171
E.coli
rluA
23S U746, tRNA U32 -> [Psi]
P39219
E.coli
rsuA
16S U516 -> [Psi]
P33918
E.coli
spoU
tRNA G18 -> Gm
P19396
St.azureus
c
tsr
23S A1067 -> Am
P18644
Sa.cerevisiae
c
PET56
d
23S G2251 -> Gm
431760
a
Numbering and designation of RNA according to
E.coli
.
b
SwissProt accession number.
c
St
. denotes
Streptomyces
and
Sa
. denotes
Saccharomyces
.
d
The PET56 protein methylates the yeast mitochondrial 23S equivalent.
Detailed analysis of the ORFs retrieved in the GRASTA and BLAST search was
performed using sequence analysis programs that are part of the Wisconsin
sequence analysis package (
8
). PILEUP was used to create multiple sequence alignments of related sequences.
LINEUP was used as screen editor to edit the multiple sequence alignments, and
PRETTY was used to display the sequence alignments. The level of similarity and
identity between the probe used and the identified sequences were determined
using the GAP program. The gap creation penalty was set to 3.0 and gap
extension penalty was set to 0.1. Before the final sequence analysis, the
compiled amino acid sequences were truncated uniformly at the C- and N-terminal ends to provide sequences of similar overall length as
shown in the figures (Figs
2
,
3
and
4
). The PILEUP program was also used to generate and plot a graph drawn in
unrooted tree format (dendrogram) which shows the clustering relationships used
to create the alignments.
The multiple sequence alignments were used to identify conserved amino acid
sequence motifs. ORFs having the conserved motifs were considered true homologs
and were used as probes in an iterative manner, i.e. the identified ORFs were
used to search the databases for additional homologs using the same criteria as
with the initial probes. As more ORFs were added to the separate sequence
alignments (one alignment for each family of enzymes), the identity of
conserved sequence motifs was further improved. The identities of the motifs
were further refined by aligning homologous sequences from organisms other than
E.coli
,
H.influenzae
and
M.genitalium
found in the GenBank database. These sequences, many of which are incomplete
ORFs, have not been included in the alignment figures of this paper.
Escherichia coli
and
H.influenzae
ORFs were considered as orthologous gene pairs when the ORFs paired together
strongly in the dendrogram generated by PILEUP with branch points well
separated from the next most distant branch points. In the comparisons
presented here, the strongly paired ORFs had an identity score >30% and a
similarity score >50% as determined by the GAP program.
E.coli
and
M.genitalium
ORFs were considered as orthologous gene pairs when the ORFs paired together
strongly in the dendrogram and had an identity score >20% and a similarity
score >40%. In this context identity means having the same amino acid at the
same position whereas similarity is having a similar amino acid as defined by
the GAP program at the same position in the two ORFs. The homology and
similarity analyses were based on the truncated ORFs as noted above.
All rRNA designations and nucleotide numbering reflect the
E.coli
equivalent rRNA and nucleotide position respectively.
RESULTS AND DISCUSSION
[Psi]
Synthases
Four
E.coli
[Psi] synthases have been identified (Table
1
). The product of the
truA
gene, TruA (also known as HisT or [Psi] synthase I), converts U residues to [Psi] in the anticodon arm of some tRNAs (
15
). The
truB
gene product, TruB, forms [Psi] at U55 in the T-arm of all
E.coli
tRNAs (
16
). The product of the
rsuA
gene, RsuA, introduces the only [Psi] found in 16S rRNA (
17
). The product of the
rluA
gene, RluA, has two enzymatic activities; it catalyzes [Psi] formation at U746 in domain II of 23S rRNA and also catalyzes [Psi] formation at U32 in some tRNAs (
18
).
The amino acid sequences of the four known [Psi] synthase genes were used as probes for iterative searching of the genome
sequences of
E.coli
,
H.influenzae
and
M.genitalium
. The search using the
truA
and
truB
probes identified an ortholog for each in
H.influenzae
, and an ortholog for
truA
, but not
truB
, in
M.genitalium
; the search did not identify any non-orthologous homologous ORFs. Searches using the
rluA
and
rsuA
probes yielded two families of homologs, one to each probe. A distant, but
distinct homology exists between the
rluA
and
rsuA
families (Figs
1
B and
3
).
a
The
H.influenzae
ORFs are denoted according to Fleischmann
et al.
(1).
b
The percentages of identity and similarity of the corresponding amino acid
sequence to the
E.coli
ortholog are calculated using the GAP program from the GCG package, which
aligns sequences using the Needleman-Wunsch algorithm. The ORFs were truncated at the N- and C-terminal ends as shown in Figure 2 before determining the
identity and similarity.
c
The
M.genitalium
ORFs are denoted according to (2).
d
The assignment of ORFs MG209 and MG370 as an orthologs of
yceC
and
yfiI
is speculative (see text).
a
The
H.influenzae
ORFs are denoted according to Fleischmann
et al.
(1).
b
The percentages of identity and similarity of the corresponding amino acid
sequence to the
E.coli
ortholog are calculated using the GAP program from the GCG package, which
aligns sequences using the Needleman-Wunsch algorithm. The ORFs were truncated at the N- and C-terminal ends as shown in Figure 2 before determining the
identity and similarity.
c
The
M.genitalium
ORFs are denoted according to Fraser
et al.
(2).
The
rluA
subfamily has five homologs in the completely sequenced
H.influenzae
genome, four of which have orthologs in
E.coli
. Although an
E.coli
ortholog of the fifth
rluA
homolog found in
H.influenzae
has not been found, it may exist in the 26% of the
E.coli
chromosome that remains to be sequenced. Two
rluA
homologs were identified in
M.genitalium
; MG209 and MG370. The ORF MG209 is the more conserved of these two and is
slightly more related to the
E.coli
ORF
yceC
than to the other
rluA
homologous ORFs (Figs
1
B and
3
; Table
3
); however the sequence conservation is not strong enough to assign clear
orthology.
The
rsuA
sub-family has three homologs in
E.coli
, three in
H.influenzae
and none in
M.genitalium
. Two of the
E.coli
ORFs,
yciL
and
rsuA
, have apparent orthologs in
H.influenzae
.
The remaining
rsuA
homolog in
E.coli
(
yjbC
) and in
H.influenzae
(HI42) are not orthologous to each other and have no apparent orthologs in other sequences examined here (Figs
1
B and 3; Table
3
). The similarity of
yceC
,
yfiI
and
yjbC
to
rluA
and
rsuA
has previously been noted (
4
).
An alignment was made of the amino acid sequences of the six
rsuA
homologs from
E.coli
and
H.influenzae
and the eleven
rluA
homologs from
E.coli
,
H.influenzae
and
M.genitalium
. Upon aligning the two subgroups of [Psi] synthases, three conserved sequence motifs (motif I: 1-K-P-x
3
-2, motif II: R-L-D-x
2
-T-x-G-2-2-2-h and motif III: G-5-x
2
-1-2-R) were found in both sets of [Psi] synthases (Fig.
3
).
A total of 17 [Psi]s are known to be present in
E.coli
RNA. Mature tRNA has seven [Psi] nucleotides. Three enzymes-RluA, TruA and TruB-which catalyze five of the seven tRNA modifications have been
characterized (Table
1
). The enzymes catalyzing the two remaining modifications in tRNA have not been
identified. 16S RNA has a single [Psi] at nucleotide 516 which is formed by RsuA (
17
). 23S rRNA has nine [Psi]s which, with one exception ([Psi]955) are located at the peptidyl transferase center (
19
). RluA catalyzes formation of [Psi]746, but enzymes for the eight remaining [Psi]s in 23S rRNA ([Psi]955, [Psi]1911, m
3
[Psi]1915, [Psi]1917, [Psi]2457, [Psi]2504, [Psi]2580 and [Psi]2605) have not been identified. Thus, there are a
total of ten [Psi] nucleotides for which the modifying enzyme have not been identified. We
have identified five new ORFs in
E.coli
predicted to encode RNA [Psi] synthases. The difference in numbers-ten [Psi] nucleotides versus five [Psi] synthases-may be explained by (a) enzymes having multiple
substrates as with TruA and RluA, (b)
E.coli
genes not yet sequenced, or (c) genes not related to the major
rsuA
/
rluA
branch of [Psi] synthases as is the case for
truA
and
truB
.
We assume that ortholog pairs of enzymes from
E.coli
and
H.influenzae
having the highest homology as well as having close homologs present in
M.genitalium
, will catalyze those modified nucleotides present at the most conserved
locations. The most conserved [Psi] nucleotides found in ribosomal RNA are [Psi]2580 located in domain V of 23S rRNA, and two [Psi]s clustered in domain IV ([Psi]1915 and [Psi]1917) (
20
,
21
). Thus, we predict that one of the two
E.coli
ORFs with the most conserved orthologs (
yceC
and
yfiI
) encodes the enzyme that catalyzes formation of [Psi]2580 and that the other [Psi] synthase encodes the enzyme that catalyzes formation of one or both
of the conserved [Psi]1915 and [Psi]1917 in domain IV (Table
3
). It is noteworthy that TruA can modify up to three closely spaced Us in the
anticodon arm of some tRNAs, thus providing precedent for multiple [Psi] modifications at closely spaced positions by a single enzyme.
2
'
-
O
Methyltransferases
The sequences of three genes encoding 2'-
O
methyltransferases are available:
spoU
in
E.coli
, encoding the tRNA Gm18 methyltransferase (C. Gustafsson, unpublished),
tsr
in
Streptomyces azureus
encoding the thiostreptone resistance marker 23S rRNA Am1067 methyltransferase
(
22
), and PET56 in yeast, encoding the 23S rRNA Gm2251 methyltransferase (
23
) (Table
1
). The amino acid sequences of these three gene products were used as probes for
iterative searching of the genome sequences of
E.coli
,
H.influenzae
and
M.genitalium
to identify other ORFs encoding enzymes that catalyze methylation of the 2' hydroxyl of the ribose in RNA.
Ten previously uncharacterized ORFs were found in the search which, after
assignment of orthologous pairs, corresponded to four new 2'-
O
methyltransferases. The four previously uncharacterized ORFs in
E.coli
all had orthologs present in
H.influenzae
and two of which (
yjfH
and
yibK
) also had orthologs present in
M.genitalium
. The two ORFs in
M.genitalium
were orthologs of the two most homologous gene pairs in
E.coli
and
H.influenzae
. The probe
spoU
did not have an ortholog in either
H.influenzae
or
M.genitalium
(Table
4
; Fig.
1
C).
An alignment made of the three known and four newly identified 2'-
O
methyltransferases revealed three motifs found in all of the ORFs (Fig.
4
). One of the motifs, motif II (h-2-h-G-x-E-x
2
-G-2), consists of a series of bulky aliphatic amino acid residues
followed by two conserved glycines, resembling an AdoMet binding motif (
12
). Two additional conserved motifs were found, motif I (3-x-N-x-G-x
3
-R) located at the N-terminus of the sequences and motif III (2-P-x
6
-S-2-N-2) located at the C-terminus (Fig.
4
).
A total of seven 2'-
O
modified nucleotides have been found in
E.coli
RNA. One is in 16S rRNA (m
4
Cm1402) and three are in 23S rRNA (Gm2251, Cm2498 and Um2552). There are also
three 2'-
O
modified nucleotides in tRNA; however, two of the three, Um32 and Cm32, are
both pyrimidine nucleotides and occur at the same position in different tRNAs,
and are likely to be catalyzed by the same enzyme. Thus, we propose that there
are six 2'-
O
methyltransferases in
E.coli
which catalyze the seven RNA modifications. One,
spoU
, has been previously identified and we have here identified four previously
uncharacterized ORFs as putative 2'-
O
methyltransferases. The remaining ORF may be (a) in the part of the
E.coli
genome not sequenced yet, (b) attributed to one enzyme having multiple target
substrates, or (c) part of another 2'-
O
methyltransferase family. The similarity of
lasT
,
yibK
and
yfiF
to
spoU
has previously been noted (
4
).
Since the two most highly conserved putative 2'-
O
methyltransferases,
yibK
and
yjfH
, are the only ORFs within this family present in
M.genitalium
, we suggest they
encode the enzymes catalyzing the 2'-
O
methylations 23S Gm2251 and Um2552, which are the only modified nucleotides
found in all organisms so far analyzed (
20
). Since the
yibK
ortholog set is phylogenetically more closely related to the guanosine
methylase
spoU
(Fig.
1
C), it probably encodes a guanosine methylase. Thus, we propose that
yibK
encodes the 23S rRNA Gm2251 methyltransferase and
yjfH
consequently encodes the 23S rRNA Um2552 methyltransferase.
We are currently experimentally testing the functional predictions described in
this paper. So far, we have cloned and expressed three of the
E.coli
ORFs described above;
ygcA, yceC
and
yfiF
. Although the specific bases modified have not yet been identified, we have
determined that each of the three enzymes encoded by these ORFs does indeed
catalyze the formation of the predicted RNA modifications.
ACKNOWLEDGEMENT
This work was supported by grant GM-51232 from the National Institute of Health.
4 Koonin,E.V., Tatusov,R.L. and Rudd,K.E. (1995) Proc. Natl Acad. Sci. USA, 92,11921-11925
5 Björk,G.R. (1996) In Neidhardt,F.C. (ed.), Escherichia coli and Salmonella. Cellular and Molecular Biology. American Society for Microbiology, Washington, DC, pp. 861-886.
6 Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) J. Mol. Biol., 215, 403-410.
7 Pearson,W.R. and Lipman,D.J. (1988) Proc. Natl Acad. Sci. USA, 85, 2444-2448.
8 Genetics Computer Group, (1994). Version 8, Madison, Wisconsin, USA.
9 Björk,G.R. and Isaksson,L.A. (1970) J. Mol. Biol., 51, 83-100.
10 Gustafsson,C., Lindström,P.H., Hagervall,T.G., Esberg,K.B. and Björk,G.R. (1991) J. Bacteriol., 173, 1757-1764.