tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic
sequence
tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence
Todd M.
Lowe
and
Sean R.
Eddy*
Department of Genetics, Washington University School of Medicine, 660 South
Euclid, Box 8232,
St Louis
, MO 63110,
USA
Received December 2, 1996;
Revised and Accepted January 17, 1997
ABSTRACT
We describe a program, tRNAscan-SE, which identifies 99-100% of transfer RNA genes in DNA sequence while giving less than
one false positive per 15 gigabases. Two previously described tRNA detection
programs are used as fast, first-pass prefilters to identify candidate tRNAs, which are then analyzed by a
highly selective tRNA covariance model. This work represents a practical
application of RNA covariance models, which are general, probabilistic
secondary structure profiles based on stochastic context-free grammars. tRNAscan-SE searches at
~ 30 000 bp/s. Additional extensions to tRNAscan-SE detect unusual tRNA homologues such as selenocysteine tRNAs, tRNA-derived repetitive elements and tRNA pseudogenes.
INTRODUCTION
Transfer RNA (tRNA) genes are the single largest gene family. A typical
eukaryotic genome contains hundreds of tRNA genes; the human genome contains an
estimated 1300 (
1
). In a time when complete genomes are being sequenced, one would like to have
an accurate means of tRNA gene identification. The tRNA repertoire of an
organism affects the codon bias seen in highly expressed protein coding genes.
In extreme cases, selective pressure for extremely high or low genomic GC
content may have caused loss of a tRNA, producing an unassigned codon (
2
,
3
). Suppressor tRNAs are important genetic loci in many model organisms. In
addition to authentic tRNA genes, tRNA-derived short interspersed nuclear elements (SINEs) have been identified
in rodents and other mammals as likely mobile genetic elements (
4
,
5
). Detection and discrimination of these elements from true tRNAs is a desirable
feature of tRNA identification methods.
It is commonly believed that the best RNA gene detection methods are custom-written programs that search for one type of RNA gene exclusively (
6
). Numerous tRNA search programs key on primary sequence patterns and/or
secondary structure specific to tRNAs (
7
-
13
). Why bother with specialized tRNA-detection software instead of using a fast, commonly available similarity
search program such as BLAST (
14
) or FASTA (
15
)? Since many functional RNA genes tend to conserve a common base-paired secondary structure better than a consensus primary sequence, the
accuracy of RNA similarity searching is much improved by including secondary
structure elements. A group of generalized RNA gene search tools look for
specific combinations of primary and secondary structure motifs specified by
the user (
16
-
21
), although tRNA `descriptors' in these pattern-matching languages have typically under performed custom-written programs.
tRNAscan 1.3 by Fichant and Burks (
12
) is perhaps the most widely used tRNA detection program. It identifies ~97.5% of true tRNA genes and gives 0.37 false positives per million base
pairs (Mbp) (
12
). The algorithm uses a hierarchical, rule-based system in which each potential tRNA must exceed empirically
determined similarity thresholds for two intragenic promoters, plus have the
ability to form base pairings present in tRNA stem-loop structures. The false positive rate of tRNAscan has been acceptable
for small genomes, but for larger eukaryotic genomes it becomes a significant
problem. It will produce ~1100 false positive tRNAs for the human genome (0.37 false positives/Mbp
for 3000 Mbp); given that there are about 1300 true tRNAs in the genome, almost
half of the tRNAs predicted by tRNAscan will be false positives.
Pavesi and colleagues have developed a different tRNA detection algorithm (
13
) which searches exclusively for linear sequence signals in the form of
eukaryotic RNA polymerase III promoters and terminators. The sensitivity and
selectivity of this algorithm is roughly comparable to tRNAscan 1.3 in
detection of eukaryotic tRNAs. Notably, the Pavesi algorithm identifies tRNAs
not detected by tRNAscan 1.3, and vice versa (
13
). The combined sensitivities of these two programs exceed 99%; however, the
combined false positive rate is about five times that of tRNAscan alone.
Eddy and Durbin (
22
) have developed a general RNA structure similarity search method employing
probabilistic RNA structural profiles, or `covariance models'. Covariance
models are able to capture both primary consensus and secondary structure
information through the use of stochastic context-free grammars (SCFGs;
22
-
24
). Much like sequence profiles (
25
,
26
), covariance models are constructed from multiple sequence alignments.
Sequences are searched against a given covariance model using a three-dimensional dynamic programming algorithm, similar to a Smith-Waterman alignment but including base-pairing terms as well. RNA covariance models have the
advantages of high sensitivity, high specificity and general applicability to
any RNA sequence family of interest, obviating the need for custom-written software for each RNA family. However, covariance model dynamic
programming algorithms are almost prohibitively CPU intensive. A tRNA
covariance model identifies >99.98% of true tRNAs, with a false positive rate
of <0.2/Mbp (
22
), but searching the human genome with a tRNA covariance model would take about
nine and a half CPU years (based on benchmarks on an SGI Indigo2 R4400/200 CPU,
140 SPECint92).
We describe here a program, tRNAscan-SE, that combines three tRNA search methods to attain the specificity of
covariance model analysis with the speed and sensitivities of optimized
versions of tRNAscan 1.3 and the Pavesi search algorithm. tRNAscan-SE detects 99-100% of true tRNAs, giving fewer than one false positive per
fifteen billion nucleotides of random sequence, at ~1000-3000 times the speed of searching with tRNA covariance models.
Additional extensions to tRNAscan-SE allow detection and accurate secondary structure prediction of unusual
tRNA species including both prokaryotic and eukaryotic selenocysteine tRNA
genes, as well as tRNA-derived repetitive elements and pseudogenes.
METHODS
tRNAscan-SE input consists of DNA or RNA sequences in FASTA format. tRNA
predictions are output in tabular, ACeDB, or an extended format including tRNA
secondary structure information. tRNAscan-SE does no tRNA detection itself, but instead negotiates the flow of
information between three independent tRNA prediction programs, performs some
post-processing and outputs the results (Fig.
1
).
RESULTS
A summary of the overall sensitivity, selectivity and search speed for the four
tRNA search programs tested is shown in Table
1
. The number of true positives is based on the percentage of tRNAs detected
within a test set taken from the Sprinzl tRNA database (Table
2
). The false positive rate is based on analysis of randomly generated sequence
data (Table
4
). The search speeds for the various programs are shown for a scan of the
current
C.elegans
genomic sequences averaging 30 Kbp per clone. tRNAscan 1.3 search speed
decreases approximately linearly with length. Search speed for tRNAscan-SE is approximately constant, but varies based on tRNA density within the
sequence.
True positives are based on detection rates within a non-organellar, non-viral subset of the Sprinzl tRNA database (Table 2). False positive
rates are estimates based on searches of randomly generated human sequence
(Table 4). Search speeds are from a search of 58.4 Mbp of
C.elegans
cosmid sequences on a Silicon Graphics Indigo2 R4400 200 Mhz workstation.
a
EufindtRNA is based on the Pavesi search algorithm which was designed to detect
eukaryotic tRNAs only; searching only eukaryotic tRNAs, EufindtRNA has a 98.6%
true positive detection rate (Table 2).
Sensitivity
tRNAscan-SE was shown to be more sensitive than tRNAscan 1.3 by several measures,
the first being a search of the Sprinzl and GenBank databases subsets (Table
2
). In the Sprinzl test set, tRNAscan-SE detected 586 of 589 known tRNAs (99.5%), versus 560 of 589 (95.1%) for
tRNAscan 1.3. Of all 1144 non-organellar tRNAs in the complete Sprinzl database, tRNAscan-SE fails to recognize seven. One was a eukaryotic sequence from
Trypanosoma brucei
(Sprinzl ID DT6050, GenBank TBTRNA3) which has been previously noted by Pavesi
et al.
(
13
) as being missed by both tRNAscan 1.3 and the Pavesi search algorithm. The
other six tRNAs missed by tRNAscan-SE were from various eubacteria (Sprinzl IDs: DA1543, DE2180, DG1351,
DG1482, DS1250 and RG1380). Several of these undetected tRNAs appear to be
irregular in source or function. DE2180 is derived from DNA from the cyanelle
(a photosynthetic organelle) of the unicellular eukaryote
Cyanophora paradoxa
and is thus misclassified as eubacterial in the database. DG1482 and RG1380
both contain substitutions of four highly conserved bases within the T[Psi]C loop, an indication that the tRNAs are probably used in synthesis of the
peptidoglycan instead of protein translation (
29
). All seven of these atypical tRNAs were detected using covariance model
analysis. The tRNA covariance model search does miss two tRNAs within the 1144-member Sprinzl database subset, both selenocysteine tRNAs (Sprinzl ID
DZ1430 and DZ7742) that pass below the 20.0 bit cutoff at 0.60 and 14.19 bits,
respectively. EufindtRNA, designed to search eukaryotic sequences exclusively,
shows improved sensitivity for eukaryotic tRNAs (98.6%) over tRNAscan 1.3
(95.0%), but is still slightly less sensitive than tRNAscan-SE (100%). Over the three phylogenetic domains, tRNA covariance model
analysis appears to be the most sensitive detection method, yet tRNAscan-SE trails by as little as one third of one percentage point.
Searching the GenBank subset sequences which contain less reliable tRNA
annotation, tRNAscan-SE detects 98.5% of the 1462 tRNAs verified by either covariance model
analysis or visual inspection, whereas tRNAscan 1.3 has a 93.4% detection rate
(Table
2
). All prediction discrepancies were visually inspected. Of the 18 tRNAs that
covariance model analysis detected but were missed by all three other methods,
all had scores over 36 bits, and were annotated in the GenBank entries. The two
tRNAs detected by tRNAscan-SE but missed by covariance model analysis were a selenocysteine tRNA
(CTTRSEL; same as previously noted Sprinzl DZ1430 tRNA), and a long tRNA from
Haloferax volcanii
(HALTGW) whose 104 bp intron caused the tRNA to exceed the maximum total length
limit for normal tRNA covariance model analysis (150 bp). Of the nine sequences
annotated as tRNAs but missed by all four detection methods, four have large
group I or group II introns of 241 bp or larger (ANATGL, SSU10482, PHU29955,
SYOTRNLUAA), and five appear to have either sequencing errors or modified bases
which appear in the GenBank annotation but not in the sequence (corresponding
tRNAs within the Sprinzl database were identified correctly by all four
detection methods). Because of sequence discrepancies between the GenBank
sequences and corresponding Sprinzl entries, these five GenBank tRNAs were not
included in the 1462-member test set.
Genome analysis
Another measure of sensitivity was derived from searching complete or partial
genomic sequence data from eubacterial, archaebacterial, yeast and
C.elegans
sequencing projects (Table
3
). For
Mycoplasma genitalium
, 33 tRNAs were noted in the published (
30
) and online gene identifications (http://www.tigr.org/tdb/mdb/mgdb/mgdb.html),
whereas 36 tRNAs were detected by three tRNA detection methods (tRNAscan 1.3, tRNAscan-SE, covariance model analysis). The three tRNAs not appearing in the
literature are for Arg (anticodon: CCT, bounds: 306615-306686, upper strand), Leu (anticodon: CAA, bounds: 448783-448861, upper strand), and Leu (anticodon: GAG, bounds: 446265-446181, reverse strand). For the completed
H.influenzae
genome, 56 tRNAs are noted in the literature (
31
) and online gene identifications (http://www.tigr.org/tdb/mdb/hidb/hidb.html).
tRNAscan-SE and covariance model analysis both identify the tRNAs noted in the
literature, plus two potentially novel tRNAs not noted in the literature:
SelCys (anticodon: TCA, bounds: 753291-753201, reverse strand) and Leu (anticodon: GAG, tRNA bounds: 1576453-1576372, intron bounds: 1576419-1576408, reverse strand). The first is a selenocysteine
tRNA and the other appears to be either a pseudogene or a true tRNA containing
a short intron. The selenocysteine tRNA identification is not unexpected; BLAST
searches identify two enzymes in the selenocysteine insertion pathway, as well
formate dehydrogenase containing a `UGA' selenocysteine-insertion codon. The evidence for the other potentially novel tRNA is less
certain. The short 12 bp `intron' would presumably require protein splicing to
generate a functional tRNA, a feature that would be novel among eubacterial
tRNAs. However, the covariance model score of 36.88 bits for the tRNA is well
above the minimum cutoff of 20 bits, indicating that the sequence is likely to
have evolutionary homology with tRNA. It is possible that it is a pseudogene.
tRNAscan 1.3 identifies 55 of the 56 tRNAs noted in the literature (Gly-B, by TIGR nomenclature, is not detected), and does not detect either of
the novel tRNAs detected by tRNAscan-SE and covariance model analysis.
The detection rates for the Sprinzl tRNA database are broken down by
phylogenetic domain. The Sprinzl subset tested contains only non-organellar, non-viral tRNAs which were not used in training of the tRNA covariance
model. For the Sprinzl database subset, numbers in parentheses indicate
percentage of correct tRNA identifications relative to total in the literature.
The GenBank subset sequences were selected by retrieving non-organellar, non-viral, full-length tRNA sequences with `tRNA' indicated in the feature
field of the entry. Since GenBank tRNA annotation is less reliable, the numbers
in parentheses for this row are the percentage of correct tRNA identifications
relative to all tRNAs verified by either covariance model analysis or visual
inspection.
a
EufindtRNA is based on the Pavesi search algorithm (13) which was designed
specifically to find only cytoplasmic eukaryotic tRNAs.
.
tRNAs identified in genomic databases by various search methods
Sequence source
Size
Literature
tRNAscan 1.3
EufindtRNA
a
tRNA CMtRNAscan-SE
(Kbp)
tRNAs
Total
(%)
Total
(%)
Total
(%)
Total
(%)
M.genitalium
580
33
36
(100)
19
(52.8)
36
(100)
36
(100)
+1 FP
H.influenzae
1830
56
55
(98.2)
42
(73.7)
58
(103.6)
58
(103.6)
+2 FP
M.jannaschii
1730
37
36
(97.3)
20
(54.0)
37
(100)
37
(100)
+1 FP
S.pombe
(through 9/96)
4176
-
45
(93.7)
46
(95.8)
48
48
(100)
+4 FP
+1 FP
S.cerevisiae
12 057
273
270
(98.5)
274
(100)
274
274
(100)
+4 FP
+10 FP
+1 pseudo
+1 pseudo
+1 pseudo
C.elegans
(through 11/13/96)
58 402
-
389
(96.5)
400
(99.2)
403
403
(100)
16 FP
+29 FP
+355 FP
+11 id pseudo
+19 pseudo
+23 pseudo
+8 pseudo
P.anserina
mitochondrion
100
27
18
(66.7)
11
(40.7)
27
(100)
22
(81.5)
`Literature' column represents the published number of tRNAs found within
genomes. `Total' columns indicate total number of tRNAs found in searches for
each program. Numbers in parentheses in (%) columns indicate percentage of
tRNAs detected relative to literature (
H.influenzae, M.jannaschii, P.anserina)
, or when published tRNA annotation is incomplete or uncertain (
M.genitalium
,
S.pombe, S.cerevisiae, C.elegans
), detection percentages are relative to total tRNAs found by tRNA covariance
model analysis and supported by manual inspection. `FP', false positives
determined by covariance model analysis and manual inspection (these do not
include pseudogenes that have strong similarity to known tRNAs). `pseudo', tRNA
identifications which appear to be pseudogenes containing 5' truncations of 3-16 bp, large insertions or deletions elsewhere, or other
characteristics of tRNA-derived repetitive elements. `id pseudo', tRNAs automatically identified
by tRNAscan-SE as likely pseudogenes which have qualities similar to manually detected
pseudogenes described above.
a
EufindtRNA is based on an algorithm (13) which was designed specifically to find
only cytoplasmic eukaryotic tRNAs.
The genomic sequence of the archaebacterium
M.jannaschii
was also analyzed. Both tRNAscan-SE and covariance model analysis identified all 37 tRNAs as given in the
literature (
32
). tRNAscan 1.3 identified 36 of the 37 tRNAs, missing the single selenocysteine
tRNA in the set. We also scanned the recently completed genomic sequence of the
budding yeast
Saccharomyces cerevisiae
(12 Mbp). The covariance model search took 14 days to complete, and produced
275 tRNAs. Based either on inspection for ability to form correct tRNA
secondary structure, or exact identity with previously characterized yeast
tRNAs, we believe 274 predicted tRNAs are true tRNAs, and one is a pseudogene
with a 7 bp 5' truncation. One of these 274 tRNAs was missing from the yeast genome
project web site annotation (http://speedy.mips.biochem.mpg.de/mips/yeast), but
this is probably an oversight since a tRNA of identical sequence is correctly
annotated elsewhere in the genome [tRNA_i_S (GCT)LR2]. tRNAscan-SE took 19 min and detected the same 275 tRNAs found by covariance model
analysis. EufindtRNA found the same 275 tRNAs in just over 1 min. tRNAscan 1.3
took ~10 h to complete, and missed four (two pairs identical in sequence) of the
274 true tRNAs found by the other three methods. Four Mbp of available genomic
sequence from
S.pombe
(fission yeast) was also analyzed. tRNAscan-SE and covariance model analysis both predict 48 tRNAs. tRNAscan 1.3
identifies 45 of the 48 predicted by covariance model analysis (two out of
three missed were identical in sequence), whereas EufindtRNA identifies 46 of
the 48 total tRNAs.
Finally, we scanned the largest set of genomic sequence currently available,
58.4 Mbp from the
C.elegans
genome project. Since only a handful of the tRNAs detected have been previously
published in the literature, we again relied on covariance model detection of
tRNAs as our best measure for `true' tRNAs. Conflicts in tRNA predictions
between tRNAscan 1.3, tRNAscan-SE and covariance model analysis were all examined manually for highly
conserved primary sequence motifs and proper secondary structure. As most tRNA
species are multicopy in eukaryotes, BLAST similarity searches were used to
help discern `false positives' from pseudogenes. We define false positives as
predicted tRNAs which do not appear to be evolutionarily derived from true
tRNAs. These false positives are assessed by failure to form recognizable tRNA
secondary structure and the lack of related tRNAs elsewhere in the genome.
Pseudogenes, on the other hand, usually have at least partial tRNA secondary
structure, plus clear deletions or insertions relative to at least one related,
intact tRNA elsewhere in the genome. tRNA-derived mobile elements also have recognizable primary sequence similarity
to tRNAs, although most have poor tRNA secondary structure similarity. Of the
403 complete tRNAs detected by covariance model analysis, tRNAscan-SE detected all 403 tRNAs (100%), whereas tRNAscan 1.3 detected 389
(96.5%) and EufindtRNA found 400 (99.2%).
Taken together, the data analyzed from the
M.genitalium, H.influenzae, M.jannaschii, S.cerevisiae, S.pombe
and
C.elegans
genomes, 100% of the 856 tRNAs detected by covariance model analysis were found
by tRNAscan-SE. tRNAscan 1.3 detected 831, missing 25 tRNAs identified by covariance
models, a 97.1% detection rate. EufindtRNA detects 93.5% of the 856 tRNA set,
but if only eukaryotic genomes are considered, the program finds 720 of 725
(99.3%).
.
False positive rates for actual and simulated genomes
Size (Mbp)
tRNAscan 1.3
a
EufindtRNA
tRNA CMtRNAscan-SE
FP
FP/Mbp
FP
FP/Mbp
FP
FP/Mbp
FP
FP/Mbp
S.cerevisiae
Actual FP (completed genome)
12.0
4
0.33
10
0.83
0
<0.08
0
< 0.08
C.elegans
Actual FP (portion completed)
58.4
29
0.50
355
6.08
0
<0.03
0
< 0.03
Simulated FP (total genome)
100
42.5
0.42
26
0.26
0
<0.01
0
< 0.001
Human
Actual FP (portion completed)
5.32
3
0.56
5
0.94
0
<0.19
0
< 0.19
Simulated FP (total genome)
3000
1118
0.37
684
0.23
ND
-
0
< 0.00007
a
Searches performed with tRNAscan 1.4, but all false positives verified with
unaltered tRNAscan 1.3.
`Actual FP' rows contain false positives detected in actual genomic sequence.
`Simulated FP' rows contain the false positives found in whole-genome scale random sequence simulations (10 trials for
C.elegans
, five for human). For tRNA covariance model searches (tRNA CM), only one random
C.elegans
and no human genome simulations were performed due to extreme CPU demands (ND,
not done).
.
Analysis time in hours required for various complete genomes and tRNA search
algorithms
Complete
Size
tRNAscan 1.3
EufindtRNA
tRNA CM
tRNAscan-SE
genome
(Mbp)
(CPU hours)
(CPU hours)
(CPU hours)
(CPU hours)
P.anserina
mitochondrion
0.1
0.14
<0.001
2.8
0.019
H.influenzae
1.8
2.54
<0.001
51
0.069
S.cerevisiae
12
16.7
0.02
333
0.33
C.elegans
100
139
0.15
2780
1.8
Human
3000
>4170
7.1
83 300
36.6
Actual genome scan times are given for tRNAscan-SE and EufindtRNA (genome simulation times used for human). Estimated scan
times are given for tRNAscan 1.3 (400 bp/s) and tRNA covariance model analysis
(tRNA CM; 20 bp/s).
Selectivity
While the `sensitivity' of an algorithm is measured by the proportion of true
positives identified in reference sequences, a method's `selectivity' is
measured by its ability to avoid misidentifying unrelated sequences as true
tRNAs. Increased sensitivity is usually gained at the expense of an increased
false positive rate. A rate of one false positive per five to ten million bases
of sequence has, in the past, been acceptable since the total amount of
uncharacterized or non-protein coding sequence in the databases has been relatively small.
However, with the advent of whole-genome sequencing projects on the megabase scale, this false positive rate
is of much greater concern.
Assessing the ability of an algorithm to discriminate between true and false
positives using biological sequence data can be difficult. At false positive
rates of less than one per million bases, there is not enough well annotated
sequence in the public databases to give a reliable indication of an
algorithm's true performance. Even for the data that is available, it is
uncertain whether or not an accurate prediction has been made in the absence of
biochemical experimental evidence. An alternative strategy is to generate random nucleotide sequence which is known to have no biologically-derived genes. An unlimited amount of random sequence can be generated
based on a general or species-specific genomic nucleotide frequency. Each identification of a tRNA gene
in this random sequence can then be confidently counted as a false positive.
False positives due to biologically-derived repetitive elements or pseudogenes are not taken into account in
these synthetic test sequences, and must be addressed separately.
We generated two types of random sequence sets to simulate the size and GC
content of the
C.elegans
and human genomes (100 million and 3 billion bases of random sequence,
respectively, as described in the Methods). The number of false positives found
with each algorithm appear in Table
4
along with false positive rates from actual genomic sequence (discussed below).
Analysis of the simulated genomes gave consistent false positive rates between
the various trials, at ~0.40 false positives per million bases for tRNAscan 1.3, a little more than
half that for EufindtRNA, and zero for both tRNAscan-SE and covariance model analysis. In 10 independent
C.elegans
genome simulations, an average of 42.5 tRNAs were identified by tRNAscan 1.4.
The sequences for the false positive tRNAs were saved and analyzed with the
original tRNAscan 1.3 program to confirm that false positives were due to the
tRNAscan 1.3 algorithm, not the modifications introduced in tRNAscan 1.4.
EufindtRNA misidentified an average of 26 false positives per simulated
C.elegans
genome. Both tRNAscan-SE and the tRNA covariance model searches found zero positives for every
trial (only one genome simulation was searched with the tRNA covariance model
due to the extreme CPU demands). As seen in Table
5
, minor differences among analysis times for the various methods for microbial
genomes become substantial when analyzing larger eukaryotic genomes. Analysis
of the single
C.elegans
genome simulation with covariance models required almost four CPU months.
For the five human genome simulations, tRNAscan 1.4 produced an average of 1118
false positives per genome (had tRNAscan 1.3 been used, it would have taken
almost half a CPU year per trial). EufindtRNA searched the simulated genomes in
just over 7 h per trial, giving an average of 684 falsely predicted tRNAs for
each. Had we searched the entire 3 billion nucleotide human genome simulation
with tRNA covariance model analysis, it would have taken over nine CPU years
for each trial (Table
5
). Based on the histogram of covariance model scores against 500 million bases
of simulated human sequence data (not shown), we estimate that the tRNA
covariance model search of the simulated human genome would have produced zero
false positives. tRNAscan-SE required an average of 1.5 days to scan each of the three billion
nucleotide test sets, and produced no false positives in any of the five trials
(the exact same sequences were used as in the trials described above for
tRNAscan 1.4 and EufindtRNA).
A concern not addressed by the random sequence genome simulations is the `false
positive' rate caused by certain classes of SINEs that are suspected to be
derived from tRNA genes (
4
,
5
). These elements have similarity to known tRNA genes and contain well conserved
RNA polymerase III internal A and B box promoters. To assess tRNAscan-SE's ability to identify and exclude these types of pseudo-tRNAs, the repeat element database
Repbase
maintained by Jerzy Jurka (ftp://ncbi.nlm.nih.gov/repository/repbase) was
scanned. Of the reference sequences searched, tRNAscan-SE did not produce any false positive tRNA identifications. Covariance
model analysis, however, did misidentify 12 of 775 rodent B2 SINE sequences and
two ALU-like sequences (bovine ALU-like repetitive element and rat ALU type III-like repetitive element), all with scores between 20 and 28
bits. Rat identifier (ID or R.dre.1) sequences, also known to have high
similarity to alanine, proline and other tRNAs, were searched within GenBank
and dbEST (database of expressed sequence tags,
33
). tRNAscan-SE misidentified four rat ID element sequences total, one from GenBank
(RATRSIDH) and three from dbEST (R46943, R46943 and R82886). The extreme
sensitivity of covariance model analysis is also unable to distinguish between
these SINEs and true tRNAs, giving bit scores between 24.5 and 33.1 bits.
tRNAscan 1.3 requires strong adherence to secondary structure rules, thus does
not call any of these pseudogenes as tRNAs. The rest of
Repbase
, including consensus and database collections of ALU, L1, THE, MIR, MIR2, THR
and B1 repetitive elements, were also searched with tRNAscan-SE, giving no other false positives.
The selectivity of tRNAscan has already affected genome sequence annotation
detrimentally. In 58.4 Mbp of
C.elegans
genomic sequence, tRNAscan 1.3 produced 29 tRNAs which were judged to be false
positives (0.50 FP/Mbp) based on searching with the tRNA covariance model,
visual inspection of secondary structure and lack of primary sequence
similarity to any other tRNAs within the genome. Since both the Washington
University Genome Sequencing Center (St Louis) and the Sanger Center
(Cambridge, UK) used tRNAscan 1.3 in semi-automated sequence annotation until very recently, 16 of these 29 false
positives are annotated as tRNAs in finished, submitted GenBank entries. This
false positive rate is very close to that seen in the random
C.elegans
genome simulation (0.42 FP/Mbp), giving additional confidence to the estimates
based on simulated sequence data.
tRNAscan-SE produced no obvious false positives in the
C.elegans
genomic sequence, but did identify eight tRNAs that were judged to be possible
pseudogenes by manual inspection (Table
3
). Eleven other tRNAs were automatically identified as pseudogenes via primary
or secondary structure scores that fell below minimum values described in the
methods. All 19 pseudogenes had strong similarity to other tRNAs within the
genome, and contained unusual features such as 3-16 bp truncations of the 5'-end of the gene, or other large insertions or deletions
within the sequence. One could consider detection of these possible pseudogenes
a desirable feature of tRNAscan-SE's sensitivity. Further studies of these unusual tRNAs may help better
elucidate aspects of genome dynamics, genetic element mobility and evolution.
Selenocysteine tRNA detection
There are not enough selenocysteine tRNA sequences to properly evaluate tRNAscan-SE's selenocysteine detection accuracy. Three selenocysteine tRNAs (one
each from
H.influenzae
,
M.jannaschii
and
C.elegans
) were detected in recent genome sequence data. The
H.influenzae
tRNA, previously unrecognized in the literature, was detected by the
prokaryotic selenocysteine-specific routines and covariance model. The tRNA from the distantly
related
M.jannaschii
, however, was detected by the standard EufindtRNA algorithm and general tRNA
covariance model. The failure of the specialized routines may have been due in
part to the fact that this is the first and only archaebacterial selenocysteine
tRNA available to date. For the remaining non-archaeal selenocysteine tRNAs, use of the specialized models boosts
covariance model scores from the 20-40 bit range to 45-72 bits. Since accurate tRNA secondary structure prediction relies
on correct alignment of the tRNA sequence to the covariance model, use of
selenocysteine-specific models for these tRNAs improves the accuracy of structure
predictions. A search of the non-redundant database (nrdb) maintained at NCBI revealed no new
selenocysteine tRNAs from species for which there was no previously noted
sequence.
Intron detection
tRNAscan-SE correctly predicted the introns for the 13 species of intron-containing tRNAs in the
S.cerevisiae
genome (
34
). tRNAscan 1.3 often gives multiple intron predictions for each tRNA, making
correct placement uncertain. EufindtRNA does not attempt to predict intron
boundaries at all (
13
).
Detection of tRNAs containing long introns, usually group I or group II, is
problematic. The default maximum tRNA length for tRNAscan-SE is 192 bp, but this can be increased (option -L <max length>) to allow searches with no practical limit on tRNA
length. In the first phase of tRNAscan-SE, EufindtRNA searches for A and B boxes of the specified maximum
distance apart, and passes only the 5' and 3' tRNA ends to covariance model analysis for confirmation (removing
the bulk of long intervening sequences). Using this option, tRNAscan-SE was able to detect three of the four long tRNAs initially missed by all
four methods in the GenBank tRNA subset search (the fourth tRNA was
undetectable with EufindtRNA even with the intron removed before analysis).
Group I or II introns in tRNAs tend to occur in positions other than the
canonical position of protein-spliced introns, so tRNAscan-SE mispredicts the intron bounds and anticodon sequence for these
cases. 5' and 3' tRNA bounds were correct for all three unusual tRNAs.
Performance on mitochondrial tRNAs
Although tRNAscan-SE was designed with non-organellar tRNA detection in mind, we also tested it on a complete
mitochondrial genome, that of
Podospora anserina
(GenBank ID PANMTPACGA). tRNAscan-SE detected 22 of the 27 annotated tRNAs (81.5%), tRNAscan 1.3 detected 18
of 27 (66.7%) and covariance model analysis detected all 27 tRNAs (Table
3
). Since organellar genomes are usually small, the computational demand of
covariance model analysis alone (without the use of fast first-pass scanners) is not prohibitive. For this reason, tRNAscan-SE can be run in covariance model analysis-only mode (-C option) for maximum sensitivity, bypassing dependence
on tRNAscan 1.4 and EufindtRNA. This mode gives the same results as would be
obtained by running the covariance model search program alone, but in addition,
produces annotated tRNA output identical in format to that found in the default
tRNAscan-SE search mode.
DISCUSSION
Speed, sensitivity and selectivity
The most sensitive and selective tRNA detection method that we are aware of
utilizes probabilistic RNA covariance models (
22
), which are based on stochastic context-free grammar techniques. However, searching with covariance models has two
drawbacks. First, it is extremely CPU intensive, requiring days to weeks of
processor time to scan megabase-size genomic data from higher eukaryotes. Second, the general nature of
the approach hampers output of tRNA-specific feature information such as anticodon, isotype and intron
position. Our goal in the development of tRNAscan-SE was to produce a practical (i.e. fast) application of stochastic
context-free grammar-based RNA analysis methods with sensitivity and selectivity as close
as possible to using native covariance model searches. tRNAscan-SE achieves this goal.
tRNAscan-SE increases tRNA covariance model search speed by 1000-3000-fold while offering nearly equal sensitivity and slightly
improved selectivity. Selenocysteine tRNA detection features are built into
tRNAscan-SE, including modifications to EufindtRNA and the use of selenocysteine
tRNA covariance models. With these additions, tRNAscan-SE correctly identifies both of the selenocysteine tRNAs in the Sprinzl
database not detected by normal covariance model analysis. The GenBank version
of one of these two selenocysteine tRNA sequences, CTTRSEL from
C.thermoaceticum
, was also detected within the GenBank tRNA subset (the other selenocysteine
tRNA was not in the GenBank subset).
tRNAscan-SE also extends the maximum length of tRNAs detectable to almost any
length. In covariance model analysis, search time increases as the square of
the maximum tRNA length, so the search window has typically been limited to 150
bp. In tRNAscan-SE, the first-pass scanners define the approximate bounds of a tRNA, and for tRNAs
with very long introns, intervening sequences can be cut out based on the first-pass analysis. This allows detection of rare, abnormally long tRNAs
without greatly increasing the overall average search time. In the GenBank
subset, tRNAscan-SE detected four tRNAs (HALTGW plus three detected with the -L option) whose introns, ranging from 104 to 850 bp, exceeded the
normal length limit for covariance model detection.
tRNA false positives and pseudogenes
Of the 5591 total false positives identified by tRNAscan 1.4 in 15 gigabases of
simulated human sequence (Table
4
), in only six instances did it agree with EufindtRNA (relaxed parameters) in
falsely identifying a sequence as a tRNA. The majority of false positives found
by tRNAscan 1.4 seem to have tRNA-like secondary structure but lack similarity to conserved tRNA primary
sequence. EufindtRNA, on the other hand, identifies correctly spaced primary
sequence promoter elements, yet tends to err because it does not check for
proper tRNA secondary structure.
These observations hold up on examination of false positives from actual genomic
sequence from
C.elegans
. Most of the 29 false positives identified by tRNAscan 1.3 were discarded by
covariance model analysis because of the lack of primary sequence similarity to
the general tRNA model. EufindtRNA, on the other hand, more commonly identifies
pseudogene tRNA fragments, SINE-like repetitive elements or other tRNA-like sequences containing A and B boxes (Table
3
). Pseudogenes are recognizable since part of the sequence is very similar to
other intact tRNAs, in spite of truncations or large insertions elsewhere in
the pseudogene. However, tRNA secondary structure in pseudogenes and SINE-like elements tends to be lost more quickly than primary sequence promoter
elements. This may not be surprising in light of the observation that portions
of tRNA sequences are thought to help provide mobility for some tRNA-derived repetitive elements (
35
). Since EufindtRNA (relaxed parameters) only looks for canonical promoter
regions, it is prone to finding these instances of pseudogenes and repetitive
elements with tRNA promoters in the absence of structural tRNA features.
To some extent, covariance model analysis is also apt to identify truncated
tRNAs and other tRNA-derived sequence elements. The minimum cutoff score of 20 bits has been
set to include outlying tRNAs with low overall homology to the general tRNA
model. However, if a part of a high-scoring tRNA is truncated, the score may be much lower, but still exceed
the 20 bit threshold. The most extreme example of this occurs with a tRNA in
the
C.elegans
cosmid W03A3. The tRNA has 100% identity with tRNAs on at least four other
cosmids, except for a truncation of the first 16 bases that removes the 5' side of the aminoacyl acceptor stem and the first half of the A box
promoter sequence (part of the D-loop). tRNAscan 1.3 did not detect this pseudogene because of the lost
base pairings in the D-loop and aminoacyl stems, whereas EufindtRNA could not locate the A box
promoter sequence. Covariance model analysis similarly identified three other
pseudogenes that neither tRNAscan 1.3 nor EufindtRNA found: one appears to have
a 13 bp truncation relative to tRNAs in two other cosmids; one has a peculiar
21 bp insertion in the middle of the A box promoter sequence that makes three
near-perfect repeats of the 7mer `GTCGCGA'; and one cosmid has a pseudo tRNA
containing a 55 bp insert in the anticodon loop that does not appear to be a
true intron. Since none of these were identified by either tRNAscan 1.3 or
EufindtRNA, tRNAscan-SE necessarily does not detect them.
tRNAscan-SE does, however, detect 19 other tRNA-like sequences that are identified by EufindtRNA and `confirmed' by
covariance model analysis (scores greater than 20 bits). These may or may not
be pseudogenes. Nine of these involve 5' truncations of 3-15 nucleotides relative to other tRNAs in the nematode. It is
impossible to determine by computational analysis alone if these are functional
tRNAs or inactive pseudogenes. In either case, it is important to be aware of
these possible tRNA pseudogenes for possible further experimental and/or
computational study. Elucidating a common transpositional mechanism for
preferential loss of the 5'-end of these tRNAs is a question of interest.
Conclusion
tRNAscan-SE has been designed with the demands of human genome analysis in mind,
but can be used for any DNA sequence. We estimate that tRNAscan-SE will detect ~99.5% of the true tRNAs in the human genome, give zero false
positives (except for tRNA-derived SINEs and tRNA pseudogenes), and take ~36 h.
tRNAscan-SE demonstrates that general RNA structural profiles, covariance models,
can be used as the basis for very sensitive RNA similarity searching. The
primary limitation is speed. Although the strategy of using fast first-pass tRNA scanners in combination with second-stage covariance model analysis is effective here, this is not an
attractive general strategy for searching for other RNA gene family members.
Except for group I introns (
36
), there are no fast, specialized algorithms for detection of other RNA gene
families, and much effort is required for creating these highly specialized new
programs. Further work will focus on algorithmic development of covariance
model search methods that will reduce both time and memory requirements,
allowing faster searches for larger RNA genes without the need for first-pass screens.
REFERENCES
1 Hatlen,L. and Attardi,G. (1971) J. Mol. Biol., 56, 535-553.MEDLINE Abstract
2 Oba,T., Andachi,Y., Muto,A. and Osawa,S. (1991) Proc. Nat. Acad. Sci.USA, 88, 921-925.MEDLINE Abstract
3 Kanna,A., Ohama,T., Abe,R. and Osawa,S. (1993) J. Mol. Biol., 230, 51-56.
4 Daniels,G.R. and Deininger,P.L. (1984) Nature, 317, 819-822.
5 Deininger,P.L. (1989) In Berg,D.E. and Howe,M.M. (eds), Mobile DNA. ASM, Washington, pp. 619-636.
6 Dandekar,T. and Hentze,M.W. (1995) Trends Genet., 11, 45-50.MEDLINE Abstract
23 Grate,L., Herbster,M., Hughey,R., Haussler,D., Mian,I.S. and Noller,H. (1994) Proceedings, Second International Conference on Intelligent Systems for Molecular Biology, 2, 138-146.