| Nucleic Acids Research | Pages |
Combining diverse evidence for gene recognition in completely sequenced bacterial genomes
Introduction
Materials And Methods
Data
Outline of the algorithm
Similarity search and the set of seed ORFs
Coding potential and the complete set of candidate ORFs
RBS weight matrix and assignment of gene starts
Choice of the minimal allowed ORF length and handling of short ORFs
Implementation and availability
Results And Discussion
Acknowledgements
References
Combining diverse evidence for gene recognition in completely sequenced bacterial genomes
ABSTRACT
INTRODUCTION
The principal goal of large-scale genome sequencing is to obtain new insights into physiological and biochemical processes in living organisms. An essential step in this process is gene identification with subsequent computer-based annotation of the corresponding gene products. Although bacterial genomic sequences are devoid of introns, gene recognition in bacteria is far from being simple. It is easy to extract all possible open reading frames (ORFs) from a given DNA sequence; it is much less trivial to decide which of them correspond to genes that are actually expressed and code for proteins. The following features are important indicators of protein coding regions in DNA: (i) sufficient ORF length. Long ORFs rarely occur by chance; (ii) specific patterns of codon usage that are different from triplet frequencies in non-coding regions (`coding potential'); (iii) the presence of ribosome binding sites (RBS) in the (-20)...(-1) region upstream of the start codon that help to direct ribosomes to the correct translation start positions (1). A part of the RBS is formed by the purine-rich Shine-Dalgarno (SD) sequence which is complementary to the 3[prime] end of the 16S rRNA (2); (iv) similarity to known, especially experimentally characterized, gene products.
Correspondingly, the approaches to gene recognition are traditionally divided into two broad categories (3,4). Intrinsic, or ab initio methods utilize statistic, linguistic or pattern recognition algorithms to find genes in DNA through detection of specific nucleic acid motifs or global statistical patterns, whereas extrinsic methods take into account information about other known proteins.
There exist numerous algorithms for ab initio recognition of protein-coding regions and functional sites (reviewed in 5,6). The most popular gene prediction program for prokaryotes, GeneMark (6), utilizes non-homogeneous Markov models to find DNA regions that code for proteins or are complementary to them. Non-coding regions are described by homogeneous Markov models. A Bayesian decision rule is applied to deduce the coding capacity of sliding windows. GeneMark has been used in several genome sequencing projects (e.g. 7-9). Recently this algorithm was extended to take into account information about candidate ribosomal binding sites (10). The recently developed GLIMMER (11) has been reported to provide very high gene prediction accuracy in Haemophilus influenzae and Helicobacter pylori genomes. GLIMMER relies on interpolated Markov models to take into account DNA oligomers of varying length, dependent on the local composition of the sequence. Another program, EcoParse (12), utilizes hidden Markov models to find the maximum likelihood parse of a DNA sequence into coding and non-coding regions without the use of sliding windows. A program for gene recognition relying solely on ORF length and RBS was described by Hatzigeorgiou and Fickett (13).
Extrinsic analysis involves similarity searches with candidate gene products against protein sequence databanks. The most popular program of this class, BLASTX (14), performs six-frame translation of the query DNA and compares the resulting amino acid sequence to known proteins. Search results are represented in an integrated report, with hits from different reading frames combined to produce one statistically meaningful similarity score. Robinson et al. (15) used BLASTX to detect 450 new bacterial genes missed in original publications, including several genes previously known only in eukaryotes. Another DNA-protein search program, DPS (16), is the only currently available software tool that allows us to compare a complete genome sequence (3-5 mbp and more) with the total protein sequence databank in one pass.
Neither extrinsic nor intrinsic methods taken separately can ensure successful prediction. Practical experience in prokaryotic genome analysis as well as the recent trends in gene recognition in higher organisms (17,18) show that it is necessary to incorporate all available evidence in order to achieve reliable results. In real life, putative coding regions predicted by intrinsic methods are verified by similarity searches. Finding a related protein serves as a decisive supporting evidence. Pearson et al. (19) studied the ORFs predicted with GeneMark in H.influenzae, Methanococcus jannaschii and Mycoplasma genitalium. In many cases they were able to correct the length of genes based on comparative analysis with known proteins. Additionally they found many short genes not identified by GeneMark. The overall conclusion of this work is that a sizeable amount of genes annotated within the framework of large-scale sequencing projects are fully or partially wrong.
Experience coming from many computational genome analysis efforts (20-22) shows that 60-80% of genes in newly sequenced organisms have known counterparts in other species. In many cases the similarity is only marginal, partial or to gene products without known function. However, in at least 30% of the cases, reliable global alignments with well characterized proteins can be obtained. We were thus tempted to invert the usual procedure in which genes predicted by ab initio statistical methods are accepted or rejected based on subsequent similarity searches. In this work we use DNA regions significantly related to known proteins to extract codon usage statistics and other intrinsic recognition parameters that are further applied to unexplored parts of a genome. The leading idea of this work is that extrinsic evidence should be given higher priority than intrinsic information.
We also pay specific attention to assignment of gene starts. This is important since 5[prime] ends of genes often are not conserved, whereas they carry important functional and structural information. In particular, a signal peptide may provide information about protein localization (23). The N-terminus can contain information about the life span of a protein (24). The estimated strength of the RBS can be an indicator of the efficacy of the translation initiation (25). Thus correct determination of the gene start can be as important as identification of the gene itself.
MATERIALS AND METHODS
Data
Complete nucleotide sequences of the Bacillus subtilis (9) and Escherichia coli (8) genomes were downloaded from the SubtiList WWW Server at the Pasteur Institute (http://www.pasteur.fr/Bio/SubtiList.html ) and the E.coli genome project resource at the University of Wisconsin (http://ecoliftp.genetics.wisc.edu/ ), respectively. In addition, we also obtained full sets of the protein sequences encoded in these two genomes (4099 for B.subtilis and 4277 for E.coli) as assigned by the genome authors based on the application of various computational techniques as well as manual analysis.
Independently, B.subtilis and E.coli sequences were extracted from the PIR-International protein sequence database using the Sequence Retrieval System (SRS; 26). Special care was taken to select only sequences determined by individual researchers in different laboratories not associated with large-scale sequencing projects. Sequences submitted after 1995, plasmid sequences, fragments and proteins described in PIR as hypothetical, as well as the PIR entries containing the names of the main researchers involved in the B.subtilis and E.coli genome projects were discarded. These two sets were compared with the full sets of gene products from the two genomes using the BLAST2 software (W.Gish, unpublished; 27,14). Only the PIR sequences at least 98% identical to their counterparts in complete genomes and having the same N-terminal sequence were retained. This selection procedure resulted in 219 E.coli and 346 B.subtilis proteins.
Throughout this work, the full sets of gene products from complete genomes as determined by the authors will be referred to as SUBGEN (for B.subtilis) and ECOGEN (for E.coli), and the sequence sets extracted from the PIR database as SUBPIR and ECOPIR, respectively.
For similarity searches we created a non-redundant protein databank by merging the PIR-International (28), SWISS-PROT, SWISS-NEW, TREMBL and TREMBLNEW (29) sequence collections using the NRDB2 software developed by W.Gish (unpublished). Sequences from all species for which genome sequencing projects have been completed were excluded. The resulting databank currently contains 208 660 protein sequences.
Outline of the algorithm
Our algorithm is based on the assumption that information about coding regions derived from similarity searches is in principle more reliable than statistical data. We use the term `seed ORF' to describe the minimal, most reliable possible ORF that can be inferred. In the case of similarity searches, a seed ORF is obtained by extending the reliably aligned region in the upstream direction until the first start codon occurs and in the downstream direction until a stop codon is encountered. These similarity-derived seed ORFs are used to calculate coding potential parameters. For ORFs predicted ab initio a seed ORF results from extending a DNA region of a given minimal length (e.g., 300 nt) possessing sufficiently high coding potential (see below) in the same fashion. At the next step of analysis the algorithm tries to extend the seed ORFs by including additional upstream DNA fragments encompassing the next available start codon provided that the DNA region between the old and new candidate starts satisfies conditions imposed on coding potential. The sample of ORFs with a single possible start codon is used to derive the RBS recognition matrix. Finally, in ORFs with multiple candidate starts, the leftmost start codon having sufficiently strong RBS is selected.
Similarity search and the set of seed ORFs
We used the DPS program (16) to compare complete genomic sequences with the complete non-redundant protein sequence databank. DPS performs mapping of all protein sequences from the database onto the query genomic sequence. The DPS output contains full information about a DNA-protein match, including the start and end positions, reading frame, similarity score and alignment of the high-scoring DNA segment with the corresponding protein sequence fragment represented in three-letter code. The alignment may be split into several high-scoring fragments, in which case reading frames, coordinates and similarity scores are given for each such fragment, and the aggregate similarity for the entire alignment is indicated separately. We took into account only DPS hits with sufficiently high aggregate scores (typically >750) involving only one reading frame; cases involving more than one reading frame are subject to a separate procedure aimed at detecting frame-shifts.
Coding potential and the complete set of candidate ORFs
Seed ORF sets produced by the similarity search were utilized to calculate the codon usage tables and the average and standard deviation of the coding potential. To do that, the significant DNA-protein alignments were extended until the first stop codon downstream and the first start codon upstream occurred. The obtained DNA sequences are the most reliable representatives of the coding parts of the genome that can be extracted automatically.
Let F(abc) be the genomic frequency of the codon abc. Statistical weight of abc is defined as W(abc) = log F(abc). Primary coding potential of a DNA segment of length n codons is its log-likelihood:
![]() |
To account for DNA fragments of different length, we will use the normalized potential measured in the standard deviation units:
![]() |
![]() |
![]() |
Finally, to avoid the influence of the local base composition and gene shadows, and to set the strand and reading frame, we define the coding quality of a DNA fragment as
![]() |
Upon derivation of the statistical parameters above, the DPS output was screened again to extract all similarity-based seed ORFs. The parts of the genome not covered by the similarity-based seed ORFs were subsequently analyzed for the presence of other protein-coding seed ORFs. A seed ORF was accepted if its length exceeded a given threshold (100 codons) and its coding quality [Omega] was sufficiently high.
All seed ORFs were then extended in the 5[prime] direction as far as possible. Short overlaps between genes (up to 6 nt) were allowed. Each extension piece of DNA started with ATG, GTG or TTG. The extension was accepted if the coding quality [Omega] of the DNA segment of length 99 nt starting with the new candidate start codon was acceptable; otherwise the extension of a given ORF was interrupted. The window length of 99 was chosen to ensure sufficient statistical significance of the calculated coding potential (30).
This procedure resulted in the complete set of `open-start' candidate ORFs. The start codons for this set were assigned at the final step.
RBS weight matrix and assignment of gene starts
Candidate ORFs with only one possible start codon and not having neighbors closer than 30 bases upstream were selected. Regions (-20)...(+3) of these ORFs were aligned at start codons. These sequences were used to derive the RBS weight matrix.
Let L be the expected length of the SD box (L = 6). Denote by F(b,j) positional nucleotide frequencies in the initial alignment [j = (-20)...(-1); b = T,C,A,G]. Positional information content is:
![]() |
Initially the RBS signal was assumed to reside in positions having the maximum total information content:
![]() |
Then the position of the SD box in each individual sequence was determined using the following two-stage procedure.
We start with some definitions. Denote by N(b,k) positional nucleotide counts in the SD profile at a given iteration (k = 1...L). Positional nucleotide weights are:
![]() |
![]() |
The first re-alignment stage involves the iteration until convergence of the following two steps: (i) find in each sequence the segment of length L with the highest score; (ii) re-calculate the nucleotide weight matrix.
A distinctive feature of our algorithm is that at each optimization step only the top scoring fraction (usually top 80%) of sequences are used to produce the current weight matrix.
At the second stage the preferences for the distance between the SD box and the start codon are taken into account. Let M be a possible position of the SD box within the RBS region, and let this position occur N(M) times. Denote by Nmax the count of the most frequent position. The positional weights are calculated using the standard formula:
![]() |
Now the strength of the SD signal is defined as:
![]() |
Thus the RBS profile is the nucleotide weight matrix and the vector of position weights. The two step iterative procedure is used again until convergence.
The final step of the genome annotation is the assignment of start codons to `open start' candidate ORFs. If a candidate ORF contains start codons with sufficiently strong RBS, the 5[prime]-proximal of these starts was accepted. Otherwise the initial start generated at the previous stage was used, thus taking into account the possibility of translation re-initialization from an upstream gene.
Choice of the minimal allowed ORF length and handling of short ORFs
First versions of our program worked with a fixed minimal ORF length, typically 100 codons. The conflicts between overlapping ORFs were resolved based on the strength of the coding potential, as described above. The disadvantage of this approach was that quite often short ORFs defeated much longer competing ORFs, and the genome regions vacated by the latter would be returned to the pool of unoccupied space, giving rise to additional abundant short ORFs. This problem became especially severe when the minimal allowed ORF length was set to values under 100 codons, which led to a large number of predicted short ORFs at the expense of the longer ones.
To resolve this problem, we modified the final stage of gene prediction process as follows. The minimal allowed ORF length is first set to a very high value (2000 nt), after which all genes are predicted as described above. Then the minimal ORF length is reduced step-wise, and the gene prediction process repeated. At that, the genes predicted at the previous step remain unchanged. Thus, longer ORFs get higher priority, and the next pool of ORFs is derived from the genome regions that are unoccupied after completion of the previous step. This allows us to avoid the explosion of short ORFs while preserving high overall prediction accuracy.
Table 1. RBS weight matrix for B.subtilis and E.coli
| Nucleotide | Nucleotide position in the window | |||||
| 1 | 2 | 3 | 4 | 5 | 6 | |
| Bacillus subtilis | ||||||
| A | 0.000 | -0.909 | -0.845 | 0.000 | -0.909 | -0.511 |
| C | -0.856 | -1.000 | -0.943 | -0.923 | -0.999 | -0.748 |
| G | -0.804 | 0.000 | 0.000 | -0.709 | 0.000 | 0.000 |
| T | -0.765 | -0.897 | -0.980 | -0.868 | -0.962 | 0.725 |
| Consensus | A | G | G | A | G | G |
| Escherichia coli | ||||||
| A | 0.000 | -0.035 | 0.000 | -0.995 | -0.936 | 0.000 |
| C | -0.299 | 0.000 | -0.804 | -0.984 | -0.987 | -0.814 |
| G | -0.386 | -0.115 | -0.903 | 0.000 | 0.000 | -0.710 |
| T | -0.027 | -0.426 | -0.752 | -0.937 | -1.009 | -0.749 |
| Consensus | A/T | C/A | A | G | G | A |
Implementation and availability
The program to calculate weight matrices based on a multiple alignment of putative RBS regions is called STARTERand is written in C programming language. All other computational steps described in this paper are implemented as a Perl 5 script called ORPHEUS. Both programs as well as all data mentioned (protein sequence sets, weight matrices, DPS search results, etc.) are freely available to academic users; see http://pedant.mips.biochem.mpg.de/frishman/orpheus_home.html
Figure 1. Alignment of the B.subtilis regions upstream from the 5[prime] ends of the ORFs with one possible start codon and acceptable coding potential. Sequences are numbered 1-386. Each sequence includes positions -22...-1 upstream of the start codon and the start codon (shown in italic). Location of the regions with the highest RBS score are shown by boxes, and the corresponding RBS scores are indicated in the last column. The location of the preferred SD position (in this case -13; see Fig. 2) is shaded. Note that the procedure to find RBS uses the top scoring 80% of sequences at each iteration step. Sequences with the worst 20% of scores (in this example 10 and 383) are ignored.
RESULTS AND DISCUSSION
910 E.coli and 529 B.subtilis similarity-based seed ORFs were extracted from the DPS search results. The average length of these ORFs was 444 and 507 amino acids, respectively, greater then the average length of E.coli (339) and B.subtilis (326) genes (excluding the genes shorter than 100 amino acids). The reason for this difference is that a very stringent DPS similarity score threshold was chosen, giving preference to long, reliable alignments.
RBS weight matrices (Table 1) were derived from alignments of 385 B.subtilis and 644 E.coli 5[prime] upstream gene regions with single candidate starts (Fig.
| Figure 2. Automatically derived positional preference of the SD box in B.subtilis (a) and E.coli (b). Higher values (closer to 0) correspond to more preferred start positions of the window of length 6 nt. |
|
The start selection procedure is exemplified in Figure
Figure 3. Start codon selection procedure. The seed ORF of the comA gene (32) with multiple possible starts situated on the complementary strand is extended in the upstream direction, and the values of the coding potential downstream of each start codon ([Omega]) and RBS signal strength upstream ([Delta]) recorded. Start position 3 252 523 is selected since it is preceded by a very strong RBS (bold line). The SD sequence indicated in (32) is underlined. Start codons are shown in upper case. Note that the values of [Omega] are higher than the conservative threshold -1.0 in all cases. Our program identified 4379 genes longer than 35 codons in the B.subtilis genomeand 4595 genes longer than 35 codons in the E.coli genome. As seen in Table 2 and Figure
As seen in Figure
The main distinctive feature of our algorithm is that we start the analysis with the coding regions and candidate RBS that can be expected to be highly reliable. They serve as a learning set used to derive statistical parameters used for further, more detailed analysis. The use of top 80% highest scoring candidate RBS to derive the weight matrix proved to be highly effective and improved discrimination power of the weight matrix.
Unlike GeneMark and EcoParse, our algorithm does not rely on the statistics of the non-coding regions. This is motivated by the fact that only coding regions can be defined unambiguously, especially at the initial steps of the analysis. Similarly, we do not use the energy of the base-pairing of the SD and the 16s rRNA. This makes the program applicable at early stages of genome analysis when the rRNA genes may be not be sequenced yet. Also, in some bacteria the RBS does not conform to the standard base-pairing model (33). Finally, we do not use complicated multiple alignment techniques for derivation of the RBS profile: it turned out that the relatively strong RBS signal can be detected by a relatively simple iterative procedure (cf. 34).
Table 2. caption>Comparison of the gene prediction results with the sets of sequences from the PIR-International and the genome sequencing projects
| Dataset | % correctly identified genes (true positives) |
% correct starts for correctly identified genes |
% correctly predicted genes with correct starts using `leftmost ATG' procedure |
|||
| L > 100 | L > 35 | L > 100 | L > 35 | L > 100 | L > 35 | |
| SUBPIR | 93.3 | - | 96.3 | - | 83.0 | - |
| ECOPIR | 96.3 | - | 83.9 | - | 86.9 | - |
| SUBGEN | 98.9 | 88.9 | 92.9 | 94.2 | 75.7 | 82.8 |
| ECOGEN | 99.1 | 87.1 | 75.7 | 76.7 | 78.0 | 77.7 |
| Figure 4. Distribution of the true positive, false positive and false negative lengths in B.subtilis (a) and E.coli (b) ORFs. Only the ORF length range 0-200 codons is shown. Predictions for longer ORFs are practically perfect. |
|
Most genes in the current databanks, and specifically the genes determined in the framework of major sequencing projects, are not corroborated experimentally. This makes it very difficult to assess performance of any particular algorithm or perform large-scale benchmarking (cf. the detailed discussion in 13). Interestingly, the gene start prediction accuracy both in B.subtilis and E.coli was a few percentage points higher for the SUBPIR and ECOPIR subsets than for the full genomes, whereas for the percentage of correctly predicted genes the situation was the opposite. The differences between the prediction results for the PIR and GenBank data sets can be explained by the details of the gene analysis in the original publications. However, analysis of the disagreements between different annotations should be undertaken in order to resolve this problem.
A slightly worse percentage of identified genes in B.subtilis as compared to E.coli can be explained by the relatively uniform codon usage in these species (35; see, however, 36). On the other hand, much better assignment of gene starts in B.subtilis reflects the general tendency towards stronger RBS in some Gram-positive bacteria (13).
The second surprising finding is the large number of non-ATG start codons. Indeed, in the candidate genes of B.subtilis having only one possible start codon, this codon is TTG in 21% of genes and GTG in 16% of genes (cf. 13 and 9% respectively, in 9). Since it is unlikely that the set of similarity-seed ORFs is somehow biased in the use of start codons, we feel that the former values are likely to be correct.
The future direction of this work is to incorporate additional evidence for better prediction of genes and their starting positions. It would be very desirable to take into account the influence of the mRNA secondary structure on the choice of start codons (e.g. 37) and to mask the genome regions coding for stable RNAs such as rRNAs and tRNAs (38) in order to decrease the number of false positives. Protein features can also be important for gene recognition. At present the gene recognition programs serve mainly as an initial step of genome analysis, to be followed by protein functional and structural analysis (22). An obvious possibility is the use of signal peptide predictions for the choice of the start codon. However, more sophisticated uses of protein analysis are possible. This probably should be done by hierarchical analysis systems with various feedback connections.
ACKNOWLEDGEMENTS
We are grateful to M.Galperin, A.Grigoriev, J.Hani, E.Koonin, P.Pevzner and M.Roytberg for useful discussions and toA.Hatzigeorgiou and J.W.Fickett for communicating their results prior to publication. A.M. and M.G. are partially supported by grants from the Russian Fund of Basic Research, the Russian State Program `Human Genome' and from the USA Department of Energy (DE-FG-94ER61919). The support of theBundes-ministerium für Forschung und Technologie (FKZ 0311670) is gratefully acknowledged.
REFERENCES
This article has been cited by other articles:
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 4 Jun 1998
Copyright©Oxford University Press, 1998.
![]()
CiteULike
Connotea
Del.icio.us What's this?
![]()
![]()

![]()
![]()
![]()
J. A. Moynihan, J. P. Morrissey, E. R. Coppoolse, W. J. Stiekema, F. O'Gara, and E. F. Boyd
Evolutionary History of the phl Gene Cluster in the Plant-Associated Bacterium Pseudomonas fluorescens
Appl. Envir. Microbiol.,
April 1, 2009;
75(7):
2122 - 2131.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
M. T. G. Holden, H. M. B. Seth-Smith, L. C. Crossman, M. Sebaihia, S. D. Bentley, A. M. Cerdeno-Tarraga, N. R. Thomson, N. Bason, M. A. Quail, S. Sharp, et al.
The Genome of Burkholderia cenocepacia J2315, an Epidemic Pathogen of Cystic Fibrosis Patients
J. Bacteriol.,
January 1, 2009;
191(1):
261 - 277.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
V. Kunin, A. Copeland, A. Lapidus, K. Mavromatis, and P. Hugenholtz
A Bioinformatician's Guide to Metagenomics
Microbiol. Mol. Biol. Rev.,
December 1, 2008;
72(4):
557 - 578.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
L. Klasson, T. Walker, M. Sebaihia, M. J. Sanders, M. A. Quail, A. Lord, S. Sanders, J. Earl, S. L. O'Neill, N. Thomson, et al.
Genome Evolution of Wolbachia Strain wPip from the Culex pipiens Group
Mol. Biol. Evol.,
September 1, 2008;
25(9):
1877 - 1887.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
M. Naito, H. Hirakawa, A. Yamashita, N. Ohara, M. Shoji, H. Yukitake, K. Nakayama, H. Toh, F. Yoshimura, S. Kuhara, et al.
Determination of the Genome Sequence of Porphyromonas gingivalis Strain ATCC 33277 and Genomic Comparison with Strain W83 Revealed Extensive Genome Rearrangements in P. gingivalis
DNA Res,
August 1, 2008;
15(4):
215 - 225.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
T. P. Stinear, T. Seemann, P. F. Harrison, G. A. Jenkin, J. K. Davies, P. D.R. Johnson, Z. Abdellah, C. Arrowsmith, T. Chillingworth, C. Churcher, et al.
Insights from the complete genome sequence of Mycobacterium marinum on the evolution of Mycobacterium tuberculosis
Genome Res.,
May 1, 2008;
18(5):
729 - 741.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
S. D. Bentley, C. Corton, S. E. Brown, A. Barron, L. Clark, J. Doggett, B. Harris, D. Ormond, M. A. Quail, G. May, et al.
Genome of the Actinomycete Plant Pathogen Clavibacter michiganensis subsp. sepedonicus Suggests Recent Niche Adaptation
J. Bacteriol.,
March 15, 2008;
190(6):
2150 - 2160.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
T. Goto, A. Yamashita, H. Hirakawa, M. Matsutani, K. Todo, K. Ohshima, H. Toh, K. Miyamoto, S. Kuhara, M. Hattori, et al.
Complete Genome Sequence of Finegoldia magna, an Anaerobic Opportunistic Pathogen
DNA Res,
February 7, 2008;
(2008)
dsm030v1.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
G.-Q. Hu, X. Zheng, Y.-F. Yang, P. Ortet, Z.-S. She, and H. Zhu
ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes
Nucleic Acids Res.,
January 11, 2008;
36(suppl_1):
D114 - D119.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
S. Kang, S.-J. Yang, S. Kim, and J. Bhak
CONSORF: a consensus prediction system for prokaryotic coding sequences
Bioinformatics,
November 15, 2007;
23(22):
3088 - 3090.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. L. Delcher, K. A. Bratke, E. C. Powers, and S. L. Salzberg
Identifying bacterial genes and endosymbiont DNA with Glimmer
Bioinformatics,
March 15, 2007;
23(6):
673 - 679.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
L. Krause, A. C. McHardy, T. W. Nattkemper, A. Puhler, J. Stoye, and F. Meyer
GISMO--gene identification using a support vector machine for ORF classification
Nucleic Acids Res.,
January 28, 2007;
35(2):
540 - 549.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
M. L. Riley, T. Schmidt, I. I. Artamonova, C. Wagner, A. Volz, K. Heumann, H.-W. Mewes, and D. Frishman
PEDANT genome database: 10 years online
Nucleic Acids Res.,
January 12, 2007;
35(suppl_1):
D354 - D357.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
H. Noguchi, J. Park, and T. Takagi
MetaGene: prokaryotic gene finding from environmental genome shotgun sequences
Nucleic Acids Res.,
November 14, 2006;
34(19):
5623 - 5630.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
P. Nielsen and A. Krogh
Large-scale prokaryotic gene prediction and comparison to genome annotation
Bioinformatics,
December 15, 2005;
21(24):
4322 - 4329.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. Lomsadze, V. Ter-Hovhannisyan, Y. O. Chernoff, and M. Borodovsky
Gene identification in novel eukaryotic genomes by self-training algorithm
Nucleic Acids Res.,
November 28, 2005;
33(20):
6494 - 6506.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
P. Ricke, M. Kube, S. Nakagawa, C. Erkel, R. Reinhardt, and W. Liesack
First Genome Data from Uncultured Upland Soil Cluster Alpha Methanotrophs Provide Further Evidence for a Close Phylogenetic Relationship to Methylocapsa acidiphila B2 and for High-Affinity Methanotrophy Involving Particulate Methane Monooxygenase
Appl. Envir. Microbiol.,
November 1, 2005;
71(11):
7472 - 7482.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. A. Witney, G. L. Marsden, M. T. G. Holden, R. A. Stabler, S. E. Husain, J. K. Vass, P. D. Butcher, J. Hinds, and J. A. Lindsay
Design, Validation, and Application of a Seven-Strain Staphylococcus aureus PCR Product Microarray for Comparative Genomics
Appl. Envir. Microbiol.,
November 1, 2005;
71(11):
7504 - 7514.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
N. E. Collins, J. Liebenberg, E. P. de Villiers, K. A. Brayton, E. Louw, A. Pretorius, F. E. Faber, H. van Heerden, A. Josemans, M. van Kleef, et al.
The genome of the heartwater agent Ehrlichia ruminantium contains multiple tandem repeats of actively variable copy number
PNAS,
January 18, 2005;
102(3):
838 - 843.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
K. A. Brayton, L. S. Kappmeyer, D. R. Herndon, M. J. Dark, D. L. Tibbals, G. H. Palmer, T. C. McGuire, and D. P. Knowles Jr.
Complete genome sequencing of Anaplasma marginale reveals that the surface is skewed to two superfamilies of outer membrane proteins
PNAS,
January 18, 2005;
102(3):
844 - 849.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
T. Kuwahara, A. Yamashita, H. Hirakawa, H. Nakayama, H. Toh, N. Okada, S. Kuhara, M. Hattori, T. Hayashi, and Y. Ohnishi
Genomic analysis of Bacteroides fragilis reveals extensive DNA inversions regulating cell surface adaptation
PNAS,
October 12, 2004;
101(41):
14919 - 14924.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
K. S. Bell, M. Sebaihia, L. Pritchard, M. T. G. Holden, L. J. Hyman, M. C. Holeva, N. R. Thomson, S. D. Bentley, L. J. C. Churcher, K. Mungall, et al.
Genome sequence of the enterobacterial phytopathogen Erwinia carotovora subsp. atroseptica and characterization of virulence factors
PNAS,
July 27, 2004;
101(30):
11105 - 11110.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
H. Ben Abdelkhalek, A. Beckers, K. Schuster-Gossler, M. N. Pavlova, H. Burkhardt, H. Lickert, J. Rossant, R. Reinhardt, L. C. Schalkwyk, I. Muller, et al.
The mouse homeobox gene Not is required for caudal notochord development and affected by the truncate mutation
Genes & Dev.,
July 15, 2004;
18(14):
1725 - 1736.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
O. Futterer, A. Angelov, H. Liesegang, G. Gottschalk, C. Schleper, B. Schepers, C. Dock, G. Antranikian, and W. Liebl
Genome sequence of Picrophilus torridus and its implications for life around pH 0
PNAS,
June 15, 2004;
101(24):
9091 - 9096.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
P. Ricke, C. Erkel, M. Kube, R. Reinhardt, and W. Liesack
Comparative Analysis of the Conventional and Novel pmo (Particulate Methane Monooxygenase) Operons from Methylocystis Strain SC2
Appl. Envir. Microbiol.,
May 1, 2004;
70(5):
3055 - 3063.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. M. Cerdeno-Tarraga, A. Efstratiou, L. G. Dover, M. T. G. Holden, M. Pallen, S. D. Bentley, G. S. Besra, C. Churcher, K. D. James, A. De Zoysa, et al.
The complete genome sequence and analysis of Corynebacterium diphtheriae NCTC13129
Nucleic Acids Res.,
November 15, 2003;
31(22):
6516 - 6523.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
F. O. Glockner, M. Kube, M. Bauer, H. Teeling, T. Lombardot, W. Ludwig, D. Gade, A. Beck, K. Borzym, K. Heitmann, et al.
Complete genome sequence of the marine planctomycete Pirellula sp. strain 1
PNAS,
July 8, 2003;
100(14):
8298 - 8303.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
F.-B. Guo, H.-Y. Ou, and C.-T. Zhang
ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes
Nucleic Acids Res.,
March 15, 2003;
31(6):
1780 - 1789.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
D. Frishman, M. Mokrejs, D. Kosykh, G. Kastenmuller, G. Kolesov, I. Zubrzycki, C. Gruber, B. Geier, A. Kaps, K. Albermann, et al.
The PEDANT genome database
Nucleic Acids Res.,
January 1, 2003;
31(1):
207 - 211.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
T. Shibuya and I. Rigoutsos
Dictionary-driven prokaryotic gene finding
Nucleic Acids Res.,
June 15, 2002;
30(12):
2710 - 2725.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
Q. Bao, Y. Tian, W. Li, Z. Xu, Z. Xuan, S. Hu, W. Dong, J. Yang, Y. Chen, Y. Xue, et al.
A Complete Sequence of the T. tengcongensis Genome
Genome Res.,
May 1, 2002;
12(5):
689 - 700.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
H. W. Mewes, D. Frishman, U. Guldener, G. Mannhaupt, K. Mayer, M. Mokrejs, B. Morgenstern, M. Munsterkotter, S. Rudd, and B. Weil
MIPS: a database for genomes and protein sequences
Nucleic Acids Res.,
January 1, 2002;
30(1):
31 - 34.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J. Besemer, A. Lomsadze, and M. Borodovsky
GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions
Nucleic Acids Res.,
June 15, 2001;
29(12):
2607 - 2618.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
M. B. Prentice, K. D. James, J. Parkhill, S. G. Baker, K. Stevens, M. N. Simmonds, K. L. Mungall, C. Churcher, P. C. F. Oyston, R. W. Titball, et al.
Yersinia pestis pFra Shows Biovar-Specific Differences and Recent Common Ancestry with a Salmonella enterica Serovar Typhi Plasmid
J. Bacteriol.,
April 15, 2001;
183(8):
2586 - 2594.
[Abstract]
[Full Text]
![]()
![]()
![]()

![]()
![]()
![]()
B. J. May, Q. Zhang, L. L. Li, M. L. Paustian, T. S. Whittam, and V. Kapur
Complete genomic sequence of Pasteurella multocida,Pm70
PNAS,
March 13, 2001;
98(6):
3460 - 3465.
[Abstract]
[Full Text]
[PDF]
![]()
This Article ![]()
![]()
Abstract
![]()
Print PDF (144K)
![]()
Alert me when this article is cited
![]()
Alert me if a correction is posted
![]()
Services ![]()
![]()
Email this article to a friend
![]()
Similar articles in this journal
![]()
Similar articles in ISI Web of Science
![]()
Similar articles in PubMed
![]()
Alert me to new issues of the journal
![]()
Add to My Personal Archive
![]()
Download to citation manager
![]()
Search for citing articles in:
ISI Web of Science (104)
![]()
Request Permissions ![]()
Commercial Re-use Guidelines
for Open Access NAR Content
![]()
Google Scholar ![]()
![]()
Articles by Frishman, D.
![]()
Articles by Gelfand, M.
![]()
Search for Related Content
![]()
PubMed ![]()
![]()
PubMed Citation
![]()
Articles by Frishman, D.
![]()
Articles by Gelfand, M.
![]()
Social Bookmarking ![]()
![]()
What's this?













