Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (71K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (309)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Salzberg, S. L.
Right arrow Articles by White, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Salzberg, S. L.
Right arrow Articles by White, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research Pages 544-548


Microbial gene identification using interpolated Markov models
Introduction
Interpolated Markov Models
   Markov chains
   Interpolated models
Algorithm And System Design
   Setting IMM parameters
   The GLIMMER system
Methods And Results
   Comparison on H.influenzae
   Gene finding accuracy on H.pylori
Conclusion
Acknowledgements
References


Microbial gene identification using interpolated Markov models

Microbial gene identification using interpolated Markov models Steven L. Salzberg1,2,*, Arthur L. Delcher3, Simon Kasif4 and Owen White1

1The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA, 2Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA, 3Department of Computer Science, Loyola College in Maryland, Baltimore, MD 21210, USA and 4Department of Electrical Engineering and Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USA

Received September 10, 1997; Revised and Accepted November 11, 1997

ABSTRACT

This paper describes a new system, GLIMMER, for finding genes in microbial genomes. In a series of tests on Haemophilus influenzae, Helicobacter pylori and other complete microbial genomes, this system has proven to be very accurate at locating virtually all the genes in these sequences, outperforming previous methods. A conservative estimate based on experiments on H.pylori and H.influenzae is that the system finds >97% of all genes. GLIMMER uses interpolated Markov models (IMMs) as a framework for capturing dependencies between nearby nucleotides in a DNA sequence. An IMM-based method makes predictions based on a variable context; i.e., a variable-length oligomer in a DNA sequence. The context used by GLIMMER changes depending on the local composition of the sequence. As a result, GLIMMER is more flexible and more powerful than fixed-order Markov methods, which have previously been the primary content-based technique for finding genes in microbial DNA.

INTRODUCTION

The number of new microbial genomes has dramatically increased since the first genome, Haemophilus influenzae, was sequenced in 1995 (1). Ten whole genomes have been completed, and at least 30 others are expected to be completed in the next two years. This abundance of data demands new and highly accurate computational analysis tools in order to explore these genomes and maximize the scientific knowledge gained from them. One of the first steps in the analysis of a microbial genome is the identification of all its genes. Because these genomes tend to be gene-rich, typically containing 90% coding sequence, the gene discovery problem takes on a different character than it does in eukaryotic genomes, especially higher eukaryotes whose genomes may have <10% coding sequence. In particular, the most difficult problem is determining which of two or more overlapping open reading frames (orfs) represent true genes. Other difficult problems include identifying the start of translation and finding regulatory signals such as promoters and terminators.

The most reliable way to identify a gene in a new genome is to find a close homolog from another organism. This can be done today very effectively using programs such as BLAST (3) and FASTA (4) to search all the entries in GenBank. However, many of the genes in new genomes still have no significant homology to known genes (1). For these genes, we must rely on computational methods of scoring the coding region to identify the genes. The best-known program for this task is GeneMark (5), which uses a Markov chain model to score coding regions. GeneMark has been highly effective and was used in the H.influenza and more recent genome projects. We have developed a new system, GLIMMER, that uses a technique called interpolated Markov models (IMMs) to find coding regions in microbial sequences. IMMs are in principle more powerful than Markov chains, and the computational experiments described below demonstrate that they produce more accurate results when used to find genes in bacterial DNA.

Markov models are a well-known tool for analyzing biological sequence data, and the predominant model for microbial sequence analysis is a fixed-order Markov chain (5,6). A fixed order Markov model predicts each base of a DNA sequence using a fixed number of preceding bases in the sequence. For example, a 5th-order model, which is the basis of GeneMark, uses the five previous bases to predict the next base. However, learning such models accurately can be difficult when there is insufficient training data to accurately estimate the probability of each base occurring after every possible combination of five preceding bases. In general, a kth-order Markov model for DNA sequences requires 4k + 1 probabilities to be estimated from the training data (e.g., 4096 probabilities for a 5th-order model). In order to estimate these probabilities, many occurrences of all possible kmers must be present in the data.

An IMM overcomes this problem by combining probabilities from contexts of varying lengths to make predictions, and by only using those contexts (oligomers) for which sufficient data are available. In a typical microbial genome some 5mers will occur too infrequently to give reliable estimates of the probability of the next base, while some 8mers may occur frequently enough to give very reliable estimates. In principle, using longer oligomers is always preferable to using shorter ones, but only if sufficient data is available to produce good probability estimates. An IMM uses a linear combination of probabilities obtained from several lengths of oligomers to make predictions, giving high weights to oligomers that occur frequently and low weights to those that do not. Thus an IMM uses a longer context to make a prediction whenever possible, taking advantage of the greater accuracy produced by higher-order Markov models. Where the statistics on longer oligomers are insufficient to produce good estimates, an IMM can fall back on shorter oligomers to make its predictions.

Using IMMs we have developed a new system, called GLIMMER, to identify coding regions in microbial DNA. GLIMMER uses a novel approach, based on frequency of occurrence and predictive value, to determine the relative weights of oligomers that vary in length from 1 to 8. After first creating IMMs for each of the six possible reading frames, GLIMMER then uses them to score entire orfs. When two high-scoring orfs overlap, the overlap region is scored separately to determine which orf is more likely to be a gene. We have tested GLIMMER using the H.influenzae, Helicobacter pylori and Escherichia coli genomes and found that it is very accurate in identifying genes, as we explain in Methods and Results. The system has recently been used to find the genes in two newly completed genomes: Borrelia burgdorferi, the bacteria that causes Lyme disease (14), and Treponema pallidum, the bacteria that causes syphilis (Fraser et al., manuscript in preparation). Annotation for these and other completed genomes will be available on the GLIMMER web site.

INTERPOLATED MARKOV MODELS

Markov chains

Our probabilistic model of DNA sequences represents a sequence as a process that may be described as a sequence of random variables X1, X2, ..., where Xi corresponds to position i in the sequence. Each random variable Xi takes a value from the set of bases (a, c, g, t). The probability that a variable Xi takes will depend on the local context; that is, the bases immediately adjacent to the base at position i. We sometimes refer to (a, c, g, t) as the set of possible states that a variable can take. In other words, variable Xi is in state a if Xi = a. As an illustration, consider the simple example of a Markov model in Figure 1. This 1-state model can be used to model any length DNA sequence. In each position, the probability of a is 0.2. Thus the sequence aaaaa would have a probability of (0.2)5 = 0.00032. In this way we can score any sequence by computing the probability that it was generated by the model.


Figure 1. Sample 1-state Markov model for simple sequence modeling.

A first order Markov chain is a sequence of random variables where the probability that Xi takes a particular value only depends on the preceding variable Xi-1. A kth order Markov chain is a natural generalization of this definition where the probability distribution of Xi depends only on the k preceding bases. Note that for DNA sequences a first-order Markov chain is specified completely by a matrix of 16 probabilities: p(a[brvbar]a), p(a[brvbar]c), ..., p(t[brvbar]t). There are two essential computational issues that must be considered in building and using these probabilistic models: (i) the learning problem, which involves learning a good model for coding regions in microbial DNA and (ii) the evaluation problem, which involves assigning a score to a new DNA sequence that represents the likelihood that the sequence is coding. GLIMMER's solutions to both these computational issues are described in the Interpolated models section below.

To use a Markov chain model to find genes in microbial DNA, we need to build at least six submodels, one for each of the possible reading frames (three forward and three reverse). We can also build a seventh, separate model for non-coding regions, though this is not strictly necessary. Each model makes different predictions for the bases in the three codon positions. Even with a 0th-order model, the frequency of g in codon position 1 will be different from its frequency in another frame, so even this very weak model has some ability to identify the right reading frame for a gene.

In a 1st-order model, the output of a state depends on the state immediately previous; i.e., a base is dependent on the previous base. Thus instead of four probabilities in each state, we compute sixteen: p(a[brvbar]a), p(a[brvbar]c), ..., p(t[brvbar]t). In order to score a new sequence, the model considers two bases at a time, the current base and the previous one. Likewise, in a 2nd-order model, the output of a state depends on the two previous bases. So to predict a base in the third codon position with our 2nd-order model, we look at the first and second codon positions. To predict a base in the first codon position, the 2nd-order model looks at the second and third codon positions in the previous codon.

Using the Markov models for each of the six possible frames plus a model of non-coding DNA, we can straightforwardly produce a simple algorithm for finding genes. Simply score every orf using all seven models, and choose the model with the highest score. The scores can be normalized so they represent the probability that a sequence is coding. If the model corresponding to the true coding region in the correct frame scores the highest, then the orf can be labeled as a gene. This simple algorithm ignores the difficult problem of how to handle overlapping genes, which we address in the Algorithm and System Design section, which contains the details of GLIMMER. (To be effective, an algorithm must do much more than this intentionally simple description. For example, all scores could be nearly equal, or the highest score could still be quite low, so the algorithm needs to have a threshold score below which no region is classified as coding.)

Interpolated models

In general, we would always like to use the highest-order Markov model possible. The higher-order model should always do at least as well as, and frequently better than, lower-order models. This can be explained by a simple example.

Suppose that the base in the third codon position depends only on the second codon position. Then we might observe in a given genome that P(a3[brvbar]g2) = 0.22; i.e., the probability of observing adenine in the third codon position given that guanine occurs in the second is 0.22. This is a first-order dependency. Suppose that the prior probability of adenine P(a3) is 0.30. Clearly we will perform better by using the first-order statistic, since adenine occurs less frequently in the third position following guanine than it does otherwise. Now consider using both the first and second codon positions to predict a3. Given our assumption that only the second position matters, we should find that P(a3[brvbar]g2) = P(a3[brvbar]g2, x1), where x1 indicates any base in the first codon position. Thus the 2nd-order model will perform exactly the same as the 1st-order model. If it turns out that the third codon position depends on both the first and second positions, then the 2nd-order model will perform better.

The problem that arises in practice is that, as we move to higher order models, the number of probabilities that we must estimate from the data increases exponentially. For DNA sequence data, we need to learn 4k + 1 probabilities in a kth-order Markov model. Our six submodels actually need 6×4k + 1 probabilities. So a 5th-order model needs 24 576 probabilities. In a microbial genome such as H.influenzae with 1.8 million bases, we will observe each of the 4096 possible 6mers often enough to get accurate estimates for a 5th-order model, although for rare hexamers we may not have enough data. For a 6th-order model, which requires probabilities for all 7mers, there are a substantial number of 7mers that do not occur sufficiently often, and for 7th and 8th-order models the problem is worse. However, even for 8th-order models, there are some oligomers that occur often enough to be extremely useful predictors. We would like a Markov model that uses these higher-order statistics whenever sufficient data is available. This is one of the key advantages of using an IMM. [Note that there exist other techniques to incorporate variable length predictive models (7,8). We experimented with these alternatives before converging on the approach described here.]

To be more precise, an IMM uses a combination of all the probabilities based on 0, 1, 2, ..., k previous bases, where k is a parameter given to the algorithm. In GLIMMER, we use k = 8. Thus for oligomers that occur frequently, the IMM can use an 8th-order model, while it might use a 5th or even lower-order model for rare oligomers. In order to `smooth' its predictions, an IMM uses predictions from the lower-order models, where much more data is available, to adjust the predictions made from higher-order models.

During training, GLIMMER computes the probability of each base a, c, g, t, following all kmers for 0 <= k <= 8. Then, for each kmer it computes a weight to use in combining the predictions of different order models. Details of the algorithm for computing these weights are given in the Algorithm and system design section. Once the weights are computed, GLIMMER evaluates new sequences by computing the probability that the model M generated the sequence S, P (S[brvbar]M). This probability is computed as
where Sx is the oligomer ending at position x, and n is the length of the sequence. IMM8 (Sx), the 8th-order interpolated Markov model score, is computed as
IMMk(Sx) = [lambda]k(Sx - 1) - Pk(Sx) + [1 - [lambda]k(Sx - 1)] - IMMk - 1(Sx)
where [lambda]k(Sx - 1) is the numeric weight associated with the kmer ending at position x - 1 in the sequence S and Pk(Sx) is the estimate obtained from the training data of the probability of the base located at x in the kth-order model. Thus, the 8th-order IMM score of an oligomer is a linear combination of the predictions made by the 8th, 7th and lesser-order models all the way down to the 0th-order model, which is just the simple prior probabilities of a, c, g, t. The above equation is the solution to the evaluation problem mentioned in the introduction.

From this definition, it is clear that an IMM is in principle always preferable to a fixed-order Markov model. For example, by giving zero weights to all oligomers except 5mers, an IMM will perform identically to a 5th-order Markov model. However, if there are any 6mers that occur frequently enough in the training data to be useful, and if these 6mers predict a different distribution of bases than the corresponding 5mers, then the IMM will outperform the 5th-order model. Not only longer but also shorter oligomers will help improve performance: even if a 5th-order model is better than a 4th-order model, there may be some rare 5mers for which insufficient data are available. A 5th-order model has no choice but to use the unreliable predictions from these rare 5mers, but an IMM can fall back on the much more reliable predictions made by the 4mers in such cases. The experiments described below indicate that both of these phenomena occur and both serve to give IMMs an advantage over fixed-order Markov models.

It is worth remarking that GLIMMER builds a non-homogenous Markov model; i.e., different models are created for each of the three codon positions. This type of `3-periodic' Markov chain was introduced in GeneMark (5) to account for patterns that depend on the reading frame.

ALGORITHM AND SYSTEM DESIGN

Setting IMM parameters

In this section we describe how GLIMMER computes the values of the [lambda] parameters for the kth-order IMM described in the preceding section. In addition, we explain the solution to the learning problem mentioned in the introduction. First, a set of known coding sequences must be assembled into a training set. To be certain these are truly coding is somewhat problematic for a new genome. The solution we have adopted is to use only very long orfs and sequences with homology to known genes from other organisms. These can easily be identified a priori without knowing anything else about the genome being analyzed.

From the training set of genes, the frequencies of occurrence of all possible substring patterns of length 1 to k + 1 are tabulated in each of the six reading frames. (The last base in the substring defines the reading frame.) For simplicity, let us consider just a single reading frame and use f(S) to denote the number of occurrences of string (sequence) S = s1s2 ... sn. (This same procedure is repeated for each of the six reading frames.) From these frequencies we get initial estimates of the probability of base sx occurring given the context string sx-i, sx-i+1, ..., sx-1, denoted by Sx,i (i.e., the i bases just previous to position x). We compute the probability of base sx given the i previous bases as
The value of [lambda]i(Sx) that we associate with Pi(Sx) can be regarded as a measure of our confidence in the accuracy of this value as an estimate of the true probability. GLIMMER uses two criteria to determine [lambda]i(Sx). The first of these is simply frequency of occurrence. If the number of occurrences of context string Sx,i in the training data exceeds a specific threshold value, then [lambda]i(Sx) is set to 1.0. Thus, when there are sufficiently many sample occurrences of a context string in the training data, then those sample probabilities are used. The current default value for this threshold in GLIMMER is 400, which gives ~95% confidence that the sample probabilities are within ±0.05 of the true probabilities from which the sample was taken. (Other thresholds were tested experimentally, but none provided any noticeable improvement.)

When there are insufficiently many sample occurrences of a context string to estimate the probability of the next base with confidence, we employ an additional criterion to assign a [lambda] value. For a given context string Sx,i of length i, we compare the observed frequencies of the following base, f(Sx,i, a), f(Sx,i, c), f(Sx,i, g) and f(Sx,i, t), with the previously calculated IMM probabilities using the next shorter context, IMMi-1 (Sx,i-1, a), IMMi-1 (Sx,i-1, c), IMMi-1 (Sx,i-1, g) and IMMi-1 (Sx,i-1, t). Using a X2 test, we determine how likely it is that the four observed frequencies are consistent with the IMM values from the next shorter context. When the frequencies differ significantly from the IMM values, we prefer to use them as better predictors of the next base, i.e., give them a higher [lambda] value. Conversely, when the frequencies are consistent with the IMM values, they offer little predictive value and hence we give them a lower [lambda] value. Specifically, we calculate the [chi]2 confidence c that the frequencies are not consistent with the IMM probabilities and set
Thus, we assign higher [lambda] values based on a combination of predictive value, determined by [chi]2 significance, and accuracy, determined by frequency of occurrence. This [lambda] value now defines the probabilities IMMi (Sx,i, b) for b [member of] {a, c, g, t} according to equation 1. [Other methods of assigning [lambda] values for IMMs have been developed (9,10). We experimented with these methods in addition to the one described above, and comparative results will be given in a follow up paper. Roberts (11), cited in (12) also describes a method for building nonuniform Markov models.]

The GLIMMER system

The GLIMMER system consists of two programs. The first of these, called build-imm, takes an input set of sequences and builds and outputs the IMM for them as described above. These sequences can be complete genes or just partial orfs. The second program, called glimmer, then uses this IMM to identify putative genes in an entire genome. Glimmer does not use sliding windows to score regions. Instead, it first identifies all orfs longer than some specified threshold value, and scores each one in all six reading frames. Those that score higher than a designated threshold in the correct reading frame are then selected for further processing. These selected orfs are then examined for overlaps. If two orfs in different reading frames overlap (by more than some designated minimum length), the overlapping region alone is scored separately. The overlap region's six reading frame scores are then compared with those of the two overlapping orfs to see which frame scores highest. In general, when a longer orf overlaps a shorter orf and the overlap region scores highest in the reading frame of the longer orf, then the shorter orf is eliminated as a gene candidate. The final output of the program is a list of putative gene coordinates in the genome, together with notations for each one that may have had a suspicious overlap with another gene candidate. These `suspect' gene candidates (usually a very small percentage of the total) can then be examined manually to determine if they are in fact genes. Samples of GLIMMER outputs for the H.pylori genome are available on the GLIMMER web site at http://www.cs.jhu.edu/labs/compbio/glimmer.html, which also contains results for E.coli and H.influenzae. The GLIMMER system, including all source code, is freely available from this site.

METHODS AND RESULTS

To evaluate the effectiveness of our IMM, we compared it to a conventional fixed-order model on data from H.influenzae genome. As a second confirming test, we ran it on the recently sequenced H.pylori genome and did a careful comparison of the genes found by GLIMMER to those annotated in the public databases and to the genes found by the GeneMark system.

Comparison on H.influenzae

Haemophilus influenzaehas many putative genes whose existence has not been confirmed biologically. For this experiment, we wanted to train GLIMMER using only genes that had a very high likelihood of being real; therefore, we chose for training a set of orfs that satisfy both of these criteria: (i) the orf is >500 bases long, which provides the basis for a statistical argument that the gene is highly likely to be a coding region, since orfs of this length almost never occur in non-coding DNA. (ii) The orf does not overlap any other orf longer than 500 bp. Using these criteria, we were able to collect 1168 orfs from the current version of H.influenzae (GenBank accession L42023), which contains 1717 annotated genes. Thirty-two of these did not match CDS entries, but we included them anyway. This gives us a completely automatic training procedure for GLIMMER, requiring no human intervention.

This experiment compared GLIMMER's IMM to a conventional fixed-length Markov model on the H.influenzae genome data. We followed identical training protocols for both the IMM and a fixed-length 5th-order Markov model. [This 5th-order Markov model is the same model as that used by GeneMark (6). Because we did not have access to the GeneMark source code, we could not retrain that system on our data, so we implemented our own model based on published descriptions of GeneMark.] All post-processing to resolve overlaps was also identical for both methods. Thus the only difference was the model itself: in one case an interpolated Markov model, and in the other case a 5th-order Markov model. Note that we also implemented 4th and 6th-order Markov models, but the 5th-order model performed better than these. The results are shown in Table 1.

Table 1. Comparison of the IMM model used in GLIMMER to a 5th-order Markov model
Model Genesfound Genesmissed Additional genes
GLIMMER IMM 1680 (97.8% 37 209
5th-Order Markov 1574 (91.7%) 143 104
The first column indicates how many of the 1717 annotated genes in H.influenzae were found by each algorithm. The `additional genes' column shows how many extra genes, not included in the 1717 annotated entries, were called genes by each method.

Of the 37 genes missed by GLIMMER's IMM, only one was found by the 5th-order model. In contrast, the IMM found 107 genes that the 5th order model missed. For this run, a pre-set threshold prevented both systems from finding genes shorter than 100 bp, and six of the 37 genes missed by GLIMMER were below this threshold. Of the remaining 31 genes, only one was longer than 500 bp. Finally, note that this was a completely `self-trained' experiment in which database matches were not used for training; augmenting the training set with these additional genes will almost certainly improve performance further. Of the 209 additional genes called by the system, some can be eliminated from consideration by comparison with functional RNA sequences. The remainder may or may not be expressed genes, and further biological evidence is required to resolve these genes.

Gene finding accuracy on H.pylori

Finally, in a test designed to run the system as it will be used on new, complete genomes, we ran GLIMMER on the complete, recently sequenced genome of H.pylori (13), the bacterium that causes stomach ulcers. A training set of brute force orfs that were >500 nt were collected from the complete genome of H.pylori. (This training set was collected from the genome without reference to any annotation, exactly as it would be for a brand new sequence.) The resulting IMM model was then compared to the annotated set of genes identified for this organism. The 1590 genes annotated for Helicobacter were identified by integrating the following sets of information: (i) evaluating brute force orfs for protein-level sequence similarity matches to the public archives, (ii) predicting coding regions using the GeneMark system and (iii) collecting `intergenic' orfs that were found between the genes with database matches and the genes called by GeneMark. We consider the H.pylori sequence annotation to have been intensively evaluated by the research community, and as yet, no unidentified genes have been reported since the H.pylori publication.

The annotated genes were compared to the results of the GLIMMER algorithm, and 1548 of the 1590 genes were found to have been correctly identified. An additional 314 potential orfs were found by the system in the H.pylori genome. Some of these additional genes can be eliminated by discarding those that conflict with ribosomal and transfer RNAs, but the remainder cannot be ruled out as authentic genes without further biological evidence. The set of 42 unidentified genes, representing a potential false negative rate of 2.6%, were examined further. Nineteen of these genes from the H.pylori annotation were under 100 nt in length, and possibly below the length for meaningful detection by compositional methods. Orfs that have matches to proteins in the current public archives serve as the most reliable and independent verification that an orf is an authentic gene; of these orfs, only seven were present in the 42 genes that GLIMMER did not identify. This suggests a minimal false negative rate of 0.44% for GLIMMER.

Note that for this experiment, GLIMMER used a minimum gene length of 90 bp; this length can be changed with a simple command line parameter. With a minimum gene length of 180 bp (60 amino acids), for example, GLIMMER calls 286 fewer genes in H.pylori.

Finally, we conducted a limited comparison to the GeneMark system (6). To keep the comparison simple, we only considered the 974 genes from H.pylori that had database matches to other organisms; these can safely be considered to be `true' genes. GLIMMER, was trained exclusively on orfs longer than 500 bp, with overlapping orfs simply discarded. Thus GLIMMER was completely self-trained for this test, with no human intervention. (This fully automatic training requires only a few minutes of computation time.) For the first comparison, we used the output of GeneMark as generated by the H.pylori project (13); the GeneMark version used in that study was from early 1997.

From the set of 974 genes, GLIMMER found 21 genes that GeneMark missed, while GeneMark found one gene that GLIMMER missed. Overall GLIMMER missed eight genes while GeneMark missed 28. The two systems agreed on 945 of the 974 genes. We then ran a second comparison, this time using GeneMarkHMM, the newest release of GeneMark. (For this experiment, GeneMarkHMM was trained using all orfs longer than 700 bp, and the genes were divided during training into `typical' and `atypical' classes.) GeneMarkHMM missed 23 of the genes from the list of database hits. GLIMMER found 15 of the genes that GeneMarkHMM missed, while GeneMarkHMM did not find any genes that GLIMMER missed. The two systems agreed on 951/974 (97.6%) of the genes.

Note that the experiments described here all used a fully automatic training protocol, in which long orfs were identified by a program and then fed directly into GLIMMER. The system will perform even better if additional genes are included in the training set, and we expect that genome projects will include database matches to other organisms as part of training. Another simple method for improving performance is also available: the first set of genes identified by the system can be used as a new (larger) training set, and the system can be re-run repeatedly until it converges. This iterative algorithm will also be available as an option in the GLIMMER system.

CONCLUSION

Evaluating the accuracy of a microbial gene finder is difficult, because the genes annotated in GenBank do not always have biological evidence to back up their existence. As the annotation becomes more stable, more accurate estimates of accuracy will be possible. At the same time, better gene finders should result because the available training data will improve. Although GLIMMER'S sensitivity is nearing 100% already, there are several important areas of future improvements. One is to improve its specificity by reducing the number of false positives (after first confirming that the unannotated genes found by the system are in fact false). Specificity can already be reduced substantially, at the cost of slightly reducing sensitivity, by increasing the minimum length orf that GLIMMER will consider as a gene. Another is to incorporate separate pattern analysis algorithms that will allow the system to find promoters, enhancers, terminators and other signals that occur in intergenic regions. Accurate location of these signals is an important problem in its own right, and a system that integrates the content scoring approach of GLIMMER with a good signal identification algorithm should produce better results than either approach could independently.

ACKNOWLEDGEMENTS

Thanks to Mark Borodovsky and Alexander Lukashin for kindly sharing the results of GeneMarkHMM on the H.pylori genome. S.L.S. is supported by the National Human Genome Research Institute at NIH under Grant No. K01-HG00022-1. S.L.S. and A.L.D. are supported by the National Science foundation under Grant No. IRI-9530462. S.K. is supported by NSF IRI-9529227. O.W. is supported by the Department of Energy Grant No. DE-FC02-95ER61962.A003.

REFERENCES

1. Fleischmann,R.D., Adams,M., White,O., Clayton,R., Kirkness,R., Kerlavage,A., Bult,C., Tomb,J.-F., Dougherty,B. Merrick ,J., et al., (1995) Science, 269, 496-512. MEDLINE Abstract

2. Strauss,E.J. and Falkow,S. (1997) Science, 276, 707-712. MEDLINE Abstract

3. Altschul,S., Gish,W., Miller,W., Myers,E. and Lipman,D. (1990) J. Mol. Biol., 215, 403-410. MEDLINE Abstract

4. Pearson,W.R. (1995) Protein Sci., 4, 1145-1160. MEDLINE Abstract

5. Borodovsky,M. and Mcininch ,.D. (1993) Comp. Chem., 17, 123-133.

6. Borodovsky,M., McIninch,J., Koonin,E., Rudd,K., Medigue,C. and Danchin (1995) Nucleic Acids Res., 23, 3554-3562. MEDLINE Abstract

7. Ron,D., Singer,Y. and Tishby,N. (1996) Machine learning 25, 117-149.

8. Rissanen,J. (1983) IEEE Transactions on information theory 29, 656-664.

9. Ristad,E. and Thomas,R. (1997) Nonuniform Markov models In International Conference on Acoustics, Speech and Signal Processing, Munich, Germany.

10. Jelinek,F. and Mercer,R.L. (1980) In Gelsema,E.S. and Kanal,L.N. (Eds.), Pattern Recognition in Practice. Elsevier, North Holland, NY, USA. pp. 381-397.

11. Roberts,M.G. (1982) Local Order Estimating Markovian Analysis for Noiseless Source Coding and Authorship Identification Ph.D. thesis, Stanford University, Stanford, CA.

12. Williams,R.N. (1991) Adaptive Data Compression, Kluwer Academic Publishers Boston, MA.

13. Tomb,J.-F,. White,O., Kerlavage,A.R., Clayton,R., Sutton,G., Fleischmann,R., Ketchum,K., Klenk,H., Gill,S., Dougherty,B., et al., (1997) Nature, 388, 539-547. MEDLINE Abstract

14. Fraser,C.M., Casjen,S., Huang,W., Sutton,G., Clayton, R., Lathigra,R., White,O., Ketchum,K., Dodson,R., Hickey,E. et al. (1997) Nature, 390, 680-686.


*To whom correspondence should be addressed. Tel: +1 301 315 2537; Fax: +1 301 838 0208; Email: salzberg@tigr.org, salzberg@cs.jhu.edu


This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals Comments and feedback: www-admin{at}oup.co.uk
Last modification: 6 Jan 1998
Copyright© Oxford University Press, 1998.

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Bacteriol.Home page
U. Wegmann, K. Overweg, N. Horn, A. Goesmann, A. Narbad, M. J. Gasson, and C. Shearman
Complete Genome Sequence of Lactobacillus johnsonii FI9785, a Competitive Exclusion Agent against Pathogens in Poultry
J. Bacteriol., November 15, 2009; 191(22): 7142 - 7143.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
J. A. Coker, P. DasSarma, M. Capes, T. Wallace, K. McGarrity, R. Gessler, J. Liu, H. Xiang, R. Tatusov, B. R. Berquist, et al.
Multiple Replication Origins of Halobacterium sp. Strain NRC-1: Properties of the Conserved orc7-Dependent oriC1
J. Bacteriol., August 15, 2009; 191(16): 5253 - 5261.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. A. Elias, A. Mukhopadhyay, M. P. Joachimiak, E. C. Drury, A. M. Redding, H.-C. B. Yen, M. W. Fields, T. C. Hazen, A. P. Arkin, J. D. Keasling, et al.
Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation
Nucleic Acids Res., May 1, 2009; 37(9): 2926 - 2939.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
L. Klasson, J. Westberg, P. Sapountzis, K. Naslund, Y. Lutnaes, A. C. Darby, Z. Veneti, L. Chen, H. R. Braig, R. Garrett, et al.
The mosaic genome structure of the Wolbachia wRi strain infecting Drosophila simulans
PNAS, April 7, 2009; 106(14): 5725 - 5730.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. C. Stewart, B. Osborne, and T. D. Read
DIYA: a bacterial annotation pipeline for any genomics lab
Bioinformatics, April 1, 2009; 25(7): 962 - 963.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. R. Bayjanov, M. Wels, M. Starrenburg, J. E. T. van Hylckama Vlieg, R. J. Siezen, and D. Molenaar
PanCGH: a genotype-calling algorithm for pangenome CGH data
Bioinformatics, February 1, 2009; 25(3): 309 - 314.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
E. Li, C. I. Reich, and G. J. Olsen
A whole-genome approach to identifying protein binding sites: promoters in Methanocaldococcus (Methanococcus) jannaschii
Nucleic Acids Res., December 1, 2008; 36(22): 6948 - 6958.
[Abstract] [Full Text] [PDF]


Home page
DNA ResHome page
H. Noguchi, T. Taniguchi, and T. Itoh
MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes
DNA Res, December 1, 2008; 15(6): 387 - 396.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
V. Ter-Hovhannisyan, A. Lomsadze, Y. O. Chernoff, and M. Borodovsky
Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training
Genome Res., December 1, 2008; 18(12): 1979 - 1990.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
D. A. Rasko, M. J. Rosovitz, G. S. A. Myers, E. F. Mongodin, W. F. Fricke, P. Gajer, J. Crabtree, M. Sebaihia, N. R. Thomson, R. Chaudhuri, et al.
The Pangenome Structure of Escherichia coli: Comparative Genomic Analysis of E. coli Commensal and Pathogenic Isolates
J. Bacteriol., October 15, 2008; 190(20): 6881 - 6893.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
Y. Wang, A. Diehl, F. Wu, J. Vrebalov, J. Giovannoni, A. Siepel, and S. D. Tanksley
Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae
Genetics, September 1, 2008; 180(1): 391 - 408.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
L. Klasson, T. Walker, M. Sebaihia, M. J. Sanders, M. A. Quail, A. Lord, S. Sanders, J. Earl, S. L. O'Neill, N. Thomson, et al.
Genome Evolution of Wolbachia Strain wPip from the Culex pipiens Group
Mol. Biol. Evol., September 1, 2008; 25(9): 1877 - 1887.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
H. Takarada, M. Sekine, H. Kosugi, Y. Matsuo, T. Fujisawa, S. Omata, E. Kishi, A. Shimizu, N. Tsukatani, S. Tanikawa, et al.
Complete Genome Sequence of the Soil Actinomycete Kocuria rhizophila
J. Bacteriol., June 15, 2008; 190(12): 4139 - 4146.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
X. Hu, W. Fan, B. Han, H. Liu, D. Zheng, Q. Li, W. Dong, J. Yan, M. Gao, C. Berry, et al.
Complete Genome Sequence of the Mosquitocidal Bacterium Bacillus sphaericus C3-41 and Comparison with Those of Closely Related Bacillus Species
J. Bacteriol., April 15, 2008; 190(8): 2892 - 2902.
[Abstract] [Full Text] [PDF]


Home page
Brief Funct Genomic ProteomicHome page
C. Ansong, S. O. Purvine, J. N. Adkins, M. S. Lipton, and R. D. Smith
Proteogenomics: needs and roles to be filled by proteomics in genome annotation
Brief Funct Genomic Proteomic, March 10, 2008; (2008) eln010v1.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
C.-T. Lee, C. Amaro, K.-M. Wu, E. Valiente, Y.-F. Chang, S.-F. Tsai, C.-H. Chang, and L.-I Hor
A Common Virulence Plasmid in Biotype 2 Vibrio vulnificus and Its Dissemination Aided by a Conjugal Plasmid
J. Bacteriol., March 1, 2008; 190(5): 1638 - 1648.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
S. J. Foote, J. T. Bosse, A. B. Bouevitch, P. R. Langford, N. M. Young, and J. H. E. Nash
The Complete Genome Sequence of Actinobacillus pleuropneumoniae L20 (Serotype 5b)
J. Bacteriol., February 15, 2008; 190(4): 1495 - 1496.
[Abstract] [Full Text] [PDF]


Home page
DNA ResHome page
A. Vandenbon, Y. Miyamoto, N. Takimoto, T. Kusakabe, and K. Nakai
Markov Chain-based Promoter Structure Modeling for Tissue-specific Expression Pattern Prediction
DNA Res, February 7, 2008; (2008) dsm034v1.
[Abstract] [Full Text] [PDF]


Home page
MicrobiologyHome page
F. Karray, E. Darbon, N. Oestreicher, H. Dominguez, K. Tuphile, J. Gagnat, M.-H. Blondelet-Rouault, C. Gerbaud, and J.-L. Pernodet
Organization of the biosynthetic gene cluster for the macrolide antibiotic spiramycin in Streptomyces ambofaciens
Microbiology, December 1, 2007; 153(12): 4111 - 4122.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Kang, S.-J. Yang, S. Kim, and J. Bhak
CONSORF: a consensus prediction system for prokaryotic coding sequences
Bioinformatics, November 15, 2007; 23(22): 3088 - 3090.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
E. J. Cytryn, D. P. Sangurdekar, J. G. Streeter, W. L. Franck, W.-s. Chang, G. Stacey, D. W. Emerich, T. Joshi, D. Xu, and M. J. Sadowsky
Transcriptional and Physiological Responses of Bradyrhizobium japonicum to Desiccation-Induced Stress
J. Bacteriol., October 1, 2007; 189(19): 6751 - 6762.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Saeys, I. Inza, and P. Larranaga
A review of feature selection techniques in bioinformatics
Bioinformatics, October 1, 2007; 23(19): 2507 - 2517.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
S. Nakagawa, Y. Takaki, S. Shimamura, A.-L. Reysenbach, K. Takai, and K. Horikoshi
Deep-sea vent {varepsilon}-proteobacterial genomes provide insights into emergence of pathogens
PNAS, July 17, 2007; 104(29): 12146 - 12150.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
N.-H. Cho, H.-R. Kim, J.-H. Lee, S.-Y. Kim, J. Kim, S. Cha, S.-Y. Kim, A. C. Darby, H.-H. Fuxelius, J. Yin, et al.
The Orientia tsutsugamushi genome reveals massive proliferation of conjugative type IV secretion system and host cell interaction genes
PNAS, May 8, 2007; 104(19): 7981 - 7986.
[Abstract] [Full Text] [PDF]


Home page
J. Clin. Microbiol.Home page
M. Klint, H.-H. Fuxelius, R. R. Goldkuhl, H. Skarin, C. Rutemark, S. G. E. Andersson, K. Persson, and B. Herrmann
High-Resolution Genotyping of Chlamydia trachomatis Strains by Multilocus Sequence Analysis
J. Clin. Microbiol., May 1, 2007; 45(5): 1410 - 1414.
[Abstract] [Full Text] [PDF]


Home page
J. Clin. Microbiol.Home page
A. J. O'Neill, A. R. Larsen, R. Skov, A. S. Henriksen, and I. Chopra
Characterization of the Epidemic European Fusidic Acid-Resistant Impetigo Clone of Staphylococcus aureus
J. Clin. Microbiol., May 1, 2007; 45(5): 1505 - 1510.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
A. L. Lloyd, D. A. Rasko, and H. L. T. Mobley
Defining Genomic Islands and Uropathogen-Specific Genes in Uropathogenic Escherichia coli
J. Bacteriol., May 1, 2007; 189(9): 3532 - 3546.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. de Groot, T. Mailund, and J. Hein
Comparative annotation of viral genomes with non-conserved gene structure
Bioinformatics, May 1, 2007; 23(9): 1080 - 1089.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
U. Wegmann, M. O'Connell-Motherway, A. Zomer, G. Buist, C. Shearman, C. Canchaya, M. Ventura, A. Goesmann, M. J. Gasson, O. P. Kuipers, et al.
Complete Genome Sequence of the Prototype Lactic Acid Bacterium Lactococcus lactis subsp. cremoris MG1363
J. Bacteriol., April 15, 2007; 189(8): 3256 - 3270.
[Abstract] [Full Text] [PDF]


Home page
MicrobiologyHome page
H. Yukawa, C. A. Omumasaba, H. Nonaka, P. Kos, N. Okai, N. Suzuki, M. Suda, Y. Tsuge, J. Watanabe, Y. Ikeda, et al.
Comparative analysis of the Corynebacterium glutamicum group and complete genome sequence of strain R
Microbiology, April 1, 2007; 153(4): 1042 - 1058.
[Abstract] [Full Text] [PDF]


Home page
Antimicrob. Agents Chemother.Home page
M. Zienkiewicz, I. Kern-Zdanowicz, M. Golebiewski, J. Zylinska, P. Mieczkowski, M. Gniadkowski, J. Bardowski, and P. Ceglowski
Mosaic Structure of p1658/97, a 125-Kilobase Plasmid Harboring an Active Amplicon with the Extended-Spectrum {beta}-Lactamase Gene blaSHV-5
Antimicrob. Agents Chemother., April 1, 2007; 51(4): 1164 - 1171.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
S. M. Faruque, V. C. Tam, N. Chowdhury, P. Diraphat, M. Dziejman, J. F. Heidelberg, J. D. Clemens, J. J. Mekalanos, and G. B. Nair
Genomic analysis of the Mozambique strain of Vibrio cholerae O1 reveals the origin of El Tor strains carrying classical CTX prophage
PNAS, March 20, 2007; 104(12): 5151 - 5156.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. L. Delcher, K. A. Bratke, E. C. Powers, and S. L. Salzberg
Identifying bacterial genes and endosymbiont DNA with Glimmer
Bioinformatics, March 15, 2007; 23(6): 673 - 679.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
J. F. Challacombe, A. J. Duncan, T. S. Brettin, D. Bruce, O. Chertkov, J. C. Detter, C. S. Han, M. Misra, P. Richardson, R. Tapia, et al.
Complete Genome Sequence of Haemophilus somnus (Histophilus somni) Strain 129Pt and Comparison to Haemophilus ducreyi 35000HP and Haemophilus influenzae Rd
J. Bacteriol., March 1, 2007; 189(5): 1890 - 1898.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Saeys, P. Rouze, and Y. Van de Peer
In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists
Bioinformatics, February 15, 2007; 23(4): 414 - 420.
[Abstract] [Full Text] [PDF]


Home page
Appl. Environ. Microbiol.Home page
N. H. Bergman, K. D. Passalacqua, P. C. Hanna, and Z. S. Qin
Operon Prediction for Sequenced Bacterial Genomes without Experimental Information
Appl. Envir. Microbiol., February 1, 2007; 73(3): 846 - 854.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
L. Krause, A. C. McHardy, T. W. Nattkemper, A. Puhler, J. Stoye, and F. Meyer
GISMO--gene identification using a support vector machine for ORF classification
Nucleic Acids Res., January 28, 2007; 35(2): 540 - 549.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
H. Sugawara, T. Abe, T. Gojobori, and Y. Tateno
DDBJ working on evaluation and classification of bacterial genes in INSDC
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D13 - D15.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
J. A. Lanie, W.-L. Ng, K. M. Kazmierczak, T. M. Andrzejewski, T. M. Davidsen, K. J. Wayne, H. Tettelin, J. I. Glass, and M. E. Winkler
Genome Sequence of Avery's Virulent Serotype 2 Strain D39 of Streptococcus pneumoniae and Comparison with That of Unencapsulated Laboratory Strain R6
J. Bacteriol., January 1, 2007; 189(1): 38 - 51.
[Abstract] [Full Text] [PDF]


Home page
Infect. Immun.Home page
A. Brotcke, D. S. Weiss, C. C. Kim, P. Chain, S. Malfatti, E. Garcia, and D. M. Monack
Identification of MglA-Regulated Genes Reveals Novel Virulence Factors in Francisella tularensis
Infect. Immun., December 1, 2006; 74(12): 6642 - 6655.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
H. Noguchi, J. Park, and T. Takagi
MetaGene: prokaryotic gene finding from environmental genome shotgun sequences
Nucleic Acids Res., November 14, 2006; 34(19): 5623 - 5630.
[Abstract] [Full Text] [PDF]


Home page
Antimicrob. Agents Chemother.Home page
Y.-T. Chen, H.-Y. Shu, L.-H. Li, T.-L. Liao, K.-M. Wu, Y.-R. Shiau, J.-J. Yan, I.-J. Su, S.-F. Tsai, and T.-L. Lauderdale
Complete Nucleotide Sequence of pK245, a 98-Kilobase Plasmid Conferring Quinolone Resistance and Extended-Spectrum-{beta}-Lactamase Activity in a Clinical Klebsiella pneumoniae Isolate
Antimicrob. Agents Chemother., November 1, 2006; 50(11): 3861 - 3866.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. S. Vernikos and J. Parkhill
Interpolated variable order motifs for identification of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands
Bioinformatics, September 15, 2006; 22(18): 2196 - 2203.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
B. Palenik, Q. Ren, C. L. Dupont, G. S. Myers, J. F. Heidelberg, J. H. Badger, R. Madupu, W. C. Nelson, L. M. Brinkac, R. J. Dodson, et al.
Genome sequence of Synechococcus CC9311: Insights into adaptation to a coastal environment
PNAS, September 5, 2006; 103(36): 13555 - 13559.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
G. Delhon, E. R. Tulman, C. L. Afonso, Z. Lu, J. J. Becnel, B. A. Moser, G. F. Kutish, and D. L. Rock
Genome of invertebrate iridescent virus type 3 (mosquito iridescent virus).
J. Virol., September 1, 2006; 80(17): 8439 - 8449.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
E. R. Tulman, G. Delhon, C. L. Afonso, Z. Lu, L. Zsak, N. T. Sandybaev, U. Z. Kerembekova, V. L. Zaitsev, G. F. Kutish, and D. L. Rock
Genome of horsepox virus.
J. Virol., September 1, 2006; 80(18): 9244 - 9258.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
B. R. Kulasekara, H. D. Kulasekara, M. C. Wolfgang, L. Stevens, D. W. Frank, and S. Lory
Acquisition and Evolution of the exoU Locus in Pseudomonas aeruginosa
J. Bacteriol., June 1, 2006; 188(11): 4037 - 4050.
[Abstract] [Full Text] [PDF]


Home page
Appl. Environ. Microbiol.Home page
S. Campoy, J. Aranda, G. Alvarez, J. Barbe, and M. Llagostera
Isolation and Sequencing of a Temperate Transducing Phage for Pasteurella multocida.
Appl. Envir. Microbiol., May 1, 2006; 72(5): 3154 - 3160.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
C. S. Han, G. Xie, J. F. Challacombe, M. R. Altherr, S. S. Bhotika, D. Bruce, C. S. Campbell, M. L. Campbell, J. Chen, O. Chertkov, et al.
Pathogenomic Sequence Analysis of Bacillus cereus and Bacillus thuringiensis Isolates Closely Related to Bacillus anthracis
J. Bacteriol., May 1, 2006; 188(9): 3382 - 3390.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
Y. Wang, X. Tang, Z. Cheng, L. Mueller, J. Giovannoni, and S. D. Tanksley
Euchromatin and Pericentromeric Heterochromatin: Comparative Composition in the Tomato Genome
Genetics, April 1, 2006; 172(4): 2529 - 2540.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
H. Nonaka, G. Keresztes, Y. Shinoda, Y. Ikenaga, M. Abe, K. Naito, K. Inatomi, K. Furukawa, M. Inui, and H. Yukawa
Complete Genome Sequence of the Dehalorespiring Bacterium Desulfitobacterium hafniense Y51 and Comparison with Dehalococcoides ethenogenes 195
J. Bacteriol., March 15, 2006; 188(6): 2262 - 2274.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
V. Gonzalez, R. I. Santamaria, P. Bustos, I. Hernandez-Gonzalez, A. Medrano-Soto, G. Moreno-Hagelsieb, S. C. Janga, M. A. Ramirez, V. Jimenez-Jacinto, J. Collado-Vides, et al.
The partitioned Rhizobium etli genome: Genetic and metabolic redundancy in seven interacting replicons
PNAS, March 7, 2006; 103(10): 3834 - 3839.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Dalevi, D. Dubhashi, and M. Hermansson
Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures
Bioinformatics, March 1, 2006; 22(5): 517 - 522.
[Abstract] [Full Text] [PDF]


Home page
Infect. Immun.Home page
A. Bryan, P. Roesch, L. Davis, R. Moritz, S. Pellett, and R. A. Welch
Regulation of Type 1 Fimbriae by Unlinked FimB- and FimE-Like Recombinases in Uropathogenic Escherichia coli Strain CFT073
Infect. Immun., February 1, 2006; 74(2): 1072 - 1083.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
H.-Y. Ou, L.-L. Chen, J. Lonnen, R. R. Chaudhuri, A. B. Thani, R. Smith, N. J. Garton, J. Hinton, M. Pallen, M. R. Barer, et al.
A novel strategy for the identification of genomic islands by comparative analysis of the contents and contexts of tRNA sites in closely related bacteria
Nucleic Acids Res., January 9, 2006; 34(1): e3 - e3.
[Abstract] [Full Text] [PDF]


Home page
DNA ResHome page
T. Kosuge, T. Abe, T. Okido, N. Tanaka, M. Hirahata, Y. Maruyama, J. Mashima, A. Tomiki, M. Kurokawa, R. Himeno, et al.
Exploration and Grading of Possible Genes from 183 Bacterial Strains by a Common Protocol to Identification of New Genes: Gene Trek in Prokaryote Space (GTPS)
DNA Res, January 1, 2006; 13(6): 245 - 254.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
P. Nielsen and A. Krogh
Large-scale prokaryotic gene prediction and comparison to genome annotation
Bioinformatics, December 15, 2005; 21(24): 4322 - 4329.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
F. Thieme, R. Koebnik, T. Bekel, C. Berger, J. Boch, D. Buttner, C. Caldana, L. Gaigalat, A. Goesmann, S. Kay, et al.
Insights into Genome Plasticity and Pathogenicity of the Plant Pathogenic Bacterium Xanthomonas campestris pv. vesicatoria Revealed by the Complete Genome Sequence
J. Bacteriol., November 1, 2005; 187(21): 7254 - 7266.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
O. C. Kulkarni, R. Vigneshwar, V. K. Jayaraman, and B. D. Kulkarni
Identification of coding and non-coding sequences using local Holder exponent formalism
Bioinformatics, October 15, 2005; 21(20): 3818 - 3823.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
H. Tettelin, V. Masignani, M. J. Cieslewicz, C. Donati, D. Medini, N. L. Ward, S. V. Angiuoli, J. Crabtree, A. L. Jones, A. S. Durkin, et al.
Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial "pan-genome"
PNAS, September 27, 2005; 102(39): 13950 - 13955.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
H. Chibana, N. Oka, H. Nakayama, T. Aoyama, B. B. Magee, P. T. Magee, and Y. Mikami
Sequence Finishing and Gene Mapping for Candida albicans Chromosome 7 and Syntenic Analysis Against the Saccharomyces cerevisiae Genome
Genetics, August 1, 2005; 170(4): 1525 - 1537.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
I. Ben-Gal, A. Shani, A. Gohr, J. Grau, S. Arviv, A. Shmilovici, S. Posch, and I. Grosse
Identification of transcription factor binding sites with variable-order Bayesian networks
Bioinformatics, June 1, 2005; 21(11): 2657 - 2666.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
M. Dziejman, D. Serruto, V. C. Tam, D. Sturtevant, P. Diraphat, S. M. Faruque, M. H. Rahman, J. F. Heidelberg, J. Decker, L. Li, et al.
Genomic characterization of non-O1, non-O139 Vibrio cholerae reveals genes for a type III secretion system
PNAS, March 1, 2005; 102(9): 3465 - 3470.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
X.-F. Wan, N. C. VerBerkmoes, L. A. McCue, D. Stanek, H. Connelly, L. J. Hauser, L. Wu, X. Liu, T. Yan, A. Leaphart, et al.
Transcriptomic and Proteomic Characterization of the Fur Modulon in the Metal-Reducing Bacterium Shewanella oneidensis
J. Bacteriol., December 15, 2004; 186(24): 8385 - 8400.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
M. B. Lobocka, D. J. Rose, G. Plunkett III, M. Rusin, A. Samojedny, H. Lehnherr, M. B. Yarmolinsky, and F. R. Blattner
Genome of Bacteriophage P1
J. Bacteriol., November 1, 2004; 186(21): 7032 - 7068.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
Z. I. Johnson and S. W. Chisholm
Properties of overlapping genes are conserved across microbial genomes
Genome Res., November 1, 2004; 14(11): 2268 - 2272.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
E. Lerat and H. Ochman
{Psi}-{Phi}: Exploring the outer limits of bacterial pseudogenes
Genome Res., November 1, 2004; 14(11): 2273 - 2278.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
W. C. Nierman, D. DeShazer, H. S. Kim, H. Tettelin, K. E. Nelson, T. Feldblyum, R. L. Ulrich, C. M. Ronning, L. M. Brinkac, S. C. Daugherty, et al.
From the Cover: Structural flexibility in the Burkholderia mallei genome
PNAS, September 28, 2004; 101(39): 14246 - 14251.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
C. M. Alsmark, A. C. Frank, E. O. Karlberg, B.-A. Legault, D. H. Ardell, B. Canback, A.-S. Eriksson, A. K. Naslund, S. A. Handley, M. Huvet, et al.
The louse-borne human pathogen Bartonella quintana is a genomic derivative of the zoonotic agent Bartonella henselae
PNAS, June 29, 2004; 101(26): 9716 - 9721.
[Abstract] [Full Text] [PDF]


Home page
Infect. Immun.Home page
L. D. Fletcher, L. Bernfield, V. Barniak, J. E. Farley, A. Howell, M. Knauf, P. Ooi, R. P. Smith, P. Weise, M. Wetherell, et al.
Vaccine Potential of the Neisseria meningitidis 2086 Lipoprotein
Infect. Immun., April 1, 2004; 72(4): 2088 - 2100.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
A. T. Dobbins, M. George Jr., D. A. Basham, M. E. Ford, J. M. Houtz, M. L. Pedulla, J. G. Lawrence, G. F. Hatfull, and R. W. Hendrix
Complete Genomic Sequence of the Virulent Salmonella Bacteriophage SP6
J. Bacteriol., April 1, 2004; 186(7): 1933 - 1944.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. A. Rasko, J. Ravel, O. A. Okstad, E. Helgason, R. Z. Cer, L. Jiang, K. A. Shores, D. E. Fouts, N. J. Tourasse, S. V. Angiuoli, et al.
The genome sequence of Bacillus cereus ATCC 10987 reveals metabolic adaptations and a large plasmid related to Bacillus anthracis pXO1
Nucleic Acids Res., February 11, 2004; 32(3): 977 - 988.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
J. Westberg, A. Persson, A. Holmberg, A. Goesmann, J. Lundeberg, K.-E. Johansson, B. Pettersson, and M. Uhlen
The Genome Sequence of Mycoplasma mycoides subsp. mycoides SC Type Strain PG1T, the Causative Agent of Contagious Bovine Pleuropneumonia (CBPP)
Genome Res., February 1, 2004; 14(2): 221 - 227.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
C.-Y. Chen, K.-M. Wu, Y.-C. Chang, C.-H. Chang, H.-C. Tsai, T.-L. Liao, Y.-M. Liu, H.-J. Chen, A. B.-T. Shen, J.-C. Li, et al.
Comparative Genome Analysis of Vibrio vulnificus, a Marine Pathogen
Genome Res., December 1, 2003; 13(12): 2577 - 2587.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
J. R. de la Torre, L. M. Christianson, O. Beja, M. T. Suzuki, D. M. Karl, J. Heidelberg, and E. F. DeLong
Proteorhodopsin genes are distributed among divergent marine bacterial taxa
PNAS, October 28, 2003; 100(22): 12830 - 12835.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
G. Delhon, M. P. Moraes, Z. Lu, C. L. Afonso, E. F. Flores, R. Weiblen, G. F. Kutish, and D. L. Rock
Genome of Bovine Herpesvirus 5
J. Virol., October 1, 2003; 77(19): 10339 - 10347.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
C. Baar, M. Eppinger, G. Raddatz, J. Simon, C. Lanz, O. Klimmek, R. Nandakumar, R. Gross, A. Rosinus, H. Keller, et al.
Complete genome sequence and analysis of Wolinella succinogenes
PNAS, September 30, 2003; 100(20): 11690 - 11695.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
K. E. Nelson, R. D. Fleischmann, R. T. DeBoy, I. T. Paulsen, D. E. Fouts, J. A. Eisen, S. C. Daugherty, R. J. Dodson, A. S. Durkin, M. Gwinn, et al.
Complete Genome Sequence of the Oral Pathogenic Bacterium Porphyromonas gingivalis Strain W83
J. Bacteriol., September 15, 2003; 185(18): 5591 - 5601.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
E. S. Miller, J. F. Heidelberg, J. A. Eisen, W. C. Nelson, A. S. Durkin, A. Ciecko, T. V. Feldblyum, O. White, I. T. Paulsen, W. C. Nierman, et al.
Complete Genome Sequence of the Broad-Host-Range Vibriophage KVP40: Comparative Genomics of a T4-Related Bacteriophage
J. Bacteriol., September 1, 2003; 185(17): 5220 - 5233.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
R. Barrangou, E. Altermann, R. Hutkins, R. Cano, and T. R. Klaenhammer
Functional and comparative genomic analyses of an operon involved in fructooligosaccharide utilization by Lactobacillus acidophilus
PNAS, July 22, 2003; 100(15): 8957 - 8962.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
W. H. Majoros, M. Pertea, C. Antonescu, and S. L. Salzberg
GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders
Nucleic Acids Res., July 1, 2003; 31(13): 3601 - 3604.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Bocs, S. Cruveiller, D. Vallenet, G. Nuel, and C. Medigue
AMIGene: Annotation of MIcrobial Genes
Nucleic Acids Res., July 1, 2003; 31(13): 3723 - 3726.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. Schiex, J. Gouzy, A. Moisan, and Y. de Oliveira
FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences
Nucleic Acids Res., July 1, 2003; 31(13): 3738 - 3741.
[Abstract] [Full Text] [PDF]


Home page
Appl. Environ. Microbiol.Home page
M. R. Liles, B. F. Manske, S. B. Bintrim, J. Handelsman, and R. M. Goodman
A Census of rRNA Genes and Linked Genomic Sequences within a Soil Metagenomic Library
Appl. Envir. Microbiol., May 1, 2003; 69(5): 2684 - 2691.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. D. Read, G. S. A. Myers, R. C. Brunham, W. C. Nelson, I. T. Paulsen, J. Heidelberg, E. Holtzapple, H. Khouri, N. B. Federova, H. A. Carty, et al.
Genome sequence of Chlamydophila caviae (Chlamydia psittaci GPIC): examining the role of niche-specific genes in the evolution of the Chlamydiaceae
Nucleic Acids Res., April 15, 2003; 31(8): 2134 - 2147.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
F.-B. Guo, H.-Y. Ou, and C.-T. Zhang
ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes
Nucleic Acids Res., March 15, 2003; 31(6): 1780 - 1789.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
E.-M. Lai, N. D. Phadke, M. T. Kachman, R. Giorno, S. Vazquez, J. A. Vazquez, J. R. Maddock, and A. Driks
Proteomic Analysis of the Spore Coats of Bacillus subtilis and Bacillus anthracis
J. Bacteriol., February 15, 2003; 185(4): 1443 - 1454.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
K. Chan, S. Baker, C. C. Kim, C. S. Detweiler, G. Dougan, and S. Falkow
Genomic Comparison of Salmonella enterica Serovars and Salmonella bongori by Use of an S. enterica Serovar Typhimurium DNA Microarray
J. Bacteriol., January 15, 2003; 185(2): 553 - 563.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
R. A. Welch, V. Burland, G. Plunkett III, P. Redford, P. Roesch, D. Rasko, E. L. Buckles, S.-R. Liou, A. Boutin, J. Hackett, et al.
Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli
PNAS, December 24, 2002; 99(26): 17020 - 17024.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
J. Ma, A. Campbell, and S. Karlin
Correlations between Shine-Dalgarno Sequences and Gene Features Such as Predicted Expression Levels and Operon Structures
J. Bacteriol., October 15, 2002; 184(20): 5733 - 5745.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Q. Jin, Z. Yuan, J. Xu, Y. Wang, Y. Shen, W. Lu, J. Wang, H. Liu, J. Yang, F. Yang, et al.
Genome sequence of Shigella flexneri 2a: insights into pathogenicity through comparison with genomes of Escherichia coli K12 and O157
Nucleic Acids Res., October 15, 2002; 30(20): 4432 - 4441.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (71K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (309)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Salzberg, S. L.
Right arrow Articles by White, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Salzberg, S. L.
Right arrow Articles by White, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?