ABSTRACT
The number of completely sequenced bacterial genomes has been growing fast. There are computer methods available for finding genes but yet there is a need for more accurate algorithms. The GeneMark.hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries. The idea was to embed the GeneMark models into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states. We also used the specially derived ribosome binding site pattern to refine predictions of translation initiation codons. The algorithm was evaluated on several test sets including 10 complete bacterial genomes. It was shown that the new algorithm is significantly more accurate than GeneMark in exact gene prediction. Interestingly, the high gene finding accuracy was observed even in the case when Markov models of order zero, one and two were used. We present the analysis of false positive and false negative predictions with the caution that these categories are not precisely defined if the public database annotation is used as a control.
For the `post-genomic' molecular biology, a computer became the major tool for interpreting DNA and protein sequence information. By the end of 1997, 10 complete bacterial genomes were available from the GenBank database: Haemophilus influenzae (1), Mycoplasma genitalium (2), Methanococcus jannaschii (3), Mycoplasma pneumoniae (4), Synechocystis PCC6803 (5), Escherichia coli (6), Helicobacter pylori (7), Methanobacterium thermoauthotrophicum (8), Bacillus subtilis (9), Archeoglobus fulgidus (10). The majority of genes in these genomes were annotated using theoretical (computer derived) rather than experimental evidence. With many more genomes to come in the near future, the methods of highly accurate DNA sequence interpretation, particularly gene finding, become increasingly important. Here we present a new method, GeneMark.hmm, for gene finding in bacterial genomes. The previously developed GeneMark program (11), that has been used in practice (1-6,9-10), identified a gene mainly as the ORF (open reading frame) where the gene is residing. However, the 5' boundary of the gene (the translation initiation codon associated with the protein N-terminus) might not be precisely predicted. The range of uncertainty for the initiation codon position is of the size of GeneMark sliding window, i.e. ~100 nucleotides (nt). As a palliative, GeneMark indicates several possible start codons and scores them (http://intron.biology.gatech.edu/GeneMark). However, the exact prediction of the N-terminus is important for further functional analysis of a putative protein, and, eventually, for correct annotation of thousands of genes in growing databases. Therefore we see our goal as developing an algorithm with a high accuracy of exact gene prediction.
Gene annotation in bacterial DNA defines a functional role of each nucleotide in the sequence. For a DNA sequence designated as S = {b1, b2, ..., bL}, where the bi stands for the nucleotide symbol, T, C, A or G, and L is the sequence length, the functional role of each nucleotide could be indicated by a `functional' sequence A = {a1, a2, ..., aL}. Here each ai may take integer value `0' if nucleotide bi is a part of non-coding region; value `1' if bi is a part of a gene residing in the direct DNA strand; and a value of `2' if bi is involved in encoding a protein in the complementary DNA strand. The aim of gene finding is to determine the `true' functional sequence A for the anonymous DNA sequence S. Statistical patterns of nucleotide ordering specific for DNA sequences that carry (or do not carry) the genetic code have been used in gene finding algorithms since the 1980s (see ref. 12 for review). In GeneMark, for instance, these patterns were quantified and converted into parameters of Markov chain models (11). A general pattern recognition algorithm should be able to compute the probability that a particular functional sequence A underlies a given sequence S, P(ASS) = P(a1, a2, ..., aLzb1, b2, ..., bL). The core GeneMark.hmm procedure computes the P(AzS) value and, eventually, defines the functional sequence A* having the largest value P(A*zS) among all possible A. The functional sequence A*, the output of the algorithm, describes the most likely annotation of the DNA sequence S.
The problem of the P(AzS) computation and maximization is considered in terms of hidden Markov models (HMM), the technique that was successfully applied in speech recognition (see ref. 13 for review). Applications of HHM theory to DNA and protein sequence analysis have also been described by several groups (14-21). The algorithm ECOPARSE developed by Krogh et al. (17) was the first HMM based gene-finding algorithm intended specifically for the E.coli genome. The GeneMark and GeneMark.hmm have been compared with the performance of ECOPARSE (see below).
The HMM framework of GeneMark.hmm, the logic of transitions between hidden Markov states, followed the logic of the genetic structure of the bacterial genome (Fig. 1). The Markov models of coding and non-coding regions were incorporated into the HMM framework to generate stretches of DNA sequence with coding or non-coding statistical patterns. This type of HMM architecture is known as `HMM with duration' (13). The sequence of hidden states associated with a given DNA sequence S, carries information on positions where coding function is switching into non-coding and vice versa. Thus, the previously introduced functional sequence A becomes equivalent to the sequence of hidden states, called the HMM trajectory. Since the nucleotide sequence S is given, every possible sequence A could be assessed by the value of P(AzS), the conditional probability of A given S. This evaluation made use of the whole set of statistical models (see Materials and Methods). The core GeneMark.hmm procedure is the Viterbi algorithm (13) that finds the sequence A*. However, this core procedure did not take into account the possibility of gene overlaps since the observed overlaps, though frequent, were not extensive enough to provide sufficient data for deriving statistical models of overlapping genes in several possible orientations. To further improve the prediction of the translation start position the model of the ribosome binding site (RBS) was derived. This model was used to refine translation initiation codon prediction at the post-processing step.
We have used DNA sequences of the complete genomes of H.influenzae (GenBank accession no. L42023), M.genitalium (L43967), M.jannaschii (L77117), M.pneumoniae (U00089), Synechocystis PCC6803 (synecho), E.coli (U00096), H.pylori (AE000511), M.thermoauthotrophicum (AE000666), B.subtilis (AL009126), Archeoglobus fulgidus (AE000782). The data on annotated E.coli RBS were provided by W. Hayes (22). The data on experimentally verified N-terminal protein sequences were kindly provided by A. Link (23). The Markov models parameters were obtained from the GeneMark library (http://exon.biology.gatech.edu/~genmark/matrices/).
The architecture of the hidden Markov model used in the GeneMark.hmm algorithm is shown in Figure 1. To deal simultaneously with direct and reverse DNA strands, as was done in the initial GeneMark algorithm (11), nine hidden states were defined. These states correspond to the functional units of bacterial genomes, namely: (i) a Typical gene in the direct strand, (ii) a Typical gene in the reverse strand, (iii) an Atypical gene in the direct strand, (iv) an Atypical gene in the reverse strand, (v) a non-coding (intergenic) region, (vi/vii) start/stop codons in the direct strand, and (viii/ix) start/stop codons in the reverse strand. It should be mentioned that this HMM does not account for gene overlap (see below). The models of Typical and Atypical genes were derived from the sets of protein-coding DNA sequence obtained by clusterization of the whole set of genes from the genome of a given species (22). The names `Typical' and `Atypical' were used for the following reason. For the E.coli genome it was shown that the majority of the E.coli genes mainly belong to the cluster of Typical genes, while many genes that are believed to have been horizontally transferred into the E.coli genome fall into the cluster of Atypical genes. Note, that the comprehensive accounts on the E.coli genes evolutionary classification have been presented earlier (24,25).
An important feature of the proposed HMM architecture is that any coding as well as non-coding hidden state is allowed to generate a nucleotide sequence, observed sequence, of the length of hidden state duration (13). Such an explicit state duration HMM was used previously in algorithms Genie and GENSCAN (18,20). The crucial point, however, is that an observed DNA sequence S = {b1, b2, ..., bL} is thought to be generated by an HMM such as depicted in Figure 1, in parallel with the HMM transitions from one hidden state to another. The hidden state trajectory A, one of a variety of allowed paths, can be concisely represented as a sequence of M hidden states ai having duration di: A = {(a1d1)(a2d2) ... (aMdM)}, [Sigma]di = L. For a given sequence of observed states (nucleotides) S = {b1, b2, ..., bL} the optimal trajectory of hidden (functional) states A* is defined as the trajectory (functional sequence) A with the maximal value of conditional probability P(AzS). Therefore, a computer optimization procedure is supposed to find the maximum likelihood sequence A* that, according to its physical meaning, defines the predicted locations of protein coding regions in the nucleotide sequence S.
The problem formulated above is equivalent to a problem of finding the trajectory A* = {(a1*d1*)(a2*d2*) ... (aM*dM*)} that has the largest probability of occurring simultaneously with the sequence S in comparison with all other possible trajectories:
To describe the optimization algorithm we introduce the quantity (11):
Equations 3-6 present the Viterbi algorithm which finds for the given (observed) nucleotide sequence S the maximum likely trajectory A*. This algorithm is an extension of the Viterbi algorithm, described by Rabiner (13), for the case of HMM with variable duration of hidden states. The equations for straightforward initialization and backtracking procedures are not shown.
The described above mechanism of generating nucleotide sequence S by variable duration HMM could naturally use the Markov models of coding and non-coding DNA sequences. These models have been already defined the GeneMark algorithm (11). Therefore, the time-consuming and cumbersome procedure of HMM training was largely avoided. For instance, given a hidden state `1' corresponding to a coding region, the probability, P1(b1, b2, ..., bd), of observing a particular DNA sequence {b1, b2, ..., bd} as a part of a coding region was calculated using the three-periodic inhomogeneous Markov chain model (11). For non-coding state `0' the probability of observing sequence {b1, b2, ..., bd} as a part of a non-coding region, P0(b1, b2, ..., bd), was calculated using the homogeneous Markov model (11). The probability pa(d) that a state a has duration d was defined by analytical approximation of the frequency distribution of the lengths of coding (non-coding) regions in the E.coli genome (Fig. 2). As is seen in Figure 1, the only allowed transitions between hidden states were `non-coding' -> `direct start' -> `direct coding' -> `direct stop' -> `non-coding', as well as `non-coding' -> `reverse stop' -> `reverse coding' -> `reverse start' -> `non-coding'. Therefore, just a few additional parameters, such as the probabilities of possible start codons, initial and transition probabilities for hidden states had to be specified. Initial probabilities for four coding and one non-coding states were set to 0.2. Initial probabilities for start/stop states were set to zero. The probabilities of the start codons were defined in agreement with the E.coli genome statistics: P(ATG) = 0.905, P(GTG) = 0.090, P(TTG) = 0.005. The probability of transition from a non-coding state to a Typical (Atypical) coding state was set to 0.85 (0.15). These values are the estimates of frequencies of `native' (`foreign') genes in the E.coli genome suggested by Medigue et al. (24) and Lawrence (25).
As follows from the described HMM architecture (Fig. 1) the optimal sequence A* found by the Viterbi algorithm should have predicted genes separated from one another by at least a 1 nt long intergenic region. Therefore, the actual overlap of two genes will prevent finding the exact location of at least one gene. Initially, we considered an overlap of bacterial genes as an unlikely event. However, when the larger body of complete genomic sequences became available we found that at least short overlaps are quite common in bacterial genomes (see below). Obviously, the Viterbi algorithm tends to predict genes involved in overlaps shorter than they really are. Therefore, we used a post-processing procedure, searching for ribosome binding site (RBS), to refine initial Viterbi predictions. For a predicted gene, the RBS was searched in the interval from -19 to -4 nt upstream to each alternative start codons located between the position of start codon suggested by the Viterbi algorithm and the position of start codon producing the longest open reading frame (ORF) for the predicted gene. The initially predicted translation intiation position was redefined if the score of one of the RBS candidates associated with an admitted alternative start exceeded a certain threshold (see below). Otherwise, the position suggested by the Viterbi algorithm was accepted.
The probabilistic model for the RBS was derived as follows. First, the E.coli records in GenBank with annotated RBSs were analyzed, and 325 genes with known RBSs were selected from the complete E.coli genome (6). Second, from each of these 325 sequences, the 16 nt sequence preceding the annotated start (from -4 to -19) was collected. Third, these 325 short sequences were subjected to the multiple alignment procedure performed by the simulated annealing algorithm (26). Specifically, we have chosen a fixed size window, w, and searched for the best alignment by maximizing a matching score
Here nb(k) is the number of symbols b (b = T, C, A, G) in the position (column) k of the window alignment. In each step of the simulated annealing algorithm iterative procedure, one of the 325 sequences chosen at random was shifted to the right or to the left, relative to the fixed window, for a randomly chosen number of positions (with no gaps, deletions or insertions). The matching score R* for the resulting alignment was calculated (equation 7). If R* was larger than R, the new alignment was unconditionally accepted and used as the starting point for the next iterative step. Otherwise, the new alignment was accepted with the probability exp[-R -R*)/T], where the parameter T can be interpreted as the `temperature' in the annealing procedure. We used the standard exponential cooling schedule Tn+1 = cTn, where c = 0.999999. The window size was chosen to be equal to w = 5.
Table 1.
The finally obtained alignment of the 325 sequences has revealed the RBS sequence pattern in the form of a matrix of positional nucleotide frequencies (Table 1). It is seen that the matrix defines the strong consensus sequence: AGGAG, which is complementary to a pentamer located in the E.coli 16S rRNA near its 3'-end. This observation is in a good agreement with the generally accepted mechanism of ribosome-mRNA binding. Note that a similar result was obtained previously (27). To evaluate a putative RBS we calculated its probabilistic score as the product of corresponding elements of the matrix given in Table 1. The threshold value for RBS score was chosen as 0.00025. It can be shown that the log of this score is proportional to ribosome binding energy (with appropriate sign) under the assumption of independent formation of ribonucleotide pairs.
The GeneMark.hmm predictions were obtained for nine other bacterial genomes. In these computations we used the species specific Markov models of coding and non-coding regions. All other parameters of the GeneMark.hmm algorithm stayed the same as defined for the E.coli genome. It is worth mentioning that for the gram-positive bacterium, B.subtilis, we have slightly modified the RBS prediction procedure. In species, such as B.subtilis, that do not have the ribosomal protein S1 involved in initiation of the ribosome-mRNA complex, the elevated strength of ribosome binding sites is thought to be a compensatory mechanism to facilitate ribosome binding. For the B.subtilis case the described above alignment procedure produced a highly biased frequency pattern with the strong RBS consensus. To obtain reasonable agreement between predicted initiation codons of B.subtilis genes and annotated ones we had to admit to competition the alternative start codons located not only upstream to the Viterbi prediction of translation start, but also those located downstream up the 66 nt distance. We think that this rule could be applicable to all other genomes, but presently, there is a tendency in genome annotation process to prefer longer ORFs to shorter ones provided there is no convincing evidence in favor of the shorter one. Statistically, this tendency is well justified since it is expected that in about 75% of cases actual genes occupy the longest ORFs. This figure can be obtained as follows. Consider the set of four codons: ATG, TAA, TAG, TGA and an intergenic region situated upstream to the true initiation codon of a gene X. Read codons in 5' direction in the same reading frame as the initiation codon until the first codon from the above set is met. If this codon is ATG, then the gene X does not occupy the longest ORF. Otherwise gene X does occupy the longest ORF, which happens in 75% of cases assuming that the four codons specified above occur with equal frequencies and ATG is the only possible initiation codon. In B.subtilis the presence of a strong RBS site provided a good reason to override the `longest ORF' annotation rule and shorter ORFs in B.subtilis were annotated more frequently than in other bacterial genomes.
The performance of the GeneMark.hmm program was tested using several control sets including 10 complete bacterial genomes. Our focus was on the E.coli genome. The complete genomic sequence of E.coli consists of 4 639 221 nt with 4288 genes annotated (6). When the GeneMark.hmm program was applied to the E.coli genomic sequence, as many as 4440 genes were identified. Each predicted gene was also characterized as Typical or Atypical (22) depending on the type of the underlying coding (hidden) state. Twenty percent of the predicted genes were identified as Atypical ones. The gene finding accuracy was evaluated using four control sets of genes annotated in the E.coli genome (Table 2). Control set #1 contained all annotated E.coli genes. Set #2 was compiled from non-overlapping E.coli genes. The E.coli genes whose RBS were annotated in GenBank constituted set #3. The genes coding for proteins with experimentally verified N-termini (23) were included in set #4.
Table 2.
The evaluation results (Table 2) show that the Viterbi algorithm alone (VA) was able to exactly predict 58% of the E.coli genes in Set #1. The gene overlap seems to be an important factor indeed, since the percentage of exact gene predictions jumped up to 71% when the overlapping genes were eliminated (Set #2). It is worth mentioning that both the 58% and the 71% figures may not be consistent estimates of the algorithm real performance since the majority of annotated translation initiation codons in control sets #1 and #2 were not verified in experiments. In control sets, #3 and #4, the Viterbi algorithm exactly predicted 78 and 76.5% of the genes respectively. These two close figures give a more realistic estimation of the Viterbi algorithm predictive power for genes with no overlaps.
The percentage of the E.coli genes predicted either exactly or with misplaced translation starts was 95, 98, 98 and 99.5% for the sets #1, #2, #3 and #4 respectively. These figures did not change when the RBS prediction was combined with the Viterbi prediction at the post-processing step (PP in Table 2). However, for many genes initially partially predicted by the Viterbi algorithm the correct position of the translation start was found. The fraction of exact predictions increased from 58 up to 75% for set #1, from 71 up to 80% for set #2, from 78 up to 89% for set #3, and from 76.5 up to 87.5% for set #4. One may conclude that RBS correction produces 10% increase in the percentage of exactly predicted genes under non-overlap conditions. Also, it appears from the results of program testing on set #1, that gene overlaps were responsible for ~10% of non-exact predictions.
A gene annotated in GenBank was counted as `missing' in predictions if neither its 5' nor 3' boundary was precisely found by the algorithm (even if there was some overlap between annotated and predicted genes). The GeneMark.hmm algorithm missed 213 out of the 4288 annotated E.coli genes (set #1 in Table 2). Some of these genes, 113 out of 213, had a length exceeding 300 nt. In fact, the majority of these 113 genes overlapped with genes located in the opposite strand (the `stop near stop' overlap). This fact, along with the observation that the percentage of missing genes in sets #2, #3 and #4 is lower than in test set #1, explains why these relatively long genes were missing. If an overlap occurs, the stop codons of the two genes fall into the region of overlap, and, consequently, at least one stop codon is overlooked by the algorithm. This means that a local `mishap' such as just the four nucleotide overlap between two genes (i.e. TTAA, TTAG, CTAA, CTAG) makes the Viterbi algorithm lose the whole gene. Note that many overlapping genes are not likely to be missed by the GeneMark program. Its `voting' mechanism accounts for detection of the coding potential within a number of windows covering a given ORF, thus suppressing the fluctuations that might affect just a few windows.
Among 4440 genes predicted by the GeneMark.hmm program in the E.coli genome, there were 363 genes with neither the 5'-end nor the 3'-end matched to any annotated gene. Some of these predictions, 231 out of 363, were located in the regions annotated as non-coding and these 231 predictions might be classified as `wrong' or `new'. Thirteen of these predictions had a length larger than 300 nt. The protein products of these putative genes were searched for similarity against the non-redundant protein sequence database using the gapped BLAST (28). Four putative proteins were found to have significant similarity with hypothetical proteins previously identified in other species (Table 3). This analysis indicates once again that genome annotations in public databases are not perfect. Some real genes still may go unnoticed while some already annotated may not be functional. At any rate, `false positive' gene predictions need much further analysis before they are sorted out as wrong ones. Therefore, the exact fractions of wrong predictions as well as the fractions of predicted new genes remain to be determined.
Table 3.
We have compared the performance of the GeneMark.hmm program with the GeneMark program (11) and with the ECOPARSE program (17). The ECOPARSE algorithm differs from GeneMark and GeneMark.hmm, particularly, in analyzing DNA strands in turn, one after another, while GeneMark and GeneMark.hmm deal with both strands simultaneously. The test set for this comparison included five E.coli DNA contigs of 30 000 nt length each (the maximum possible length for the ECOPARSE e-mail server input sequence as of June, 1997). The predictions for each DNA contig were obtained by each of the three algorithms (including post-processing cycles) and compared with the GenBank annotation (6). The results (Table 4) indicate that the GeneMark.hmm program was more accurate in exact predictions: 71 versus 62% by GeneMark and 53% by ECOPARSE. It is worth mentioning that the current versions of GeneMark and ECOPARSE use RBS models as well. The GeneMark.hmm program also had the least number of missing genes and the highest percentage of annotated genes found exactly or partially (Table 4). Particularly, the genes thrL, yacG, cspE and ydiE missed by GeneMark were detected by GeneMark.hmm.
Table 4.
The GeneMark.hmm performance may depend on the choice of the algorithm parameters. The robustness of the algorithm was tested with regard to the values of the Markov models' transition probabilities. The GeneMark.hmm predictions for E.coli were recalculated using the transition probability matrices obtained by training on an alternative set of E.coli genes (22). The prediction versus annotation comparisons were close to those shown in Table 2. For example, the number of set #1 genes exactly predicted (with post-processing) was equal to 3088 compared to 3233 shown in Table 2. A 20% variation of other algorithm parameters had changed the overall performance even less noticeably (data not shown).
The GeneMark.hmm predictions obtained for nine other bacterial genomes were compared with the GenBank annotations and the results are shown Table 5. It is seen that the program, on average, found exact locations of 78.1% of annotated genes. For 94.6% of annotated genes the reading frames were predicted correctly but the initiation codons did not coincide with the annotated one. The average percentage of missing genes was 5.4%. For a particular genome the frequency of missed genes was strongly correlated with the frequency of gene overlaps. The largest frequencies of overlap were observed in A.fulgidus (61% of all annotated genes had overlaps), M.genitalium (59%) and M.pneumonia (51%), while the smallest were found in B.subtilis (24%), H.influenzae (27%) and M.jannaschii (29%). The average percentage of false positive predictions, 10%, is relatively high, but how many of these predictions are actually correct remains to be found by further analysis. We did not use any filters for false positives. Even the restriction on the minimum length of the gene prediction was not applied since the genomic sequence still may contain small pieces of frameshifted genes. Actually, from 382 gene predictions that did not find annotated analogs in A.fulgidus genome, 42 have already been confirmed as real genes and their protein products were included in protein sequence database prior to our study. By using the gapped BLAST significant similarities of predicted protein products to known proteins from species other than A.fulgidus were found for 18 more predictions. In total, 291 of the GeneMark.hmm `false positive' predictions for the 10 species were already confirmed to some extent by other researches and were included in protein databases. Another 71 predictions, as the current study shows, have good additional evidence (from the gapped BLAST) to be real genes. Many from the remaining 2068 predictions could be genes encoding so called `pioneer proteins' (29).
Table 5.
The results presented in Table 2 were obtained by GeneMark.hmm employing second order Markov models of coding and non-coding regions. The graphs in Figure 3a show the percentage of exact predictions as a function of the model order. Surprisingly, even the zero ordermodels yield high enough accuracy. The reason for this is that GeneMark.hmm accumulates detectable signal within the rather long bacterial gene even if the relatively weak zero order model is used. This does not happen with the GeneMark algorithm where the length of an analyzed DNA sequence is restricted by the short window, and, as a consequence, the higher order models are known to be more accurate in coding potential detection (29). The later corresponds, however, to the observation (Fig. 3b) that the number of missing genes, presumably short genes, decreases as the model's order increases. Note that the slight accuracy improvement observed for higher order models was achieved at the price of a non-linear increase in computer memory requirements. For analysis of eukaryotic DNA with coding regions (exons) being, in average, much shorter than bacterial coding regions this is a well justified price.
In spite of casual opinion that gene overlaps are likely to happen only in phage and virus genomes where requirements for tight gene packing are `vitally' important, the complete bacterial genomes demonstrate quite a few gene overlaps. The overlap regions are of special interest because of their double genetic code load. The distributions of length of gene overlaps observed in E.coli genome are shown in Figure 4. These length distributions are different for overlapping genes residing in the same strand (Fig. 4a) and for genes residing in opposite strands (Fig. 4b). The overlaps in the same strand are more common, with the trivial overlaps of the length 1 (TGA/ATG) or 4 (ATGA) constituting the majority (406 out of 695 same strand overlaps). An overlap length larger than 48 nt was observed in 45 cases. As expected, there were no observed overlaps in the same strand with a length equal to a multiple of three.
The results presented above demonstrate that GeneMark.hmm provides an improved tool for exact prediction of bacterial genes. One drawback is the tendency to underpredict genes with overlaps. Nevertheless, it is worth mentioning that GeneMark.hmm and GeneMark have complementary properties in the sense that the genes missed by GeneMark.hmm may be recovered by GeneMark and the partial gene predictions made by GeneMark may be corrected by GeneMark.hmm. A combination of the two programs could be, therefore, an even better tool for gene prediction. Note, though, that we do not mean such a combination that would decrease the number of false negative predictions at the mere price of an increase of the number of false positive ones. By selecting those GeneMark predictions that are clear patches to the GeneMark.hmm prediction list we indeed avoided an increase in the number of false positives. The evaluation of the combined program for the 10 genomes has shown that the fraction of missing genes significantly decreased (Table 5). As is seen, one of the largest figures of missing genes, 4.4%, was observed for H.pylori. It is worth mentioning that of 956 genes of H.pylori that have verified protein database matches, the combined program missed only seven genes. The combined GeneMark.hmm and GeneMark program with about a 1 min run time for a sequence of 100 kb, is available through Internet: http://genemark.biology.gatech.edu/GeneMark
We thank William Hayes for excellent software programming help and, particularly, for setting up the WWW facilities, James McIninch for valuable assistance, Jim Fickett for useful discussions, David Dusenbery for making valuable remarks on the previous draft of the manuscript. Anders Krogh kindly helped with using the ECOPARSE e-mail server, Andy Link provided the data on proteins with verified N-terminals, Steven Salzberg helped in the H.pylori gene prediction comparisons. All this help is greatly appreciated. The project was supported in part by the National Institutes of Health.
Nucleic Acids Research
Pages
Introduction
Materials And Methods
Materials
Model of prokaryotic sequence structure
Viterbi algorithm for variable duration HMM
Parameters of the model
Post-processing: finding RBS
Algorithm modifications for genomes other than E.coli
Results And Discussion
Gene prediction accuracy
`Missing' genes (false negatives)
`Wrong' gene predictions (false positives)
Comparison with the earlier programs
Robustness of the algorithm
Other bacterial genomes
Higher order models and models of Typical and Atypical genes
Gene overlaps
GeneMark.hmm and GeneMark combination
Acknowledgements
References

1
where m is the number of hidden states visited during generation of the first l nucleotides, {q sub {a sub {m - 1}}} {^ sub {a sub m}} is the probability of transition from hidden state am-1 to state {a sub m} , {p sub {a sub m}} "(" {d sub m} ")" is the probability of duration dm for state am, and Pam (bl-dm+1 ... bl) is the probability of observing (generating) the nucleotide sequence, bl-dm+1, ..., bl, given the state am. By induction (m >= 2) we have

2

3

4

5

6

7
Nucleotide
Position
1
2
3
4
5
T
0.161
0.050
0.012
0.071
0.115
C
0.077
0.037
0.012
0.025
0.046
A
0.681
0.105
0.015
0.861
0.164
G
0.077
0.808
0.960
0.043
0.659
Set #
Number
of genesPrediction
methodExact
predictionOnly 3'-end
predictionMissing
genes
1
4288
VA
2483 (58%)
1592 (37%)
213 (5%)
1
4288
PP
3233 (75%)
842 (20%)
213 (5%)
2
2821
VA
2017 (71%)
750 (27%)
54 (2%)
2
2821
PP
2268 (80%)
499 (18%)
54 (2%)
3
325
VA
255 (78%)
64 (20%)
6 (2%)
3
325
PP
289 (89%)
30 ( 9%)
6 (2%)
4
204
VA
156 (76.5%)
47 (23%)
1 (0.5%)
4
204
PP
177 (87.5%)
26 (12%)
1 (0.5%)
Gene #
Strand
5'-end
3'-end
Score
E-value
Subject
1
comp.
238736
238257
270
4e-72
gi[brvbar]1552787; hypothetical protein
2
comp.
279586
279248
229
4e-60
pir[brvbar][brvbar]I41306; hypothetical protein
(argF-lacZ region)
3
direct
1286288
1286854
122
1e-27
gi[brvbar]1787481; 35 pct identical
<3 gaps> to 54 residues of approx.
1040 aa protein BGAL_KLEPH
4
direct
2201992
2202309
217
2e-56
sp[brvbar]P33347[brvbar]YEHK; hypothetical
12.6 kDa protein
Number
of genesPrediction
methodExact
predictionOnly 3'-end
predictionMissing
genes
148
GeneMark.hmm
105 (71%)
28 (19%)
15 (10%)
148
GeneMark
92 (62%)
37 (25%)
19 (13%)
148
ECOPARSE
79 (53%)
33 (23%)
36 (24%)
Genome
Genes
annotatedGenes
predictedExact
predictionMissing
genes (%)Wrong
genes (%)
A.fulgidus
2407
2530
73.1
10.8 (2.0)
15.1
B.subtilis
4101
4384
77.5
3.6 (2.8)
9.8
E.coli
4288
4440
75.4
5.0 (2.7)
8.2
H.influenzae
1718
1840
86.7
3.8 (3.2)
10.2
H.pylori
1566
1612
79.7
6.0 (4.4)
8.7
M.genitalium
467
509
78.4
9.9 (1.7)
17.3
M.jannaschii
1680
1841
72.7
4.6 (0.8)
12.9
M.pneumoniae
678
734
70.1
7.8 (4.1)
13.6
M.thermoauthotrophicum
1869
1944
70.9
5.0 (3.5)
8.6
Synechocystis
3169
3360
89.6
4.0 (1.5)
9.4
Averaged
21 943
23 194
78.1
5.4 (2.7)
10.4
REFERENCES
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 6 Feb 1998
Copyright© Oxford University Press, 1998.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
H.-Y. Shu, C.-P. Fung, Y.-M. Liu, K.-M. Wu, Y.-T. Chen, L.-H. Li, T.-T. Liu, R. Kirby, and S.-F. Tsai Genetic diversity of capsular polysaccharide biosynthesis in Klebsiella pneumoniae clinical isolates Microbiology, December 1, 2009; 155(12): 4170 - 4183. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Zeng, R. Alhajj, and D. Demetrick Adaptive multi-agent architecture for functional sequence motifs recognition Bioinformatics, December 1, 2009; 25(23): 3084 - 3092. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-F. Ma, Y. Zhang, J.-Y. Zhang, D.-W. Chen, Y. Zhu, H. Zheng, S.-Y. Wang, C.-Y. Jiang, G.-P. Zhao, and S.-J. Liu The Complete Genome of Comamonas testosteroni Reveals Its Genetic Adaptations to Changing Environments Appl. Envir. Microbiol., November 1, 2009; 75(21): 6812 - 6819. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. David, N. W.G. Chen, A. Pedrosa-Harand, V. Thareau, M. Sevignac, S. B. Cannon, D. Debouck, T. Langin, and V. Geffroy A Nomadic Subtelomeric Disease Resistance Gene Cluster in Common Bean Plant Physiology, November 1, 2009; 151(3): 1048 - 1065. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Shen, Z. Wei, Y. Jiang, X. Du, S. Ji, Y. Yu, and L. Li Novel Genetic Environment of the Carbapenem-Hydrolyzing {beta}-Lactamase KPC-2 among Enterobacteriaceae in China Antimicrob. Agents Chemother., October 1, 2009; 53(10): 4333 - 4338. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. A. Kronmiller and R. P. Wise Computational Finishing of Large Sequence Contigs Reveals Interspersed Nested Repeats and Gene Islands in the rf1-Associated Region of Maize Plant Physiology, October 1, 2009; 151(2): 483 - 495. [Abstract] [Full Text] [PDF] |
||||
![]() |
K.-M. Wu, L.-H. Li, J.-J. Yan, N. Tsao, T.-L. Liao, H.-C. Tsai, C.-P. Fung, H.-J. Chen, Y.-M. Liu, J.-T. Wang, et al. Genome Sequencing and Comparative Analysis of Klebsiella pneumoniae NTUH-K2044, a Strain Causing Liver Abscess and Meningitis J. Bacteriol., July 15, 2009; 191(14): 4492 - 4501. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Cheung, M. Trick, N. Drou, Y. P. Lim, J.-Y. Park, S.-J. Kwon, J.-A Kim, R. Scott, J. C. Pires, A. H. Paterson, et al. Comparative Analysis between Homoeologous Genome Segments of Brassica napus and Its Progenitor Species Reveals Extensive Sequence-Level Divergence PLANT CELL, July 1, 2009; 21(7): 1912 - 1928. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Lim, T.-H. Lee, B. H. Nahm, Y. D. Choi, M. Kim, and I. Hwang Complete Genome Sequence of Burkholderia glumae BGR1 J. Bacteriol., June 1, 2009; 191(11): 3758 - 3759. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. F. Andeer, D. A. Stahl, N. C. Bruce, and S. E. Strand Lateral Transfer of Genes for Hexahydro-1,3,5-Trinitro-1,3,5-Triazine (RDX) Degradation Appl. Envir. Microbiol., May 15, 2009; 75(10): 3258 - 3262. [Abstract] [Full Text] [PDF] |
||||
![]() |
S.-F. Lan, C.-H. Huang, C.-H. Chang, W.-C. Liao, I-H. Lin, W.-N. Jian, Y.-G. Wu, S.-Y. Chen, and H.-c. Wong Characterization of a New Plasmid-Like Prophage in a Pandemic Vibrio parahaemolyticus O3:K6 Strain Appl. Envir. Microbiol., May 1, 2009; 75(9): 2659 - 2667. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Shanks, M. N. Burtnick, P. J. Brett, D. M. Waag, K. B. Spurgers, W. J. Ribot, M. A. Schell, R. G. Panchal, F. C. Gherardini, K. D. Wilkinson, et al. Burkholderia mallei tssM Encodes a Putative Deubiquitinase That Is Secreted and Expressed inside Infected RAW 264.7 Murine Macrophages Infect. Immun., April 1, 2009; 77(4): 1636 - 1648. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Zhang and N. Jiao Roseophage RDJL{Phi}1, Infecting the Aerobic Anoxygenic Phototrophic Bacterium Roseobacter denitrificans OCh114 Appl. Envir. Microbiol., March 15, 2009; 75(6): 1745 - 1749. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. S. Wondji, H. Irving, J. Morgan, N. F. Lobo, F. H. Collins, R. H. Hunt, M. Coetzee, J. Hemingway, and H. Ranson Two duplicated P450 genes are associated with pyrethroid resistance in Anopheles funestus, a major malaria vector Genome Res., March 1, 2009; 19(3): 452 - 459. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. A. Tivendale, A. H. Noormohammadi, J. L. Allen, and G. F. Browning The conserved portion of the putative virulence region contributes to virulence of avian pathogenic Escherichia coli Microbiology, February 1, 2009; 155(2): 450 - 460. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. C. Fineran, T. R. Blower, I. J. Foulds, D. P. Humphreys, K. S. Lilley, and G. P. C. Salmond The phage abortive infection system, ToxIN, functions as a protein-RNA toxin-antitoxin pair PNAS, January 20, 2009; 106(3): 894 - 899. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. E. Zegans, J. C. Wagner, K. C. Cady, D. M. Murphy, J. H. Hammond, and G. A. O'Toole Interaction between Bacteriophage DMS3 and Host CRISPR Region Inhibits Group Behaviors of Pseudomonas aeruginosa J. Bacteriol., January 1, 2009; 191(1): 210 - 219. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Shen, Y. Jiang, Z. Zhou, J. Zhang, Y. Yu, and L. Li Complete nucleotide sequence of pKP96, a 67 850 bp multiresistance plasmid encoding qnrA1, aac(6')-Ib-cr and blaCTX-M-24 from Klebsiella pneumoniae J. Antimicrob. Chemother., December 1, 2008; 62(6): 1252 - 1256. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Noguchi, T. Taniguchi, and T. Itoh MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes DNA Res, December 1, 2008; 15(6): 387 - 396. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Ter-Hovhannisyan, A. Lomsadze, Y. O. Chernoff, and M. Borodovsky Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training Genome Res., December 1, 2008; 18(12): 1979 - 1990. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. M. Mittelmeier, P. Berthold, A. Danon, M. R. Lamb, A. Levitan, M. E. Rice, and C. L. Dieckmann C2 Domain Protein MIN1 Promotes Eyespot Organization in Chlamydomonas reinhardtii Eukaryot. Cell, December 1, 2008; 7(12): 2100 - 2112. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Beargie, T. Liu, M. Corriveau, H. Y. Lee, J. Gott, and R. Bundschuh Genome annotation in the presence of insertional RNA editing Bioinformatics, November 15, 2008; 24(22): 2571 - 2578. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. E. Battle, F. Meyer, J. Rello, V. L. Kung, and A. R. Hauser Hybrid Pathogenicity Island PAGI-5 Contributes to the Highly Virulent Phenotype of a Pseudomonas aeruginosa Isolate in Mammals J. Bacteriol., November 1, 2008; 190(21): 7130 - 7140. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. R. van der Ploeg Characterization of Streptococcus gordonii prophage PH15: complete genome sequence and functional analysis of phage-encoded integrase and endolysin Microbiology, October 1, 2008; 154(10): 2970 - 2978. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. D'Aniello, M. Irimia, I. Maeso, J. Pascual-Anaya, S. Jimenez-Delgado, S. Bertrand, and J. Garcia-Fernandez Gene Expansion and Retention Leads to a Diverse Tyrosine Kinase Superfamily in Amphioxus Mol. Biol. Evol., September 1, 2008; 25(9): 1841 - 1854. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Ceccarelli, A. Daccord, M. Rene, and V. Burrus Identification of the Origin of Transfer (oriT) and a New Gene Required for Mobilization of the SXT/R391 Family of Integrating Conjugative Elements J. Bacteriol., August 1, 2008; 190(15): 5328 - 5338. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. J. Labrie, J. Josephsen, H. Neve, F. K. Vogensen, and S. Moineau Morphology, Genome Sequence, and Structural Proteome of Type Phage P335 from Lactococcus lactis Appl. Envir. Microbiol., August 1, 2008; 74(15): 4636 - 4644. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Sato, Y. Nakamura, T. Kaneko, E. Asamizu, T. Kato, M. Nakao, S. Sasamoto, A. Watanabe, A. Ono, K. Kawashima, et al. Genome Structure of the Legume, Lotus japonicus DNA Res, August 1, 2008; 15(4): 227 - 239. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-F. Dubern, E. R. Coppoolse, W. J. Stiekema, and G. V. Bloemberg Genetic and functional characterization of the gene cluster directing the biosynthesis of putisolvin I and II in Pseudomonas putida strain PCL1445 Microbiology, July 1, 2008; 154(7): 2070 - 2083. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Iacono, L. Villa, D. Fortini, R. Bordoni, F. Imperi, R. J. P. Bonnal, T. Sicheritz-Ponten, G. De Bellis, P. Visca, A. Cassone, et al. Whole-Genome Pyrosequencing of an Epidemic Multidrug-Resistant Acinetobacter baumannii Strain Belonging to the European Clone II Group Antimicrob. Agents Chemother., July 1, 2008; 52(7): 2616 - 2625. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Gayral, J.-C. Noa-Carrazana, M. Lescot, F. Lheureux, B. E. L. Lockhart, T. Matsumoto, P. Piffanelli, and M.-L. Iskra-Caruana A Single Banana Streak Virus Integration Event in the Banana Genome as the Origin of Infectious Endogenous Pararetrovirus J. Virol., July 1, 2008; 82(13): 6697 - 6710. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Schoenfeld, M. Patterson, P. M. Richardson, K. E. Wommack, M. Young, and D. Mead Assembly of Viral Metagenomes from Yellowstone Hot Springs Appl. Envir. Microbiol., July 1, 2008; 74(13): 4164 - 4174. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Liu, A. J. Mackey, D. S. Roos, and F. C. N. Pereira Evigan: a hidden variable model for integrating gene evidence for eukaryotic gene prediction Bioinformatics, March 1, 2008; 24(5): 597 - 605. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Deveau, R. Barrangou, J. E. Garneau, J. Labonte, C. Fremaux, P. Boyaval, D. A. Romero, P. Horvath, and S. Moineau Phage Response to CRISPR-Encoded Resistance in Streptococcus thermophilus J. Bacteriol., February 15, 2008; 190(4): 1390 - 1400. [Abstract] [Full Text] [PDF] |
||||
![]() |
P.-J. Ceyssens, V. Mesyanzhinov, N. Sykilinda, Y. Briers, B. Roucourt, R. Lavigne, J. Robben, A. Domashin, K. Miroshnikov, G. Volckaert, et al. The Genome and Structural Proteome of YuA, a New Pseudomonas aeruginosa Phage Resembling M6 J. Bacteriol., February 15, 2008; 190(4): 1429 - 1435. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Tuanyok, R. K. Auerbach, T. S. Brettin, D. C. Bruce, A. C. Munk, J. C. Detter, T. Pearson, H. Hornstra, R. W. Sermswan, V. Wuthiekanun, et al. A Horizontal Gene Transfer Event Defines Two Distinct Groups within Burkholderia pseudomallei That Have Dissimilar Geographic Distributions J. Bacteriol., December 15, 2007; 189(24): 9044 - 9049. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Kang, S.-J. Yang, S. Kim, and J. Bhak CONSORF: a consensus prediction system for prokaryotic coding sequences Bioinformatics, November 15, 2007; 23(22): 3088 - 3090. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Defoor, M.-B. Kryger, and J. Martinussen The orotate transporter encoded by oroP from Lactococcus lactis is required for orotate utilization and has utility as a food-grade selectable marker Microbiology, November 1, 2007; 153(11): 3645 - 3659. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Xiong, C. E. Bauer, and A. Pancholy Insight into the haem d1 biosynthesis pathway in heliobacteria through bioinformatics analysis Microbiology, October 1, 2007; 153(10): 3548 - 3562. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Sato, Y. Nakamura, E. Asamizu, S. Isobe, and S. Tabata Genome Sequencing and Genome Resources in Model Legumes Plant Physiology, June 1, 2007; 144(2): 588 - 593. [Full Text] [PDF] |
||||
![]() |
S. de Groot, T. Mailund, and J. Hein Comparative annotation of viral genomes with non-conserved gene structure Bioinformatics, May 1, 2007; 23(9): 1080 - 1089. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Clark, M. Pontes, T. Jones, and C. Dale A Possible Heterodimeric Prophage-Like Element in the Genome of the Insect Endosymbiont Sodalis glossinidius J. Bacteriol., April 1, 2007; 189(7): 2949 - 2951. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. L. Delcher, K. A. Bratke, E. C. Powers, and S. L. Salzberg Identifying bacterial genes and endosymbiont DNA with Glimmer Bioinformatics, March 15, 2007; 23(6): 673 - 679. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Severin, E. Nickbarg, J. Wooters, S. A. Quazi, Y. V. Matsuka, E. Murphy, I. K. Moutsatsos, R. J. Zagursky, and S. B. Olmsted Proteomic Analysis and Identification of Streptococcus pyogenes Surface-Associated Proteins J. Bacteriol., March 1, 2007; 189(5): 1514 - 1522. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Daniel, P. E. Bonnen, and V. A. Fischetti First Complete Genome Sequence of Two Staphylococcus epidermidis Bacteriophages J. Bacteriol., March 1, 2007; 189(5): 2086 - 2100. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Louarn, Y. Guedon, J. Lecoeur, and E. Lebon Quantitative Analysis of the Phenotypic Variability of Shoot Architecture in Two Grapevine (Vitis vinifera) Cultivars Ann. Bot., March 1, 2007; 99(3): 425 - 437. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. M. Fuchs, S. Spring, H. Teeling, C. Quast, J. Wulf, M. Schattenhofer, S. Yan, S. Ferriera, J. Johnson, F. O. Glockner, et al. From the Cover: Characterization of a marine gammaproteobacterium capable of aerobic anoxygenic photosynthesis PNAS, February 20, 2007; 104(8): 2891 - 2896. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Krause, A. C. McHardy, T. W. Nattkemper, A. Puhler, J. Stoye, and F. Meyer GISMO--gene identification using a support vector machine for ORF classification Nucleic Acids Res., January 28, 2007; 35(2): 540 - 549. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. E. Snyder, N. Kampanya, J. Lu, E. K. Nordberg, H. R. Karur, M. Shukla, J. Soneja, Y. Tian, T. Xue, H. Yoo, et al. PATRIC: The VBI PathoSystems Resource Integration Center Nucleic Acids Res., January 12, 2007; 35(suppl_1): D401 - D406. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. M. McCarthy, S. M. Bridges, N. Wang, G. B. Magee, W. P. Williams, D. S. Luthe, and S. C. Burgess AgBase: a unified resource for functional analysis in agriculture Nucleic Acids Res., January 12, 2007; 35(suppl_1): D599 - D603. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Heusser, M. Ender, B. Berger-Bachi, and N. McCallum Mosaic Staphylococcal Cassette Chromosome mec Containing Two Recombinase Loci and a New mec Complex, B2 Antimicrob. Agents Chemother., January 1, 2007; 51(1): 390 - 393. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Schlueter, I. F. Vasylenko-Sanders, S. Deshpande, J. Yi, M. Siegfried, B. A. Roe, S. D. Schlueter, B. E. Scheffler, and R. C. Shoemaker The FAD2 Gene Family of Soybean:: Insights into the Structural and Functional Divergence of a Paleopolyploid Genome Crop Sci., January 1, 2007; 47(Supplement_1): S-14 - S-26. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Noguchi, J. Park, and T. Takagi MetaGene: prokaryotic gene finding from environmental genome shotgun sequences Nucleic Acids Res., November 14, 2006; 34(19): 5623 - 5630. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Chu, Q. Qian, W. Liang, C. Yin, H. Tan, X. Yao, Z. Yuan, J. Yang, H. Huang, D. Luo, et al. The FLORAL ORGAN NUMBER4 Gene Encoding a Putative Ortholog of Arabidopsis CLAVATA3 Regulates Apical Meristem Size in Rice Plant Physiology, November 1, 2006; 142(3): 1039 - 1052. [Abstract] [Full Text] [PDF] |
||||
![]() |
P.-J. Ceyssens, R. Lavigne, W. Mattheus, A. Chibeu, K. Hertveldt, J. Mast, J. Robben, and G. Volckaert Genomic Analysis of Pseudomonas aeruginosa Phages LKD16 and LKA1: Establishment of the {phi}KMV Subgroup within the T7 Supergroup. J. Bacteriol., October 1, 2006; 188(19): 6924 - 6931. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. F. Petrosino, Q. Xiang, S. E. Karpathy, H. Jiang, S. Yerrapragada, Y. Liu, J. Gioia, L. Hemphill, A. Gonzalez, T. M. Raghavan, et al. Chromosome Rearrangement and Diversification of Francisella tularensis Revealed by the Type B (OSU18) Genome Sequence. J. Bacteriol., October 1, 2006; 188(19): 6977 - 6985. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Gioia, X. Qin, H. Jiang, K. Clinkenbeard, R. Lo, Y. Liu, G. E. Fox, S. Yerrapragada, M. P. McLeod, T. Z. McNeill, et al. The Genome Sequence of Mannheimia haemolytica A1: Insights into Virulence, Natural Competence, and Pasteurellaceae Phylogeny. J. Bacteriol., October 1, 2006; 188(20): 7257 - 7266. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Bruggmann, A. K. Bharti, H. Gundlach, J. Lai, S. Young, A. C. Pontaroli, F. Wei, G. Haberer, G. Fuks, C. Du, et al. Uneven chromosome contraction and expansion in the maize genome Genome Res., October 1, 2006; 16(10): 1241 - 1251. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Schlueter, B. E. Scheffler, S. D. Schlueter, and R. C. Shoemaker Sequence Conservation of Homeologous Bacterial Artificial Chromosomes and Transcription of Homeologous Genes in Soybean (Glycine max L. Merr.) Genetics, October 1, 2006; 174(2): 1017 - 1028. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Renton, Y. Guedon, C. Godin, and E. Costes Similarities and gradients in growth unit branching patterns during ontogeny in 'Fuji' apple trees: a stochastic approach J. Exp. Bot., September 1, 2006; 57(12): 3131 - 3143. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. R. Kulasekara, H. D. Kulasekara, M. C. Wolfgang, L. Stevens, D. W. Frank, and S. Lory Acquisition and Evolution of the exoU Locus in Pseudomonas aeruginosa J. Bacteriol., June 1, 2006; 188(11): 4037 - 4050. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Wang, X. Gao, P. Gao, W. Deng, P. Yu, J. Ma, J. Guo, X. Wang, H. Cheng, C. Zhang, et al. Cell-Based Screening and Validation of Human Novel Genes Associated with Cell Viability J Biomol Screen, June 1, 2006; 11(4): 369 - 376. [Abstract] [PDF] |
||||
![]() |
C. D. Town, F. Cheung, R. Maiti, J. Crabtree, B. J. Haas, J. R. Wortman, E. E. Hine, R. Althoff, T. S. Arbogast, L. J. Tallon, et al. Comparative Genomics of Brassica oleracea and Arabidopsis thaliana Reveal Gene Loss, Fragmentation, and Dispersal after Polyploidy PLANT CELL, June 1, 2006; 18(6): 1348 - 1359. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. E. Smith, D. G. Buckley, Z. Wu, C. Saenphimmachak, L. R. Hoffman, D. A. D'Argenio, S. I. Miller, B. W. Ramsey, D. P. Speert, S. M. Moskowitz, et al. From the Cover: Genetic adaptation by Pseudomonas aeruginosa to the airways of cystic fibrosis patients PNAS, May 30, 2006; 103(22): 8487 - 8492. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Schuch and V. A. Fischetti Detailed Genomic Analysis of the W{beta} and {gamma} Phages Infecting Bacillus anthracis: Implications for Evolution of Environmental Fitness and Antibiotic Resistance. J. Bacteriol., April 1, 2006; 188(8): 3037 - 3051. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Yu. Mitrophanov and M. Borodovsky Statistical significance in biological sequence analysis Brief Bioinform, March 1, 2006; 7(1): 2 - 24. |
||||
![]() |
Z. Chen and T. D. Schneider Comparative analysis of tandem T7-like promoter containing regions in enterobacterial genomes reveals a novel group of genetic islands Nucleic Acids Res., February 21, 2006; 34(4): 1133 - 1147. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Chouikha, P. Germon, A. Bree, P. Gilot, M. Moulin-Schouleur, and C. Schouler A selC-Associated Genomic Island of the Extraintestinal Avian Pathogenic Escherichia coli Strain BEN2908 Is Involved in Carbohydrate Uptake and Virulence J. Bacteriol., February 1, 2006; 188(3): 977 - 987. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. D. Tyler and A. Severini The Complete Genome Sequence of Herpesvirus Papio 2 (Cercopithecine Herpesvirus 16) Shows Evidence of Recombination Events among Various Progenitor Herpesviruses J. Virol., February 1, 2006; 80(3): 1214 - 1221. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. M. Stupar, K. A. Beaubien, W. Jin, J. Song, M.-K. Lee, C. Wu, H.-B. Zhang, B. Han, and J. Jiang Structural Diversity and Differential Transcription of the Patatin Multicopy Gene Family During Potato Tuber Development Genetics, February 1, 2006; 172(2): 1263 - 1275. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-w. Wu, J. Glasner, M. Collins, S. Naser, and A. M. Talaat Whole-Genome Plasticity among Mycobacterium avium Subspecies: Insights from Comparative Genomic Hybridizations J. Bacteriol., January 15, 2006; 188(2): 711 - 723. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. P. Chan, G. Pertea, F. Cheung, D. Lee, L. Zheng, C. Whitelaw, A. C. Pontaroli, P. SanMiguel, Y. Yuan, J. Bennetzen, et al. The TIGR Maize Database Nucleic Acids Res., January 1, 2006; 34(suppl_1): D771 - D776. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Nielsen and A. Krogh Large-scale prokaryotic gene prediction and comparison to genome annotation Bioinformatics, December 15, 2005; 21(24): 4322 - 4329. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Haberer, S. Young, A. K. Bharti, H. Gundlach, C. Raymond, G. Fuks, E. Butler, R. A. Wing, S. Rounsley, B. Birren, et al. Structure and Architecture of the Maize Genome Plant Physiology, December 1, 2005; 139(4): 1612 - 1624. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Lomsadze, V. Ter-Hovhannisyan, Y. O. Chernoff, and M. Borodovsky Gene identification in novel eukaryotic genomes by self-training algorithm Nucleic Acids Res., November 28, 2005; 33(20): 6494 - 6506. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-L. Xiao, S. R. Smith, N. Ishmael, J. C. Redman, N. Kumar, E. L. Monaghan, M. Ayele, B. J. Haas, H. C. Wu, and C. D. Town Analysis of the cDNAs of Hypothetical Genes on Arabidopsis Chromosome 2 Reveals Numerous Transcript Variants Plant Physiology, November 1, 2005; 139(3): 1323 - 1337. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. C. Kulkarni, R. Vigneshwar, V. K. Jayaraman, and B. D. Kulkarni Identification of coding and non-coding sequences using local Holder exponent formalism Bioinformatics, October 15, 2005; 21(20): 3818 - 3823. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. H. Bamford, J. J. Ravantti, G. Ronnholm, S. Laurinavicius, P. Kukkaro, M. Dyall-Smith, P. Somerharju, N. Kalkkinen, and J. K. H. Bamford Constituents of SH1, a Novel Lipid-Containing Virus Infecting the Halophilic Euryarchaeon Haloarcula hispanica J. Virol., July 15, 2005; 79(14): 9097 - 9107. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Besemer and M. Borodovsky GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses Nucleic Acids Res., July 1, 2005; 33(suppl_2): W451 - W454. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Cruveiller, J. Le Saux, D. Vallenet, A. Lajus, S. Bocs, and C. Medigue MICheck: a web tool for fast checking of syntactic annotations of bacterial genomes Nucleic Acids Res., July 1, 2005; 33(suppl_2): W471 - W479. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Walter, M. Mangold, and G. W. Tannock Construction, Analysis, and {beta}-Glucanase Screening of a Bacterial Artificial Chromosome Library from the Large-Bowel Microbiota of Mice Appl. Envir. Microbiol., May 1, 2005; 71(5): 2347 - 2354. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. D. Wu and C. K. Watanabe GMAP: a genomic mapping and alignment program for mRNA and EST sequences Bioinformatics, May 1, 2005; 21(9): 1859 - 1875. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Yuan, S. Ouyang, A. Wang, W. Zhu, R. Maiti, H. Lin, J. Hamilton, B. Haas, R. Sultana, F. Cheung, et al. The Institute for Genomic Research Osa1 Rice Genome Annotation Database Plant Physiology, May 1, 2005; 138(1): 18 - 26. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Chantret, J. Salse, F. Sabot, S. Rahman, A. Bellec, B. Laubin, I. Dubois, C. Dossat, P. Sourdille, P. Joudrier, et al. Molecular Basis of Evolutionary Events That Shaped the Hardness Locus in Diploid and Polyploid Wheat Species (Triticum and Aegilops) PLANT CELL, April 1, 2005; 17(4): 1033 - 1045. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. E. Smith, E. H. Sims, D. H. Spencer, R. Kaul, and M. V. Olson Evidence for Diversifying Selection at the Pyoverdine Locus of Pseudomonas aeruginosa J. Bacteriol., March 15, 2005; 187(6): 2138 - 2147. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Creuzburg, B. Kohler, H. Hempel, P. Schreier, E. Jacobs, and H. Schmidt Genetic structure and chromosomal integration site of the cryptic prophage CP-1639 encoding Shiga toxin 1 Microbiology, March 1, 2005; 151(3): 941 - 950. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Dale, T. Jones, and M. Pontes Degenerative Evolution and Functional Diversification of Type-III Secretion Systems in the Insect Endosymbiont Sodalis glossinidius Mol. Biol. Evol., March 1, 2005; 22(3): 758 - 766. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Lewenza, J. L. Gardy, F. S.L. Brinkman, and R. E.W. Hancock Genome-wide identification of Pseudomonas aeruginosa exported proteins using a consensus computational strategy combined with a laboratory-based PhoA fusion screen Genome Res., February 1, 2005; 15(2): 321 - 329. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Florea, V. Di Francesco, J. Miller, R. Turner, A. Yao, M. Harris, B. Walenz, C. Mobarry, G. V. Merkulov, R. Charlab, et al. Gene and alternative splicing annotation with AIR Genome Res., January 1, 2005; 15(1): 54 - 66. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

























