Published online 28 November 2005
Article |
Gene identification in novel eukaryotic genomes by self-training algorithm
1School of Biology, Georgia Institute of Technology Atlanta, GA 30332-0230, USA 2Department of Biomedical Engineering, Georgia Institute of Technology Atlanta, GA 30332-0535, USA
*To whom correspondence should be addressed. Tel: +1 404 894 8432; Fax: +1 404 894 0519; Email: mark{at}amber.biology.gatech.edu
Received August 5, 2005. Revised October 12, 2005. Accepted October 12, 2005.
Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. However, genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools tuned up for previously studied species are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Gene identification methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. Conventional statistical ab initio methods require large training sets of validated genes for estimating gene model parameters. In practice, neither one of these types of data may be available in sufficient amount until rather late stages of the novel genome sequencing. Nevertheless, we have shown that gene finding in eukaryotic genomes could be carried out in parallel with statistical models estimation directly from yet anonymous genomic DNA. The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Rounds of genomic sequence labeling into coding and non-coding regions are followed by the rounds of model parameters estimation. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step. Several novel genomes have been analyzed and biologically interesting findings are discussed. Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification.
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
N. Warthmann, S. Das, C. Lanz, and D. Weigel Comparative Analysis of the MIR319a MicroRNA Locus in Arabidopsis and Related Brassicaceae Mol. Biol. Evol., May 1, 2008; 25(5): 892 - 902. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Stanke, M. Diekhans, R. Baertsch, and D. Haussler Using native and syntenically mapped cDNA alignments to improve de novo gene finding Bioinformatics, March 1, 2008; 24(5): 637 - 644. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Sulakhe, M. D'Souza, M. Syed, A. Rodriguez, Y. Zhang, E. M. Glass, M. F. Romine, and N. Maltsev GNARE--a grid-based server for the analysis of user submitted genomes Nucleic Acids Res., May 25, 2007; (2007) gkm366v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Parra, K. Bradnam, and I. Korf CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes Bioinformatics, May 1, 2007; 23(9): 1061 - 1067. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Krause, A. C. McHardy, T. W. Nattkemper, A. Puhler, J. Stoye, and F. Meyer GISMO--gene identification using a support vector machine for ORF classification Nucleic Acids Res., January 28, 2007; 35(2): 540 - 549. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Chu, Q. Qian, W. Liang, C. Yin, H. Tan, X. Yao, Z. Yuan, J. Yang, H. Huang, D. Luo, et al. The FLORAL ORGAN NUMBER4 Gene Encoding a Putative Ortholog of Arabidopsis CLAVATA3 Regulates Apical Meristem Size in Rice Plant Physiology, November 1, 2006; 142(3): 1039 - 1052. [Abstract] [Full Text] [PDF] |
||||



