Nucleic Acids Research, 2002, Vol. 30, No. 12 2710-2725
© 2002 Oxford University Press
Dictionary-driven prokaryotic gene finding
Exploratory Technology, IBM Tokyo Research Laboratory, 1623-14 Shimotsuruma, Yamato-shi, Kanagawa 242-8502, Japan and 1 Bioinformatics and Pattern Discovery Group, Computational Biology Center, IBM Thomas J. Watson Research Center, PO Box 218, Yorktown Heights, NY 10598, USA
Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: one school employs sequence composition statistics, whereas the other relies on database similarity searches. In this paper, we propose a new gene identification scheme that combines the best characteristics from each of these two schools. In particular, our method determines gene candidates among the ORFs that can be identified in a given DNA strand through the use of the Bio-Dictionary, a database of patterns that covers essentially all of the currently available sample of the natural protein sequence space. Our approach relies entirely on the use of redundant patterns as the agents on which the presence or absence of genes is predicated and does not employ any additional evidence, e.g. ribosome-binding site signals. The Bio-Dictionary Gene Finder (BDGF), the algorithms implementation, is a single computational engine able to handle the gene identification task across distinct archaeal and bacterial genomes. The engine exhibits performance that is characterized by simultaneous very high values of sensitivity and specificity, and a high percentage of correctly predicted start sites. Using a collection of patterns derived from an old (June 2000) release of the Swiss-Prot/TrEMBL database that contained 451 602 proteins and fragments, we demonstrate our methods generality and capabilities through an extensive analysis of 17 complete archaeal and bacterial genomes. Examples of previously unreported genes are also shown and discussed in detail.
* To whom correspondence should be addressed. Tel: +1 914 945 1384; Fax: +1 914 945 4104; Email: rigoutso@us.ibm.com
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
L. Krause, A. C. McHardy, T. W. Nattkemper, A. Puhler, J. Stoye, and F. Meyer GISMO--gene identification using a support vector machine for ORF classification Nucleic Acids Res., January 28, 2007; 35(2): 540 - 549. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. G. Beiko, C. X. Chan, and M. A. Ragan A word-oriented approach to alignment validation Bioinformatics, May 15, 2005; 21(10): 2230 - 2239. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Huynh and I. Rigoutsos The web server of IBM's Bioinformatics and Pattern Discovery group: 2004 update Nucleic Acids Res., July 1, 2004; 32(suppl_2): W10 - W15. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Klockgether, O. Reva, K. Larbig, and B. Tummler Sequence Analysis of the Mobile Genome Island pKLC102 of Pseudomonas aeruginosa C J. Bacteriol., January 15, 2004; 186(2): 518 - 534. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Murphy, I. Rigoutsos, T. Shibuya, and T. E. Shenk Reevaluation of human cytomegalovirus coding potential PNAS, November 11, 2003; 100(23): 13585 - 13590. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Rigoutsos, P. Riek, R. M. Graham, and J. Novotny Structural details (kinks and non-{alpha} conformations) in transmembrane helices are intrahelically determined and can be predicted by sequence pattern descriptors Nucleic Acids Res., August 1, 2003; 31(15): 4625 - 4631. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Huynh, I. Rigoutsos, L. Parida, D. Platt, and T. Shibuya The web server of IBM's Bioinformatics and Pattern Discovery group Nucleic Acids Res., July 1, 2003; 31(13): 3645 - 3650. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Rigoutsos, J. Novotny, T. Huynh, S. T. Chin-Bow, L. Parida, D. Platt, D. Coleman, and T. Shenk In Silico Pattern-Based Analysis of the Human Cytomegalovirus Genome J. Virol., April 1, 2003; 77(7): 4326 - 4344. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Rigoutsos, T. Huynh, A. Floratos, L. Parida, and D. Platt Dictionary-driven protein annotation Nucleic Acids Res., September 1, 2002; 30(17): 3901 - 3916. [Abstract] [Full Text] [PDF] |
||||




