Skip Navigation

This Article
Right arrow Full Text Freely available
Right arrow Print PDF (564K) Freely available
Right arrow Supplementary Material
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (19)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Shibuya, T.
Right arrow Articles by Rigoutsos, I.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shibuya, T.
Right arrow Articles by Rigoutsos, I.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2002, Vol. 30, No. 12 2710-2725
© 2002 Oxford University Press

Dictionary-driven prokaryotic gene finding

Tetsuo Shibuya and Isidore Rigoutsos1,*

Exploratory Technology, IBM Tokyo Research Laboratory, 1623-14 Shimotsuruma, Yamato-shi, Kanagawa 242-8502, Japan and 1 Bioinformatics and Pattern Discovery Group, Computational Biology Center, IBM Thomas J. Watson Research Center, PO Box 218, Yorktown Heights, NY 10598, USA

Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: one school employs sequence composition statistics, whereas the other relies on database similarity searches. In this paper, we propose a new gene identification scheme that combines the best characteristics from each of these two schools. In particular, our method determines gene candidates among the ORFs that can be identified in a given DNA strand through the use of the Bio-Dictionary, a database of patterns that covers essentially all of the currently available sample of the natural protein sequence space. Our approach relies entirely on the use of redundant patterns as the agents on which the presence or absence of genes is predicated and does not employ any additional evidence, e.g. ribosome-binding site signals. The Bio-Dictionary Gene Finder (BDGF), the algorithm’s implementation, is a single computational engine able to handle the gene identification task across distinct archaeal and bacterial genomes. The engine exhibits performance that is characterized by simultaneous very high values of sensitivity and specificity, and a high percentage of correctly predicted start sites. Using a collection of patterns derived from an old (June 2000) release of the Swiss-Prot/TrEMBL database that contained 451 602 proteins and fragments, we demonstrate our method’s generality and capabilities through an extensive analysis of 17 complete archaeal and bacterial genomes. Examples of previously unreported genes are also shown and discussed in detail.

* To whom correspondence should be addressed. Tel: +1 914 945 1384; Fax: +1 914 945 4104; Email: rigoutso@us.ibm.com


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
L. Krause, A. C. McHardy, T. W. Nattkemper, A. Puhler, J. Stoye, and F. Meyer
GISMO--gene identification using a support vector machine for ORF classification
Nucleic Acids Res., January 28, 2007; 35(2): 540 - 549.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
R. G. Beiko, C. X. Chan, and M. A. Ragan
A word-oriented approach to alignment validation
Bioinformatics, May 15, 2005; 21(10): 2230 - 2239.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. Huynh and I. Rigoutsos
The web server of IBM's Bioinformatics and Pattern Discovery group: 2004 update
Nucleic Acids Res., July 1, 2004; 32(suppl_2): W10 - W15.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
J. Klockgether, O. Reva, K. Larbig, and B. Tummler
Sequence Analysis of the Mobile Genome Island pKLC102 of Pseudomonas aeruginosa C
J. Bacteriol., January 15, 2004; 186(2): 518 - 534.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
E. Murphy, I. Rigoutsos, T. Shibuya, and T. E. Shenk
Reevaluation of human cytomegalovirus coding potential
PNAS, November 11, 2003; 100(23): 13585 - 13590.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
I. Rigoutsos, P. Riek, R. M. Graham, and J. Novotny
Structural details (kinks and non-{alpha} conformations) in transmembrane helices are intrahelically determined and can be predicted by sequence pattern descriptors
Nucleic Acids Res., August 1, 2003; 31(15): 4625 - 4631.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. Huynh, I. Rigoutsos, L. Parida, D. Platt, and T. Shibuya
The web server of IBM's Bioinformatics and Pattern Discovery group
Nucleic Acids Res., July 1, 2003; 31(13): 3645 - 3650.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
I. Rigoutsos, J. Novotny, T. Huynh, S. T. Chin-Bow, L. Parida, D. Platt, D. Coleman, and T. Shenk
In Silico Pattern-Based Analysis of the Human Cytomegalovirus Genome
J. Virol., April 1, 2003; 77(7): 4326 - 4344.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
I. Rigoutsos, T. Huynh, A. Floratos, L. Parida, and D. Platt
Dictionary-driven protein annotation
Nucleic Acids Res., September 1, 2002; 30(17): 3901 - 3916.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.