Skip Navigation


Nucleic Acids Research Advance Access originally published online on December 20, 2007
Nucleic Acids Research 2008 36(Database issue):D517-D518; doi:10.1093/nar/gkm886
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (57K) Freely available
Right arrow Screen PDF (64K) Freely available
Right arrowOA All Versions of this Article:
36/suppl_1/D517    most recent
gkm886v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Pedroso, I.
Right arrow Articles by Holmes, D. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pedroso, I.
Right arrow Articles by Holmes, D. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2008, Vol. 36, Database issue D517-D518
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article appears in the following Nucleic Acids Research issue: Database issue [View the issue table of contents]

Articles

AlterORF: a database of alternate open reading frames

Inti Pedroso1, Gustavo Rivera1, Felipe Lazo2, Max Chacón2, Francisco Ossandón1, Felipe A. Veloso1 and David S. Holmes1,*

1Center for Bioinformatics and Genome Biology, Life Science Foundation, MIFAB and Andrés Bello University, Santiago, Chile and 2Department of Informatics, University of Santiago, Santiago, Chile

*To whom correspondence should be addressed. Tel: +56 2 239 8969; Fax: +56 2 237 2259; Email: dsholmes2000{at}yahoo.com

Received August 17, 2007. Revised October 2, 2007. Accepted October 2, 2007.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONSTRUCTION
 DATABASE CONTENTS
 WEB INTERFACE AND SERVICES
 REFERENCES
 
AlterORF is a searchable database that contains information regarding alternate open reading frames (ORFs) for over 1.5 million genes in 481 prokaryotic genomes. The objective of the database is to provide a platform for improving genome annotation and to serve as an aid for the identification of prokaryotic genes that potentially encode proteins in more than one reading frame. The AlterORF Database can be accessed through a web interface at www.alterorf.cl


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONSTRUCTION
 DATABASE CONTENTS
 WEB INTERFACE AND SERVICES
 REFERENCES
 
A DNA sequence contains six potential open reading frames (ORFs), three on one strand and three on the reverse strand. However, typically only one of the six is actually expressed because it is associated with appropriate genetic signals that specify the DNA strand and the reading frame to be transcribed and translate. Exceptions occur in which more than one open reading frame is translated into a protein, as has long been observed in the case of viral genes, where it was suggested that this property permitted a high packing density of information (1). However, analysis of the coding potential of 481 prokaryotic genomes revealed the surprisingly high frequency of alternate ORFs of annotated genes especially in high G + C rich genomes, where almost every annotated ORF exhibits an alternative ORF that could potentially encode a protein of 100 amino acids or more (2).

The frequency of alternate open reading frames in high G + C genomes gives rise to the possibility that this property could be exploited to evolve novel genetic information and it is important to be able to detect this potential. However, this high frequency also provokes serious problems of gene annotation, where the incorrect ORF may inadvertently be mis-annotated as the coding sequence. This potential for error is especially problematic when automatic gene prediction programs are used to annotate genomes, but errors can also slip by human annotators. The problem is exacerbated if an alternative ORF is mis-annotated and the error is propagated in subsequent genome annotations.

AlterORF provides a searchable database of all possible alternative ORFs in sequenced prokaryotic genomes that are potentially capable of encoding proteins of 100 amino acids or more. The objectives are 2-fold: to improve genome annotation by indicating possible errors in ORF identification and, perhaps more important in the long term, to predict instances of genes that potentially could give rise to more than one protein.


    DATABASE CONSTRUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONSTRUCTION
 DATABASE CONTENTS
 WEB INTERFACE AND SERVICES
 REFERENCES
 
Annotated protein coding genes were extracted from completely sequenced prokaryotic genomes in the Genome Database of NCBI. All alternative ORFs, potentially encoding 100 amino acids or more, were extracted from each gene sequence using Perl scripts and the BioPerl Application Programming Interface (API) (3). Using the standard genetic code, the in silico translated amino acid sequence of each alternative ORF was searched for similarity in completely sequenced prokaryotic genomes (4) and for conserved domains and motifs using CDD (5), PFAM (6), COG (7), KOG (8), SMART (9) and UniProt. (10). Hierarchical clustering using the software hcluster_sg developed as part of the TreeFam project (11) was used to build sequence families with the alternate ORFs. BLAST e-values were normalized from 0 to 100 (with 100 corresponding to e-value 0.0). The resulting information was stored in a relational database built with Microsoft SQL Server 2005.


    DATABASE CONTENTS
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONSTRUCTION
 DATABASE CONTENTS
 WEB INTERFACE AND SERVICES
 REFERENCES
 
Release 1.0 (September 2007) contains approximately 1.5 million annotated genes from 481 organisms and about 3 million alternate ORFs. Of these 942 856 (33%) occur in frame –1, 621 306 (21%) in frame –2, 322 284 (11%) in frame –3, 350 805 (12%) in frame +2 and 675 525 (23%) in frame +3. The following are provided for each alternate ORF sequence: (i) conserved domains and motifs including CDD (5), PFAM (6), COG (7), KOG (8), SMART (9) and UniProt. (10) and (ii) BLAST results with annotated sequences in completely sequenced prokaryotic genomes and alternate ORFs identified in AlterORF. The cross genera conservation of some alternate ORFs suggests that they might represent new protein families or domains and hierarchical clustering (11) was used to build sequence families from conserved alternate ORFs.


    WEB INTERFACE AND SERVICES
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONSTRUCTION
 DATABASE CONTENTS
 WEB INTERFACE AND SERVICES
 REFERENCES
 
The AlterORF database can be accessed through a simple and easy to use web interface at www.alterorf.cl. The database can be searched by protein ID (derived from NCBI), by organism and by sequence using a sequence search service. In addition, an option is provided to analyze complete genome sequences not present in the database. Searching by protein ID: a protein ID can be used to recover the original annotated gene that appeared in the database (e.g. GenBank), and also any alternate ORF(s) associated with that gene. If alternate ORFs are detected, tables providing information regarding domains, motifs and protein family are displayed with links to further information. Searching by organism: the user can select an organism from a pulldown menu or index for a pre-analyzed list of annotated protein coding genes with alternate ORFs. Searching by protein sequence: a search using a protein sequence can be carried out against all sequences stored in AlterORF using WU-BLAST (blast.wustl.edu/). Downloading data: all data in the AlterORF database can be freely downloaded by ftp. Additional information on the use of AlterORF can be found in the FAQs and Tutorial sections.


    ACKNOWLEDGEMENTS
 
The database is supported in part by a Microsoft Sponsored Research Award and Fondecyt 1050063. Funding to pay the Open Access publication charges for this article was provided by the above sponsors and by Andrés Bello University.

Conflict of interest statement. None declared.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONSTRUCTION
 DATABASE CONTENTS
 WEB INTERFACE AND SERVICES
 REFERENCES
 

  1. Fiddes JC, Seeburg PH, DeNoto FM, Hallewell RA, Baxter JD, Goodman HM. Evolution of the three overlapping gene systems in G4 and phi X174. J. Mol. Biol. (1979) 133:19–43.[CrossRef][ISI][Medline]

  2. Valdes J, Veloso F, Jedlicki E, Holmes D. Large-scale, multi-genome analysis of alternate open reading frames in bacteria and archaea. OMICS (2005) 9:91–105.[CrossRef][ISI][Medline]

  3. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. (2002) 12:1611–1618.[Abstract/Free Full Text]

  4. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zheng Z, Miller W, Lipman DJ. Gapped BLAST and PSIBLAST—A new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.[Abstract/Free Full Text]

  5. Marchler-Bauer A, Anderson JB, Derbyshire MK, DeWeese-Scott C, Gonzales NR, Gwad M, Hao L, He S, Hurwitz DI, et al. CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Res. (2007) 35:D237–D240.[Abstract/Free Full Text]

  6. Finn RD, Mistry J, Schuster-Böckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, et al. Pfam: clans, web tools and services. Nucleic Acids Res. (2006) 34:D247–D251.[Abstract/Free Full Text]

  7. Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science (1997) 278:631–637.[Abstract/Free Full Text]

  8. Tatusov RL, Fedorova ND, Jackson JJ, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics (2003) 4:41.[CrossRef][Medline]

  9. Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P. SMART 5: domains in the context of genomes and networks. Nucleic Acids Res. (2006) 34:D257–D260.[Abstract/Free Full Text]

  10. UniProt Consortium. The Universal Protein Resource (UniProt). Nucleic Acids Res. (2007) 35:D193–D197.[CrossRef][ISI][Medline]

  11. Li H, Coghlan A, Ruan J, Coin LJ, Hériché JK, Osmotherly L, Li R, Liu T, Zhang Z, et al. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. (2006) 34:D572–D580.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Print PDF (57K) Freely available
Right arrow Screen PDF (64K) Freely available
Right arrowOA All Versions of this Article:
36/suppl_1/D517    most recent
gkm886v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Pedroso, I.
Right arrow Articles by Holmes, D. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pedroso, I.
Right arrow Articles by Holmes, D. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?