Skip Navigation

Nucleic Acids Research 2005 33(Database Issue):D71-D74; doi:10.1093/nar/gki064
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (61K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Lee, Y.
Right arrow Articles by Quackenbush, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lee, Y.
Right arrow Articles by Quackenbush, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2005, Vol. 33, Database issue D71-D74
© 2005, the authors
Nucleic Acids Research, Vol. 33, Database issue © Oxford University Press 2005; all rights reserved

The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes

Y. Lee*, J. Tsai, S. Sunkara, S. Karamycheva, G. Pertea, R. Sultana, V. Antonescu, A. Chan, F. Cheung and J. Quackenbush

The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA

* To whom correspondence should be addressed. Tel: +1 301 795 7831; Fax: +1 301 838 0208; Email: dlee{at}tigr.org

Received September 14, 2004; Revised and Accepted October 5, 2004


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 RECENT DEVELOPMENTS
 REFERENCES
 
Although the list of completed genome sequencing projects has expanded rapidly, sequencing and analysis of expressed sequence tags (ESTs) remain a primary tool for discovery of novel genes in many eukaryotes and a key element in genome annotation. The TIGR Gene Indices (http://www.tigr.org/tdb/tgi) are a collection of 77 species-specific databases that use a highly refined protocol to analyze gene and EST sequences in an attempt to identify and characterize expressed transcripts and to present them on the Web in a user-friendly, consistent fashion. A Gene Index database is constructed for each selected organism by first clustering, then assembling EST and annotated cDNA and gene sequences from GenBank. This process produces a set of unique, high-fidelity virtual transcripts, or tentative consensus (TC) sequences. The TC sequences can be used to provide putative genes with functional annotation, to link the transcripts to genetic and physical maps, to provide links to orthologous and paralogous genes, and as a resource for comparative and functional genomic analysis.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 RECENT DEVELOPMENTS
 REFERENCES
 
The TIGR Gene Index databases (TGI) (http://www.tigr.org/tdb/tgi) are constructed using all publicly available expressed sequence tags (EST) and known gene sequence data stored in GenBank for each target species. Sequences are first cleaned to identify and remove contaminating sequences, including vector, adaptor, mitochondrial, ribosomal and chimeric sequences. These sequences are then searched pairwise against each other and grouped into clusters based on shared sequence similarity. The clusters are assembled at high stringency to produce tentative consensus (TC) sequences. The virtual transcripts represented in the TCs are annotated using a variety of tools for open reading frame (ORF) prediction, single nucleotide polymorphism (SNP) prediction, long oligo prediction for microarrays, putative annotation using a controlled vocabulary, Gene Ontology (GO) and Enzyme Commission (EC) number assignments and maps onto complete or drafted genomes or available genetic maps. The TCs are used to construct a variety of other databases, including the Eukaryotic Gene Orthologs (EGO) database and RESOURCERER, a database that annotates and cross-references microarray resources for plants and animals.

At present, 77 species are represented in the Gene Index databases, including 29 animals, 25 plants, 8 fungi and 15 protists; this includes most species for which public EST projects have released more than 50 000 ESTs. Current release information for each species-specific database is summarized in Table 1. Individual databases are updated and released three times yearly, on February 1, June 1 and October 1, if the number of available ESTs for that species has increased by either 25 000 or >10%, whichever is less.


View this table:
[in this window]
[in a new window]
 
Table 1. Summary of the current release of TIGR Gene Indices (TGI)

 

    RECENT DEVELOPMENTS
 TOP
 ABSTRACT
 INTRODUCTION
 RECENT DEVELOPMENTS
 REFERENCES
 
Construction of the Gene Indices
The process used to assemble each Gene Index is similar to that described previously (13), although some modifications have been made to improve the efficiency and accuracy of the process. mgBLAST, a modified version of the Megablast (4) program, is now used for the pairwise sequence comparisons that are the basis for defining the sequence clusters which form the basis for assembly. For large clusters containing hundreds or thousands of sequences (e.g. highly expressed genes such as actin), sequence representation is reduced prior to assembly using a variety of multilayer approaches, including transitive clustering, containment clustering and seeded clustering with known genes. Following clustering, the Paracel Transcript Assembler (PTA), a modified version of CAP3 assembly program (5), is used to assemble each TC. An open source set of software tools that embody this process, TGICL, is available (http://www.tigr.org/tdb/tgi/software) with other open-source utilities for users interested in performing a similar analysis on their own datasets (6).

New features of the TC report
The central element of the TGI databases are the TC sequences and the TC reports that are presented through the project website. Each TC report presents a summary of the assembly and annotation process, including the consensus TC sequence in the FASTA format with a history from previous builds in the header, a map showing component EST and gene sequences, and a table providing links to the primary sequences, putative annotation, an expression summary based on the number of ESTs from various libraries, genomic locations and links to tentative orthologs in EGO. Since the last presentation of the TGI databases in Nucleic Acids Research, several new features have been added to the TC report. Putative polyadenylation signals are identified and shaded in the consensus sequence and putative poly(A/T) trimming sites are shown in sequence map for each of the component ESTs. Potential ORFs are predicted for each TC using a variety of software tools including the NCBI ORF Finder, ESTScan (7) and FrameFinder; predicted ORFs can be searched against a variety of databases using WU-BLAST. Assembly of the TCs can result in incorrect orientations for the consensus and an attempt is now made to determine the proper orientation using the annotated direction of component gene and EST sequences as well as BLAST search results. Putative SNP sites are found by analyzing the multiple sequence alignments that are produced in the assembly stage; putative SNPs are reported only if a variant is found in multiple sequences from independent libraries. Unique 70mers are predicted for each TC using OligoPicker (8). GO terms and metabolic pathway in KEGG are provided for each TC based on protein database searches. Where possible, TCs are aligned with draft genomes and displayed using TGIviewer, gbrowse, EnsEMBL and the UCSC genome viewers.

New databases and tools
The EGO (http://www.tigr.org/tdb/tgi/ego) (9) database, previously known as TIGR Orthologous Gene Alignments (TOGA), uses pairwise sequence similarity searches and a transitive, reciprocal closure process to identify Tentative Ortholog Groups (TOGs) in eukaryotes (9). EGO has expanded its representation to include all 77 species represented in the TGI and TOGs have been cross-referenced to the Online Mendelian in Man (OMIM) (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM) database of human disease genes.

RESOURCERER (10) provides annotation based on the TIGR Gene Indices for widely available microarray resources in human, mouse, rat, zebrafish and Xenopus, including widely used clone sets and Affymetrix GeneChipsTM as well as a variety of other sequence-based resources such as RefSeq. RESOURCERER provides a wide range of annotation and integration with genomic and other resources, including gene name assignments, GO term and EC number assignments, chromosomal localization, integration with genetic and quantitative trait locus (QTL) maps, ortholog identification, lists of relevant abstracts in PubMed and promoter region identification. Owing to its integration with the TGI and EGO, RESOURCERER also provides links between microarray platforms both within and between species. Users can also submit a list of GenBank accessions corresponding to their microarray databases for annotation and functional analysis. A plant-specific version, Plant RESOURCERER, was released in September 2004 with microarray resources from Arabidopsis, potato, tomato, maize and rice.

Genomic maps align TCs to available complete or draft genomes, including human, mouse, rat, zebrafish, fly, worm, Fugu, mosquito, Arabidopsis, yeast, fission yeast and rice. Also these alignments can be viewed using either TGIviewer or gbrowse or through a number of distributed annotation system (DAS) viewers (11), including one developed at TIGR. Each Gene Index also includes graphical metabolic pathway maps linked to TCs associated with specific pathways through GO term and EC number annotation. Comparisons between TCs are also used to identify putative alternative splice forms based on shared blocks of sequence similarity.

Using the TIGR Gene Indices
There are many ways in which users can access the TIGR Gene Index databases. Nucleotide or protein sequences can be searched using WU-BLAST against individual TGI databases, EGO or pre-selected classes of species, such as animals or plants. The TGI can be searched using unique identifiers (GB and TC Accessions, EST identifiers and ET numbers from the TIGR PREEGAD database), gene product names, functional classifications based on GO terms, metabolic pathways, library-related expression analysis, map position within various sequenced genomes, TOGs in the EGO database and alternative splice forms. Complete annotations for all of the ESTs and TCs in each TGI database are now also provided through the EST Annotator and TC Annotator features which provide comprehensive lists of sequences within each species-specific database.

All of the TIGR Gene Indices are available for download through the main page for each species. Downloads consist of six files, including a FASTA file for all unique sequences, the TC list, the component ESTs in each TC, GO analysis, predicted oligos and a README file.

Software
Many of the software tools used to create the TGI are available with source code to the research community through the TGI software tools website (http://www.tigr.org/tdb/tgi/software). The TGI Clustering tool (TGICL) (6) is a software system for fast clustering and assembly of large EST datasets. TGICL starts with a large multi-FASTA file (and an optional quality value file) and outputs the assemblies produced by CAP3 (5). Both clustering and assembly phases can be parallelized by distributing the searches and the assembly jobs across multiple CPUs, as TGICL can take advantage of either SMP or PVM (Parallel Virtual Machine) clusters. Other available software includes clview for viewing sequence assemblies in .ace format, SeqClean which is used to remove contaminating sequences from EST and gene sequences and cdbfasta/cdbyank which index FASTA-formatted files and can be used to rapidly extract sequences from them.


    ACKNOWLEDGEMENTS
 
The authors wish to thank TIGR IT group for their database and computer system support. This work was supported by the US Department of Energy, grant DE-FG02-99ER62852 and the US National Science Foundation, grant DBI-9983070.


    Notes
 
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact journals.permissions{at}oupjournals.org.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 RECENT DEVELOPMENTS
 REFERENCES
 

  1. Quackenbush,J., Liang,F., Holt,I., Pertea,G. and Upton,J. ( (2000) ) The TIGR gene indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Res., , 28, , 141–145.[Abstract/Free Full Text] .

  2. Quackenbush,J., Cho,J., Lee,D., Liang,F., Holt,I., Karamycheva,S., Parvizi,B., Pertea,G., Sultana,R. and White,J. ( (2001) ) The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res., , 29, , 159–164.[Abstract/Free Full Text] .

  3. Liang,F., Holt,I., Pertea,G., Karamycheva,S., Salzberg,S.L. and Quackenbush,J. ( (2000) ) An optimized protocol for analysis of EST sequences. Nucleic Acids Res., , 28, , 3657–3665.[Abstract/Free Full Text] .

  4. Zhang,Z., Schwartz,S., Wagner,L. and Miller,W. ( (2000) ) A greedy algorithm for aligning DNA sequences. J. Comput. Biol., , 7, , 203–214.[CrossRef][Web of Science][Medline] .

  5. Huang,X. and Madan,A. ( (1999) ) CAP3: a DNA sequence assembly program. Genome Res., , 9, , 868–877.[Abstract/Free Full Text] .

  6. Pertea,G., Huang,X., Liang,F., Antonescu,V., Sultana,R., Karamycheva,S., Lee,Y., White,J., Cheung,F., Parvizi,B., Tsai,J. and Quackenbush,J. ( (2002) ) TIGR Gene Indices Clustering Tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics, , 19, , 651–652. .

  7. Iseli,C., Jongeneel,C.V. and Bucher,P. ( (1999) ) ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol., , 138–148. .

  8. Wang,X. and Seed,B. ( (2003) ) Selection of oligonucleotide probes for protein coding sequences. Bioinformatics, , 19, , 796–802.[Abstract/Free Full Text] .

  9. Lee,Y., Sultana,R., Pertea,G., Cho,J., Karamycheva,S., Tsai,J., Parvizi,B., Cheung,F., Antonescu,V., White,J., Holt,I., Liang,F. and Quackenbush,J. ( (2002) ) Cross-referencing eukaryotic genomes: TIGR Othologous Gene Alignments (TOGA). Genome Res., , 12, , 493–502.[Abstract/Free Full Text] .

  10. Tsai,J., Sultana,R., Lee,Y., Pertea,G., Karamycheva,S., Anonescu,V., Cho,J., Parvizi,B., Cheung,F. and Quackenbush,J. ( (2001) ) RESOURCERER: a database for annotating and linking microarray resources with and cross species. Genome Biol., , 2, , software0002.1–software0002.4. .

  11. Dowell,R.D., Jokerst,R.M., Day,A., Eddy,S.R. and Stein,L. ( (2001) ) The distributed annotation system. BMC Bioinformatics, , 2, , 7.[CrossRef][Medline] .


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
S. Shridhar, D. Chattopadhyay, and G. Yadav
PLecDom: a program for identification and analysis of plant lectin domains
Nucleic Acids Res., July 1, 2009; 37(suppl_2): W452 - W458.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
J. D.F. Meyer, D. C.G. Silva, C. Yang, K. F. Pedley, C. Zhang, M. van de Mortel, J. H. Hill, R. C. Shoemaker, R. V. Abdelnoor, S. A. Whitham, et al.
Identification and Analyses of Candidate Genes for Rpp4-Mediated Resistance to Asian Soybean Rust in Soybean
Plant Physiology, May 1, 2009; 150(1): 295 - 307.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
S. Dam, B. S. Laursen, J. H. Ornfelt, B. Jochimsen, H. H. Staerfeldt, C. Friis, K. Nielsen, N. Goffard, S. Besenbacher, L. Krusell, et al.
The Proteome of Seed Development in the Model Legume Lotus japonicus
Plant Physiology, March 1, 2009; 149(3): 1325 - 1340.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
K. L. Childs
Genomic and Genetic Database Resources for the Grasses
Plant Physiology, January 1, 2009; 149(1): 132 - 136.
[Full Text] [PDF]


Home page
DNA ResHome page
S. Sato, Y. Nakamura, T. Kaneko, E. Asamizu, T. Kato, M. Nakao, S. Sasamoto, A. Watanabe, A. Ono, K. Kawashima, et al.
Genome Structure of the Legume, Lotus japonicus
DNA Res, August 1, 2008; 15(4): 227 - 239.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. Aguilar, L. Skrabanek, S. S. Gross, B. Oliva, and F. Campagne
Beyond tissueInfo: functional prediction using tissue expression profile similarity searches
Nucleic Acids Res., June 1, 2008; 36(11): 3728 - 3737.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Castellano, V. N. Gladyshev, R. Guigo, and M. J. Berry
SelenoDB 1.0 : a database of selenoprotein genes, proteins and SECIS elements
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D332 - D338.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
S. Sato, Y. Nakamura, E. Asamizu, S. Isobe, and S. Tabata
Genome Sequencing and Genome Resources in Model Legumes
Plant Physiology, June 1, 2007; 144(2): 588 - 593.
[Full Text] [PDF]


Home page
Plant Physiol.Home page
A. Masoudi-Nejad, S. Goto, R. Jauregui, M. Ito, S. Kawashima, Y. Moriya, T. R. Endo, and M. Kanehisa
EGENES: Transcriptome-Based Plant Database of Genes with Metabolic Pathway Information and Expressed Sequence Tag Indices in KEGG
Plant Physiology, June 1, 2007; 144(2): 857 - 866.
[Abstract] [Full Text] [PDF]


Home page
J. Exp. Biol.Home page
J. Quackenbush
Extracting biology from high-dimensional biological data
J. Exp. Biol., May 1, 2007; 210(9): 1507 - 1517.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
S. Faure, J. Higgins, A. Turner, and D. A. Laurie
The FLOWERING LOCUS T-Like Gene Family in Barley (Hordeum vulgare)
Genetics, May 1, 2007; 176(1): 599 - 609.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Y. Lee, Y. Lee, B. Kim, Y. Shin, S. Nam, P. Kim, N. Kim, W.-H. Chung, J. Kim, and S. Lee
ECgene: an alternative splicing database update
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D99 - D103.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
K. L. Childs, J. P. Hamilton, W. Zhu, E. Ly, F. Cheung, H. Wu, P. D. Rabinowicz, C. D. Town, C. R. Buell, and A. P. Chan
The TIGR Plant Transcript Assemblies database
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D846 - D851.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
N. D'Agostino, M. Aversano, L. Frusciante, and M. L. Chiusano
TomatEST database: in silico exploitation of EST data to explore expression patterns in tomato species
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D901 - D905.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. M. Smith, J. H. Finger, T. F. Hayamizu, I. J. McCright, J. T. Eppig, J. A. Kadin, J. E. Richardson, and M. Ringwald
The mouse Gene Expression Database (GXD): 2007 update
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D618 - D623.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
N. Pavy, J. J. Johnson, J. A. Crow, C. Paule, T. Kunau, J. MacKay, and E. F. Retzel
ForestTreeDB: a database dedicated to the mining of tree transcriptomes
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D888 - D894.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
S. H. Nagaraj, R. B. Gasser, and S. Ranganathan
A hitchhiker's guide to expressed sequence tag (EST) analysis
Brief Bioinform, January 1, 2007; 8(1): 6 - 21.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
S. B. Cannon, L. Sterck, S. Rombauts, S. Sato, F. Cheung, J. Gouzy, X. Wang, J. Mudge, J. Vasdewani, T. Schiex, et al.
Legume genome evolution viewed through the Medicago truncatula and Lotus japonicus genomes
PNAS, October 3, 2006; 103(40): 14959 - 14964.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Masoudi-Nejad, K. Tonomura, S. Kawashima, Y. Moriya, M. Suzuki, M. Itoh, M. Kanehisa, T. Endo, and S. Goto
EGassembler: online bioinformatics service for large-scale processing, clustering and assembling ESTs and genomic DNA fragments.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W459 - W462.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
L. Florea
Bioinformatics of alternative splicing and its regulation
Brief Bioinform, March 1, 2006; 7(1): 55 - 69.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Hartmann, D. Lu, J. Phillips, and T. J. Vision
Phytome: a platform for plant comparative genomics
Nucleic Acids Res., January 1, 2006; 34(suppl_1): D724 - D730.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Droc, M. Ruiz, P. Larmande, A. Pereira, P. Piffanelli, J. B. Morel, A. Dievart, B. Courtois, E. Guiderdoni, and C. Perin
OryGenesDB: a database for rice reverse genetics
Nucleic Acids Res., January 1, 2006; 34(suppl_1): D736 - D740.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
J. Lazar, C. Moreno, H. J. Jacob, and A. E. Kwitek
Impact of genomics on research in the rat
Genome Res., December 1, 2005; 15(12): 1717 - 1728.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
Q. Dong, C. J. Lawrence, S. D. Schlueter, M. D. Wilkerson, S. Kurtz, C. Lushbough, and V. Brendel
Comparative Plant Genomics Resources at PlantGDB
Plant Physiology, October 1, 2005; 139(2): 610 - 618.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Djebbari, S. Karamycheva, E. Howe, and J. Quackenbush
MeSHer: identifying biological concepts in microarray assays based on PubMed references and MeSH terms
Bioinformatics, August 1, 2005; 21(15): 3324 - 3326.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
S. B. Cannon, J. A. Crow, M. L. Heuer, X. Wang, E. K.S. Cannon, C. Dwan, A.-F. Lamblin, J. Vasdewani, J. Mudge, A. Cook, et al.
Databases and Information Integration for the Medicago truncatula Genome and Transcriptome
Plant Physiology, May 1, 2005; 138(1): 38 - 46.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (61K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Lee, Y.
Right arrow Articles by Quackenbush, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lee, Y.
Right arrow Articles by Quackenbush, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?