Nucleic Acids Research, 2003, Vol. 31, No. 1 371-373
© 2003 Oxford University Press
The TIGRFAMs database of protein families
The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA
*To whom correspondence should be addressed. Email: haft{at}tigr.org
Received September 13, 2002; Revised and Accepted November 13, 2002
ABSTRACT
TIGRFAMs is a collection of manually curated protein families consisting of hidden Markov models (HMMs), multiple sequence alignments, commentary, Gene Ontology (GO) assignments, literature references and pointers to related TIGRFAMs, Pfam and InterPro models. These models are designed to support both automated and manually curated annotation of genomes. TIGRFAMs contains models of full-length proteins and shorter regions at the levels of superfamilies, subfamilies and equivalogs, where equivalogs are sets of homologous proteins conserved with respect to function since their last common ancestor. The scope of each model is set by raising or lowering cutoff scores and choosing members of the seed alignment to group proteins sharing specific function (equivalog) or more general properties. The overall goal is to provide information with maximum utility for the annotation process. TIGRFAMs is thus complementary to Pfam, whose models typically achieve broad coverage across distant homologs but end at the boundaries of conserved structural domains. The database currently contains over 1600 protein families. TIGRFAMs is available for searching or downloading at www.tigr.org/TIGRFAMs.
INTRODUCTION
TIGRFAMs is a manually curated database of protein families described by hidden Markov models (HMMs) and attached information. It is available by FTP and through the World Wide Web. The salient feature of TIGRFAMs is the tuning of the breadth of each protein family to serve the needs of genome annotation. This is achieved through judicious selection of both cutoff scores and members of the seed alignment for each model. Factors examined during model construction include sequence similarity, evidence of function taken directly from the scientific literature, phylogenetics inferred from carefully constructed sequence alignments, species-specific metabolic context and neighboring genes.
Figure 1 illustrates the tailoring of model range to support annotation. Four non-overlapping models were built from a larger family of aromatic amino acid hydroxylases. A neighbor-joining tree is shown rooted between eukaryotic, monomeric forms and tetrameric bacterial forms. The monomeric forms, although quite closely related to each other, are separated on the basis of both function and phylogenetics into three families, each representing a distinct biochemical activity.
|
We have previously (1) defined the term equivalog to describe the relationship of proteins conserved in function since their last common ancestor. This term stands in contrast to ortholog, the proper term for proteins related purely by speciation since their last common ancestor (2). Orthologs by definition cannot have undergone horizontal gene transfer events, although such events are ubiquitous (3,4). Orthologs are not necessarily conserved in function. Although the term ortholog is used commonly in the literature to imply conserved function, this ambiguous and imprecise usage may lead easily to misinterpretation. We suggest that separating the terms ortholog and equivalog will help clarify discussions of protein sequence homology.
More than half the models in TIGRFAMs are of type equivalog (as are the four TIGRFAMs in Fig. 1). Each TIGRFAMs equivalog model confers a strong prediction of the specific protein function named by the model to any protein that scores above its trusted cutoff. For example, model TIGR00936 is adenosylhomocysteinase, EC 3.3.1.1, with gene symbol ahcY in bacteria. The trusted cutoff score of 600 bits and the noise cutoff of -150 set thresholds for automated and manual annotation. The trusted cutoff sets the bar above which recognition by the HMM may trigger automated assignment of protein name, EC number, gene symbol, GO-IDs (5), use in metabolic reconstruction or phylogenetic profiling, etc. The noise cutoff filters out sequences that clearly belong to different families. Between noise and trusted is a gray zone requiring manual inspection. For this family, the gray zone is populated with a dubious second adenosylhomocysteinase from Archaeoglobus fulgidus and fragmentary sequences from a number of sources.
Over 350 TIGRFAMs models are of type subfamily. Superfamily is a homology type representing the complete set of proteins having homology over essentially their whole length. Members may vary greatly in function. A subfamily represents a distinct clade within a superfamily. We assign the term subfamily to classify any family that is not necessarily conserved with respect to function but is also not necessarily a complete superfamily.
The breadth of each subfamily model in TIGRFAMs is tuned, where possible, to support annotation. Within a large family of homologous transporters, for example, one phylogenetic clade may contain various heavy metal cation transporters. By reflecting the nature of the clade on the whole rather than the name of a one sample member, a TIGRFAMs subfamily HMM can provide an informative naming suggestion for any new member of the subfamily encountered during genome sequencing and annotation.
A subfamily model may perform two equally important functions in annotation. First, it can represent what is shared among a group of proteins that vary somewhat in function and thus extend the reach of annotation with moderately specific information. Second, it can mark families of proteins in which the danger of overinterpreting homologs as equivalogs can lead to misannotation. Comments within the model can explain the pitfalls of overinterpreting particular protein matches, of following certain legacy annotations, or of overgeneralizing from cursory analysis of pairwise matches.
A listing and description of the various types of models in TIGRFAMs, including equivalog, subfamily and domain can be found at http://www.tigr.org/TIGRFAMs/Explanations.shtml.
TIGRFAMs is designed to be complementary to Pfam, an invaluable resource for finding homology domains with high sensitivity. Both HMM databases use the same, freely available HMMER package (6) of HMM software. A review comparing Pfam, TIGRFAMs, and SMART has recently been published (7). A graphic illustration of one contrast between TIGRFAMs and Pfam is seen in Figure 2. Six separate domain HMMs from Pfam describe the architecture of the rat pyruvate decarboxylase, but none directly answers the questions What should this protein be called? and What does this protein do? each has a broad scope, describing regions shared by proteins with various functions. In contrast, a single equivalog model provides annotation for the protein on the whole.
|
The current release of TIGRFAMs, version 2.1, contains 1622 families, of which 837 are classified as equivalogs. An additional 198 are proposed equivalog families whose function is not yet known (hypothetical equivalog). Coverage in newly sequenced bacterial genomes is estimated at 20% for all TIGRFAMs models and 10% for equivalog models, including roughly half of enzymes annotated with complete EC numbers.
TIGRFAMs strives for broad coverage of microbial proteins, but the starting point for model construction is often interest-driven. Areas of special focus have included transporters and DNA repair proteins from both prokaryotes and eukaryotes, bacterial housekeeping proteins and enzymes. More recent work has emphasized plant and parasite paralogous families and proteins characteristic of prokaryotic transmissable genetic elements such as CRISPR (8), temperate phage and conjugative transposons.
Validation of TIGRFAMs-based annotation has been attempted in two ways. First, HMM search results versus complete genomes are stored in relational database tables and subjected to quality control queries. The same region of the same protein should not belong to two different equivalog families. Equivalogs should appear once per genome for most genomes, excepting known cases of multiple isozymes. Second, TIGRFAMs models are tested by use. Models have been used for some time to make preliminary protein name assignments during microbial annotation at the Institute for Genomic Research (TIGR). Subsequent manual review of these annotations has provided steady feedback that has led to the improvement of many models.
We try to maintain existing models as we develop new ones. We would like to invite input from users of the database, including both suggestions to improve existing models and contributions of curated alignments. Contact information is available through the web page at http://www.tigr.org/TIGRFAMs/.
ACKNOWLEDGEMENT
TIGRFAMs is supported by Department of Energy grant number DEFC0201ER63203.
REFERENCES
- Haft,D.H., Loftus,B.J., Richardson,D.L., Yang,F., Eisen,J.A., Paulsen,I.T. and White,O. (2001) TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res., 29, 4143.
[Abstract/Free Full Text] - Fitch,W.M. (1970) Distinguishing homologous from analogous proteins. Syst. Zool., 19, 99113.
[Abstract/Free Full Text] - Nelson,K.E., Clayton,R.A., Gill,S.R., Gwinn,M.L., Dodson,R.J., Haft,D.H., Hickey,E.K., Peterson,J.D., Nelson,W.C., Ketchum,K.A. et al. (1999) Evidence for lateral gene transfer between archaea and bacteria from genome sequence of Thermotoga maritima. Nature, 399, 323329.[CrossRef][Medline]
- Hayashi,T., Makino,K., Ohnishi,M., Kurokawa,K., Ishii,K., Yokoyama,K., Han,C.G., Ohtsubo,E., Nakayama,K., Murata,T., Tanaka,M., Tobe,T., Iida,T., Takami,H., Honda,T., Sasakawa,C., Ogasawara,N., Yasunaga,T., Kuhara,S., Shiba,T., Hattori,M. and Shinagawa,H. (2001) Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Res., 28, 1122.
- Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S. and Eppig,J.T. (2000) Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nature Genet., 25, 2529.[CrossRef][Web of Science][Medline]
- Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755763.
[Abstract/Free Full Text] - Bateman,A. and Haft,D.H. (2002) HMM-based databases in InterPro. Brief. Bioinform., 3, 236245.
[Abstract/Free Full Text] - Jansen,R., Embden,J.D., Gaastra,W. and Schouls,L.M. (2002) Identification of genes that are associated with DNA repeats in prokaryotes. Mol. Microbiol., 43, 15651575.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
M. Beckstette, R. Homann, R. Giegerich, and S. Kurtz Significant speedup of database searches with HMMs by search space reduction with PSSM family models Bioinformatics, December 15, 2009; 25(24): 3251 - 3258. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Liu, J. R. Faeder, and C. J. Camacho Toward a quantitative theory of intrinsically disordered proteins and their function PNAS, November 24, 2009; 106(47): 19819 - 19823. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Lee, R. Rentzsch, and C. Orengo GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains Nucleic Acids Res., November 18, 2009; (2009) gkp1049v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. W. Sayers, T. Barrett, D. A. Benson, E. Bolton, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. DiCuccio, S. Federhen, et al. Database resources of the National Center for Biotechnology Information Nucleic Acids Res., November 12, 2009; (2009) gkp967v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Davidsen, E. Beck, A. Ganapathy, R. Montgomery, N. Zafar, Q. Yang, R. Madupu, P. Goetz, K. Galinsky, O. White, et al. The comprehensive microbial resource Nucleic Acids Res., November 5, 2009; (2009) gkp912v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Meyer, R. Overbeek, and A. Rodriguez FIGfams: yet another set of protein families Nucleic Acids Res., November 1, 2009; 37(20): 6643 - 6654. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. A Reeves, D. Talavera, and J. M Thornton Genome and proteome annotation: organization, interpretation and integration J R Soc Interface, February 6, 2009; 6(31): 129 - 147. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Lima, A. H. Auchincloss, E. Coudert, G. Keller, K. Michoud, C. Rivoire, V. Bulliard, E. de Castro, C. Lachaize, D. Baratin, et al. HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot Nucleic Acids Res., January 1, 2009; 37(suppl_1): D471 - D478. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Hunter, R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bork, U. Das, L. Daugherty, L. Duquenne, et al. InterPro: the integrative protein signature database Nucleic Acids Res., January 1, 2009; 37(suppl_1): D211 - D215. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. M. Berman, J. D. Westbrook, M. J. Gabanyi, W. Tao, R. Shah, A. Kouranov, T. Schwede, K. Arnold, F. Kiefer, L. Bordoli, et al. The protein structure initiative structural genomics knowledgebase Nucleic Acids Res., January 1, 2009; 37(suppl_1): D365 - D368. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Wilson, R. Pethica, Y. Zhou, C. Talbot, C. Vogel, M. Madera, C. Chothia, and J. Gough SUPERFAMILY--sophisticated comparative genomics, data mining, visualization and phylogeny Nucleic Acids Res., January 1, 2009; 37(suppl_1): D380 - D386. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Azcarate-Peril, E. Altermann, Y. J. Goh, R. Tallon, R. B. Sanozky-Dawes, E. A. Pfeiler, S. O'Flaherty, B. L. Buck, A. Dobson, T. Duong, et al. Analysis of the Genome Sequence of Lactobacillus gasseri ATCC 33323 Reveals the Molecular Basis of an Autochthonous Intestinal Organism Appl. Envir. Microbiol., August 1, 2008; 74(15): 4610 - 4625. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. del Val, P. Ernst, M Falkenhahn, C. Fladerer, K. H. Glatting, S. Suhai, and A. Hotz-Wagenblatt ProtSweep, 2Dsweep and DomainSweep: protein analysis suite at DKFZ Nucleic Acids Res., July 13, 2007; 35(suppl_2): W444 - W450. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Gewehr, V. Hintermair, and R. Zimmer AutoSCOP: automated prediction of SCOP classifications using unique pattern-class mappings Bioinformatics, May 15, 2007; 23(10): 1203 - 1210. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. J. Mulder, R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bork, V. Buillard, L. Cerutti, R. Copley, et al. New developments in the InterPro database Nucleic Acids Res., January 12, 2007; 35(suppl_1): D224 - D228. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Pavy, J. J. Johnson, J. A. Crow, C. Paule, T. Kunau, J. MacKay, and E. F. Retzel ForestTreeDB: a database dedicated to the mining of tree transcriptomes Nucleic Acids Res., January 12, 2007; 35(suppl_1): D888 - D894. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Ren, K. Chen, and I. T. Paulsen TransportDB: a comprehensive database resource for cytoplasmic membrane transport systems and outer membrane channels Nucleic Acids Res., January 12, 2007; 35(suppl_1): D274 - D279. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. D. Selengut, D. H. Haft, T. Davidsen, A. Ganapathy, M. Gwinn-Giglio, W. C. Nelson, A. R. Richter, and O. White TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes Nucleic Acids Res., January 12, 2007; 35(suppl_1): D260 - D264. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Friedrich, B. Pils, T. Dandekar, J. Schultz, and T. Muller Modelling interaction sites in protein domains with interaction profile hidden Markov models Bioinformatics, December 1, 2006; 22(23): 2851 - 2857. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Bryson, V. Loux, R. Bossy, P. Nicolas, S. Chaillou, M. van de Guchte, S. Penaud, E. Maguin, M. Hoebeke, P. Bessieres, et al. AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system Nucleic Acids Res., July 19, 2006; 34(12): 3533 - 3545. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Vallenet, L. Labarre, Z. Rouy, V. Barbe, S. Bocs, S. Cruveiller, A. Lajus, G. Pascal, C. Scarpelli, and C. Medigue MaGe: a microbial genome annotation system supported by synteny results Nucleic Acids Res., January 10, 2006; 34(1): 53 - 65. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Riley, T. Abe, M. B. Arnaud, M. K.B. Berlyn, F. R. Blattner, R. R. Chaudhuri, J. D. Glasner, T. Horiuchi, I. M. Keseler, T. Kosuge, et al. Escherichia coli K-12: a cooperatively developed annotation snapshot--2005 Nucleic Acids Res., January 5, 2006; 34(1): 1 - 9. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Tripathy, V. N. Pandey, B. Fang, F. Salas, and B. M. Tyler VMD: a community annotation database for oomycetes and microbial genomes Nucleic Acids Res., January 1, 2006; 34(suppl_1): D379 - D381. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Hartmann, D. Lu, J. Phillips, and T. J. Vision Phytome: a platform for plant comparative genomics Nucleic Acids Res., January 1, 2006; 34(suppl_1): D724 - D730. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Joardar, M. Lindeberg, R. W. Jackson, J. Selengut, R. Dodson, L. M. Brinkac, S. C. Daugherty, R. DeBoy, A. S. Durkin, M. G. Giglio, et al. Whole-Genome Sequence Analysis of Pseudomonas syringae pv. phaseolicola 1448A Reveals Divergence among Pathovars in Genes Involved in Virulence and Transposition J. Bacteriol., September 15, 2005; 187(18): 6488 - 6498. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. J. Alm, K. H. Huang, M. N. Price, R. P. Koche, K. Keller, I. L. Dubchak, and A. P. Arkin The MicrobesOnline Web site for comparative genomics Genome Res., July 1, 2005; 15(7): 1015 - 1022. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Quevillon, V. Silventoinen, S. Pillai, N. Harte, N. Mulder, R. Apweiler, and R. Lopez InterProScan: protein domains identifier Nucleic Acids Res., July 1, 2005; 33(suppl_2): W116 - W120. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Kunin, S. A. Teichmann, M. A. Huynen, and C. A. Ouzounis The properties of protein family space depend on experimental design Bioinformatics, June 1, 2005; 21(11): 2618 - 2622. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Schneider, A. Bairoch, C. H. Wu, and R. Apweiler Plant Protein Annotation in the UniProt Knowledgebase Plant Physiology, May 1, 2005; 138(1): 59 - 66. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Orchard, H. Hermjakob, and R. Apweiler Annotating the Human Proteome Mol. Cell. Proteomics, April 1, 2005; 4(4): 435 - 440. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. H. Haft, J. D. Selengut, L. M. Brinkac, N. Zafar, and O. White Genome Properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics Bioinformatics, February 1, 2005; 21(3): 293 - 306. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Abhiman and E. L. L. Sonnhammer FunShift: a database of function shift analysis on protein subfamilies Nucleic Acids Res., January 1, 2005; 33(suppl_1): D197 - D200. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. J. Mulder, R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bradley, P. Bork, P. Bucher, L. Cerutti, et al. InterPro, progress and status in 2005 Nucleic Acids Res., January 1, 2005; 33(suppl_1): D201 - D205. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Wang and D. A. Julin DNA Helicase Activity of the RecD Protein from Deinococcus radiodurans J. Biol. Chem., December 10, 2004; 279(50): 52024 - 52032. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Liu and B. Rost Sequence-based prediction of protein domains Nucleic Acids Res., July 7, 2004; 32(12): 3522 - 3530. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Graham, K. A.T. Silverstein, S. B. Cannon, and K. A. VandenBosch Computational Identification and Characterization of Novel Genes from Legumes Plant Physiology, July 1, 2004; 135(3): 1179 - 1197. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Cecchetto, S. Amillis, G. Diallinas, C. Scazzocchio, and C. Drevet The AzgA Purine Transporter of Aspergillus nidulans: CHARACTERIZATION OF A PROTEIN BELONGING TO A NEW PHYLOGENETIC CLUSTER J. Biol. Chem., January 30, 2004; 279(5): 3132 - 3141. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. R. Buell, V. Joardar, M. Lindeberg, J. Selengut, I. T. Paulsen, M. L. Gwinn, R. J. Dodson, R. T. Deboy, A. S. Durkin, J. F. Kolonay, et al. The complete genome sequence of the Arabidopsis and tomato pathogen Pseudomonas syringae pv. tomato DC3000 PNAS, September 2, 2003; 100(18): 10181 - 10186. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. D. Thomas, M. J. Campbell, A. Kejariwal, H. Mi, B. Karlak, R. Daverman, K. Diemer, A. Muruganujan, and A. Narechania PANTHER: A Library of Protein Families and Subfamilies Indexed by Function Genome Res., September 1, 2003; 13(9): 2129 - 2141. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











