Nucleic Acids Research, 2002, Vol. 30, No. 7 1575-1584
© 2002 Oxford University Press
An efficient algorithm for large-scale detection of protein families
Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK and 1Centrum voor Wiskunde en Informatica, Kruislaan 413, NL-1098 SJ Amsterdam, The Netherlands
Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.
* To whom correspondence should be addressed. Tel: +44 1223 494452; Fax: +44 1223 494468; Email: anton{at}ebi.ac.uk
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
S. Wong and M. A. Ragan MACHOS: Markov clusters of homologous subsequences Bioinformatics, July 1, 2008; 24(13): i77 - i85. [Abstract] [PDF] |
||||
![]() |
J. Reimand, L. Tooming, H. Peterson, P. Adler, and J. Vilo GraphWeb: mining heterogeneous biological networks for gene modules with functional significance Nucleic Acids Res., July 1, 2008; 36(suppl_2): W452 - W459. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Brohee, K. Faust, G. Lima-Mendez, O. Sand, R. Janky, G. Vanderstocken, Y. Deville, and J. van Helden NeAT: a toolbox for the analysis of biological networks, clusters, classes and pathways Nucleic Acids Res., July 1, 2008; 36(suppl_2): W444 - W451. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Zhang, B.-H. Park, T. Karpinets, and N. F. Samatova From pull-down data to protein interaction networks and complexes with biological relevance Bioinformatics, April 1, 2008; 24(7): 979 - 986. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. D. Swingley, R. E. Blankenship, and J. Raymond Integrating Markov Clustering and Molecular Phylogenetics to Reconstruct the Cyanobacterial Species Tree from Conserved Protein Families Mol. Biol. Evol., April 1, 2008; 25(4): 643 - 654. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Kosaka, S. Kato, T. Shimoyama, S. Ishii, T. Abe, and K. Watanabe The genome of Pelotomaculum thermopropionicum reveals niche-associated evolution in anaerobic microbiota Genome Res., March 1, 2008; 18(3): 442 - 448. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. V. Tetko, I. V. Rodchenkov, M. C. Walter, T. Rattei, and H.-W. Mewes Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information Bioinformatics, March 1, 2008; 24(5): 621 - 628. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Vandenbroucke, S. Robbens, K. Vandepoele, D. Inze, Y. Van de Peer, and F. Van Breusegem Hydrogen Peroxide-Induced Gene Expression across Kingdoms: A Comparative Analysis Mol. Biol. Evol., March 1, 2008; 25(3): 507 - 516. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. D. Swingley, M. Chen, P. C. Cheung, A. L. Conrad, L. C. Dejesa, J. Hao, B. M. Honchak, L. E. Karbach, A. Kurdoglu, S. Lahiri, et al. Niche adaptation and genome expansion in the chlorophyll d-producing cyanobacterium Acaryochloris marina PNAS, February 12, 2008; 105(6): 2005 - 2010. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Prachumwat and W.-H. Li Gene number expansion and contraction in vertebrate genomes with respect to invertebrate genomes Genome Res., February 1, 2008; 18(2): 221 - 232. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Rosvall and C. T. Bergstrom Maps of random walks on complex networks reveal community structure PNAS, January 29, 2008; 105(4): 1118 - 1123. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. G. Conte, S. Gaillard, N. Lanau, M. Rouard, and C. Perin GreenPhylDB: a database for plant comparative genomics Nucleic Acids Res., January 11, 2008; 36(suppl_1): D991 - D998. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Rattei, P. Tischler, R. Arnold, F. Hamberger, J. Krebs, J. Krumsiek, B. Wachinger, V. Stumpflen, and W. Mewes SIMAP structuring the network of protein similarities Nucleic Acids Res., January 11, 2008; 36(suppl_1): D289 - D292. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. K. Wall, J. Leebens-Mack, K. F. Muller, D. Field, N. S. Altman, and C. W. dePamphilis PlantTribes: a gene and gene family resource for comparative genomics in plants Nucleic Acids Res., January 11, 2008; 36(suppl_1): D970 - D976. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Brilli, R. Fani, and P. Lio Current trends in the bioinformatic sequence analysis of metabolic pathways in prokaryotes Brief Bioinform, January 1, 2008; 9(1): 34 - 45. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M. Pollard, K. N. Onatolu, L. Hiller, K. Haldar, and L. J. Knoll Highly Polymorphic Family of Glycosylphosphatidylinositol-Anchored Surface Antigens with Evidence of Developmental Regulation in Toxoplasma gondii Infect. Immun., January 1, 2008; 76(1): 103 - 110. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Mavroidi, D. M. Aanensen, D. Godoy, I. C. Skovsted, M. S. Kaltoft, P. R. Reeves, S. D. Bentley, and B. G. Spratt Genetic Relatedness of the Streptococcus pneumoniae Capsular Biosynthetic Loci J. Bacteriol., November 1, 2007; 189(21): 7841 - 7855. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-C. Lin, L.-C. Hsieh, M.-W. Kuo, J. Yu, H.-H. Kuo, W.-L. Lo, R.-J. Lin, A. L. Yu, and W.-H. Li Human TRIM71 and Its Nematode Homologue Are Targets of let-7 MicroRNA and Its Zebrafish Orthologue Is Essential for Development Mol. Biol. Evol., November 1, 2007; 24(11): 2525 - 2534. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Kurokawa, T. Itoh, T. Kuwahara, K. Oshima, H. Toh, A. Toyoda, H. Takami, H. Morita, V. K. Sharma, T. P. Srivastava, et al. Comparative Metagenomics Revealed Commonly Enriched Gene Sets in Human Gut Microbiomes DNA Res, October 16, 2007; (2007) dsm018v2. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. W. Ganko, B. C. Meyers, and T. J. Vision Divergence in Expression between Duplicated Genes in Arabidopsis Mol. Biol. Evol., October 1, 2007; 24(10): 2298 - 2309. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. D. Harrington, A. H. Singh, T. Doerks, I. Letunic, C. von Mering, L. J. Jensen, J. Raes, and P. Bork Quantitative assessment of protein function prediction from metagenomics shotgun sequences PNAS, August 28, 2007; 104(35): 13913 - 13918. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-R. Xu, J.-X. Zhang, B.-C. Han, L. Liang, and Z.-L. Ji CytoSVM: an advanced server for identification of cytokine-receptor interactions Nucleic Acids Res., July 13, 2007; 35(suppl_2): W538 - W542. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Dutkowski and J. Tiuryn Identification of functional modules from conserved ancestral protein protein interactions Bioinformatics, July 1, 2007; 23(13): i149 - i158. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Feng and E. R.M. Tillier A fast and flexible approach to oligonucleotide probe design for genomes and gene families Bioinformatics, May 15, 2007; 23(10): 1195 - 1202. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Yi, S.-H. Sze, and M. R. Thon Identifying clusters of functionally related genes in genomes Bioinformatics, May 1, 2007; 23(9): 1053 - 1060. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Filee, P. Siguier, and M. Chandler Insertion Sequence Diversity in Archaea Microbiol. Mol. Biol. Rev., March 1, 2007; 71(1): 121 - 157. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. R. Collins, P. Kemmeren, X.-C. Zhao, J. F. Greenblatt, F. Spencer, F. C. P. Holstege, J. S. Weissman, and N. J. Krogan Toward a Comprehensive Atlas of the Physical Interactome of Saccharomyces cerevisiae Mol. Cell. Proteomics, March 1, 2007; 6(3): 439 - 450. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Gfeller, P. De Los Rios, A. Caflisch, and F. Rao From the Cover: Complex network analysis of free-energy landscapes PNAS, February 6, 2007; 104(6): 1817 - 1822. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Dagan and W. Martin Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution PNAS, January 16, 2007; 104(3): 870 - 875. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Nikolski and D. J. Sherman Family relationships: should consensus reign?--consensus clustering for protein families Bioinformatics, January 15, 2007; 23(2): e71 - e76. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Zhang, Y. Zhang, H. Zheng, C. Zhang, W. Xiong, J. G. Olyarchuk, M. Walker, W. Xu, M. Zhao, S. Zhao, et al. SynDB: a Synapse protein DataBase based on synapse ontology Nucleic Acids Res., January 12, 2007; 35(suppl_1): D737 - D741. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. M. Mount, V. Gotea, C.-F. Lin, K. Hernandez, and W. Makalowski Spliceosomal small nuclear RNA genes in 11 insect genomes RNA, January 1, 2007; 13(1): 5 - 14. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Shoja and L. Zhang A Roadmap of Tandemly Arrayed Genes in the Genomes of Human, Mouse, and Rat Mol. Biol. Evol., November 1, 2006; 23(11): 2134 - 2141. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Li, D. W. Ehrhardt, and S. Y. Rhee Systematic Analysis of Arabidopsis Organelles and a Protein Localization Database for Facilitating Fluorescent Tagging of Full-Length Arabidopsis Proteins Plant Physiology, June 1, 2006; 141(2): 527 - 539. [Abstract] [Full Text] [PDF] |
||||
![]() |
S.-H. Kim and S. V. Yi Correlated Asymmetry of Sequence and Functional Divergence Between Duplicate Proteins of Saccharomyces cerevisiae Mol. Biol. Evol., May 1, 2006; 23(5): 1068 - 1075. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. K. McEwen, A. Woolfe, D. Goode, T. Vavouri, H. Callaway, and G. Elgar Ancient duplicated conserved noncoding elements in vertebrates: A genomic and functional analysis Genome Res., April 1, 2006; 16(4): 451 - 465. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Paccanaro, J. A. Casbon, and M. A. S. Saqi Spectral clustering of protein sequences Nucleic Acids Res., March 17, 2006; 34(5): 1571 - 1580. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Ternes, P. Sperling, S. Albrecht, S. Franke, J. M. Cregg, D. Warnecke, and E. Heinz Identification of Fungal Sphingolipid C9-methyltransferases by Phylogenetic Profiling J. Biol. Chem., March 3, 2006; 281(9): 5582 - 5592. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. L. Marsden, D. Lee, M. Maibaum, C. Yeats, and C. A. Orengo Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space Nucleic Acids Res., February 15, 2006; 34(3): 1066 - 1080. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Duarte, L. Cui, P. K. Wall, Q. Zhang, X. Zhang, J. Leebens-Mack, H. Ma, N. Altman, and C. W. dePamphilis Expression Pattern Shifts Following Duplication Indicative of Subfunctionalization and Neofunctionalization in Regulatory Genes of Arabidopsis Mol. Biol. Evol., February 1, 2006; 23(2): 469 - 478. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Uchiyama Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes Nucleic Acids Res., January 25, 2006; 34(2): 647 - 658. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Yeats, M. Maibaum, R. Marsden, M. Dibley, D. Lee, S. Addou, and C. A. Orengo Gene3D: modelling protein structure, function and evolution Nucleic Acids Res., January 1, 2006; 34(suppl_1): D281 - D284. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Chen, A. J. Mackey, C. J. Stoeckert Jr, and D. S. Roos OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups Nucleic Acids Res., January 1, 2006; 34(suppl_1): D363 - D368. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Hartmann, D. Lu, J. Phillips, and T. J. Vision Phytome: a platform for plant comparative genomics Nucleic Acids Res., January 1, 2006; 34(suppl_1): D724 - D730. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. P. Olinski, L.-G. Lundin, and F. Hallbook Conserved Synteny Between the Ciona Genome and Human Paralogons Identifies Large Duplication Events in the Molecular Evolution of the Insulin-Relaxin Gene Family Mol. Biol. Evol., January 1, 2006; 23(1): 10 - 22. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Kim, P. S. Soltis, K. Wall, and D. E. Soltis Phylogeny and Domain Evolution in the APETALA2-like Gene Family Mol. Biol. Evol., January 1, 2006; 23(1): 107 - 120. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Goldovsky, P. Janssen, D. Ahren, B. Audit, I. Cases, N. Darzentas, A. J. Enright, N. Lopez-Bigas, J. M. Peregrin-Alvarez, M. Smith, et al. CoGenT++: an extensive and extensible data environment for computational genomics Bioinformatics, October 1, 2005; 21(19): 3806 - 3810. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Bishop, T. Shah, R. Pelle, D. Hoyle, T. Pearson, L. Haines, A. Brass, H. Hulme, S. P. Graham, E. L. N. Taracha, et al. Analysis of the transcriptome of the protozoan Theileria parva using MPSS reveals that the majority of genes are transcriptionally active in the schizont stage Nucleic Acids Res., September 25, 2005; 33(17): 5503 - 5511. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Petryszak, E. Kretschmann, D. Wieser, and R. Apweiler The predictive power of the CluSTr database Bioinformatics, September 15, 2005; 21(18): 3604 - 3609. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. T. R. Vasconcelos, H. B. Ferreira, C. V. Bizarro, S. L. Bonatto, M. O. Carvalho, P. M. Pinto, D. F. Almeida, L. G. P. Almeida, R. Almeida, L. Alves-Filho, et al. Swine and Poultry Pathogens: the Complete Genome Sequences of Two Strains of Mycoplasma hyopneumoniae and a Strain of Mycoplasma synoviae J. Bacteriol., August 15, 2005; 187(16): 5568 - 5577. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. W. Hahn, T. De Bie, J. E. Stajich, C. Nguyen, and N. Cristianini Estimating the tempo and mode of gene family evolution from comparative genomic data Genome Res., August 1, 2005; 15(8): 1153 - 1160. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Kunin, L. Goldovsky, N. Darzentas, and C. A. Ouzounis The net of life: Reconstructing the microbial phylogenetic network Genome Res., July 1, 2005; 15(7): 954 - 959. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Donald and E. I. Shakhnovich Determining functional specificity from protein sequences Bioinformatics, June 1, 2005; 21(11): 2629 - 2635. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. R. Thomson, C. Yeats, K. Bell, M. T.G. Holden, S. D. Bentley, M. Livingstone, A. M. Cerdeno-Tarraga, B. Harris, J. Doggett, D. Ormond, et al. The Chlamydophila abortus genome sequence reveals an array of variable proteins that contribute to interspecies variation Genome Res., May 1, 2005; 15(5): 629 - 640. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Horan, J. Lauricha, J. Bailey-Serres, N. Raikhel, and T. Girke Genome Cluster Database. A Sequence Family Analysis Platform for Arabidopsis and Rice Plant Physiology, May 1, 2005; 138(1): 47 - 54. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Prigent, J. C. Thierry, O. Poch, and F. Plewniak DbW: automatic update of a functional family-specific multiple alignment Bioinformatics, April 15, 2005; 21(8): 1437 - 1442. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. A. C. Singer, A. T. Lloyd, L. B. Huminiecki, and K. H. Wolfe Clusters of Co-expressed Genes in Mammalian Genomes Are Conserved by Natural Selection Mol. Biol. Evol., March 1, 2005; 22(3): 767 - 775. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Hall, M. Karras, J. D. Raine, J. M. Carlton, T. W. A. Kooij, M. Berriman, L. Florens, C. S. Janssen, A. Pain, G. K. Christophides, et al. A Comprehensive Survey of the Plasmodium Life Cycle by Genomic, Transcriptomic, and Proteomic Analyses Science, January 7, 2005; 307(5706): 82 - 86. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Chen, Y. Zhang, Y. Yin, G. Gao, S. Li, Y. Jiang, X. Gu, and J. Luo SPD--a web-based secreted protein database Nucleic Acids Res., January 1, 2005; 33(suppl_1): D169 - D173. [Abstract] [Full Tex |














