Nucleic Acids Research, 2001, Vol. 29, No. 1 55-57
© 2001 Oxford University Press
A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3
Structural Genomics Group, EMBL-EBI, Cambridge CB10 1SD, UK and 1Structural and Genetic Information, CNRS UMR 1889, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
Received October 6, 2000; Accepted October 17, 2000.
| ABSTRACT |
|---|
|
|
|---|
The Dali Domain Dictionary (http://www.ebi.ac.uk/dali/domain) is a numerical taxonomy of all known structures in the Protein Data Bank (PDB). The taxonomy is derived fully automatically from measurements of structural, functional and sequence similarities. Here, we report the extension of the classification to match the traditional four hierarchical levels corresponding to: (i) supersecondary structural motifs (attractors in fold space), (ii) the topology of globular domains (fold types), (iii) remote homologues (functional families) and (iv) homologues with sequence identity above 25% (sequence families). The computational definitions of attractors and functional families are new. In September 2000, the Dali classification contained 10 531 PDB entries comprising 17 101 chains, which were partitioned into five attractor regions, 1375 fold types, 2582 functional families and 3724 domain sequence families. Sequence families were further associated with 99 582 unique homologous sequences in the HSSP database, which increases the number of effectively known structures several-fold. The resulting database contains the description of protein domain architecture, the definition of structural neighbours around each known structure, the definition of structurally conserved cores and a comprehensive library of explicit multiple alignments of distantly related protein families.
| INTRODUCTION |
|---|
|
|
|---|
Improved methods of protein engineering, crystallography and NMR spectroscopy have led to a surge of new protein structures deposited in the Protein Data Bank (PDB), and a number of derived databases that organise this data into hierarchical classification schemes or in terms of structural neighbourhoods have appeared on the World Wide Web (14). We maintain the Dali Domain Dictionary and FSSP database with continuous weekly updates. Because many structural similarities are between substructures (domains), i.e. parts of structures, protein chains are decomposed into domains using the criteria of recurrence and compactness (5). Each domain is assigned a Domain Classification number D.C.l.m.n.p representing fold space attractor region (l), globular folding topology (m), functional family (n) and sequence family (p). The discrete classification presents views that are free of redundancy and simplify navigation in protein space. The structural classification is explicitly linked to sequence families with associated functional annotation, resulting in a rich network of biologically interesting relationships that can be browsed online. In particular, structure-based alignments increase our understanding of the more distant evolutionary relationships (Fig. 1).
|
| A MAP OF FOLD SPACE |
|---|
|
|
|---|
The central concept underlying the classification is a map of fold space. This map is based on exhaustive neighbouring of all protein structures in the PDB. The all-against-all structure comparison is carried out using the Dali program. As a result of the exhaustive comparisons, each structure in the PDB is positioned in an abstract, high-dimensional space according to its structural similarity score to all other structures. The graph of structural similarities (between domains) is partitioned into clusters at four different levels of granularity. Coarse-grained overviews yield few clusters with many members that share broad architectural similarities, while fine-grained clustering yields many clusters within which structural similarities between members can extend to atomic detail due to functional constraints, for example, in binding sites.
Continuing the practice from the FSSP database, fold types are defined by agglomerative clustering so that the members of a fold type have average pairwise Z-scores above 2. The threshold has been chosen empirically to group together structures with topological similarity. Dali Domain Dictionary version 3 introduces two new levels to the fold classification, one above and one below the fold type abstraction.
The top level of the fold classification corresponds to secondary structure composition and supersecondary structural motifs. We have previously identified five attractor regions in fold space (1). We partition fold space so that each domain is assigned to one of attractors IV, which are represented by archetype structures, using a shortest-path criterion. Structures which are disconnected from other structures, are assigned to class X. Domains which are not clearly closer to one attractor than another, are assigned to the mixed class Y. Currently, class Y comprises about one-sixth of the representative domain set. In the future, some of these may be assigned to emerging new attractors.
| AN EVOLUTIONARY CLASSIFICATION |
|---|
|
|
|---|
The other new level of the classification infers plausible evolutionary relationships from strong structural similarities that are accompanied by functional or sequence similarities. Conceptually, this functional family level is equivalent to the superfamily level of scop (2). The computational discrimination between physically convergent (analogous) and evolutionarily related, divergent (homologous) proteins has received much attention recently (68). Structural similarity alone is insufficient to draw a line between the two classes. For example, lysozymes exhibit extreme structural divergence in regions supporting the active site, while coiled coils and beta-barrels are simple, geometrically constrained topologies which are believed to have emerged several times in protein evolution.
To address the evolutionary classification problem, we have chosen to analyse functional and sequence-motif attributes on top of structural similarity in a numerical taxonomy. The more functional features two proteins have in common, the more likely it is that they do so due to a common descent rather than by chance. Currently, our feature set includes common sequence neighbours (overlap of PSI-BLAST families), analysis of 3D clusters of identically conserved residues, enzyme classification (E.C. numbers) and keyword analysis of biological function. A neural network assigns weights to these qualitatively different features. The neural network was trained against the superfamily to fold transition in a manual fold classification (2). To unify families, we exploit the empirical observation that Dalis intramolecular distance comparison measure gives higher scores to pairs of homologues than to analogues. In practice, we require that functional families are nested within fold families in the fold dendrogram: functional families are branches of the fold dendrogram where all pairs have a high average neural network prediction for being homologous.
The threshold for unification was chosen empirically and is conservative. Five hundred and four functional families unify two or more sequence families. Unified families have functional residues or sequence motifs that map to common sites in the 3D context of a fold. The strongest evidence is usually obtained for unifying enzyme catalytic domains. In some cases the expert system fails to capture enough evidence for unification of domains which are believed to be homologous, such as within the varied set of helixturnhelix motif containing DNA binding domains where several functional families are defined at the same fold type level.
| A LIBRARY OF STRUCTURE-BASED MULTIPLE ALIGNMENTS OF REMOTE HOMOLOGUES |
|---|
|
|
|---|
The Dali Domain Classification can be browsed interactively at http://www2.ebi.ac.uk/dali. The server is implemented on top of a MySQL database. The classification may be entered from the top of the hierarchy, or the user may make a query about a protein identifier or a node in the classification hierarchy. Multiple structural alignments including attributes of the proteins are generated on the fly for any user selection of structural neighbours. Precomputed alignments are available for each functional family.
The T-Coffee program (9) is used to generate genuine consensus alignments of multiple structures from the library of pairwise Dali alignments. A reliability score is computed to indicate well defined regions (the structural core) and regions where structural equivalences are ambiguous. Technically, T-Coffee improves alignment quality in a few known cases of functional families where active site residues were inconsistently aligned in some of the pairwise Dali comparisons. Scientifically, the definition of functional families and reliable multiple structure alignments for each opens the door to sensitive sequence database searches using position-specific profiles, and to benchmarking the alignment accuracy of threading predictions.
| ACKNOWLEDGEMENT |
|---|
S.D. and J.P. were supported by EU contract BIO4-CT96-0166.
| FOOTNOTES |
|---|
* To whom correspondence should be addressed. Tel: +44 1223 494454; Fax: +44 1223 494470; Email: holm{at}ebi.ac.uk
| REFERENCES |
|---|
|
|
|---|
-
1 Holm,L. and Sander,C. (1996) Mapping the protein universe. Science, 273, 595603.
2 Hubbard,T.J., Ailey,B., Brenner,S.E., Murzin,A.G. and Chothia,C. (1999) SCOP: a Structural Classification of Proteins database. Nucleic Acids Res., 27, 254256.
3 Orengo,C.A., Pearl,F.M., Bray,J.E., Todd,A.E., Martin,A.C., Lo Conte,L. and Thornton,J.M. (1999) The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res., 27, 275279. Updated article in this issue: Nucleic Acids Res. (2001), 29, 223227.
4 Marchler-Bauer,A., Addess,K.J., Chappey,C., Geer,L., Madej,T., Matsuo,Y., Wang,Y. and Bryant,S.H. (1999) MMDB: Entrezs 3D structure database. Nucleic Acids Res., 27, 240243.
5 Holm,L. and Sander,C. (1998) Dictionary of recurrent domains in protein structures. Proteins, 33, 8896.[Web of Science][Medline]
6 Russell,R.B., Saqi,M.A., Bates,P.A., Sayle,R.A. and Sternberg,M.J. (1998) Recognition of analogous and homologous protein foldsassessment of prediction success and associated alignment accuracy using empirical substitution matrices. Protein Eng., 11, 19.
7 Kawabata,T. and Nishikawa,K. (2000) Protein structure comparison using the Markov transition model of evolution. Proteins, 41, 108122.[Web of Science][Medline]
8 Wood,T.C. and Pearson,W.R. (1999) Evolution of protein sequences and structures. J. Mol. Biol., 291, 977995.[Web of Science][Medline]
9 Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205217.[Web of Science][Medline]
10 Bewley,M.C., Jeffrey,P.D., Patchett,M.L., Kanyo,Z.F. and Baker,E.N. (1999) Crystal structures of Bacillus caldevelox arginase in complex with substrate and inhibitors reveal new insights into activation, inhibition and catalysis in the arginase superfamily. Structure, 7, 435438.[Medline]
11 Finnin,M.S., Donigian,J.R., Cohen,A., Richon,V.M., Rifkind,R.A., Marks,P.A., Breslow,R. and Pavletich,N.P. (1999) Structure of a histone deacetylase homologue bound to the TSA and SAHA inhibitors. Nature, 401, 188193.[Medline]
12 Kraulis,P. (1991) MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. Appl. Crystallogr., 24, 946950.
This article has been cited by other articles:
![]() |
E. L. Peterson, J. Kondev, J. A. Theriot, and R. Phillips Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment Bioinformatics, June 1, 2009; 25(11): 1356 - 1362. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A.C. Beck, A. L. Jonsson, R. D. Schaeffer, K. A. Scott, R. Day, R. D. Toofanny, D. O.V. Alonso, and V. Daggett Dynameomics: mass annotation of protein dynamics and unfolding in water by high-throughput atomistic molecular dynamics simulations Protein Eng. Des. Sel., June 1, 2008; 21(6): 353 - 368. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Malgieri, L. Russo, S. Esposito, I. Baglivo, L. Zaccaro, E. M. Pedone, B. Di Blasio, C. Isernia, P. V. Pedone, and R. Fattorusso The prokaryotic Cys2His2 zinc-finger adopts a novel fold as revealed by the NMR structure of Agrobacterium tumefaciens Ros DNA-binding domain PNAS, October 30, 2007; 104(44): 17341 - 17346. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Busch, J. Lacal, A. Martos, J. L. Ramos, and T. Krell Bacterial sensor kinase TodS interacts with agonistic and antagonistic signals PNAS, August 21, 2007; 104(34): 13774 - 13779. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Lerman and B. E. Shakhnovich Defining functional distance using manifold embeddings of gene ontology annotations PNAS, July 3, 2007; 104(27): 11334 - 11339. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Rueda, C. Ferrer-Costa, T. Meyer, A. Perez, J. Camps, A. Hospital, J. L. Gelpi, and M. Orozco A consensus view of protein dynamics PNAS, January 16, 2007; 104(3): 796 - 801. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Yu, Y.-H. Liang, E. Brostromer, J.-M. Quan, S. Panjikar, Y.-H. Dong, and X.-D. Su A Catalytic Mechanism Revealed by the Crystal Structures of the Imidazolonepropionase from Bacillus subtilis J. Biol. Chem., December 1, 2006; 281(48): 36929 - 36936. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Janner, A. V Pandey, P. E Mullis, and C. E Fluck Clinical and biochemical description of a novel CYP21A2 gene mutation 962_963insA using a new 3D model for the P450c21 protein. Eur. J. Endocrinol., July 1, 2006; 155(1): 143 - 151. [Abstract] [Full Text] [PDF] |
||||
![]() |
S.-C. Chen, Y.-C. Chang, C.-H. Lin, C.-H. Lin, and S.-H. Liaw Crystal Structure of a Bifunctional Deaminase and Reductase from Bacillus subtilis Involved in Riboflavin Biosynthesis J. Biol. Chem., March 17, 2006; 281(11): 7605 - 7613. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-H. Huang, W.-L. Lai, M.-H. Lee, C.-J. Chen, A. Vasella, Y.-C. Tsai, and S.-H. Liaw Crystal Structure of Glucooligosaccharide Oxidase from Acremonium strictum: A NOVEL FLAVINYLATION OF 6-S-CYSTEINYL, 8{alpha}-N1-HISTIDYL FAD J. Biol. Chem., November 18, 2005; 280(46): 38831 - 38838. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Schnell, T. Sandalova, U. Hellman, Y. Lindqvist, and G. Schneider Siroheme- and [Fe4-S4]-dependent NirA from Mycobacterium tuberculosis Is a Sulfite Reductase with a Covalent Cys-Tyr Bond in the Active Site J. Biol. Chem., July 22, 2005; 280(29): 27319 - 27328. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Espadaler, R. Aragues, N. Eswar, M. A. Marti-Renom, E. Querol, F. X. Aviles, A. Sali, and B. Oliva Detecting remotely related proteins by their interactions and sequence similarity PNAS, May 17, 2005; 102(20): 7151 - 7156. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Sandelin Extracting multiple structural alignments from pairwise alignments: a comparison of a rigorous and a heuristic approach Bioinformatics, April 1, 2005; 21(7): 1002 - 1009. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. E. Shakhnovich, E. Deeds, C. Delisi, and E. Shakhnovich Protein structure and evolutionary history determine sequence space topology Genome Res., March 1, 2005; 15(3): 385 - 392. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Casbon and M. A. S. Saqi S4: structure-based sequence alignments of SCOP superfamilies Nucleic Acids Res., January 1, 2005; 33(suppl_1): D219 - D222. [Abstract] [Full Text] [PDF] |
||||
![]() |
S.-H. Liaw, Y.-J. Chang, C.-T. Lai, H.-C. Chang, and G.-G. Chang Crystal Structure of Bacillus subtilis Guanine Deaminase: THE FIRST DOMAIN-SWAPPED STRUCTURE IN THE CYTIDINE DEAMINASE SUPERFAMILY J. Biol. Chem., August 20, 2004; 279(34): 35479 - 35485. [Abstract] [Full Text] [PDF] |
||||
![]() |
W.-L. Lai, L.-Y. Chou, C.-Y. Ting, R. Kirby, Y.-C. Tsai, A. H.-J. Wang, and S.-H. Liaw The Functional Role of the Binuclear Metal Center in D-Aminoacylase: ONE-METAL ACTIVATION AND SECOND-METAL ATTENUATION J. Biol. Chem., April 2, 2004; 279(14): 13962 - 13967. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Espadaler, N. Fernandez-Fuentes, A. Hermoso, E. Querol, F. X. Aviles, M. J. E. Sternberg, and B. Oliva ArchDB: automated protein loop classification as a tool for structural genomics Nucleic Acids Res., January 1, 2004; 32(90001): D185 - 188. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Agrawal, P. V. N. Dasaradhi, A. Mohmmed, P. Malhotra, R. K. Bhatnagar, and S. K. Mukherjee RNA Interference: Biology, Mechanism, and Applications Microbiol. Mol. Biol. Rev., December 1, 2003; 67(4): 657 - 685. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Poirot, E. O'Toole, and C. Notredame Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments Nucleic Acids Res., July 1, 2003; 31(13): 3503 - 3506. [Abstract] [Full Text] [PDF] |
||||
![]() |
T.-P. Ko, J.-J. Lin, C.-Y. Hu, Y.-H. Hsu, A. H.-J. Wang, and S.-H. Liaw Crystal Structure of Yeast Cytosine Deaminase: INSIGHTS INTO ENZYME MECHANISM AND EVOLUTION J. Biol. Chem., May 23, 2003; 278(21): 19111 - 19117. [Abstract] [Full Text] [PDF] |
||||
![]() |
S.-H. Liaw, S.-J. Chen, T.-P. Ko, C.-S. Hsu, C.-J. Chen, A. H.-J. Wang, and Y.-C. Tsai Crystal Structure of D-Aminoacylase from Alcaligenes faecalis DA1. A NOVEL SUBSET OF AMIDOHYDROLASES AND INSIGHTS INTO THE ENZYME MECHANISM J. Biol. Chem., February 7, 2003; 278(7): 4957 - 4962. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. S. Krishna, I. Majumdar, and N. V. Grishin Structural classification of zinc fingers: SURVEY AND SUMMARY Nucleic Acids Res., January 15, 2003; 31(2): 532 - 550. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









