| Nucleic Acids Research | Pages |
The CATH Database provides insights into protein structure/function relationships
Introduction
Implications For Structural Genomics
Relationship Between Protein Structure And Function
Assignment Of Function Through Structure
References
The CATH Database provides insights into protein structure/function relationships
ABSTRACT
INTRODUCTION
The CATH classification of protein domain structures was established in 1993 (1) as a hierarchical clustering of protein domain structures into evolutionary families and structural groupings, depending on sequence and structure similarity. There are four major levels, corresponding to protein class, architecture, topology or fold and homologous family (Fig.
Figure 1. Schematic representation of the (C)lass, (A)rchitecture and (T)opology/fold levels in the CATH database. CATH consists of both phylogenetic and phenetic descriptors for protein domain relationships. At the lowest levels in the hierarchy, proteins are grouped into evolutionary families (Homologous familes), for having either significant sequence similarity ([ge]35% identity) or high structural similarity and some sequence similarity ([ge]20% identity). Structural similarity is assessed using an automatic method (SSAP) (3,4), which scores 100 for identical proteins and generally returns scores above 80 for homologous proteins. More distantly related folds generally give scores above 70 (Topology or fold level), though in the absence of any sequence or functional similarity this may simply represent examples of convergent evolution, reinforcing the hypothesis that there exists a limited number of folds in nature (5,6). The Architecture level in CATH, groups proteins whose folds have similar 3D arrangements of secondary structures (e.g., barrel, sandwich or propellor), regardless of their connectivity, whilst the top level, Class, simply reflects the proportion of [alpha]-helix or [beta]-strand secondary structures. Three major classes are recognised, mainly-[alpha], mainly-[beta] and [alpha]-[beta], since analysis revealed considerable overlap between the [alpha]+[beta] and alternating [alpha]/[beta] classes, originally described by Levitt and Chothia (7). Before classification, multidomain proteins are first separated into their constituent folds using a consensus method which seeks agreement between three independent algorithms (8). Whilst the protocol for updating CATH is largely automatic (9), several stages require manual validation, in particular establishing domain boundaries in proteins for which no consensus could be reached and in checking the relationships of very distant homologues and proteins having borderline fold similarity. Although there are plans to assign the more regular architectures automatically, all architecture groupings are currently assigned manually. Figure 2. Snapshot of a web page showing data available in the CATH dictionary of homologous superfamilies, for the subtilisin family (CATH id: 3.40.50.200). Tables display the PDB codes for non-identical relatives in the family, together with EC identifier codes and information about the enzyme reactions. The multiple structural alignment, shown, has been coloured according to secondary structure assignments (red for helix, blue for strands). A homologous family Dictionary is now available within CATH, which contains functional data, where available, for each protein within a homologous family. This includes EC identifiers, SWISS-PROT keywords and information from the Enzyme database or the literature (Fig. Thornton, submitted to Protein Engng.). The topology of each domain is illustrated by schematic TOPS diagrams (http://www3.ebi.ac.uk/tops ; 10). We have also recently set up a Web Server (11), which enables the user to scan the CATH database with a newly determined protein structure and identify possible fold similarities or evolutionary relationships. There are also plans to incorporate sequence searches (using BLAST or PSI-BLAST) (12) to identify a probable fold for a new sequence. The latest release of CATH (version 1.4, April 1998) contains 9342 protein chains from the PDB (13), which divide into 13 359 domain folds. Currently 32 different architectures are recognised. Since the last release, three new architectures have been described, including the five-bladed [alpha]-[beta] propellor. Grouping proteins on the basis of sequence, structure and functional similarity gives 827 evolutionary homologous families (H-level). Whilst recognising more distant structural similarity with no accompanying sequence or function similarity gives rise to 593 different fold groups (T-level). The population of the different levels in the CATH hierarchy is illustrated by the CATH wheel shown in Figure Figure 3. CATH wheel plot showing the population of homologous families in different fold groups, architectures and classes. The wheel is coloured according to protein class (red, mainly-[alpha]; green, mainly-[beta]; yellow, [alpha][beta]; blue, few secondary structures). The size of the outer wheel represents the number of homologous families in CATH whilst each band in the outer wheel corresponds to a single fold family. The size of each `fold band' therefore reflects the number of homologous families having that fold. It can be seen that most fold families contain a single homologous family. The superfold families are shown as paler bands, containing many homologous families. The inner wheel shows the population of homologous families in the different architectures. As the sequence databases grow rapidly, the need to interpret these sequences and assign functions to specific genes becomes increasingly important. Many techniques exist for matching protein sequences and thereby inheriting functional information. However, for very distant homologues there is often no detectable sequence similarity, despite conservation of 3D structure and function. For these cases, evolutionary relationships and thereby functions can only be assigned by comparing the structures. Therefore, a number of structural genomics initiatives are being proposed (14) which aim to identify all the folds in nature with the ultimate goal of being able to predict the function of a new protein from its known or probable structure. The important questions to ask are how many more folds do we need to determine before we have the complete set? and how confident can we be in assigning function between proteins having similar structures? In the current genomes, on average only 30-46% of sequences can be assigned to a structural family, by recognising sequence similarity to a protein of known structure (15,16). With only ~600 unique structures currently in the PDB, compared with ~20 000 sequence families, it is clear that we still need to determine many more structures if we are to understand biology at the molecular level. However analysis of recently deposited structural data is very revealing. Figure
Figure 4. Pi-charts showing the proportion of 2159 recently deposited structures, which match structures in CATH. (a) Proportion of new structures matching by sequence alignment (21) or structure alignment (SSAP) (3). (b) Proportion of new non-homologous structure (<30% sequence identity to any previous CATH entry), which match previous CATH entries by structure. Those which have more than 20% sequence identity, measured after structural alignment, or functional similarity, are assigned as homologues. The remaining structures are analogues, having no clear evolutionary relationship. Of the remaining 443 structures (Fig.
IMPLICATIONS FOR STRUCTURAL GENOMICS
a

b

RELATIONSHIP BETWEEN PROTEIN STRUCTURE AND FUNCTION
We now need to consider at what levels of structural similarity or evolutionary 'distance' it is reasonable to inherit functional information, within a protein family. Data on the CATH evolutionary families and structural groupings is stored in a Postgres relational database (11) with links to a ligand database containing information about protein-ligand interactions (2). This allows us to analyse the relationship between the 3D structure and function, using stored data on EC identifiers, SWISS-PROT key words and protein-ligand interactions (11).
Considering the degree of functional similarity observed in structures with similar folds, the vast majority (>96%) of fold groups in the PDB derive from a single homologous family, with similar or closely related functions within the family. However, for the very common folds (superfolds, see above) which derive from three or more apparently unrelated homologous families, the proteins can perform quite unrelated functions even though they have the same fold. We have described these as analogous folds, which may or may not have a common ancestor.
At the homologous superfamily level in CATH, a more detailed analysis of enzyme functions showed that the majority of homologous enzyme families in CATH (>90%) contained proteins for which the first three EC identifiers were the same. Considering those families where homologues have significant sequence identity ([ge]20%) after structural alignment, 95% were found to have a single EC identifier, whilst for families where proteins have more than 30% sequence similarity, we observed that 98% had a single EC code.
Although assigning function on the basis of homology is common practice, it is clear that some caution should be exercised, particularly where there is little or no sequence similarity. There are also some clear examples where homologues with significant sequence similarity perform different functions. The role of `gene recruitment' is especially clear in the eye lens proteins, which function as enzymes in other cellular environments, but which are used as structural proteins in this context (17). The extent of such `gene recruitment' and context-sensitive function is really not known at this time. For enzymes, it is clear that catalytic function can change and evolve, usually to act on a different but related substrate. Similarly, within the lipocalin family (CATH id #: 2.40.130.10), several proteins are found with very similar structures, which bind different fatty acids in the same region at the base of the [beta]-barrel (e.g., retinol, bilin, biotin).
Nearly half of the homologous families where two or more different EC numbers were observed, belong to the superfolds. This suggests that if a new protein is assigned to a superfold family, more caution should be used when inheriting functional information, as there appears to be greater tolerance to changes in sequence and ultimately function, for these families. However, it is interesting to note that many of these were TIM barrel or Rossmann folds. These are superfolds in which the substrate or ligand commonly binds in the same place. This is in the base of the [beta]-barrel for the TIMs and at the crossover of the polypeptide chain for the doubly wound Rossmann structures.
ASSIGNMENT OF FUNCTION THROUGH STRUCTURE
One of the reasons for determining structures is to derive more information to facilitate the assignment of function. From our analysis of proteins in CATH, we suggest that structural data can help to assign function in several ways:
(i) The structural data allow recognition of more distant homologues compared with sequence data-in our analysis, 83% of structures with novel sequences could be assigned as homologues in this way (note that such assignment of function is again subject to the caveats imposed by `gene recruitment' discussed above).
(ii) The structural data allows detailed inspection of the functional site-to suggest if and how the function may have evolved. For example, if an enzyme has evolved to act on a different substrate, the binding site may reveal, or at least suggest, possible changes in the substrate.
(iii) For the superfolds, similarity of structure does not necessarily mean similarity of function. However the active site/binding sites are often conserved, e.g., in the TIM barrel or Rossmann fold structures, the ligand always binds at the same end of the barrel or sheet.
(iv) Some methods have already been developed, and will increasingly be the focus of attention over the next few years, which aim to predict function ab initio from structure. For example, enzymes can often be identified by the presence of a major cleft, which also locates the active site (18). Similarly critical surface patches, which are used for molecular recognition in binding other proteins or ligands, may be identified using knowledge-based approaches (19,20).
In summary, extrapolating the data from Figure
REFERENCES
This article has been cited by other articles:
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 9 Dec 1998
Copyright©Oxford University Press, 1998.
![]()
CiteULike
Connotea
Del.icio.us What's this?
![]()
![]()

![]()
![]()
![]()
G. Malgieri, L. Russo, S. Esposito, I. Baglivo, L. Zaccaro, E. M. Pedone, B. Di Blasio, C. Isernia, P. V. Pedone, and R. Fattorusso
The prokaryotic Cys2His2 zinc-finger adopts a novel fold as revealed by the NMR structure of Agrobacterium tumefaciens Ros DNA-binding domain
PNAS,
October 30, 2007;
104(44):
17341 - 17346.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
H. Yu, R. Jansen, G. Stolovitzky, and M. Gerstein
Total ancestry measure: quantifying the similarity in tree-like classification, with genomic applications
Bioinformatics,
August 15, 2007;
23(16):
2163 - 2173.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
E. Portugaly, N. Linial, and M. Linial
EVEREST: a collection of evolutionary conserved protein domains
Nucleic Acids Res.,
January 12, 2007;
35(suppl_1):
D241 - D246.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. L. Cuff, R. W. Janes, and A. C.R. Martin
Analysing the ability to retain sidechain hydrogen-bonds in mutant proteins
Bioinformatics,
June 15, 2006;
22(12):
1464 - 1470.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
K. Wang and R. Samudrala
FSSA: a novel method for identifying functional signatures from structural alignments
Bioinformatics,
July 1, 2005;
21(13):
2969 - 2977.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J. D. Thompson, V. Prigent, and O. Poch
LEON: multiple aLignment Evaluation Of Neighbours
Nucleic Acids Res.,
February 24, 2004;
32(4):
1298 - 1307.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J. Kopp and T. Schwede
The SWISS-MODEL Repository of annotated three-dimensional protein structure homology models
Nucleic Acids Res.,
January 1, 2004;
32(90001):
D230 - 234.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J. A. T. DOW and S. A. DAVIES
Integrative Physiology and Functional Genomics of Epithelial Function in a Genetic Model Organism
Physiol Rev,
July 1, 2003;
83(3):
687 - 729.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
H. Hegyi and M. Gerstein
Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins
Genome Res.,
October 1, 2001;
11(10):
1632 - 1640.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
N. M. Luscombe, R. A. Laskowski, and J. M. Thornton
Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level
Nucleic Acids Res.,
July 1, 2001;
29(13):
2860 - 2874.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
K. A. T. Silverstein, E. Shoop, J. E. Johnson, A. Kilian, J. L. Freeman, T. M. Kunau, I. A. Awad, M. Mayer, and E. F. Retzel
The MetaFam Server: a comprehensive protein family resource
Nucleic Acids Res.,
January 1, 2001;
29(1):
49 - 51.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
S. Dietmann, J. Park, C. Notredame, A. Heger, M. Lappe, and L. Holm
A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3
Nucleic Acids Res.,
January 1, 2001;
29(1):
55 - 57.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
I. N. Shindyalov and P. E. Bourne
A database and tools for 3-D protein structure comparison and alignment using the Combinatorial Extension (CE) algorithm
Nucleic Acids Res.,
January 1, 2001;
29(1):
228 - 229.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
W. W. Li, B. V. B. Reddy, I. N. Shindyalov, and P. E. Bourne
CKAAPs DB: a conserved key amino acid positions database
Nucleic Acids Res.,
January 1, 2001;
29(1):
329 - 331.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
D. D. Pollock, J. A. Eisen, N. A. Doggett, and M. P. Cummings
A Case for Evolutionary Genomics and the Comprehensive Examination of Sequence Biodiversity
Mol. Biol. Evol.,
December 1, 2000;
17(12):
1776 - 1788.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
S. Balasubramanian, T. Schneider, M. Gerstein, and L. Regan
Proteomics of Mycoplasma genitalium: identification and characterization of unannotated and atypical proteins in a small model genome
Nucleic Acids Res.,
August 15, 2000;
28(16):
3075 - 3082.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
E. Caputo, G. Manco, L. Mandrich, and J. Guardiola
A Novel Aspartyl Proteinase from Apocrine Epithelia and Breast Tumors
J. Biol. Chem.,
March 10, 2000;
275(11):
7935 - 7941.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J.E. Bray, A.E. Todd, F.M.G. Pearl, J.M. Thornton, and C.A. Orengo
The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues
Protein Eng. Des. Sel.,
March 1, 2000;
13(3):
153 - 165.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J. Reichert, A. Jabs, P. Slickers, and J. Suhnel
The IMB Jena Image Library of Biological Macromolecules
Nucleic Acids Res.,
January 1, 2000;
28(1):
246 - 249.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
R. Sanchez, U. Pieper, N. Mirkovi, P. I. W. de Bakker, E. Wittenstein, and A. ali
MODBASE, a database of annotated comparative protein structure models
Nucleic Acids Res.,
January 1, 2000;
28(1):
250 - 253.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
S. E. Brenner, P. Koehl, and M. Levitt
The ASTRAL compendium for protein structure and sequence analysis
Nucleic Acids Res.,
January 1, 2000;
28(1):
254 - 256.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
H. Huang, C. Xiao, and C. H. Wu
ProClass protein family database
Nucleic Acids Res.,
January 1, 2000;
28(1):
273 - 276.
[Abstract]
[Full Text]
[PDF]
![]()
This Article ![]()
![]()
Abstract
![]()
Print PDF (1190K)
![]()
Alert me when this article is cited
![]()
Alert me if a correction is posted
![]()
Services ![]()
![]()
Email this article to a friend
![]()
Similar articles in this journal
![]()
Similar articles in ISI Web of Science
![]()
Similar articles in PubMed
![]()
Alert me to new issues of the journal
![]()
Add to My Personal Archive
![]()
Download to citation manager
![]()
Search for citing articles in:
ISI Web of Science (82)
![]()
Request Permissions ![]()
Commercial Re-use Guidelines
for Open Access NAR Content
![]()
Google Scholar ![]()
![]()
Articles by Orengo, C. A.
![]()
Articles by Thornton, J. M.
![]()
Search for Related Content
![]()
PubMed ![]()
![]()
PubMed Citation
![]()
Articles by Orengo, C. A.
![]()
Articles by Thornton, J. M.
![]()
Social Bookmarking ![]()
![]()
What's this?