| Nucleic Acids Research | Pages |
SCOP: a Structural Classification of Proteins database
Introduction
The Classification
Family
Superfamily
Common fold
Class
Organisation And Facilities Of SCOP
Other Uses Of SCOP
Evaluating the effectiveness of sequence alignment methods
Statistics of protein structural data
Conclusions
Acknowledgements
References
SCOP: a Structural Classification of Proteins database
ABSTRACT
INTRODUCTION
At present (October, 1998) the Brookhaven Protein Databank (PDB; 1) contains 7723 entries and the number is increasing by about 200 a month. These proteins have structural similarities with other proteins and, in many cases, share a common evolutionary origin. To facilitate access to this information, we have constructed the Structural Classification of Proteins (SCOP) database (2). It includes not only all proteins in the current version of the PDB, but many proteins for which there are published descriptions but whose co-ordinates are not yet available.
The classification of proteins in SCOP has been constructed by visual inspection and comparison of structures. Given the current limitations of purely automatic procedures, we believe this approach produces the most accurate and useful results. The unit of classification is usually the protein domain. Small proteins, and most of those of medium size, have a single domain and are, therefore, treated as a whole. The domains in large proteins are usually classified individually.
THE CLASSIFICATION
The classification of the proteins is on hierarchical levels.
Family
Proteins are clustered together into families on the basis of one of two criteria that imply their having a common evolutionary origin: first, all proteins that have residue identities of 30% and greater; second, proteins with lower sequence identities but whose functions and structures are very similar; for example, globins with sequence identities of 15%.
Superfamily
Families, whose proteins have low sequence identities but whose structures and, in many cases, functional features suggest that a common evolutionary origin is probable, are placed together in superfamilies; for example, the variable and constant domains of immunoglobulins.
Common fold
Superfamilies and families are defined as having a common fold if their proteins have the same major secondary structures in the same arrangement and with the same topological connections (for recent reviews see refs 5 and 6). The structural similarities of proteins in the same fold category probably arise from the physics and chemistry of proteins favouring certain packing arrangements and chain topologies.
Class
The different folds have been grouped into classes. Most of the folds are assigned to one of the five structural classes: (i) all-[alpha], those whose structure is essentially formed by helices; (ii) all-[beta], those whose structure is essentially formed by [beta]-sheets; (iii) [alpha]/[beta], those with [alpha]-helices and [beta]-strands; (iv) [alpha]+[beta], those in which [alpha]-helices and [beta]-strands are largely segregated; and (v) multi-domain, those with domains of different fold and for which no homologues are known at present.
Other classes have been assigned for peptides, small proteins, theoretical models, nucleic acids and carbohydrates. These hierarchical levels are illustrated in Figure
Figure 1. Region of SCOP hierarchy. All the major levels, including class, fold, superfamily, and family are shown. Also shown are individual proteins and the lowest level, either the PDB coordinate identifier or a literature reference. Copyright © 1994 Steven E. Brenner; reproduced with permission. There are now a number of other databases which classify protein structures, such as CATH (7,8), FSSP (9,10), Entrez (11) and DDBASE (12), however the distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is, so far, unique to SCOP. Because functional similarity is implied by an evolutionary relationship but not necessarily by a physical relationship, we believe that this classification level is of considerable value, for example, as a way of reliably linking very distant sequence families.
ORGANISATION AND FACILITIES OF SCOP
The SCOP database is available as a set of tightly coupled hypertext pages on the world wide web (WWW) via URL: http://scop.mrc-lmb.cam.ac.uk/scop/ .
The interface to SCOP has been designed to facilitate both detailed searching of particular families and browsing of the whole database. To this end, there are a variety of different techniques for navigation as detailed below.
Browsing through the SCOP hierarchy. SCOP is organised as a tree structure. Entering at the top of the hierarchy, the user can navigate through the levels of Class, Fold, Superfamily, Family and Species to the leaves of the tree which are structural domains of individual PDB entries. An alternative hierarchy of Folds, Superfamilies and Families by the date of solution of the first representative structure is also provided.From an amino acid sequence. The sequence similarity search facility allows any sequence of interest to be searched against databases of protein sequences classified in SCOP (see below) using the algorithms BLAST (13), FASTA or SSEARCH (14). SCOP can then be entered from the list of PDB chains found to be similar and the similarity can be displayed visually.From a keyword. The keyword search facility returns a list of SCOP pages containing the word entered or combinations of words separated by a series of boolean operators.From a PDB identifier. The PDB entry viewer links PDB entries to various graphical views, external databases and SCOP itself.By history. Pages are provided that order folds, superfamilies and families by date of entry into PDB or publication. This is both for interest and to make it easier to keep up to date with the appearance of new folds or significant new members of existing folds.In addition to the information on structural and evolutionary relationships contained within SCOP, each entry (for which co-ordinates are available) has links to images of the structure, interactive molecular viewers, the atomic co-ordinates, data on functional conformational changes, sequence data and homologues and MEDLINE abstracts.
To facilitate rapid and effective access to SCOP, a number of mirrors have been established, a full current list of which can be found via the above URL. The facilities provided by the various sites are always the same, so you will lose nothing by accessing your nearest mirror. The implementation does differ: for example, currently sequence similarity searching is always carried out at the main, scop.mrc-lmb.cam.ac.uk site, however, this is transparent to the user who will always be returned a search results page marked up with links to pages on the mirror that they started from.
OTHER USES OF SCOP
Evaluating the effectiveness of sequence alignment methods
Sequence database searching plays a role in virtually every branch of molecular biology and is crucial for interpreting the sequences issuing forth from genome projects. Despite this, the overall and relative capabilities of different search procedures have until recently been largely unknown. This is because it is difficult to verify algorithms on sample data as this requires large data sets of proteins whose evolutionary relationships are known unambiguously and independently of the methods being evaluated (nearly all known homologs have been identified by sequence analysis, the method to be tested). Also, it is generally very difficult to know, in the absence of structural data, whether two proteins that lack clear sequence similarity are unrelated. This has meant that, although previous evaluations have helped improve sequence comparison, they have suffered from insufficient, imperfectly characterised, or artificial test data (15).
As part of the maintenance of SCOP, new structures are automatically processed. One of the initial steps is to cluster the sequences of protein chains of known structures at different levels of sequence similarity. This has resulted in a series of non-redundant sequence databases, referred to as PDB40, PDB90, PDB95 (the number refers to percentage sequence identity as modified by the HSSP equation; 16). The chains chosen as representatives are those with the best structural `quality' defined from an equation combining resolution, rfactor and procheck values (17). The final SCOP classification is used to annotate the headers of these fasta format files and to split them into domains. The result is a set of domain sequence databases, PDB40D, PDB90D, etc. where the full set of true and false pairwise relationships between the sequences can be inferred from the scopcode in the headers. These databases are used within SCOP for the sequence search facility (see above), however, they are also ideally suited as test data for the calibration of sequence searching algorithms. They have been used to calibrate the commonly used pairwise algorithms BLAST (13), WU-BLAST2 (18), FASTA and SSEARCH (14) (see ref. 15) as well as methods making use of multiple sequences such as Hidden Markov Models (19,20) and the recently developed iterative version of BLAST2 (21), referred to as psi-BLAST (22,23). The databases used for these studies are now freely available via the SCOP URL and can easily be filtered using the scopcode to extract subsets of sequences, e.g., to create a database with a single representative sequence for each fold, etc.
Statistics of protein structural data
With structural data conveniently organised into domains, it is straightforward to investigate the population statistics of the protein structures we currently know. A recent survey of the classification in SCOP (24) clearly shows that even after the high degree of redundancy in PDB has been taken into account, the frequency of occurrence of certain folds is much greater than would be expected by chance, as has been pointed out previously (25). The raw data needed to explore the classification in this way is provided in the form of the flat file from the SCOP URL.
CONCLUSIONS
We have found that the easy access to data and images provided by SCOP make it a powerful general-purpose interface to the PDB. The specific lower levels should be helpful for comparing individual structures with their evolutionary and structurally related counterparts. On a more general level, the highest levels of classification provide an excellent overview of the diversity of protein structures now known and would be appropriate both for researchers and students. Having created the classification we have found that it has many other uses, some of which have been listed here. We hope that other researchers will find yet more uses for the raw data files that are now provided with each release.
ACKNOWLEDGEMENTS
TJPH is grateful to the MRC/DTI/ZENECA LINK programme and AGM is grateful to the MRC for financial support.
REFERENCES
This article has been cited by other articles:
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 9 Dec 1998
Copyright©Oxford University Press, 1998.
![]()
CiteULike
Connotea
Del.icio.us What's this?
![]()
![]()

![]()
![]()
![]()
E. Portugaly, N. Linial, and M. Linial
EVEREST: a collection of evolutionary conserved protein domains
Nucleic Acids Res.,
January 12, 2007;
35(suppl_1):
D241 - D246.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
O. Sasson, N. Kaplan, and M. Linial
Functional annotation prediction: All for one and one for all
Protein Sci.,
June 1, 2006;
15(6):
1557 - 1562.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. A. Fodor and R. W. Aldrich
Statistical Limits to the Identification of Ion Channel Domains by Sequence Similarity
J. Gen. Physiol.,
May 30, 2006;
127(6):
755 - 766.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
Z. Xu, S. Fang, H. Shi, H. Li, Y. Deng, Y. Liao, J.-M. Wu, H. Zheng, H. Zhu, H.-M. Chen, et al.
Topology characterization of a benzodiazepine-binding {beta}-rich domain of the GABAA receptor {alpha}1 subunit
Protein Sci.,
October 1, 2005;
14(10):
2622 - 2637.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
N. Kaplan, A. Vaaknin, and M. Linial
PANDORA: keyword-based analysis of protein sets by integration of annotation sources
Nucleic Acids Res.,
October 1, 2003;
31(19):
5617 - 5626.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
S. Goldsmith-Fischman and B. Honig
Structural genomics: Computational methods for structure analysis
Protein Sci.,
September 1, 2003;
12(9):
1813 - 1821.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
K. S. Makarova, L. Aravind, Y. I. Wolf, R. L. Tatusov, K. W. Minton, E. V. Koonin, and M. J. Daly
Genome of the Extremely Radiation-Resistant Bacterium Deinococcus radiodurans Viewed from the Perspective of Comparative Genomics
Microbiol. Mol. Biol. Rev.,
March 1, 2001;
65(1):
44 - 79.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
K. A. T. Silverstein, E. Shoop, J. E. Johnson, A. Kilian, J. L. Freeman, T. M. Kunau, I. A. Awad, M. Mayer, and E. F. Retzel
The MetaFam Server: a comprehensive protein family resource
Nucleic Acids Res.,
January 1, 2001;
29(1):
49 - 51.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
C. H. Wu, C. Xiao, Z. Hou, H. Huang, and W. C. Barker
iProClass: an integrated, comprehensive and annotated protein classification database
Nucleic Acids Res.,
January 1, 2001;
29(1):
52 - 54.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
S. Dietmann, J. Park, C. Notredame, A. Heger, M. Lappe, and L. Holm
A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3
Nucleic Acids Res.,
January 1, 2001;
29(1):
55 - 57.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
X. Shao and N. V. Grishin
Common fold in helix-hairpin-helix proteins
Nucleic Acids Res.,
July 15, 2000;
28(14):
2643 - 2650.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
N. V. Grishin
Two tricks in one bundle: helix-turn-helix gains enzymatic activity
Nucleic Acids Res.,
June 1, 2000;
28(11):
2229 - 2233.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J.E. Bray, A.E. Todd, F.M.G. Pearl, J.M. Thornton, and C.A. Orengo
The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues
Protein Eng. Des. Sel.,
March 1, 2000;
13(3):
153 - 165.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J. Reichert, A. Jabs, P. Slickers, and J. Suhnel
The IMB Jena Image Library of Biological Macromolecules
Nucleic Acids Res.,
January 1, 2000;
28(1):
246 - 249.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
R. Sanchez, U. Pieper, N. Mirkovi, P. I. W. de Bakker, E. Wittenstein, and A. ali
MODBASE, a database of annotated comparative protein structure models
Nucleic Acids Res.,
January 1, 2000;
28(1):
250 - 253.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
H. Huang, C. Xiao, and C. H. Wu
ProClass protein family database
Nucleic Acids Res.,
January 1, 2000;
28(1):
273 - 276.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
E. Portugaly and M. Linial
Estimating the probability for a protein to have a new fold: A statistical computational model
PNAS,
May 9, 2000;
97(10):
5161 - 5166.
[Abstract]
[Full Text]
[PDF]
![]()
This Article ![]()
![]()
Abstract
![]()
Print PDF (41K)
![]()
Alert me when this article is cited
![]()
Alert me if a correction is posted
![]()
Services ![]()
![]()
Email this article to a friend
![]()
Similar articles in this journal
![]()
Similar articles in ISI Web of Science
![]()
Similar articles in PubMed
![]()
Alert me to new issues of the journal
![]()
Add to My Personal Archive
![]()
Download to citation manager
![]()
Search for citing articles in:
ISI Web of Science (98)
![]()
Commercial Re-use Guidelines
for Open Access NAR Content
![]()
Google Scholar ![]()
![]()
Articles by Hubbard, T. J.
![]()
Articles by Chothia, C.
![]()
Search for Related Content
![]()
PubMed ![]()
![]()
PubMed Citation
![]()
Articles by Hubbard, T. J.
![]()
Articles by Chothia, C.
![]()
Social Bookmarking ![]()
![]()
What's this?