Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (314K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (86)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Gough, J.
Right arrow Articles by Chothia, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gough, J.
Right arrow Articles by Chothia, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2002, Vol. 30, No. 1 268-272
© 2002 Oxford University Press

SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments

Julian Gough* and Cyrus Chothia

MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK

Received September 28, 2001; Revised and Accepted October 30, 2001.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONTENTS
 APPLICATIONS
 REFERENCES
 
The SUPERFAMILY database contains a library of hidden Markov models representing all proteins of known structure. The database is based on the SCOP ‘superfamily’ level of protein domain classification which groups together the most distantly related proteins which have a common evolutionary ancestor. There is a public server at http://supfam.org which provides three services: sequence searching, multiple alignments to sequences of known structure, and structural assignments to all complete genomes. Given an amino acid or nucleotide query sequence the server will return the domain architecture and SCOP classification. The server produces alignments of the query sequences with sequences of known structure, and includes multiple alignments of genome and PDB sequences. The structural assignments are carried out on all complete genomes (currently 59) covering approximately half of the soluble protein domains. The assignments, superfamily breakdown and statistics on them are available from the server. The database is currently used by this group and others for genome annotation, structural genomics, gene prediction and domain-based genomic studies.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONTENTS
 APPLICATIONS
 REFERENCES
 
The SUPERFAMILY database is based on the SCOP (1) classification of protein domains. SCOP is a structural domain-based heirarchical classification with several levels including the ‘superfamily’ level. Proteins grouped together at the superfamily level are defined as having structural, functional and sequence evidence for a common evolutionary ancestor. It is at this level, as the name suggests, that SUPERFAMILY operates because it is the level with the most distantly related protein domains. The level below is the ‘family’ level which groups more closely related domains, and the level above is the ‘fold’ level which groups domains with similar topology which are not necessarily related.

The database uses hidden Markov models (HMMs) which are profiles based on multiple sequence alignments designed to represent a protein family (or superfamily) which can be used to search sequence databases for homologues. The SAM-T99 HMM software (2) is one of the best methods for the detection of remote protein homologues. The SAM software was used to build a library of models (3) representing all proteins of known structure, which forms the core of the SUPERFAMILY database. These models have added value by expert curation and tuning designed to detect and classify SCOP domains at the superfamily level.

There are existing databases which use HMMs representing protein domains such as Pfam (4), SMART (5) and others. There are also unifying databases which have several of these methods included, e.g. InterPro (6) and CDD (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml). There are two main differences to SUPERFAMILY: these other databases span all proteins whereas SUPERFAMILY only covers those with a known structural representative, and they also group domains into families based on sequence similarity alone leading to a level of classification more similar to the family than the superfamily level. Structural assignments have been carried out using PSI-BLAST (7) based on the CATH (8) database but are much less extensive (http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D).


    DATABASE CONTENTS
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONTENTS
 APPLICATIONS
 REFERENCES
 
The database may be accessed directly via a public server at http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY or via a link from each domain entry in SCOP at http://scop.mrc-lmb.cam.ac.uk/scop. There are also links from some genome databases, for example, Ensembl at http://www.ensembl.org. The underlying machinery of the database consists of a library of HMMs, a relational database and some programs. All of these are also available for download.

Structural assignments to sequences
The HMM library representing all proteins of known structure may be used to assign structural domains to sequences of unknown structure. An amino acid or nucleotide sequence may be queried against the library, and then the domain architecture and SCOP classification is returned (Fig. 1). The procedure has been optimised for remote homology detection retaining an estimated error rate of <1%. Three-dimensional models can be generated and these have been used to compare the method to other automatic structure prediction servers at LiveBench (http://bioinfo.pl). The server’s specificity is one of the best, especially for hard targets.



View larger version (23K):
[in this window]
[in a new window]
 
Figure 1. An example of the result of a sequence query. The protein (spP2931EPA2_HUMAN) is a multi-domain protein with five structural domains predicted with confidence, and shown in grey, three non-significant predictions. Each domain covers a different region of the query sequence and may be classified in a different SCOP superfamily with a different score. The ‘Align’ button links to a sequence alignment, and the ‘Assign.’ button links to all genome assignments for the given superfamily.

 
Nucleotide searches are carried out by translating sequences into the six reading frames and splitting across stop codons. Thus, the resulting structural assignments do not require any prior gene prediction and can be used to locate possible genes from raw DNA (Fig. 2). This does not provide gene prediction on its own, but is useful if no gene prediction is available and may suggest possible coding regions which gene prediction algorithms may not have identified.



View larger version (41K):
[in this window]
[in a new window]
 
Figure 2. A section of a result of a nucleotide search of human contig AB019437.00001 clearly showing a DNA/RNA polymerase domain consisting of exons 3–7. The alignment shows how the exons combine in order to make up a complete domain.

 
Multiple sequence alignments
The models are used to generate multiple sequence alignments to sequences of known structure. A sequence with structural domains assigned (as above) can be aligned to a known sequence of the structural domain in question. On the public server there is a link to the alignment from the result page from a sequence query (Fig. 1). The server contains all PDB sequences and all complete genome sequences, which can be added to obtain a multiple alignment; users can also upload their own sequences for addition to the multiple alignment.

In the absence of a sequence query, the multiple alignments can be reached via links from SCOP or a keyword search on the server.

Genome assignments
The SUPERFAMILY procedure has been used to carry out structural assignments to all complete genomes (Tables 1 and 2). The assignments cover ~35% of eukaryote and 45% of prokaryote sequence, which is estimated as half of the soluble protein domains. This coverage is expected to increase as structural genomics projects solve more novel structures, giving a more complete structural picture of the genomes.


View this table:
[in this window]
[in a new window]
 
Table 1. The genome assignments for 56 genomes using the model library and assignment procedure
 

View this table:
[in this window]
[in a new window]
 
Table 2. The assignments for 11 miscellaneous sequence sets including, amongst other things, five alternative human gene sets and some incomplete genomes
 
The SCOP classification of the structural domains in genomes provides a framework for comparing superfamilies within and across genomes. The public server provides statistics, and the breakdown of the genomes into superfamilies of different sizes. Within each superfamily of a given genome the individual genes may be displayed, with links to their domain architecture and sequence alignments.

A growing number of genome assignments are served via a distributed annotation system (DAS) server at http://supfam.org:8080/das allowing people to view the annotation from different sources in a single browser. To use this service a DAS client is required which can be obtained from http://www.biodas.org.


    APPLICATIONS
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONTENTS
 APPLICATIONS
 REFERENCES
 
The most straightforward application is a simple sequence search, of which the public server currently (pre-publication) receives over 1000 per month. Many larger sets of sequences have been run as special requests for specific studies; the database is used on several structural genomics projects’ targets (e.g. SPiNE at http://spine.mbb.yale.edu/spine).

Although the assignments to nucleotide sequence do not provide complete gene predictions, they can be used as information contributing to a gene prediction. Current work is generating the data for the human genome for this purpose.

The genome assignments provide annotation of the genes, much of which is novel. This information is not just accessed by users of the database but is also used by several genome projects (including all completed large eukaryotes) to aid their annotation efforts, or verbatim as annotation in its own right.

The SUPERFAMILY data provides a framework which already forms the basis of several ongoing genomic studies (9,10). The data is also used by the HIGH database (http://genomesapiens.org) of immunoglobulin genes in human.


    ACKNOWLEDGEMENTS
 
The authors thank Thomas Down for help setting up the DAS server, Matthew Bashton for contributing to web design, and the SCOP authors for helpful discussions.


    FOOTNOTES
 
* To whom correspondence should be addressed. Tel: +44 1223 402479; Fax: +44 1223 213556; Email: jgough{at}mrc-lmb.cam.ac.uk Back


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 DATABASE CONTENTS
 APPLICATIONS
 REFERENCES
 

    1 Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540.[Web of Science][Medline]

    2 Karplus,K., Barrett,C. and Hughey,R. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14, 846–856.[Abstract/Free Full Text]

    3 Gough,J., Karplus,K., Hughey,R. and Chothia,C. (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol., 313, 903–919.[Web of Science][Medline]

    4 Bateman,A., Birney,E., Durbin,R., Eddy,S.E., Lowe,K.L. and Sonnhammer,E.L. (2000) The Pfam protein families database. Nucleic Acids Res., 28, 263–266. Updated article in this issue: Nucleic Acids Res. (2002), 30, 276–280.[Abstract/Free Full Text]

    5 Schultz,J., Copley,R.R., Doerks,T., Ponting,C.P. and Bork,P. (2000) SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res., 28, 231–234. Updated article in this issue: Nucleic Acids Res. (2002), 30, 242–244.[Abstract/Free Full Text]

    6 Apweiler,R., Attwood,T.K., Bairoch,A., Bateman,A., Birney,E., Biswas,M., Bucher,P., Cerutti,L., Corpet,F., Croning,M.D.R. et al. (2000) InterPro—an integrated documentation resource for protein families, domains and functional sites. Bioinformatics, 16, 1145–1150.[Abstract/Free Full Text]

    7 Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

    8 Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) CATH—a hierarchic classification of protein domain structures. Structure, 5, 1093–1108.[Medline]

    9 Apic,G., Gough,J. and Teichmann,S.A. (2001) Domain combinations in archael, eubacterial and eukaryotic proteins. J. Mol. Biol., 310, 311–325.[Web of Science][Medline]

    10 Teichmann,S.A., Rison,S.C.G., Thornton,J.M., Riley,M., Gough,J. and Chothia,C. (2001) The evolution and structural anatomy of the small molecule metabolic pathways in Escherichia coli. J. Mol. Biol., 311, 693–708.[Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
P.-Y. Chen, C. M. Deane, and G. Reinert
A statistical approach using network structure in the prediction of protein characteristics
Bioinformatics, September 1, 2007; 23(17): 2314 - 2321.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. E. Gewehr, V. Hintermair, and R. Zimmer
AutoSCOP: automated prediction of SCOP classifications using unique pattern-class mappings
Bioinformatics, May 15, 2007; 23(10): 1203 - 1210.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
M. M. Babu, M. L. Priya, A. T. Selvan, M. Madera, J. Gough, L. Aravind, and K. Sankaran
A Database of Bacterial Lipoproteins (DOLOP) with Functional Assignments to Predicted Lipoproteins.
J. Bacteriol., April 1, 2006; 188(8): 2761 - 2773.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
B. Wallner and A. Elofsson
Pcons5: combining consensus, structural evaluation and fold recognition scores
Bioinformatics, December 1, 2005; 21(23): 4248 - 4254.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. McDermott, M. Guerquin, Z. Frazier, A. N. Chang, and R. Samudrala
BIOVERSE: enhancements to the framework for structural, functional and contextual modeling of proteins and proteomes
Nucleic Acids Res., July 1, 2005; 33(suppl_2): W324 - W325.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
R. Y. Kahsay, G. Wang, G. Gao, L. Liao, and R. Dunbrack
Quasi-consensus-based comparison of profile hidden Markov models for protein sequences
Bioinformatics, May 15, 2005; 21(10): 2287 - 2293.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
W.J. Kent, F. Hsu, D. Karolchik, R. M. Kuhn, H. Clawson, H. Trumbower, and D. Haussler
Exploring relationships and mining data with the UCSC Gene Sorter
Genome Res., May 1, 2005; 15(5): 737 - 741.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. Biol.Home page
D. Cazalla, K. Newton, and J. F. Caceres
A Novel SR-Related Protein Is Required for the Second Step of Pre-mRNA Splicing
Mol. Cell. Biol., April 15, 2005; 25(8): 2969 - 2980.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
X. Yuan and C. Bystroff
Non-sequential structure-based alignments reveal topology-independent core packing arrangements in proteins
Bioinformatics, April 1, 2005; 21(7): 1010 - 1019.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
I. Kifer, O. Sasson, and M. Linial
Predicting fold novelty based on ProtoNet hierarchical classification
Bioinformatics, April 1, 2005; 21(7): 1020 - 1027.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
S. Yang, R. F. Doolittle, and P. E. Bourne
Phylogeny determined by protein domain content
PNAS, January 11, 2005; 102(2): 373 - 378.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. Biol.Home page
F. Abdel-Sater, M. El Bakkoury, A. Urrestarazu, S. Vissers, and B. Andre
Amino Acid Signaling in Yeast: Casein Kinase I and the Ssy5 Endoprotease Are Key Determinants of Endoproteolytic Activation of the Membrane-Bound Stp1 Transcription Factor
Mol. Cell. Biol., November 15, 2004; 24(22): 9771 - 9785.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
J. Egelund, M. Skjot, N. Geshi, P. Ulvskov, and B. L. Petersen
A Complementary Bioinformatics Approach to Identify Potential Plant Cell Wall Glycosyltransferase-Encoding Genes
Plant Physiology, September 1, 2004; 136(1): 2609 - 2620.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Madera, C. Vogel, S. K. Kummerfeld, C. Chothia, and J. Gough
The SUPERFAMILY database in 2004: additions and improvements
Nucleic Acids Res., January 1, 2004; 32(90001): D235 - 239.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. H. Serres, S. Goswami, and M. Riley
GenProtEC: an updated and improved analysis of functions of Escherichia coli K-12 proteins
Nucleic Acids Res., January 1, 2004; 32(90001): D300 - 302.
[Abstract] [Full Text] [PDF]


Home page
DevelopmentHome page
C. Vogel, S. A. Teichmann, and C. Chothia
The immunoglobulin superfamily in Drosophila melanogaster and Caenorhabditis elegans and the evolution of complexity
Development, December 22, 2003; 130(25): 6317 - 6328.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
B. de Bono and C. Chothia
Exegesis: a procedure to improve gene predictions and its use to find immunoglobulin superfamily proteins in the human and mouse genomes
Nucleic Acids Res., November 1, 2003; 31(21): 6096 - 6103.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
A. T. Bankier, H. F. Spriggs, B. Fartmann, B. A. Konfortov, M. Madera, C. Vogel, S. A. Teichmann, A. Ivens, and P. H. Dear
Integrated Mapping, Chromosomal Sequencing and Sequence Analysis of Cryptosporidium parvum
Genome Res., August 1, 2003; 13(8): 1787 - 1799.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
I. Y. Y. Koh, V. A. Eyrich, M. A. Marti-Renom, D. Przybylski, M. S. Madhusudhan, N. Eswar, O. Grana, F. Pazos, A. Valencia, A. Sali, et al.
EVA: evaluation of protein structure prediction servers
Nucleic Acids Res., July 1, 2003; 31(13): 3311 - 3315.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
A. Kanapin, S. Batalov, M. J. Davis, J. Gough, S. Grimmond, H. Kawaji, M. Magrane, H. Matsuda, C. Schonbach, R. D. Teasdale, et al.
Mouse Proteome Analysis
Genome Res., June 1, 2003; 13(6): 1335 - 1344.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
S. M. Grimmond, K. C. Miranda, Z. Yuan, M. J. Davis, D. A. Hume, K. Yagi, N. Tominaga, H. Bono, Y. Hayashizaki, Y. Okazaki, et al.
The Mouse Secretome: Functional Classification of the Proteins Secreted Into the Extracellular Environment
Genome Res., June 1, 2003; 13(6): 1350 - 1359.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Madan Babu and S. A. Teichmann
Evolution of transcription factors and the gene regulatory network in Escherichia coli
Nucleic Acids Res., February 15, 2003; 31(4): 1234 - 1244.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. W. A. Buchan, S. C. G. Rison, J. E. Bray, D. Lee, F. Pearl, J. M. Thornton, and C. A. Orengo
Gene3D: structural assignments for the biologist and bioinformaticist alike
Nucleic Acids Res., January 1, 2003; 31(1): 469 - 473.
[Abstract] [Full Text] [PDF]


Home page
Physiol. GenomicsHome page
A. Turchin and I. S. Kohane
Gene homology resources on the World Wide Web
Physiol Genomics, December 3, 2002; 11(3): 165 - 177.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (314K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (86)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Gough, J.
Right arrow Articles by Chothia, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gough, J.
Right arrow Articles by Chothia, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?