Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (156K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (31)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Schneider, R.
Right arrow Articles by Sander, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Schneider, R.
Right arrow Articles by Sander, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 1997 Oxford University Press 226-230

Footnote

The HSSP database of protein structure-sequence alignments

The HSSP database of protein structure-sequence alignments Reinhard Schneider* , Antoine de Daruvar and Chris Sander

Protein Design Group, European Molecular Biology Labratory, D-69012 Heidelberg , Germany

Received October 7, 1996; Accepted October 8, 1996

ABSTRACT

HSSP is a derived database merging structural (3-D) and sequence (1-D) information. For each protein of known 3-D structure from the Protein Data Bank (PDB), the database has a multiple sequence alignment of all available homologues and a sequence profile characteristic of the family. The list of homologues is the result of a database search in SwissProt using a position-weighted dynamic programming method for sequence profile alignment (MaxHom). The database is updated frequently. The listed homologues are very likely to have the same 3-D structure as the PDB protein to which they have been aligned. As a result, the database is not only a database of aligned sequence families, but also a database of implied secondary and tertiary structures covering 29% of all SwissProt-stored sequences.

INTRODUCTION

HSSP ( h omology-derived s tructure s of p roteins) is a derived database merging information from 3-D structures and 1-D sequences of proteins. The added value in the database stems from the evolutionary observation that protein sequences can vary considerably while maintaining the same overall 3-D structure. One can therefore group sequence-similar proteins into families of structural homologues. If the 3-D structure of only one family member is known, then by implication one can derive the basic 3-D structure, or fold, of all family members.

To exploit this principle, we align, for each protein of known 3-D structure in the Protein Data Bank ( 1 ), all its likely sequence homologues. As a result, HSSP is not only a database of aligned sequence families, but also a database of implied secondary and tertiary structures. Likely secondary structures can be carried over directly from the PDB protein to each homologue. Tertiary structure models can be built by fitting the sequence of the homologue, as aligned, into the 3-D template of the protein of known structure (sequence inserts, however, are very difficult to model in 3-D).

Relative to the experimentally derived structural information in PDB, HSSP increases the number of effectively known protein structures several-fold. The database is useful for analyzing residue conservation in structural context, for defining structurally meaningful sequence patterns and, in general, for studying protein evolution, folding and design.

CONTENT AND FORMAT OF THE DATABANK

For each protein in PDB, with identifier xxxx (like: 1PPT, 5PCY), there is an ASCII (text) file xxxx.HSSP which contains: (i) the primary sequence of the protein of known structure, along with the derived secondary structure and solvent accessibility calculated from the coordinates using DSSP ( 2 ), (ii) aligned sequences of a few or tens or hundreds of sequences from the SWISS-PROT database ( 3 ) deemed structurally homologous to this protein (iii) at each position in the multiple sequence alignment, sequence variability, using two different measures, and (iv) the number of sequences that span this position (occupancy). Alignments were produced using a modified Smith-Waterman dynamic programming algorithm, allowing gaps, and likely homologues were selected applying a well-tested threshold for structural homology. Some details of the methods are given elsewhere ( 4 ).

For example, the dataset 1PPT.HSSP (Fig. 1 ) contains 30 aligned sequences of pancreatic hormones, neuropetides Y and peptides YY from different species. Residue Y27 (Tyr) is in an alpha-helix (H), has a solvent accessibility of 56 Å 2 and has a variablity of 0, i.e. it is strictly conserved as Tyr in all sequences. The alignments could be used to build explicit 3-D models of each of the homologous sequences. Such models would be quite accurate in the core regions (helices and strands), but less accurate in loop regions. If the 3-D structure of one of the aligned sequences is known experimentally, a pointer to that structure in PDB is given in the column STRID (structure identifier).

As there is considerable redundancy in the Protein Data Bank, i.e. several datasets in PDB represent the same structural family, the sequence families in HSSP overlap. For example, there are separate files for hemoglobin and myoblobin, which have about 30-35% identical residues, so that proteins homologous to both hemoglobin and myoglobin appear in both files. Sequence-identical chains in the PDB entry are removed so that the xxxx.hssp files only contain sequence-unique chains.


Figure 1 . Description of HSSP files. One HSSP file contains a structural protein family: one test protein of known structure and all its structurally homologous [as judged by our homology threshold (4)] relatives from the database of known sequences. The file is divided into four blocks, HEADERS, PROTEINS, ALIGNMENTS and SEQUENCE PROFILE . The HEADER S block is mandatory. The other three blocks are present only if at least one homologous alignment is found; each of the additional blocks begins with the string '##'. File organization is line-oriented. Lines have a maximum length of 132 bytes. Some of the line types are self-explanatory. ( b ) PROTEINS block: pair alignment data for each of the proteins deemed structurally homologous to the test protein, where the word pair alignment refers to the alignment of the test protein with the single homologous protein. ID, EMBL/Swiss-Prot identifier of the aligned (homologous) protein; STRID, if the 3-D structure of this protein is known, then STRID (structure ID) is the Protein Data Bank identifier as taken from the database reference line or DR-line (latest date) of the EMBL/Swiss-Prot entry; %IDE, percentage of residue identity of the alignment; IFIR/ ILAS, first and last residue position of the alignment in the test protein; JFIR/JLAS, first and last residue position of the alignment in the aligned protein; LALI, length of the alignment excluding insertions and deletions; NGAP, number of insertions and deletions in the alignment; LGAP, total length of all insertions and deletions; LSEQ2, length of the entire sequence of the aligned protein; ACCNUM, Swiss-Prot accession number; PROTEIN, one-line description of aligned protein. ( a ) HEADERS block: the first four bytes in the file, `HSSP', can be used for file type detection. The first line also has the version number of the HSSP software (program MaxHom). The PDBID (protein data bank identifier) line identifies the test protein of known structure (e.g. 1 PPT), the SEQBASE-line specifies the source of the aligned sequences (e.g. EMBL/Swiss-Prot or PIR/NBRF). The PARAMETER line specifies alignment parameters used in the alignment program. The THRESHOLD line refers to the homology threshold curve used. Information about the test protein as copied from PDB (name, source, author) and as derived (length of the sequence SEQLENGTH, number of distinct chains NCHAIN, and the number of aligned sequences NALIGN). ( c ) ALIGNMENT S block: residue-by-residue details of the family alignment. From left to right in one line: sequence and structure information for one position in the test protein taken from the corresponding DSSP file (2); sequence variability for this position followed by the aligned sequences in the same order as in the PROTEINS-block; equivalent (aligned) residue in each of the homologous database proteins. The sequences of the test protein and the aligned database proteins run vertically. SeqNo, sequential residue number of test protein as in DSSP file; PDBNo, residue number/name as in PDB file; AA, amino acid type in one letter code; STRUCTURE, secondary structure summary, hydrogen bonding patterns for turns and helices, geometrical bend, chirality, one character name of [beta]-ladder and of [beta]-sheet; BP1, BP2, [beta]-bridge partners; ACC, solvated residue surface area in Å 2 (number of contacting water molecules * l0); NOCC, number of aligned sequences spanning this position (including the test sequence); VAR, sequence variability (see text) as derived from the NALIGN alignments; 1, ruler to identify alignments by their number in the PROTEINS block. Note that lower case characters in the sequence of the test protein (AA-column) indicate cysteines in SS-bridges. Insertions and deletions in either sequence are indicated by special characters in the sequence of the aligned protein; dots (...) indicate a deletion in the aligned sequence lower case characters bracket an insertion point in the aligned sequence, e.g. AkeV means AK[insertion]EV There are residues of up to 70 database proteins in one line. If the number of alignments (NALIGN) is >70, the alignments block is repeated (1..70, 71-140 etc.) until the total number of alignments is reached. ( d ) SEQUENCE PROFILE block: relative frequency for each of the 20 amino acid residue in a given sequence position, from counting the residue at that position in each of the aligned sequences including the test sequence. A value of 100 means that at this position only one type of amino acid is found. Asx and Glx are counted in their acid/amide form in proportion to their database frequencies (Asx to Asp: 0.521, Asx to Asn: 0.439, Glx to Glu: 0.623, Glx to Gln: 0.410 as in EMBL/Swiss-Prot release 12, November 1989). For each line, corresponding to a particular sequence position: NOCC, number of aligned sequences spanning this position (including the test sequence); NDEL, number of sequences with a deletion in the test protein at this position; NINS, number of sequences with an insertion in the test protein at this position; ENTROPY, entropy measure of sequence variability at this position; RELENT, relative entropy, i.e. entropy normalized to the range 0-100; WEIGHT, conservation weight, ~1.0; lower for less conserved positions; higher for more conserved positions.

DISTRIBUTION

CD-ROM

A subset of the HSSP database, one file for each protein in a representative set of proteins, is distributed on CD-ROM by the EMBL Data Library. In this representative set of proteins selected from PDB, sequence similarity between any two proteins does not exceed 25% identical residues (over a length of 80 or more residues). For detailed information on how the representative set was generated see ref. ( 5 ) and the documentation distributed with the database. For enquiries regarding the distribution of HSSP on this medium contact: EMBL Data Library, European Bioinformatics Institute, Hinxton Hall, Hinxton, Cambridge CB10 1RQ, UK. Tel: +44 (0)1223 494401; Fax: +44 (0)1223 494468; Network: datalib@embl-ebi.ac.uk; WWW: http://www.embl-ebi.ac.uk

Anonymous FTP

If you have access to Internet you can obtain HSSP by anonymous ftp (File Transfer Protocol) from ftp.embl-heidelberg.de or ftp.ebi.ac.uk in directory: /pub/databases/hssp or using a WWW browser to ftp://ftp.embl-heidelberg.de/pub/databases/ or to ftp://ftp.embl-ebi.ac.uk/pub/databases/

World Wide Web

The HSSP database and HSSP-related information and data are accessible via the generic URL:

http://www.sander.embl-heidelberg.de/

The program (MaxHom) that generates the alignments is currently not available for distribution. Request for alignments based on structures not in the Protein Data Bank may be sent to R. Schneider by email. Results will be mailed back, capacity permitting. Priority will be given to new 3-D structures.

Conditions

Academic redistribution of single files or of the entire database is permitted, provided that dataset integrity is strictly maintained. No inclusion in other databases or datasets, academic or other, without explicit permission of the authors. All commercial rights reserved. Not to be used for classified research. Users are asked to refer to this paper and ref. ( 4 ) in reporting results based on use of the database.

CONTENT AND SIZE OF THE CURRENT RELEASE

The content and size of the HSSP database is of course tightly coupled to the development of the databases of protein 3-D structures (PDB) and sequences (e.g. SWISS-PROT). An overview of the increase in size is given in Table 1 . Interestingly, >15 000 of 52 205 known sequences (SwissProt release 33) are homologues of known structures and therefore have an implied know 3-D structure.


Table 1

The complete set of data files currently requires ~450 Mb of disk storage; the selected subset (480 datasets), ~50 Mb. Updates of the database are done on a regular basis.

LIMITATIONS

Accuracy of reported alignments

In general, the alignments in HSSP are based almost entirely on sequence information and therefore may deviate from alignments based on comparison of known 3-D structures in local detail, especially in terms of placement of gaps. In these cases, the sequence alignment may correctly represent conservation in the evolutionary chain of events connecting the two sequences while structural alignment may reflect a local structural rearrangement as a result of mutations in sequence positions spatially near the conserved residues. Alignments, whether based on sequences or structures, are often uncertain in loop regions.

Definition of variability

In using variability scores, the user should be aware that low occupancy positions (few alignments span that position) have ill-determined variability values-in the limit of zero occupancy the variability is undefined and set to zero. For some purposes, the user may choose to use only positions with occupancy larger than, say, five proteins.

RELATED DATA BANKS AND INFORMATION SERVICES

The following databases and information services are also available from the Protein Design Group at EMBL, with network access provided by the same mechanisms as for HSSP (FTP and WWW access, see above).

DSSP , a database of secondary structure, solvent accessibility and other information derived from 3-D structures in the Protein Data Bank ( 2 ). http://www.sander.embl-heidelberg.de/dssp/

personal email: sander{at}embl-ebi.ac.uk

FSSP , a database of protein structure families with similar folding motifs, based on 3-D alignments of protein structures. http://www.sander.embl-heidelberg.de/fssp/

personal email: holm@embl-ebi.ac.uk

PDBselect , a representative subset of sequence-unique proteins of known 3-D structure selected from the Protein Data Bank ( 5 ). http://www.sander.embl-heidelberg.de/pdbselect/

personal email: hobohm@embl-heidelberg.de

PredictProtein , an electronic mail server that provides a predicted secondary structure and solvent accessibility profile for any protein sequence with homologues in SwissProt. Rated at 72% sustained three-state accuracy ( 6 , 7 ).

http://www.embl-heidelberg.de/predictprotein/

personal email: predict-help@embl-heidelberg.de

PropSearch , performs searches in sequence databases using amino acid composition and other non-sequential properties of a protein sequence as input ( 8 ).

http://www.sander.embl-heidelberg.de/propsearch/

personal email: hobohm@embl-heidelberg.de

GeneQuiz , results of automated protein sequence analysis for completely sequenced genomes [e.g., Haemophilus influenzae ( 9 ), yeast].

http://www.sander.embl-heidelberg.de/genequiz/

personal email: genequiz@embl-heidelberg.de

GPCRDB , information system for G-protein coupled receptors.

http://www.sander.embl-heidelberg.de/7tm/

personal email: vriend@embl-heidelberg.de

Dali , an electronic mail server that performs a 3-D similarity search in the Protein Data Bank, given the atomic coordindates of a 3-D protein model as input ( 10 ).

http://www.sander.embl-heidelberg.de/dali/

personal email: holm@embl-heidelberg.de

Special software is available to construct 3-D models by homology based on the information in HSSP files, such as WHATIF by Gert Vriend ( 11 ) or MaxSprout /Torso by Liisa Holm and Chris Sander ( 12 , 13 ).

Report any problems with the HSSP database to the authors by electronic mail: schneider@embl-heidelberg.de or sander@embl-heidelberg.de

REFERENCES

1 Bernstein F.C., Koetzle T.F., Williams G.J.B., Meyer E.F., Brice M.D., Rodgers J.R., Kennard O., Shimanouchi T., Tasumi M. (1977) J. Mol. Biol. 112, 535-542. MEDLINE Abstract

2 Kabsch W., Sander C. (1983) Biopolymers 22, 2577-2637. MEDLINE Abstract

3 Bairoch A., Boeckmann B. (1992) Nucleic Acid Res. 20, 2019-2022. MEDLINE Abstract

4 Sander C., Schneider R. (19991) Proteins 9, 56-68.

5 Hobohm U., Scharf M., Schneider R., Sander C. (1992) Protein Sci. 3, 409-417.

6 Rost B., Schneider R., Sander C. (1993) Trends Biol. Sci 18, 120-123.

7 Rost B., Schneider R., Sander C. (1994) Comput. Appl. Biosci. 10, 53-60. MEDLINE Abstract

8 Hobohm U., Sander C. (1995) J. Mol. Biol. 251, 390-399. MEDLINE Abstract

9 Casari G., Andrade M.A., Bork P., Boyle J., Daruvar A., Ouzounis C., Schneider R., Tamames J., Valencia A., Sander C. (1995) Nature 376, 647-648. MEDLINE Abstract

10 Holm L, Sander C. (1993) J. Mol. Biol. 233, 123-138.

11 Vriend G. (1990) J. Mol. Graphics 8, 52-56. MEDLINE Abstract

12 Holm L., Sander C. (1991) J. Mol. Biol. 218, 183-194. MEDLINE Abstract

13 Holm L., Sander C. (1992) Proteins 14, 213-223. MEDLINE Abstract


Return

* To whom correspondence should be addressed. Tel: +49 6221 387305; Fax: +49 6221 387517; Email: schneider@embl-heidelberg.de
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Phil Trans R Soc BHome page
S. Q. Le, N. Lartillot, and O. Gascuel
Phylogenetic mixture models for proteins
Phil Trans R Soc B, December 27, 2008; 363(1512): 3965 - 3976.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Schlessinger, Y. Ofran, G. Yachdav, and B. Rost
Epitome: database of structure-inferred antigenic epitopes
Nucleic Acids Res., January 1, 2006; 34(suppl_1): D777 - D780.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Neshich, L. C. Borro, R. H. Higa, P. R. Kuser, M. E. B. Yamagishi, E. H. Franco, J. N. Krauchenco, R. Fileto, A. A. Ribeiro, G. B. P. Bezerra, et al.
The Diamond STING server
Nucleic Acids Res., July 1, 2005; 33(suppl_2): W29 - W35.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. Brooksbank, G. Cameron, and J. Thornton
The European Bioinformatics Institute's data resources: towards systems biology
Nucleic Acids Res., January 1, 2005; 33(suppl_1): D46 - D53.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Neshich, A. L. Mancini, M. E. B. Yamagishi, P. R. Kuser, R. Fileto, I. P. Pinto, J. F. Palandrani, J. N. Krauchenco, C. Baudet, A. J. Montagner, et al.
STING Report: convenient web-based application for graphic and tabular presentations of protein sequence, structure and function descriptors from the STING database
Nucleic Acids Res., January 1, 2005; 33(suppl_1): D269 - D274.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Neshich, W. Rocchia, A. L. Mancini, M. E. B. Yamagishi, P. R. Kuser, R. Fileto, C. Baudet, I. P. Pinto, A. J. Montagner, J. F. Palandrani, et al.
JavaProtein Dossier: a novel web-based data visualization tool for comprehensive analysis of protein structure
Nucleic Acids Res., July 1, 2004; 32(suppl_2): W595 - W601.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
H. R. Bigelow, D. S. Petrey, J. Liu, D. Przybylski, and B. Rost
Predicting transmembrane beta-barrels in proteomes
Nucleic Acids Res., May 11, 2004; 32(8): 2566 - 2577.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Neshich, R. C. Togawa, A. L. Mancini, P. R. Kuser, M. E. B. Yamagishi, G. Pappas Jr, W. V. Torres, T. F. e Campos, L. L. Ferreira, F. M. Luna, et al.
STING Millennium: a web-based suite of programs for comprehensive and simultaneous analysis of protein structure and sequence
Nucleic Acids Res., July 1, 2003; 31(13): 3386 - 3392.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
B. Rost
Twilight zone of protein sequence alignments
Protein Eng. Des. Sel., February 1, 1999; 12(2): 85 - 94.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
C. G. Nevill-Manning, T. D. Wu, and D. L. Brutlag
Highly specific protein sequence motifs for genome analysis
PNAS, May 26, 1998; 95(11): 5865 - 5871.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
J. Zheng, W. Luo, and M. L. Tanzer
Aggrecan Synthesis and Secretion. A PARADIGM FOR MOLECULAR AND CELLULAR COORDINATION OF MULTIGLOBULAR PROTEIN FOLDING AND INTRACELLULAR TRAFFICKING
J. Biol. Chem., May 22, 1998; 273(21): 12999 - 13006.
[Abstract] [Full Text] [PDF]


Home page
ScienceHome page
S. Henikoff, E. A. Greene, S. Pietrokovski, P. Bork, T. K. Attwood, and L. Hood
Gene Families: The Taxonomy of Protein Paralogs and Chimeras
Science, October 24, 1997; 278(5338): 609 - 614.
[Abstract] [Full Text]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (156K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (31)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Schneider, R.
Right arrow Articles by Sander, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Schneider, R.
Right arrow Articles by Sander, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?