ABSTRACT
HSSP is a derived database merging structural (3-D) and sequence (1-D) information. For each protein of known 3-D structure from the Protein Data Bank (PDB), the database has
a multiple sequence alignment of all available homologues and a sequence
profile characteristic of the family. The list of homologues is the result of a
database search in SwissProt using a position-weighted dynamic programming method for sequence profile alignment
(MaxHom). The database is updated frequently. The listed homologues are very
likely to have the same 3-D structure as the PDB protein to which they have been aligned. As a
result, the database is not only a database of aligned sequence families, but
also a database of implied secondary and tertiary structures covering 29% of
all SwissProt-stored sequences.
HSSP (
To exploit this principle, we align, for each protein of known 3-D structure in the Protein Data Bank (
1
), all its likely sequence homologues. As a result, HSSP is not only a database
of aligned sequence families, but also a database of implied secondary and
tertiary structures. Likely secondary structures can be carried over directly
from the PDB protein to each homologue. Tertiary structure models can be built
by fitting the sequence of the homologue, as aligned, into the 3-D template of the protein of known structure (sequence inserts, however,
are very difficult to model in 3-D).
Relative to the experimentally derived structural information in PDB, HSSP
increases the number of effectively known protein structures several-fold. The database is useful for analyzing residue conservation in structural context, for defining structurally meaningful sequence patterns and, in general, for studying protein
evolution, folding and design.
For each protein in PDB, with identifier xxxx (like: 1PPT, 5PCY), there is an
ASCII (text) file xxxx.HSSP which contains: (i) the primary sequence of the
protein of known structure, along with the derived secondary structure and
solvent accessibility calculated from the coordinates using DSSP (
2
), (ii) aligned sequences of a few or tens or hundreds of sequences from the
SWISS-PROT database (
3
) deemed structurally homologous to this protein (iii) at each position in the
multiple sequence alignment, sequence variability, using two different
measures, and (iv) the number of sequences that span this position (occupancy).
Alignments were produced using a modified Smith-Waterman dynamic programming
algorithm, allowing gaps, and likely homologues were selected applying a well-tested threshold for structural homology. Some details of the methods are
given elsewhere (
4
).
For example, the dataset 1PPT.HSSP (Fig.
1
) contains 30 aligned sequences of pancreatic hormones, neuropetides Y and
peptides YY from different species. Residue Y27 (Tyr) is in an alpha-helix (H), has a solvent accessibility of 56 Å
2
and has a variablity of 0, i.e. it is strictly conserved as Tyr in all
sequences. The alignments could be used to build explicit 3-D models of each of the homologous sequences. Such models would be quite
accurate in the core regions (helices and strands), but less accurate in loop
regions. If the 3-D structure of one of the aligned sequences is known experimentally, a
pointer to that structure in PDB is given in the column STRID (structure
identifier).
As there is considerable redundancy in the Protein Data Bank, i.e. several
datasets in PDB represent the same structural family, the sequence families in
HSSP overlap. For example, there are separate files for hemoglobin and
myoblobin, which have about 30-35% identical residues, so that proteins homologous to both hemoglobin
and myoglobin appear in both files. Sequence-identical chains in the PDB entry are removed so that the xxxx.hssp files
only contain sequence-unique chains.
A subset of the HSSP database, one file for each protein in a representative set
of proteins, is distributed on CD-ROM by the EMBL Data Library. In this representative set of proteins
selected from PDB, sequence similarity between any two proteins does not exceed
25% identical residues (over a length of 80 or more residues). For detailed
information on how the representative set was generated see ref. (
5
) and the documentation distributed with the database. For enquiries regarding
the distribution of HSSP on this medium contact: EMBL Data Library, European
Bioinformatics Institute, Hinxton Hall, Hinxton, Cambridge CB10 1RQ, UK. Tel:
+44 (0)1223 494401; Fax: +44 (0)1223 494468; Network: datalib@embl-ebi.ac.uk; WWW: http://www.embl-ebi.ac.uk
If you have access to Internet you can obtain HSSP by anonymous ftp (File
Transfer Protocol) from ftp.embl-heidelberg.de or ftp.ebi.ac.uk in directory: /pub/databases/hssp or using
a WWW browser to ftp://ftp.embl-heidelberg.de/pub/databases/ or to ftp://ftp.embl-ebi.ac.uk/pub/databases/
The HSSP database and HSSP-related information and data are accessible via the generic URL:
http://www.sander.embl-heidelberg.de/
The program (MaxHom) that generates the alignments is currently not available
for distribution. Request for alignments based on structures not in the Protein
Data Bank may be sent to R. Schneider by email. Results will be mailed back,
capacity permitting. Priority will be given to new 3-D structures.
Academic redistribution of single files or of the entire database is permitted,
provided that dataset integrity is strictly maintained. No inclusion in other
databases or datasets, academic or other, without explicit permission of the
authors. All commercial rights reserved. Not to be used for classified
research. Users are asked to refer to this paper and ref. (
4
) in reporting results based on use of the database.
The content and size of the HSSP database is of course tightly coupled to the
development of the databases of protein 3-D structures (PDB) and sequences (e.g. SWISS-PROT). An overview of the increase in size is given in Table
1
. Interestingly, >15 000 of 52 205 known sequences (SwissProt release 33) are
homologues of known structures and therefore have an implied know 3-D structure.
Table 1
The complete set of data files currently requires ~450 Mb of disk storage; the selected subset (480 datasets), ~50 Mb. Updates of the database are done on a regular basis.
In general, the alignments in HSSP are based almost entirely on sequence
information and therefore may deviate from alignments based on comparison of
known 3-D structures in local detail, especially in terms of placement of gaps. In
these cases, the sequence alignment may correctly represent conservation in the
evolutionary chain of events connecting the two sequences while structural
alignment may reflect a local structural rearrangement as a result of mutations
in sequence positions spatially near the conserved residues. Alignments,
whether based on sequences or structures, are often uncertain in loop regions.
In using variability scores, the user should be aware that low occupancy
positions (few alignments span that position) have ill-determined variability values-in the limit of zero occupancy the variability is undefined and set
to zero. For some purposes, the user may choose to use only positions with
occupancy larger than, say, five proteins.
The following databases and information services are also available from the
Protein Design Group at EMBL, with network access provided by the same
mechanisms as for HSSP (FTP and WWW access, see above).
DSSP
, a database of secondary structure, solvent accessibility and other information
derived from 3-D structures in the Protein Data Bank (
2
). http://www.sander.embl-heidelberg.de/dssp/
personal email: sander{at}embl-ebi.ac.uk
FSSP
, a database of protein structure families with similar folding motifs, based on
3-D alignments of protein structures. http://www.sander.embl-heidelberg.de/fssp/
personal email: holm@embl-ebi.ac.uk
PDBselect
, a representative subset of sequence-unique proteins of known 3-D structure selected from the Protein Data Bank (
5
). http://www.sander.embl-heidelberg.de/pdbselect/
personal email: hobohm@embl-heidelberg.de
PredictProtein
, an electronic mail server that provides a predicted secondary structure and
solvent accessibility profile for any protein sequence with homologues in
SwissProt. Rated at 72% sustained three-state accuracy (
6
,
7
).
http://www.embl-heidelberg.de/predictprotein/
personal email: predict-help@embl-heidelberg.de
PropSearch
, performs searches in sequence databases using amino acid composition and other
non-sequential properties of a protein sequence as input (
8
).
http://www.sander.embl-heidelberg.de/propsearch/
personal email: hobohm@embl-heidelberg.de
GeneQuiz
, results of automated protein sequence analysis for completely sequenced
genomes [e.g.,
Haemophilus influenzae
(
9
), yeast].
http://www.sander.embl-heidelberg.de/genequiz/
personal email: genequiz@embl-heidelberg.de
GPCRDB
, information system for G-protein coupled receptors.
http://www.sander.embl-heidelberg.de/7tm/
personal email: vriend@embl-heidelberg.de
Dali
, an electronic mail server that performs a 3-D similarity search in the Protein Data Bank, given the atomic
coordindates of a 3-D protein model as input (
10
).
http://www.sander.embl-heidelberg.de/dali/
personal email: holm@embl-heidelberg.de
Special software is available to construct 3-D models by homology based on the information in HSSP files, such as
WHATIF
by Gert Vriend (
11
) or
MaxSprout
/Torso
by Liisa Holm and Chris Sander (
12
,
13
).
Report any problems with the HSSP database to the authors by electronic mail:
schneider@embl-heidelberg.de or sander@embl-heidelberg.de

REFERENCES
Return
