ABSTRACT
The FSSP database presents a continuously updated classification of 3-D protein folds based on an all-against-all comparison of structures currently in the Protein Data
Bank (PDB) [Bernstein
et al
. (1977)
J. Mol. Biol.
, 112, 535-542]. The database currently contains an extended structural family for
each of 600 representative protein chains which have <25% mutual sequence identity. The results of the exhaustive pairwise structure
comparisons are reported in the form of a fold tree generated by hierachical
clustering and as a series of structurally representative sets of folds at
varying levels of uniqueness. For each query structure from the representative
set, there is a database entry containing structure-structure alignments with its structural neighbours in the representative
set and its sequence homologs in the PDB. All alignments are based purely on
the 3-D co-ordinates of the proteins and are derived by an automatic structure
comparison program (Dali). The FSSP database is accessible electronically on
the World Wide Web and by anonymous ftp.
Most newly determined protein sequences can be classified into families by
sequence homology. However, protein families are known to retain the shape of
the fold even when sequences have diverged below the limit of detection of
significant similarities at the sequence level. These similarities can be
detected by structural comparisons that merge protein families of known 3-D structure into structural classes, the members of which may or may not
be evolutionarily related (
1
-
4
). The FSSP database contains a fold classification based on exhaustive
structural alignments of known structures. The database provides a rich source
of information for the study of both divergent and convergent aspects of the
evolution of protein folds and defines useful test sets and a standard of truth
for assessing the correctness of sequence-sequence or sequence-structure alignments.
The major new developments since last year (
5
) are continuous updates of the database and easy access to the data using
browsers on the World Wide Web (WWW).
The basic structural entity used currently in the FSSP database are protein
chains, which are identified by the Protein Data Bank (PDB) entry code plus
chain identifier. All protein chains in the PDB entries that are >30 residues
are listed alphabetically in PROTEIN INDEX which gives the pointer to the
representative structure of the protein family and short summary information
about the strength of similarity to the representative. The sequence-representative set is derived using algorithm #1 of ref.
6
so that all pairwise sequence identities within this set are <25%. For example, PROTEIN INDEX (Fig.
1
) tells you that the protease inhibitor domain of Alzheimer's amyloid beta-protein precursor is deposited in the PDB as entry 1AAP which has two
chains, A and B. Both the A and B chain are 45% sequence identical to the
representative structure of the family, which is bovine pancreatic trypsin
inhibitor (PDB entry 9PTI). As expected from the high sequence identity, the
folds of both of the 1AAP chains and that of 9PTI are as good as identical (1.0-1.1 Å root-mean-square deviation of CA positions).
For each protein chain in the representative set, with PDB identifier Nxxx
(like: 1PPT, 5PCY) and chain identifier Y (omitted if blank), there is an ASCII
(text) file Nxxx.FSSP or NxxxY.FSSP which contains a few or tens of proteins
structurally similar to the search structure, alongside the secondary structure
and solvent accessibility extracted from the 3-D coordinates of the search structure (
8
). The structural neighbours that are reported include any sequence homologs to
the query structure that have a structure in the PDB and all structurally
similar chains from the representative set (Z >= 2). Details about the Dali method used to derive the database are given in
refs
9
and
10
.
An FSSP file is divided in five formatted blocks and a free text footer which
explains the format. (i) The header block identifies the query structure,
database and structural alignment method used and gives the number of
structural neighbours. (ii) The summary block gives a one-line summary for each neighbour, including the statistical significance of
the similarity (Z-score), positional root-mean-square deviation of superimposed CA coordinates, total number
of equivalent residues and the percentage of sequence identity over
structurally equivalent positions. (iii) The alignments block is a multiple
structural alignment, printed vertically and showing the sequence and secondary
structure of matched residues. (iv) The equivalences block is a machine
readable listing that gives the residue numbers of the structurally equivalent
segments. (v) The matrices block gives the rotation-translation matrices that, when applied to the 3-D coordinates in the respective PDB entries, yield the least-squares superimposition of the matched protein onto the query
structure. See below for automatic parsing of FSSP entries.
The FSSP database is accessible over the WWW addressing URL http://www.embl-heidelberg.de/dali/fssp/.
The most convenient starting point for a walk in fold space is via clicking the
`alignment' link in the FOLDTREE table. FSSP entries are parsed on the fly to
display structural neighbours of individual proteins in the form of structure
alignments laid out horizontally, multiple structure alignments (known
structures) combined with multiple sequence alignments [sequences homologous to
a known structure: HSSP database (
11
)] or superimposed coordinates [retrieved from PDB (
12
)] for viewing with molecular graphics programs such as Rasmol (
13
). There are further hypertext links to functional annotations and literature
references via SRS (
14
). For example, a study of the p21
ras
family could start from the FOLDTREE table, which immediately shows transducin
alpha, the ADP-ribosylation factor 1 and elongation factor G as the closest structural
neighbours. From the structural alignment of these remote homologs one can
identify the conserved sequence motifs GxxxxGKS and NKxD (
15
). These patterns are conserved in all members of the protein families as seen
by extending the structure alignment with the results from a sequence database
search (
11
). The number of sequence relatives displayed can be reduced from several
hundred to a few tens using a cutoff of 50% identity between any pair that is
displayed (Fig.
3
). Clicking on the sequence identifier (e.g. rash_rat) pops up the Swissprot (
16
) annotation for this sequence.
Figure
The FSSP data sets can be obtained by anonymous ftp from ftp.embl-heidelberg.de in the directory:
/pub/databases/protein_extras/fssp.
Academic redistribution of single files or of the entire database is permitted.
No inclusion in other databases or database services, academic or other,
without explicit permission of the authors. All rights reserved. Not to be used
for classified research. Users are asked to refer to ref.
9
and this paper in reporting results obtained using the database.
The size of the FSSP database is tightly coupled to that of the PDB from which
it is derived. The FSSP database is updated with each release of new structures
by the PDB. The size of the sequence-representative set of chains was 600 in August 1995, an 80% increase from
June 1994. The complete set of result files requires ~60 Mb of disk storage.
The current database contains at most one alignment per pair of full length
proteins. The alignments are constrained to be sequential as this is
biologically meaningful though not imposed by the Dali method. Different chains
in one PDB entry are compared separately; chains with <30 residues or unknown sequence are excluded.
The structure comparison program Dali (
9
) defines the extent of the common structural core by maximizing the agreement
of
intra
molecular CA-CA distances. The scoring function was deliberately designed to allow
inter-domain conformational flexibility; hence, positional root mean square
deviations for the corresponding rigid-body superimpositions are often higher than for comparison methods that
put an absolute upper limit on
inter
molecular positional deviations. This, however, is only an apparent
disadvantage.
Requests for alignments of newly solved crystallographic or solution NMR
structures (C
[alpha]
co-ordinates required) may be sent to the Dali e-mail server with Internet address:
dali{at}embl-heidelberg.de
More information on the Dali server (
10
) is available on the WWW at:
URL http://www.embl-heidelberg.de/dali/dali.html.
Kindly report any problems to the authors by e-mail.

REFERENCES
