The SBASE protein domain library, Release 4.0: a collection of annotated protein
sequence segments
The SBASE protein domain library, Release 4.0: a collection of annotated protein sequence segments
János
Murvai
1
,
Andrei
Gabrielian
2
,
Péter
Fábián
1,2
,
Zsolt
Hátsági
1,2
,
Kirill
Degtyarenko
2,+
,
Hedvig
Hegyi
1
and
Sándor
Pongor
1,2,
*
1
ABC Institute for Biochemistry and Protein Research, 2100
Gödöllö
,
Hungary
and
2
International Centre for Genetic Engineering and Biotechnology, Area Science
Park, 34012
Trieste
,
Italy
Received September 25, 1995;
Revised and Accepted October 12, 1995
ABSTRACT
SBASE 4.0 is the fourth release of SBASE, a collection of annotated protein
domain sequences that represent various structural, functional, ligand binding
and topogenic segments of proteins. SBASE was designed to facilitate the
detection of functional homologies and can be searched with standard database
search tools, such as FASTA and BLAST3. The present release contains 61 137
entries provided with standardized names and cross-referenced to all major protein, nucleic acid and sequence pattern
collections. The entries are clustered into 13 155 groups in order to
facilitate detection of distant similarities. SBASE 4.0 is freely available by
anonymous ftp file transfer from ftp.icgeb.trieste.it. Individual records can
be retrieved with the gopher server at icgeb.trieste.it and with a World Wide
Web server at http://www.icgeb.trieste.it. Automated searching of SBASE with
BLAST can be carried out with the electronic mail server
sbase{at}icgeb.trieste.it, which now also provides a graphic representation of the
homologies. A related mail server, domain{at}hubi.abc.hu, assigns SBASE domain
homologies on the basis of SWISS-PROT searches.
INTRODUCTION
SBASE is a collection of protein domain sequences designed to facilitate the
detection of distant similarities typically found between modules of
multidomain proteins (
1
,
2
). A multidomain protein can share a biologically significant sequence pattern
with a number of different, functionally related proteins or protein domains,
even though the sequence alignments may not be highly significant in the
mathematical sense. SBASE can be considered as a conversion of the protein
sequence database into a format that facilitates detection of such functional
and structural similarities (
3
,
4
).
The current Release 4.0 of SBASE contains over 60 000 annotated protein sequence
segments consistently named by structure, function, biased composition, binding
specificity and/or similarity to other proteins. The format of the database is
such that it can be searched with standard programs, like FASTA (
5
) or BLAST3 (
6
), and the information given allows the prediction of function and the direct
detection of potential domain homologies.
The main developments with respect to the previous release can be summarized as
follows.
(i) Release 4.0 contains 61 137 sequence entries, 48% more than release 3.0
(Table
1
).
(ii) The entries were clustered on the basis of BLAST similarity scores, as
previously described (
4
). The list of all clusters having at least two members is deposited in a
separate database, SBASE-CLUSTERS, which is now available through anonymous ftp, as well as through
the World Wide Web (WWW) server. A total of 13 155 clusters were found. The
definition of clusters is as previously described (
4
).
(iii) An estimated 90% of the records are now provided with standard names. In
addition to domains types used in the previous releases (e.g. structural,
functional, cellular topology and biased composition domains), standardized
names were given to repeat units that have no known function, using the name of
the parent protein or parent protein family followed by the word `REPEAT'.
Short descriptions and literature reviews have been prepared on some of the
domain types that are not described in other collections. These are now
available through the WWW server.
(iv) The graphical interface of Sonnhammer and Durbin (
7
), capable of displaying domain homologies along the query sequence, was added
to the WWW/email server sbase@icgeb.trieste.it.
DESCRIPTION OF THE DATA
Definition of protein domains
Domains included in SBASE are protein sequence segments with known structure
and/or function (for details see
3
,
4
). The main entry clases are summarized in Table
2
. As a rule domain boundaries were taken as indicated by the publishing authors.
SBASE data originate from three main sources: (i) from the SWISS-PROT protein sequence databank (
8
); (ii) from the Protein Sequence Database of the Protein Identification
Resource (PIR) (
9
); (iii) from the literature. The sequences are either translated from
nucleotide sequence databases (
10
,
11
) or directly keyed in at the protein level. From a total of 61 137 records in
SBASE 4.0, 47 765 (78.1%), 9538 (15.6%) and 3834 (6.3%) are of eukaryotic,
prokaryotic and viral origin respectively. Domain sizes vary in length between
five and 1000 amino acids. The boundaries of the domains are either as
previously defined in the original publications or determined by homology to
domains with known boundaries.
Redundancy of sequences in SBASE 4.0 is kept at a minimal level. In some cases
the domain definitions overlap, so the same sequence (e.g. EGF-REPEAT) can be present both as an independent entry and as part of another
entry (e.g. EXTRACELLULAR domain of a receptor). For the same reason entries
can belong to separate clusters in SBASE-CLUSTER.
Cross-references
SBASE 4.0 has cross-references to several protein and nucleic acid data banks, as well as to
the PRINTS (
12
), PRODOM (
13
), BLOCKS (
14
) and PROSITE (
15
) databases (Table
3
). In each record the DR lines contain the cross-reference data.
The format of SBASE 4.0 follows that of the EMBL and SWISS-PROT databases and can be directly formatted under the GCG program package
(
16
). A sample record is shown in Figure
1
. The field types used are listed in Table
4
.
Citation
Users of SBASE and of the email servers are asked to cite this article in their
publications, for example in the following form `The sequence homologies were
analysed by searching the SBASE Protein Domain Sequence Library release 4.0 via
automated email server'.
DISTRIBUTION AND ACCESS
Distribution
SBASE 4.0 (21 September 1995) is distributed by anonymous ftp file transfer from
ftp.icgeb.trieste.it. The complete database is 50 Mb, in compressed form 6.3
Mb.
Retrieval of records by gopher server
Individual entries are available through the gopher server of ICGEB at
icgeb.trieste.it. Entries can be retrieved by SBASE identifiers, standard
names, description of the parent protein, organism and authors' names.
BLAST search by email server
SBASE 4.0 can be searched by the BLAST program using the email server
sbase@icgeb.trieste.it. An example of a search request sent by email is
presented in Figure
2
A. The results of the search appear as best potential domain homologies ranked
according to BLAST score (Fig.
2
C). A related email server, domain@hubi.abc.hu, was created in order to assign
SBASE domain homologies on the basis of BLAST searches performed on the SWISS-PROT database (
17
). Users can obtain all the necessary information by sending an email to
sbase@icgeb.trieste.it or to domain@hubi.abc.hu with the word HELP in the body
of the message.
Access by WWW server
All the above services can be accessed on-line also using the WWW server at http://www.icgeb.trieste.it. At present
cross-references to SBASE-CLUSTERS, EMBL, MEDLINE, MIM, PRINTS8.0, PRODOM28, PROSITE12 and
SWISS-PROT29 (see underlined items in Fig.
1
) can be directly accessed through the WWW server.
ACKNOWLEDGEMENTS
SBASE was established in 1990 and is maintained collaboratively by the International Center for Genetic Engineering and Biotechnology, Trieste,
Italy, and the ABC Institute for Biochemistry and Protein Research, Gödöllö, Hungary.
REFERENCES
1 Barker,W.C., Hunt,L.T. and George,D.G. (1988) Protein Sequence Data Anal., 1, 363-373.
2 Baron,M., Norman,D.G. and Campbell,I.D. (1991) Trends Biochem., 16, 13-17.
3 Pongor,S., Skerl,V., Cserzö,M. and Hátsági,Z., Simon,G. and Bevilacqua,V. (1993) Protein Engng, 6, 391-395.
22 Rudd,K.E., Bouffard,G. and Miller,G. (1992) In Davies,K.E. and Tilghmann,S.M. (eds), Genome Analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, pp. 1-38.MEDLINE Abstract
*To whom correspondence should be addressed at: International Centre for Genetic
Engineering and Biotechnology, Area Science Park, 34012 Trieste, Italy
+
Present address: Department of Biochemistry and Molecular Biology, University of
Leeds, Leeds LS2 9JT, UK