The SWISS-PROT protein sequence data bank and its supplement TrEMBL
The SWISS-PROT protein sequence data bank and its supplement TrEMBL
Amos
Bairoch*
and
Rolf
Apweiler
1
Department of Medical Biochemistry, University of Geneva, 1 rue Michel Servet,
1211
Geneva
4,
Switzerland
and
1
The EMBL Outstation - The European Bioinformatics Institute, Wellcome Trust
Genome Campus, Hinxton,
Cambridge
CB10 1SD,
UK
Received October 17, 1996;
Accepted October 21, 1996
ABSTRACT
SWISS-PROT is a curated protein sequence database which strives to provide a
high level of annotations (such as the description of the function of a
protein, structure of its domains, post-translational modifications, variants, etc.), a minimal level of
redundancy and high level of integration with other databases. Recent
developments of the database include: an increase in the number and scope of
model organisms; cross-references to two additional databases; a variety of new documentation
files and the creation of TrEMBL, a computer annotated supplement to SWISS-PROT. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding
sequences (CDS) in the EMBL nucleotide sequence database, except the CDS
already included in SWISS-PROT.
INTRODUCTION
SWISS-PROT (
1
) is an annotated protein sequence database established in 1986 and maintained
collaboratively, since 1987, by the Department of Medical Biochemistry of the
University of Geneva and the EMBL Data Library [now the EMBL Outstation-The
European Bioinformatics Institute (EBI) (
2
)]. The SWISS-PROT protein sequence data bank consists of sequence entries. Sequence
entries are composed of different lines types, each with their own format. For
standardization purposes the format of SWISS-PROT (
3
) follows as closely as possible that of the EMBL Nucleotide Sequence Database.
A sample SWISS-PROT entry is shown in Figure
1
.
RECENT DEVELOPMENTS
Model organisms
We have selected a number of organisms that are the target of genome sequencing
and/or mapping projects and for which we intend to: (i) Be as complete as
possible. All sequences available at a given time should be immediately
included in SWISS-PROT. This also includes sequence corrections and updates. (ii) Provide a
higher level of annotations. (iii) Cross-references to specialized database(s) that contain, among other data, some
genetic information about the genes that code for these proteins. (iv) Provide
specific indices or documents.
Collectively these organisms represent about a third of the total number of
sequence entries in SWISS-PROT. In the last few months we included in SWISS-PROT fully annotated versions of the protein sequence entries
encoded on the complete genome of
M.genitalium
as well as entries originating from the full sequence of yeast chromosomes VII,
X, XIII and XIV. We plan to finish annotating all of the remaining yeast
sequences (mainly from chromosomes IV, XII, XV and XVI) in early 1997.
Documentation files
SWISS-PROT is distributed with a large number of documentation files. Some of
these files have been available for a long time (the user manual, release
notes, the various indices for authors, citations, keywords, etc.), but many
have been created recently and we are continually adding new files. Table
2
list all the documents that are currently available or that will be added in
the next few months.
New cross-references
We have recently added cross-references that link SWISS-PROT to the following databases: (i) the Harefield Hospital 2D gel
protein databases (
4
) prepared under the supervisation of Mike Dunn; and (ii) the Maize genome 2D
Electrophoresis database (MAIZE-2DPAGE).
Currently, SWISS-PROT is linked to 26 different databases and has consolidated its role as
the major focal points of biomolecular databases interconnectivity. In release
34, there were an average of 3.3 cross-references for each sequence entry.
List of on-line experts for PROSITE and SWISS-PROT
acindex.txt
Accession number index
autindex.txt
Author index
citindex.txt
Citation index
keyindex.txt
Keyword index
speindex.txt
Species index
7tmrlist.txt
List of 7-transmembrane G-linked receptors entries
aatrnasy.txt
List of aminoacyl-tRNA synthetases
allergen.txt
Nomenclature and index of allergen sequences
calbican.txt
Index of
C.albicans
entries and their corresponding gene designations*
cdlist.txt
CD nomenclature for surface proteins of human leucocytes
celegans.txt
Index of
C.elegans
entries and corresponding gene designations and WormPep cross-references
dicty.txt
Index of
D.discoideum
entries and corresponding gene designations and DictyDB cross-references
ec2dtosp.txt
Index of
E.coli
Gene-protein database entries referenced in SWISS-PROT
ecoli.txt
Index of
E.coli
K12 chromosomal entries and corresponding EcoGene cross-references
embltosp.txt
Index of EMBL Database entries referenced in SWISS-PROT
extradom.txt
Nomenclature of extracellular domains
glycosid.txt
Index of glycosyl hydrolases classified by families on the basis of sequence
similarities
haeinflu.txt
Index of
H.influenzae
RD chromosomal entries
hoxlist.txt
Vertebrate homeobox proteins: nomenclature and index
humchr20.txt
Index of protein sequence entries encoded on human chromosome 20*
humchr21.txt
Index of protein sequence entries encoded on human chromosome 21
humchr22.txt
Index of protein sequence entries encoded on human chromosome 22
humchrx.txt
Index of protein sequence entries encoded on human chromosome X*
humchry.txt
Index of protein sequence entries encoded on human chromosome Y
mimtosp.txt
Index of MIM entries referenced in SWISS-PROT
nomlist.txt
List of nomenclature related references for proteins
pdbtosp.txt
Index of Brookhaven PDB entries referenced in SWISS-PROT
peptidas.txt
Classification of peptidase families and index of peptidase entries
plastid.txt
List of chloroplast and cyanelle encoded proteins
pombe.txt
Index of
S.pombe
entries in SWISS-PROT and corresponding gene designations
restric.txt
List of restriction enzyme and methylase entries
ribosomp.txt
Index of ribosomal proteins classified by families on the basis of sequence
similarities*
salty.txt
Index of
S.typhimurium
LT2 chromosomal entries and corresponding StyGene cross-references
subtilis.txt
Index of
B.subtilis
168 chromosomal entries and corresponding SubtiList cross-references
yeast.txt
Index of
S.cerevisiae
entries and corresponding gene designations
yeast1.txt
Yeast Chromosome I entries
yeast2.txt
Yeast Chromosome II entries
yeast3.txt
Yeast Chromosome III entries
yeast5.txt
Yeast Chromosome V entries
yeast6.txt
Yeast Chromosome VI entries
yeast7.txt
Yeast Chromosome VII entries*
yeast8.txt
Yeast Chromosome VIII entries
yeast9.txt
Yeast Chromosome IX entries
yeast10.txt
Yeast Chromosome X entries*
yeast11.txt
Yeast Chromosome XI entries
yeast14.txt
Yeast Chromosome XIV entries*
*Documents that have been created since last year.
TrEMBL, a computer annotated supplement to SWISS-PROT
Introduction
Ongoing genome sequencing and mapping projects have dramatically increased the
number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences into SWISS-PROT without proper sequence analysis and annotation, we
cannot speed up the incorporation of new incoming data indefinitely. However,
as we also want to make sequences available as fast as possible, we introduced
with SWISS-PROT release 34, TrEMBL (TRanslation of EMBL nucleotide sequence
database), a supplement to SWISS-PROT. TrEMBL consists of computer-annotated entries in SWISS-PROT-like format derived from the translation of all coding
sequences (CDS) in the EMBL nucleotide sequence database, except for CDS
already included in SWISS-PROT.
The production of TrEMBL has emphasized the importance of linking not only to a
whole EMBL nucleotide sequence entry but to linking within that entry at the
CDS feature level. This linking has now been achieved by using the `PID', the
Protein IDentification number found in the `/db_xref' qualifier tagged to every
CDS in the EMBL nucleotide sequence database. The DR lines of SWISS-PROT and TrEMBL entries pointing to an EMBL database entry are now citing
the EMBL AC number as primary identifier and the PID as secondary identifier.
In all cases where a `PID' is already integrated into SWISS-PROT a `/db_xref' qualifier citing the corresponding SWISS-PROT entry is added to the EMBL nucleotide sequence database CDS
labelled with this `PID'. In the remaining cases a `/db_xref' qualifier is
pointing to the corresponding TrEMBL entry. This approach enables us to point
precisely from a given SWISS-PROT entry to one of potentially many CDS in the corresponding EMBL entry
and vice versa.
This change will allow the development of software tools that automatically
retrieve the part of a nucleotide sequence entry that codes for a specific
protein. This will be especially useful in the context of World-Wide Web as it will render obsolete the current situation where, for
example, one needs to retrieve the complete sequence of a yeast chromosome when
one wants the nucleotide sequence coding for a specific protein encoded on that
chromosome.
Current status
The translation of all CDS in the EMBL Nucleotide Sequence Database release 48
(September 1996) resulted in the creation of 199 000 TrEMBL preentries. Around
80 000 of these preentries were already as sequence reports in SWISS-PROT and excluded from TrEMBL. Then the remaining ~119 000 sequence entries have been automatically merged whenever
possible to reduce redundancy in TrEMBL. This step led to ~110 000 TrEMBL entries. We have split TrEMBL in two main sections; SP-TrEMBL and REM-TrEMBL.
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (~85 000) which should be incorporated into SWISS-PROT. SWISS-PROT accession numbers have been assigned to these
entries. SP-TrEMBL is partially redundant against SWISS-PROT, since ~40 000 of these entries are only additional sequence reports of
proteins already in SWISS-PROT. We will try to merge these sequence reports as fast as possible with
the already existing SWISS-PROT entries for these proteins, so as to make SWISS-PROT and TrEMBL completely non-redundant.
For SP-TREMBL to act as a computer-annotated supplement to SWISS-PROT, new procedures have been introduced whereby valuable
annotation has been added automatically. First, all TrEMBL entries are scanned
for all PROSITE (
5
) patterns compatible with their taxonomic range. The results are added to the
annotator's section of the TrEMBL entry that is not visible to the public.
Among all of the patterns, some of them are known to be very reliable (i.e. no
known false positive). These are used to enhance the information content of the
DE, CC, DR and KW fields by adding information about the potential function of
the protein, metabolic pathways, active sites, cofactors, binding sites,
domains, subcellular location and other annotation to the entry whenever
appropriate. We also make use of the ENZYME database (
6
), using the EC number as a reference point, to generate standardized
description lines for enzyme entries and to allow information such as catalytic
activity, cofactors and relevant keywords to be taken from ENZYME and to be
added automatically to SP-TrEMBL entries.
Furthermore we make use of specialized genomic databases like FlyBase (
7
) to parse information like the correct gene nomenclature and cross-references to these databases into TrEMBL entries.
REM-TrEMBL (REMaining TrEMBL) contains the entries (~20 000) that we do not want to include in SWISS-PROT. This section is organized in five subsections:
(i) Most REM-TrEMBL entries are immunoglobulins and T-cell receptors. We stopped entering immunoglobulins and T-cell receptors into SWISS-PROT, because we only want to keep the germ line gene
derived translations of these proteins in SWISS-PROT and not all known somatic recombinated variations of these proteins.
At the moment there are >12 000 immunoglobulins and T-cell receptors in TrEMBL. We would like to create a specialized database
dealing with these sequences as a further supplement to SWISS-PROT and keep only a representative cross-section of these proteins in SWISS-PROT.
(ii) Another category of data which will not be included in SWISS-PROT are synthetic sequences. Again, we do not want to leave these entries
in TrEMBL. Ideally one should build a specialized database for artificial
sequences as a further supplement to SWISS-PROT.
(iii) Fragments with less than eight amino acids.
(iv) Coding sequences captured from patent applications. A thorough survey of
these entries have shown that apart for a small minority (which have already
been integrated in SWISS-PROT), most of these sequence contains either erroneous data or concern
artificially generated sequences outside the scope of SWISS-PROT.
(v) The last subsection consists of CDS translations where we have strong
evidence to believe that these CDS are not coding for real proteins.
PRACTICAL INFORMATION
Content of the current release
Currently (October 1996) SWISS-PROT contains ~60 000 sequence entries, comprising 21 million amino acids abstracted
from ~50 000 references. The data file (sequences and annotations) requires 120
Mb of disk storage space. The documentation and index files require ~40 Mb of disk space. No restrictions are placed on use or redistribution of
the data.
How to obtain SWISS-PROT
SWISS-PROT is distributed on CD-ROM by the EMBL Outstation - The European Bioinformatics Institute
(EBI) (
2
). The CD-ROM contains both SWISS-PROT and the EMBL Nucleotide Sequence Database as well as other data
collections and some database query and retrieval software for MS-DOS and Apple MacIntosh computers. For all enquiries regarding the
subscription and distribution of SWISS-PROT one should contact The EMBL Outstation - The European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
Telephone: (+44 1223) 494 400; Telefax : (+44 1223) 494 468; Email:
datalib@ebi.ac.uk">datalib{at}ebi.ac.uk .
Individual sequence entries can be obtained from the EBI network fileserver.
Detailed instructions on how to make the best use of this service, and in
particular on how to obtain protein sequences, can be obtained by sending to
the network address netserv{at}ebi.ac.uk the following message:
HELP
HELP PROT
If you have access to a computer system linked to the Internet you can obtain
SWISS-PROT using FTP (File Transfer Protocol), from the following file servers:
EBI anonymous FTP server
Internet address: ftp.ebi.ac.uk (or 192.54.41.33)
NCBI Repository (National Library of Medicine, NIH, Washington, DC, USA)
Internet address: ncbi.nlm.nih.gov (or 130.14.20.1)
ExPASy (Expert Protein Analysis System) server, University of Geneva,
Switzerland
Internet address: expasy.hcuge.ch (or 129.195.254.61)
National Institute of Genetics (Japan) FTP server
Internet address: ftp2.ddbj.nig.ac.jp (or 133.39.3.6)
How to submit data to SWISS-PROT
To submit data to SWISS-PROT and for all enquiries regarding the submission of SWISS-PROT one should contact SWISS-PROT, The EMBL Outstation - The European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
Telephone: (+44 1223) 494 462; Telefax: (+44 1223) 494 468; Email:
datasubs{at}ebi.ac.uk (for submission), junker@ebi.ac.uk (for enquiries).
Interactive access to SWISS-PROT
The most efficient and user-friendly way to browse interactively in SWISS-PROT is to use the World-Wide Web (WWW) molecular biology server ExPASy (
8
) as well as the one developed by the EBI. WWW is a global information retrieval
system merging the power of world-wide networks, hypertext and multimedia. Through hypertext links, it gives
access to documents and information available on thousands of servers around
the world. Using a WWW browser [such as Mosaic(TM), Netscape Navigator(TM) or
Microsoft Internet Explorer(TM)], one has access to all the hypertext documents
stored on the ExPASy and EBI servers (as well as many other WWW servers).
The ExPASy server was made available to the public in September 1993. In October
1996 a cumulative total of 8 million connections was attained. It may be
accessed through its Uniform Resource Locator (URL - the addressing system defined in WWW), which is: http://expasy.hcuge.ch/
. The EBI server is accessible under: http://www.ebi.ac.uk/
Release frequency
The present distribution frequency is four releases per year. Weekly updates are
also available; these updates are available by anonymous FTP. Three files are
updated every week:
new_seq.dat
Contains all the new entries since the last full release.
upd_seq.dat
Contains the entries for which the sequence data has been updated since the last
release.
upd_ann.dat
Contains the entries for which one or more annotation fields have been updated
since the last release.
These files are available on the EBI, NCBI and ExPASy servers, whose Internet
addresses are listed above.
REFERENCES
1 Bairoch,A. and Apweiler,R. (1996) Nucleic Acids Res. 24, 21-25.