The SWISS-PROT protein sequence data bank and its new supplement TREMBL
Amos
Bairoch
and
Rolf
Apweiler
1
Department of Medical Biochemistry, University of Geneva, 1 rue Michel Servet,
1211
Geneva
4,
Switzerland
and
1
The EMBL Outstation-The European Bioinformatics Institute, Hinxton Hall, Hinxton,
Cambridge
CB10 1RQ,
UK
Received October 3, 1995;
Revised and Accepted October 13, 1995
ABSTRACT
SWISS-PROT is a curated protein sequence database which strives to provide a
high level of annotation (such as the description of the function of a protein,
its domain structure, post-translational modifications, variants, etc), a minimal level of redundancy
and a high level of integration with other databases. Recent developments of
the database include: an increase in the number and scope of model organisms;
cross-references to seven additional databases; a variety of new documentation
files; the creation of TREMBL, an unannotated supplement to SWISS-PROT. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding
sequences (CDS) in the EMBL nucleotide sequence database, except CDS already
included in SWISS-PROT.
INTRODUCTION
SWISS-PROT (
1
) is an annotated protein sequence database established in 1986 and maintained
collaboratively, since 1987, by the Department of Medical Biochemistry of the
University of Geneva and the EMBL Data Library (now the EMBL Outstation-The European Bioinformatics Institute;
2
). The SWISS-PROT protein sequence data bank consists of sequence entries. Sequence
entries are composed of different line types, each with their own format. For
standardization purposes the format of SWISS-PROT (
3
) follows as closely as possible that of the EMBL nucleotide sequence database.
A sample SWISS-PROT entry is shown in Figure
1
.
RECENT DEVELOPMENTS
Model organisms
We have selected a number of organisms that are the target of genome sequencing
and/or mapping projects and for which we intend to: (i) be as complete as
possible (all sequences available at a given time should be immediately
included in SWISS-PROT, including sequence corrections and updates); (ii) provide a higher
level of annotation; (iii) cross-reference to specialized databases that contain, among other data, some
genetic information about the genes that code for these proteins; (iv) provide
specific indices or documents.
The organisms currently selected are:
Arabidopsis thaliana
(mouse-ear cress);
Bacillus subtilis
;
Caenorhabditis elegans
(worm);
Dictyostelium discoideum
(slime mold);
Drosophila melanogaster
(fruit fly);
Escherichia coli
;
Haemophilus influenzae
;
Homo sapiens
(human);
Saccharomyces cerevisiae
(budding yeast);
Salmonella typhimurium
;
Schizosaccharomyces pombe
(fission yeast);
Sulfolobus solfataricus
. Details of the database entries for these organisms are given in Table
1
.
Collectively these organisms represent 30% of the total number of sequence
entries in SWISS-PROT.
In the last few months we have included in SWISS-PROT fully annotated versions of the protein sequence entries encoded on
the complete genome of
Haemophilus influenzae
, as well as entries originating from the full sequence of yeast chromosomes I,
II, III, V, VI, VIII, IX and XI.
Documentation files
SWISS-PROT is distributed with a large number of documentation files. Some of
these files have been available for a long time (the user manual, release
notes, the various indices for authors, citations, keywords, etc.), but many
have been created recently and we are continuously adding new files. Table
2
list all the documents that are currently available or that will be added in
the next few months.
New cross-references
We have recently added cross-references that link SWISS-PROT to the following databases:
(i) the LISTA database of yeast (
Saccharomyces cerevisiae
) genes coding for proteins prepared under the supervisation of Patrick Linder
at the University of Geneva (
4
);
(ii) the
Saccharomyces
Genome Database (SGD or SacchDB) prepared under the supervisation of Mike
Cherry at Stanford University;
(iii) the Yeast Electrophoresis Protein Database (YEPD) prepared under the
supervisation of Jim Garrells from the Quest Protein Database Center of the
Cold Spring Harbor Laboratory (
5
);
(iv) the StyGene section of the StySeq/StyMap integrated
Salmonella typhimurium
LT2 database prepared by Ken Rudd at the National Center for Biotechnology
Information (NCBI);
(v) the SubtiList relational database for the
Bacillus subtilis
168 genome prepared under the supervisation of Ivan Moszer at the Pasteur
Institute (
6
);
(vi) the database of Homology-derived Secondary Structure of Proteins (HSSP) prepared under the
supervisation of Chris Sander at the EMBL (
7
);
(vii) the transcription factor database (Transfac) developed by Edgar Wingender
and Rainer Knueppel from the Gesellschaft fuer Biotechnologische Forschung mbH
in Braunschweig (
8
).
Currently, SWISS-PROT is linked to 24 different databases and has consolidated its role as
the major focal point of biomolecular database interconnectivity. In release 32
there were an average of 3.5 cross-references for each sequence entry.
TREMBL, an unannotated supplement to SWISS-PROT
Ongoing genome sequencing and mapping projects have dramatically increased the
number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences without proper sequence analysis and
annotation, we cannot speed up the incorporation of new incoming data
indefinitely. However, as we also want to make the sequences available as fast
as possible we will introduce with SWISS-PROT release 33 an unannotated supplement to SWISS-PROT. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding
sequences (CDS) in the EMBL nucleotide sequence database, except CDS already
included in SWISS-PROT.
We name this supplement TREMBL (TRanslation from EMBL), since the translation
tools used to create translations of the CDS are based on the program `TREMBL'
written by Thure Etzold at the EMBL in Heidelberg.
Translation of all CDS in the EMBL nucleotide sequence database release 44
resulted in the creation of 145 000 TREMBL pre-entries. Around 65 000 of these pre-entries were already present as sequence reports in SWISS-PROT and were excluded from TREMBL. The remaining ~80 000 sequence entries have been automatically merged
whenever possible, to reduce redundancy in TREMBL. This step led to ~70 000 TREMBL entries, which supplement SWISS-PROT.
Submission of sequence data to the SWISS-PROT data bank*
shortdes.txt
Short description of entries in SWISS-PROT
jourlist.txt
List of abbreviations for journals cited
keywlist.txt
List of keywords in use
speclist.txt
List of organism identification codes
experts.txt
List of on-line experts for PROSITE and SWISS-PROT
acindex.txt
Accession number index
autindex.txt
Author index
citindex.txt
Citation index
keyindex.txt
Keyword index
speindex.txt
Species index
7tmrlist.txt
List of 7-transmembrane G-linked receptor entries
aatrnasy.txt
List of aminoacyl-tRNA synthetases*
allergen.txt
Nomenclature and index of allergen sequences*
cdlist.txt
CD nomenclature for surface proteins of human leucocytes
celegans.txt
Index of
Caenorhabditis elegans
entries and corresponding gene designations and WormPep cross-references
dicty.txt
Index of
Dictyostelium discoideum
entries and corresponding gene designations and DictyDB cross-references
ec2dtosp.txt
Index of
Escherichia coli
gene-protein database entries referenced in SWISS-PROT
ecoli.txt
Index of
Escherichia coli
K12 chromosomal entries and corresponding EcoGene cross-references
embltosp.txt
Index of EMBL database entries referenced in SWISS-PROT
extradom.txt
Nomenclature of extracellular domains*
glycosyl.txt
Index of glycosyl hydrolases classified by families on the basis of sequence
similarities
haeinflu.txt
Index of
Haemophilus influenzae
RD chromosomal entries*
hoxlist.txt
Vertebrate homeobox proteins: nomenclature and index
humchr21.txt
Index of protein sequence entries encoded on human chromosome 21*
humchr22.txt
Index of protein sequence entries encoded on human chromosome 22*
humchry.txt
Index of protein sequence entries encoded on human chromosome Y*
mimtosp.txt
Index of MIM entries referenced in SWISS-PROT
nomlist.txt
List of nomenclature-related references for proteins
pdbtosp.txt
Index of Brookhaven PDB entries referenced in SWISS-PROT
peptidas.txt
Classification of peptidase families and index of peptidase entries*
plastid.txt
List of chloroplast- and cyanelle-encoded proteins
pombe.txt
Index of
Schizosaccharomyces pombe
entries in SWISS-PROT and corresponding gene designations*
restric.txt
List of restriction enzymes and methylases entries
ribosomp.txt
Index of ribosomal proteins classified by families on the basis of sequence
similarities
salty.txt
Index of
Salmonella typhimurium
LT2 chromosomal entries and corresponding StyGene cross-references*
subtilis.txt
Index of
Bacillus subtilis
168 chromosomal entries and corresponding SubtiList cross-references*
yeast.txt
Index of
Saccharomyces cerevisiae
entries and corresponding gene designations
yeast1.txt
Yeast chromosome I entries*
yeast2.txt
Yeast chromosome II entries*
yeast3.txt
Yeast chromosome III entries
yeast5.txt
Yeast chromosome V entries*
yeast6.txt
Yeast chromosome VI entries*
yeast8.txt
Yeast chromosome VIII entries*
yeast9.txt
Yeast chromosome IX entries*
yeast11.txt
Yeast chromosome XI entries
Documents created since last year are flagged with an asterisk.
We have split TREMBL into two main sections, SP-TREMBL and REM-TREMBL. SP-TREMBL (SWISS-PROT TREMBL) contains entries (~55 000) which should be incorporated into SWISS-PROT. SWISS-PROT accession numbers have been
assigned to these entries. SP-TREMBL is partially redundant against SWISS-PROT, since ~30 000 of these SP-TREMBL entries are only additional sequence reports of
proteins already in SWISS-PROT. We will try to merge these sequence reports as fast as possible with
the already existing SWISS-PROT entries for these proteins, so as to make SWISS-PROT and TREMBL completely non-redundant. REM-TREMBL (REMaining TREMBL) contains those entries (~15 000) that we do not wish to include in SWISS-PROT. This section is organized into four
subsections.
(i) Most REM-TREMBL entries are immunoglobulins and T-cell receptors. We have stopped entering immunoglobulins and T-cell receptors into SWISS-PROT, because we want to keep only germ line gene-derived translations of these proteins in SWISS-PROT and not all known somatic recombinant
variations of these proteins. At the moment there are >10 000 immunoglobulins
and T cell receptors in TREMBL. We would like to create a specialized database
dealing with these sequences as a further supplement to SWISS-PROT and keep only a representative cross-section of these proteins in SWISS-PROT.
(ii) Another category of data which will not be included in SWISS-PROT is synthetic sequences. Again, we do not want to leave these entries
in TREMBL. Ideally one should build a specialized database for artificial
sequences as a further supplement to SWISS-PROT.
(iii) A third subsection consists of fragments with less than seven amino acids.
(iv) The last subsection consists of CDS translations where we have strong
evidence to believe that these CDS are not coding for real proteins.
The creation of TREMBL as a supplement to SWISS-PROT was not only for the purpose of producing a more complete and up to
date protein sequence collection. We used this task to also achieve a deeper
integration of the EMBL nucleotide sequence database with SWISS-PROT + TREMBL.
We used the PID, the Protein IDentification number found in the /db_xref
qualifier tagged to every CDS in the EMBL nucleotide sequence database, as the
ID of the TREMBL entries created from these CDS. In all 65 000 cases where an
EMBL nucleotide sequence database CDS was already present as a sequence report
in SWISS-PROT the SWISS-PROT DR lines of the corresponding SWISS-PROT entries have been updated by citing the EMBL AC number as
primary identifier and the PID as secondary identifier. In all cases where a
PID is already integrated into SWISS-PROT a /db_xref qualifier citing the corresponding SWISS-PROT entry is added to the EMBL nucleotide sequence database CDS
labelled with this PID.
This approach enables us to point precisely from a given SWISS-PROT entry to one of potentially many CDS in the corresponding EMBL entry,
and vice versa. This change will allow the development of software tools that
automatically retrieve that part of a nucleotide sequence entry that codes for
a specific protein. This will be especially useful in the context of the World
Wide Web, as it will render obsolete the current situation where, for example,
one needs to retrieve the complete sequence of a yeast chromosome when one
wants the nucleotide sequence coding for a specific protein encoded on that
chromosome.
PRACTICAL INFORMATION
Content of the current release
Release 32.0 of SWISS-PROT (October 1995) contains 48 440 sequence entries, comprising 17 000
000 amino acids abstracted from ~43 000 references. The data file (sequences and annotations) requires 90 Mb
disk storage space. The documentation and index files require ~30 Mb disk space. No restrictions are placed on use or redistribution of
the data.
How to obtain SWISS-PROT
SWISS-PROT is distributed on CD-ROM by the EMBL Outstation-the European Bioinformatics Institute (EBI) (
2
). The CD-ROM contains both SWISS-PROT and the EMBL nucleotide sequence database, as well as other
data collections and some database query and retrieval software for MS-DOS and Apple MacIntosh computers. For all enquiries regarding
subscription to and distribution of SWISS-PROT one should contact The EMBL Outstation-The European Bioinformatics Institute, Hinxton Hall, Hinxton,
Cambridge CB10 1RQ, UK (tel +44 1223 494 400; fax +44 1223 494 468; email
datalib@ebi.ac.uk).
Individual sequence entries can be obtained from the EBI file server. Detailed
instructions on how to make the best use of this service and, in particular, on
how to obtain protein sequences can be obtained by query to the network address
netserv@ebi.ac.uk
HELP
HELP PROT
If you have access to a computer system linked to the Internet you can obtain
SWISS-PROT using ftp (File Transfer Protocol) from the following file servers:
EBI anonymous ftp server (ftp.ebi.ac.uk or 192.54.41.33);
NCBI Repository, National Library of Medicine, NIH, Washington, DC
(ncbi.nlm.nih.gov or 130.14.20.1);
ExPASy (Expert Protein Analysis System) server, University of Geneva,
Switzerland (expasy.hcuge.ch or 129.195.254.61);
National Institute of Genetics (Japan) ftp server (ftp.nig.ac.jp or
133.39.16.66).
How to submit data to SWISS-PROT
To submit data to SWISS-PROT and for all enquiries regarding submission to SWISS-PROT one should contact SWISS-PROT, The EMBL Outstation-The European Bioinformatics Institute, Hinxton Hall,
Hinxton, Cambridge CB10 1RQ, UK [tel. +44 1223 494 462; fax +44 1223 494 468;
email datasubs@ebi.ac.uk (for submissions), junker{at}ebi.ac.uk (for enquiries)].
Interactive access to SWISS-PROT
The most efficient and user friendly way to browse interactively in SWISS-PROT is to use the World Wide Web (WWW) molecular biology server ExPASy (
9
), as well as that developed by the EBI. WWW is a global information retrieval
system merging the power of worldwide networks, hypertext and multimedia.
Through hypertext links it gives access to documents and information available
on thousands of servers around the world. To access a WWW server one needs a
WWW browser. Popular browsers available for most computer platforms include
Mosaictm, developed at the National Center for Supercomputing Applications (NCSA)
of the University of Illinois at Champaign (obtainable by anonymous ftp from
ftp.ncsa.uiuc.edu), and Netscape Navigatortm, from Netscape Communications Corp. (available from ftp.netscape.com).
Using a WWW browser one has access to all the hypertext documents stored on the
ExPASy and EBI servers (as well as many other WWW servers).
The ExPASy server was made available to the public in September 1993. On August
1995 a cumulative total of 2 000 000 connections was attained. It may be
accessed through its Uniform Resource Locator (URL, the addressing system
defined in WWW) which is http://expasy.hcuge.ch/. The EBI server is accessible
under http://www.ebi.ac.uk/.
Release frequency
The present distribution frequency is four releases per year, although weekly
updates are also available. These updates are available by anonymous ftp. Three
files are updated every week: new_seq.dat, containing all the new entries since
the last full release; upd_seq.dat, containing the entries for which the
sequence data has been updated since the last release; upd_ann.dat, containing
the entries for which one or more annotation fields have been updated since the
last release. These files are available on the EBI, NCBI and ExPASy servers,
whose Internet addresses are listed above.