The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998
The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998Amos Bairoch* and Rolf Apweiler1
Department of Medical Biochemistry, University of Geneva, 1 rue Michel Servet, 1211 Geneva 4, Switzerland and 1The EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Received October 22, 1997;Accepted October 24, 1997
ABSTRACT
SWISS-PROT (http://www.expasy.ch/ ) is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of model organisms; cross-references to two additional databases; a variety of new documentation files and improvements to TrEMBL, a computer annotated supplement to SWISS-PROT. TrEMBL consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except the CDS already included in SWISS-PROT.
SWISS-PROT (1 ) is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1987, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library [now the EMBL Outstation, The European Bioinformatics Institute (EBI) (2 )]. The SWISS-PROT protein sequence data bank consists of sequence entries. Sequence entries are composed of different line types, each with their own format. For standardisation purposes the format of SWISS-PROT (3 ) follows as closely as possible that of the EMBL Nucleotide Sequence Database. A sample SWISS-PROT entry is shown in Figure 1 .
We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to:
Be as complete as possible. All sequences available at a given time should be immediately included in SWISS-PROT. This also includes sequence corrections and updates.
Provide a higher level of annotations.
Cross-references to specialised database(s) that contain, among other data, some genetic information about the genes that code for these proteins.
Table 1 lists, for each of the model organisms, the name of the specialised database to which cross-references are available, the name of the SWISS-PROT index file and the number of sequences in SWISS-PROT.
Collectively these organisms represent ~40% of the total number of sequence entries in SWISS-PROT. We are currently attempting to finish the integration into SWISS-PROT of all the putative proteins from E.coli, B.subtilis and yeast.
SWISS-PROT is distributed with a large number of documentation files. Some of these files have been available for a long time (the user manual, release notes, the various indices for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files. Table 2 lists all the documents that are currently available.
We have recently added cross-references that link SWISS-PROT to the Mouse Genome Database (MGD) (4 ), to the TIGR Microbial genome database (5 ) and to the Plant Gene Nomenclature Database (MENDEL) (6 ).
Currently, SWISS-PROT is linked to 30 different databases and has consolidated its role as the major focal points of biomolecular databases interconnectivity. In release 35, there is an average of 3.3 cross-references for each sequence entry.
Ongoing genome sequencing and mapping projects have dramatically increased the number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences into SWISS-PROT without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. However, as we also want to make sequences available as fast as possible, we introduced in early 1997, TrEMBL (TRanslation of EMBL nucleotide sequence database), a supplement to SWISS-PROT. TrEMBL consists of computer-annotated entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except for CDS already included in SWISS-PROT.
The production of TrEMBL has emphasised the importance of linking not only to a whole EMBL nucleotide sequence entry but to linking within that entry at the CDS feature level. This linking has now been achieved by using the `PID', the Protein IDentification number found in the `/db_xref' qualifier tagged to every CDS in the EMBL nucleotide sequence database. The DR lines of SWISS-PROT and TrEMBL entries pointing to an EMBL database entry are now citing the EMBL accession number as primary identifier and the PID as secondary identifier. In all cases where a `PID' is already integrated into SWISS-PROT a `/db_xref' qualifier citing the corresponding SWISS-PROT entry is added to the EMBL nucleotide sequence database CDS labelled with this `PID'. In the remaining cases a `/db_xref' qualifier is pointing to the corresponding TrEMBL entry. This approach enables us to point precisely from a given SWISS-PROT entry to one of potentially many CDS in the corresponding EMBL entry and vice versa.
In October 1997, TrEMBL release 5 was produced. Release 5 was based on the translation of all 277 000 CDS in the EMBL Nucleotide Sequence Database release 52. Around 100 000 of these CDS were already sequence reports in SWISS-PROT and thus excluded from TrEMBL. The remaining ~177 000 sequence entries have been automatically merged whenever possible to reduce redundancy in TrEMBL. This step led to ~150 000 TrEMBL entries.
We have split TrEMBL in two main sections; SP-TrEMBL and REM-TrEMBL: SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (~120 000 in release 5) which should be incorporated into SWISS-PROT. SWISS-PROT accession numbers have been assigned to these entries. SP-TrEMBL is partially redundant against SWISS-PROT, since ~40 000 of these entries are only additional sequence reports of proteins already in SWISS-PROT. We merge these sequence reports as fast as possible with the already existing SWISS-PROT entries for these proteins, so as to make SWISS-PROT and TrEMBL completely non-redundant. For SP-TREMBL to act as a computer-annotated supplement to SWISS-PROT, new procedures have been introduced whereby valuable annotation has been added automatically. First, all TrEMBL entries are scanned for all PROSITE (7 ) patterns compatible with their taxonomic range. The results are added to the annotator's section of the TrEMBL entry that is not visible to the public. Among all of the patterns, some of them are known to be very reliable (i.e., no known false positive).
These are used to enhance the information content of the DE, CC, DR and KW fields by adding information about the potential function of the protein, metabolic pathways, active sites, cofactors, binding sites, domains, subcellular location, and other annotation to the entry whenever appropriate. We also make use of the ENZYME database (8 ), using the EC number as a reference point, to generate standardised description lines for enzyme entries and to allow information such as catalytic activity, cofactors and relevant keywords to be taken from ENZYME and to be added automatically to SP-TrEMBL entries.
Furthermore we make use of specialised genomic databases like MGD (4 ) and FlyBase (9 ) to parse information like the correct gene nomenclature and cross-references to these databases into TrEMBL entries.
REM-TrEMBL (REMaining TrEMBL) contains the entries (~30 000 in release 5) that we do not want to include in SWISS-PROT. This section is organised in five subsections:
Most REM-TrEMBL entries are immunoglobulins and T-cell receptors. We stopped entering immunoglobulins and T-cell receptors into SWISS-PROT, because we only want to keep the germ line gene derived translations of these proteins in SWISS-PROT and not all known somatic recombinated variations of these proteins. At the moment there are more than 15 000 immunoglobulins and T-cell receptors in TrEMBL. We would like to create a specialised database dealing with these sequences as a further supplement to SWISS-PROT and keep only a representative cross-section of these proteins in SWISS-PROT.
Another category of data which will not be included in SWISS-PROT are synthetic sequences. Again, we do not want to leave these entries in TrEMBL. Ideally one should build a specialised database for artificial sequences as a further supplement to SWISS-PROT.
Fragments with less than eight amino acids.
Coding sequences captured from patent applications. A thorough survey of these entries have shown that apart for a small minority (which have already been integrated in SWISS-PROT), most of these sequence contains either erroneous data or concern artificially generated sequences outside the scope of SWISS-PROT.
The last subsection consists of CDS translations where we have strong evidence to believe that these CDS are not coding for real proteins.
Currently (October 1997), SWISS-PROT contains ~68 500 sequence entries, comprising 24.8 million amino acids abstracted from ~56 000 references. The data file (sequences and annotations) requires 135 Mb of disk storage space. The documentation and index files require ~45 Mb of disk space. No restrictions are placed on use or redistribution of the data.
The most efficient and user-friendly way to browse interactively in SWISS-PROT or TrEMBL is to use the World-Wide Web (WWW) molecular biology server ExPASy (10 ,11 ) as well as the one developed by the EBI. The ExPASy Web server was made available to the public in September 1993. On October 1997 a cumulative total of 17 million connections was attained. It may be accessed through its Uniform Resource Locator (URL - the addressing system defined in WWW), which is: http://www.expasy.ch/
On both the ExPASy and the EBI Web servers, you can use the Sequence Retrieval System (SRS) (12 ) software package to query and retrieve sequence entries.
SWISS-PROT + TrEMBL is distributed on CD-ROM by the EMBL Outstation, the European Bioinformatics Institute (EBI) (2 ). The CD-ROMs contain SWISS-PROT + TrEMBL, the EMBL Nucleotide Sequence Database as well as other data collections and some database query and retrieval software for MS-DOS and Apple Macintosh computers. For all enquiries regarding the subscription and distribution of SWISS-PROT + TrEMBL one should contact: The EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. Tel: +44 1223 494 400; Fax: +44 1223 494 468; Email: datalib@ebi.ac.uk
If you have access to a computer system linked to the Internet you can obtain SWISS-PROT using FTP (File Transfer Protocol), from the following file servers:
ExPASy (Expert Protein Analysis System) server, University of Geneva, Switzerland
To submit new sequence data to SWISS-PROT and for all enquiries regarding the submission of SWISS-PROT one should contact: SWISS-PROT, The EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. Tel: +44 1223 494 462; Fax: +44 1223 494 468; Email: datasubs@ebi.ac.uk (for submission); junker{at}ebi.ac.uk (for enquiries).
The present distribution frequency is four releases per year. Weekly updates are also available; these updates are available by anonymous FTP. For SWISS-PROT, three files are updated every week:
new_seq.dat
Contains all the new entries since the last full release.
upd_seq.dat
Contains the entries for which the sequence data has been updated since the last release.
upd_ann.dat
Contains the entries for which one or more annotation fields have been updated since the last release.
These files are available on the EBI, ExPASy and NCBI servers, whose Internet addresses are listed above.
Every week we produce a complete non-redundant protein sequence collection by providing three compressed files (these are in the directory `/databases/sp_tr_nrdb' on the ExPASy FTP server and in /pub/databases/sp_tr_nrdb on the EBI server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.
This set of non-redundant files is especially important for two types of users:
(i) Managers of similarity search services. They can now provide what is currently the most comprehensive and non-redundant data set of protein sequences.(ii) Anybody wanting to update their full copy of SWISS-PROT and TrEMBL at their own schedule without having to wait for full releases of SWISS-PROT or of TrEMBL.