The EMBL Nucleotide Sequence DatabaseGuenter Stoesser*, Mary Ann Moseley, Joanne Sleep, Michael McGowran, Maria Garcia-Pastor and Peter Sterk
EMBL Outstation - The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Received October 1, 1997;Revised and Accepted October 10, 1997
ABSTRACT
The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl.html) constitutes Europe's primary nucleotide sequence resource. DNA and RNA sequences are directly submitted from researchers and genome sequencing groups and collected from the scientific literature and patent applications (Fig. 1). In collaboration with DDBJ and GenBank the database is produced, maintained and distributed at the European Bioinformatics Institute. Database releases are produced quarterly and are distributed on CD-ROM. EBI's network services allow access to the most up-to-date data collection via Internet and World Wide Web interface, providing database searching and sequence similarity facilities plus access to a large number of additional databases.
The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/ embl.html ) is a central activity of the European Bioinformatics Institute (EBI) (http://www.ebi.ac.uk ), an EMBL outstation located at the Wellcome Trust Genome Campus in Hinxton, near Cambridge, UK. Database services provided by the EBI (1 ) are a continuation and extension of the former EMBL Data Library (2 ), in Heidelberg, Germany. Additional to the production of the Nucleotide Sequence database, the EBI maintains and distributes the SWISS-PROT Protein Sequence Database (3 ) in collaboration with Amos Bairoch of the University of Geneva, TREMBL (a SWISS-PROT supplement consisting of translations from EMBL database coding sequences), the Radiation Hybrid Database Rhdb (4 ) and many other additional specialist molecular biology databases, many of which are produced in collaboration with the EBI.
The EMBL Databasecollects, organizes and distributes a database of nucleotide sequence data and related biological information. Since 1982 this work has been done in collaboration with GenBank (NCBI, Bethesda, USA) and the DNA Database of Japan (Mishima). Each of the three international collaborating databases DDBJ/EMBL/GenBank, collect a portion of the total sequence data reported world-wide. All new and updated database entries are exchanged between the International Nucleotide Sequence Collaboration on a daily basis. EMBL Database releases are produced quarterly and are distributed on CD-ROM. The most up-to-date data collection is available via Internet and World Wide Web interface.
The explosive growth of the database continues. Currently at over 1200 million base pairs, the database almost doubles in size each year, with a new sequence being deposited on average once a minute. Sequencing projects like the Human Genome Project and a growing number of other genome sequencing groups produce large quantities of new sequence data. The recent increase in data volume is a direct consequence of ongoing collaborations between major sequencing projects and the EMBL Database.
EMBL Database entries are grouped into divisions. The grouping is based mainly on taxonomy with a few exceptions like the new HTG (High Throughput Genome Sequences) and GSS (Genome Survey Sequences) divisions, for which grouping is based on the specific nature of the underlying data. Thus, divisions provide subsets of the database which reflect the areas of interest of many of our users. The EMBL Database currently consists of 17 divisions with each entry belonging to exactly one division. In each entry the according division is indicated using the three letter codes as shown below:
The DDBJ, EMBL and GenBank nucleic acid sequence data banks have from their inception used tables of sites and features to describe the roles and locations of higher order sequence domains and elements within the genome of an organism. The feature table follows the unified DDBJ/EMBL/GenBank Feature Table Definition which is available from the EBI network servers: (URL http://www.ebi.ac.uk/ebi_docs/embl_db/ft/feature_table.html ).
Where appropriate, entries are cross-referenced to other databases like SWISS-PROT, TREMBL (a computer-annotated supplement to SWISS-PROT), Eukaryotic Promoter database (5 ), TransFac (6 ), Flybase (7 ), etc. The feature table qualifier /db_xref represents cross-references to external databases.
In the feature table example (Fig. 2 ), a cross-reference from a nucleotide coding region (CDS) feature to the according protein in the protein database `SPTREMBL' indicates that this feature corresponds to the entity in the SP-TREMBL database with the given identifier.
The main sources contributing to the overall growth of the EMBL Database are direct submissions from authors, genome project submissions and patent data received from the European Patent Office.
Most journals require submission of sequence information to a database prior to publication. When the submission is received, database staff will provide the author with a unique database accession number to identify the sequence. The database accession number is included in the manuscript, preferably in a footnote on the first page of the article, or as required by individual journal procedures. This mandatory submission policy, regular publishing of database accession-numbers in journal articles, as well as early distribution of `Table of contents' listings by some of the major journals ensure availibility and distribution of new sequence data in a timely fashion. Additionally, the EMBL Database continues to scan major European molecular biology journals in the context of updating bibliographic references in already existing database entries.
The EMBL Nucleotide Sequence Database provides a number of different mechanisms for the direct submission of sequence data. Direct submission of sequence is the most reliable means of ensuring that entries accurately and completely reflect the underlying data. Due to the now standard practice among researchers of submitting their data directly to one of the collaborating databases, there has been an unprecedented reduction in the delay between determination of a sequence and the appearance of that sequence in the database compared to earlier years.
Data confidentiality and release dates. Sequences submitted to the database can be released to the public either immediately or withheld until an author-specified date. Data are never withheld after publication. In general, unless otherwise directed by the author, submitted sequences are available to the research community several months before these sequences appear in a journal publication.
A new generation of sophisticated sequence submission tools are now available from the EBI, allowing authors to submit sequence data to the EMBL Sequence Database in a simple and user-friendly way, either via WWW forms (Webin) or via a multi-platform (Mac/PC/Unix) stand-alone software tool (Sequin).
Webin (Fig. 3 ) is the new WWW sequence submission tool for submitting nucleotide sequence data and associated biological information to the EMBL Nucleotide Sequence Database at the European Bioinformatics Institute (EBI). To access Webin at the EBI please use the following URL: http://www.ebi.ac.uk/submission/webin.html
Sequin is the lastest multi-platform (Mac/PC/Unix) stand-alone software tool developed by the NCBI for submitting entries to the EMBL, GenBank or DDBJ sequence databases. The Sequin program, along with detailed downloading and installation instructions plus general information is available from the EMBL Database via WWW browser and anonymous FTP.
The E-mail data submission form solicits all the information needed to create a database entry. This form is available from the following sources: (i) by electronic mail via the EBI fileserver (netserv@ebi.ac.uk); (ii) from the EBI WWW-server (http://www.ebi.ac.uk/subs/allsubs.html ); (iii) with all releases of the EMBL Nucleotide Sequence Database.
Additionally, a paper submission form is available from the EMBL Database.
We encourage authors planning to submit a large number of similar sequences (e.g., >25) on a single occasion to contact the database before submitting the data. Database staff will then assist in making the submission of this specific data as convenient as possible, thus saving the author the time and effort required to complete numerous submission events individually. When contacting database staff, authors should indicate the number of sequences they plan to submit and provide a completed submission form for one of the sequences. Based on this information, database staff will create a series of templates and communicate these to the author for completion with just the information unique to each sequence required. These templates, once resubmitted, will then be processed en masse by database curators.
Once a database entry has been created from a submission, a copy is sent to the submitter for their reference and for comments or corrections. However, it often happens that the entry is correct when it is created but, with the passage of time, becomes out of date: the authors may make corrections to the sequence itself, or may discover new features of the sequence. Since such findings are often not published, the only way to keep entries correct and up-to-date is if the authors communicate their new findings to the database. Update information relating to EMBL Database entries can be communicated to the database via WWW, Email or by post or fax:
FTP: ftp.ebi.ac.uk in the file: pub/databases/embl/release/update.doc
Email: upon request from update{at}ebi.ac.uk
In a single month the EMBL Database receives ~1500 update messages, of which ~900 are citation updates. Users are welcome to report any errors they find in the database, but should be aware that only the original submitting group can authorize sequence updates and major annotation changes. Citation updates. Most submissions represent data that have not yet been accepted for publication, and therefore a full journal citation for the data is not available when the entry is created. We therefore urge researchers to inform us when and where, data they have submitted to us are published, and to include relevant accession numbers in such publications. We rely mostly on the co-operation of authors and users to keep the bibliographic references up-to-date and to release confidential data to the public in due time.
Nowadays, at least in sheer quantity, large-scale sequencing projects are the major sources of new sequence data. Automatic procedures to allow the direct submission and incorporation of such data have been developed to accommodate new projects easily. For groups producing large volumes of nucleotide sequence data over an extended period, submission accounts can be established with the EBI. A submission protocol is agreed upon and database entries produced at the research site can be deposited and updated directly by the originating group via FTP or Email. A number of genome projects and research groups have established submission accounts in the past few years, and the procedure has demonstrated itself to be flexible and efficient both for the research groups and for database staff. Each submission account is `curated' by EBI biologists, who check to ensure that new entries follow database annotation conventions and are consistent with other entries from the same project. The curator also serves as an informed liaison between the sequencing group and the database. Groups wishing to establish a submission account with the EBI should contact database staff. A selection of groups who already submit data using this method is given below:-
Human genome project, Sanger Centre, UK
Caenorhabditis elegans, Sanger Centre, UK
Schizosaccharomyces pombe, Sanger Centre, UK
Brugia malayi, Sanger Centre, UK
Mycoplasma tuberculosis, Sanger Centre, UK
Mycoplasma leprae, Sanger Centre, UK
Fugu rubripes GSS, HGMP-RC, UK
Arabidopsis thaliana ESSA project, MIPS, Germany
Human EST, Padova, Italy
Bacillus subtilis, Institut Pasteur, Paris, France
HIV, Amsterdam, Netherlands
European Drosophila Mapping Consortium, Cambridge, UK.
Mycoplasma capricolum, NCHGR, USA
An effort is being made to monitor the progress of a number of large genome sequencing projects. A genome monitoring table (Genome MOT) showing the total amount of finished and unfinished genomic DNA sequence deposited per year into the DDBJ/EMBL/GenBank databases for a number of organisms is updated on a weekly basis and can be found at URL: http://www.ebi.ac.uk/~sterk/genome-MOT/
A graphical representation of the Genome MOT from 24th September 1997 is shown in Figure 4 .
Patent Backfile project. The EMBL Database, in an ongoing collaboration with the European Patent Office, has been processing a `backfile' of European patent documents, in order to extract the sequence data and incorporate them into the public sequence databases. In the USA, NCBI and in Japan, DDBJ are dealing with the American and Japanese patent literature.The EPO/EMBL Backfile project included patent documents from 1960 to 1993, and has been finished with >25 000 protein and nucleotide sequences recovered.
Patent Frontfile project. After completion of the Backfile project, the EMBL Database is now collaborating with the EPO on the inclusion of Frontfile data consisting of patent applications required in electronic form by the EPO since September 1993. Software and automatic procedures have been created and are now in use to process and integrate sequence data from recent EPO patent applications. EPO's policy is to release data to the public (and to EMBL) 18 months after the patent application date, independent of whether a patent has been granted or not. Immediately after release by the EPO the latest patent sequence data are integrated into the EMBL database and made available to the public.
A new division for High Throughput Genome Sequences (HTG) has been created. This new division is used for genomic sequences which are produced by high-throughput sequencing projects. The records are primarily nematode and human and consist of long sequences. The annotation for many of these records is generated through computer analyses. Entries in this division all contain keywords to indicate the status of the sequencing (e.g., HTGS_PHASE1).
A single accession number is normally assigned to one clone, and as sequencing progresses and the entry passes from one phase to another, it will retain the same accession number. It is important to note that these data are unfinished and do not necessarily represent the correct sequence. Work on the sequence is in progress and the release of this data is based on the understanding that the sequence may change as work continues.
A new division for Genome Survey Sequences (GSS) has been added to the database. This division is of similar nature to the EST division, except that its sequences are genomic rather than cDNA (mRNA). The GSS division will contain (but not be limited to) the following types of data: random `single pass read' genome survey sequences, single pass reads from cosmid/BAC/YAC ends,exon trapped genomic sequences and Alu PCR sequences.
Until recently, accession numbers used by the nucleotide sequence databases consisted of one prefix letter followed by 5 digits (1+5 format). Various project data have accelerated the need to extend the accession number space. A new accession number format (2+6 format) has been created and is already in use by the collaborative databases: 2 (prefix letters) + 6 (digits) e.g., AC123456.
Note: existing 6-character accession numbers (1+5 format e.g., X12345) will remain as they are, and will never be transformed to an 8-character form.
Since Release 49 (Dec 1996), entries released from EMBL have incorporated the unified taxonomy. This sequence based taxonomy was created and is maintained by the NCBI with assistance from external advisors and in collaboration with EMBL and DDBJ. The aim of project `Taxon' is to centralise the classification of all organisms appearing in the Nucleotide Sequence Database. Unclassified organisms are added to the taxonomy tree by staff at the NCBI with the help of a team of experts in each field. The taxonomy tree reflects current phylogenetic knowledge and where possible is based on published authorities. The advantages of the unified taxonomy are that users obtaining a specific entry from any one of the collaborating databases will retrieve the same classification of the sequenced organism (as seen in the OS and OC lines in the flatfile). The NCBI taxonomy browser can be accessed at the following URL: http://www.ncbi.nlm.nih.gov/Taxonomy/tax.html
A new line type NI to contain an identifier for each nucleic acid sequence has been introduced. While the sequence remains the same, so will the value of this identifier. When a sequence change occurs, however minor, a new NI value will be assigned whilst the accession number on the AC line may remain unchanged.
A cross-reference from a CDS feature to the database `PID' (protein identifier) contains an identifier for the translated product of that coding sequence. This identifier will remain the same, despite changes to the sequence, as long as the translation remains the same. It can therefore be used by external databases (such as SWISS-PROT) as an identifier onto which cross-references can be built.
Over recent months a database redesign has been completed. Migration to an improved database schema and to the Unix Operating System result in improved performance enabling the database to cope with the growing volume of data, especially from sequencing projects.-
Constructed sequences are a way of representing long sequences, such as complete genomes.The new CON division will contain entries including information about how the construct sequences are built from its components. The new schema fully supports sequences of this type.
Annotations to database entries not submitted by the original sequencing group are referred to as third party annotations. Although guidelines still need to be set by the international collaboration, the new schema has been designed to support annotations of this type.
Graphical annotation tools in SQL*Forms 4.5 have been created enabling further automatization and enhanced accuracy checks during the annotation process.-
Relational management of SWISS-PROT and TREMBL will offer improved version tracking, update mechanisms and synchronization between EMBL nucleotide and protein databases.
The preferred form for citation of the EMBL Nucleotide Sequence Database is: Stoesser G., Moseley M.A., Sleep J., McGowran M., Garcia-Pastor M. and Sterk P. (1998) The EMBL Nucleotide Sequence Database. Nucleic Acids Res., 26, 8-15.
DATA ACCESS
The EBI is dedicated to developing network services which take full advantage of the rapid progress in computer network technologies. The EBI databases and software archives are accessible via the World Wide Web, electronic mail fileserver and FTP providing the most advanced network access to a broad range of molecular biology information resources. New and updated entries from the sequence databases are added daily to the network servers, making it possible to retrieve entries and perform sequence similarity searches on the very latest nucleotide data using FastA, Blast, Blitz and Bioccelerator.
Table1. Sites maintaining daily updated copies of the EMBL Nucleotide Sequence Database databases distributed by the EBI
EBI network fileserver. The EBI network fileserver enables access via Email to the full collection of databases, public domain software, and documentation maintained by EBI. Items are retrieved from the server by sending a command in an Email message to the fileserver address. Detailed instructions on using the fileserver and a current list of contents can be obtained by sending a message to the Internet address netserv{at}ebi.ac.uk with the word HELP in the body of the message. A full set of instructions will be returned automatically.
EBI FTP server. The EBI anonymous file transfer protocol (FTP) server enables navigation through the directories of the EBI molecular biology archives and retrieval of files. Most directories have `README' files to help with orientation. Users should connect to the anonymous FTP server at the address ftp.ebi.ac.uk using the username `anonymous', and their Email address as the password.
A collection of >50 additional specialist molecular biology databases is also available. These databases are produced by other groups in Europe and worldwide, many in collaboration with the European Bioinformatics Institute. One example is the ImmunoGenetics database IMGT (8 ), a database containing nucleotide sequence information of genes important in the function of the immune system.
Complementing these extensive data resources is a collection of freely available molecular biology software for Windows, Macintosh, VMS and UNIX, accessible from the EBI WWW pages and network servers (ftp://ftp.ebi.ac.uk/pub/software ). Also available is the Bio-Catalog, a list of software of general interest in molecular biology and genetics. First developed at CEPH/Genethon, it is now maintained and distributed by the EBI. Additionally, documents such as subscription and submission forms, and the DDBJ/EMBL/GenBank Feature Table Definition, can also be retrieved.
CD-ROM releases are distributed quarterly. The main contents are the nucleotide and protein sequence databases. Software for data query and retrieval is also provided. The program EMBL-Search for Macintosh and Windows allow data access by entry name, accession number, keyword, citation, author name, taxonomic classification, database cross-reference, free text and date. EMBL-Search also accesses the Prosite and Enzyme databases, and enables navigation between related entries via the cross-references built into the databases. It uses binary indices whose structure is documented and therefore available for other software systems. The sequence databases are also provided on a separate CD-ROM in FastA format for use with software such as FastA on Macintosh and PC systems.
The EBI provides a number of services that allow external users to compare their own sequences against the most currently available data in the EMBL Nucleotide Sequence Database and SWISS-PROT. BLITZ is based on the MPsrch program of Collins and Sturrock (Edinburgh University) which uses the well-known Smith and Waterman (9 ) algorithm for sensitive searches of the protein sequence databases. It is implemented on a MasPar (massively-parallel) computer at the EBI. Mail-FastA is based on Pearson and Lipman's FastA program (10 ). It performs sensitive comparisons of nucleotide or amino acid sequences against the database. Blast performs a full Smith and Waterman alignment against the database, and uses Karlin and Altschul `sum statistics' (11 ) to evaluate the significance of multiple regions of similarity.
These search tools are available either interactively through the URLs listed below or through Email:
The EBI provides a query/retrieval system using the Sequence Retrieval System SRS (12 ). Specific query forms are accessable at the URL: http://srs.ebi.ac.uk:5000/
The European Molecular Biology Network (http://www.embnet.org ) was initiated in 1988 to link European laboratories using biocomputing and bioinformatics in molecular biology research as well as to increase the availability and usefulness of the molecular biology databases within Europe. Remote copies of the nucleotide and protein sequence databases, updated daily, as well as other molecular biology resources, are held at nationally mandated nodes. As bioinformatics grows, EMBnet plays an important role in support, training, research and development for the European bioinformatics research community. Table 0 gives a full listing of sites maintaining daily updated copies of the EMBL Database.