ABSTRACT
The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA
sequences directly submitted from researchers and genome sequencing groups and
collected from the scientific literature and patent applications. In
collaboration with DDBJ and GenBank the database is produced, maintained and
distributed at the European Bioinformatics Institute (EBI) and constitutes
Europe's primary nucleotide sequence resource. Database releases are produced
quarterly and are distributed on CD-ROM. EBI's network services allow access to the most up-to-date data collection via Internet and World Wide Web
interface, providing database searching and sequence similarity facilities plus
access to a large number of additional databases.
The EMBL Nucleotide Sequence Database is a central activity of the European
Bioinformatics Institute (EBI), an EMBL outstation located at the Wellcome
Trust Genome Campus at Hinxton, near Cambridge, UK. Database services provided
by the EBI (
1
) are a continuation and extension of the former EMBL Data Library (
2
), in Heidelberg, Germany. Additional to the production of the nucleotide
sequence database, the EBI maintains and distributes the SWISS-PROT protein Sequence Database (
3
) in collaboration with Amos Bairoch of the University of Geneva, ftp://ebi.ac.uk/pub/databases/trembl/">TREMBL (a
SWISS-PROT supplement consisting of translations from EMBL database coding
sequences), the Radiation Hybrid Database (Rhdb) and many other additional
specialist ftp://ftp.ebi.ac.uk/pub/databases/">molecular biology databases, many of which are produced in
collaboration with the EBI.
The EMBL Data Library was established in 1980 to collect, organize and
distribute a database of nucleotide sequence data and related information.
Since 1982 this work has been done in collaboration with GenBank (
4
) (NCBI, Bethesda, USA) and the DNA Database of Japan (Mishima). Each of the
three international collaborating databases DDBJ/EMBL/GenBank collects a
portion of the total sequence data reported world-wide. Procedures are in place to ensure that all new and updated database
entries are exchanged between the collaborating databases on a daily basis
(Fig.
1
). The explosive growth of the database continues-sequencing technology has produced means of reading DNA almost like bar-code (Fig.
2
). EMBL Release 48 (September 1996) reports 931 582 sequence entries comprising
609 302 252 nucleotides and it is expected that by the end of 1996 the database
will include over 1 000 000 entries. The recent increase of sequence data is
mainly a consequence of large sequencing projects, like the Human Genome
Project. These projects yield an enormous amount of DNA sequence data now
available in public databanks.
The EBI provides a number of different mechanisms for the direct submission of
data (Table
0
). Direct submission of sequence data to the nucleotide sequence databases is
the primary means of data acquisition, and the most reliable means of ensuring
that entries accurately and completely reflect the underlying data. Due to the
now standard practice among researchers of submitting their data directly to
one of the collaborating databases, there has been an unprecedented reduction
in the delay between determination of a sequence and the appearance of that
sequence in the database compared with earlier years.
Sequences submitted to the database can be released either immediately after
processing or upon publication, depending on the specifications by the
submitter. In general, unless otherwise directed by the author, submitted
sequences are available to the research community several months before these
sequences appear in a journal publication.
Many individual submissions are now received through EBI's WWW data submission
tool (URL: http://www.ebi.ac.uk/subs/emblsubs.html). Using a series of forms submitters are assisted through the submission
process. Any world wide web browser supporting forms (Netscape, MacWeb, Lynx,
Mosaic) can be used. The web-based submission tool has become the preferred submission medium.
Authorin remains a popular submission mechanism, although since the availability
of the WWW submission tool, the numbers of Authorin submissions have decreased.
Authors prepare their data interactively using MS-DOS or Macintosh computers. One of the main advantages of the Authorin
program is that the resulting submission can be automatically processed by the
database curation staff. The Authorin program can be obtained electronically
from the EBI FTP server (see Appendix 1)
This procedure is suitable for those groups submitting a large number of similar
sequences only once. Authors planning to submit in this way should contact
database staff prior to submission in order to discuss the most appropriate
mechanism.
The Direct Submission Form can also be used for nucleotide sequence submissions.
It can be obtained from the EBI network server or by contacting the EBI
directly, and a copy is also published periodically in relevant journals.
For groups producing large volumes of nucleotide sequence data over an extended
period, submission accounts can be established with the EBI. A submission
protocol is agreed upon and database entries produced at the research site can
be deposited and updated directly by the originating group via FTP or
electronic mail. A number of new genome projects and research groups have
established submission accounts in the past few years, and the procedure has
demonstrated itself to be flexible and efficient both for the research groups
and for database staff. Each submission account is `curated' by EBI biologists,
who check to ensure that new entries follow database annotation conventions and
are consistent with other entries from the same project. The curator also
serves as an informed liaison between the sequencing group and the database. A
list of groups who already submit data using this method or are expected to
begin doing so in the near future is given below.
-
EST project, Genexpress, France
-
EST project, Genexpress, Germany
-
Human Genome Mapping Project, HGMP-RC, UK.
-
HIV, Amsterdam, Netherlands
-
MHC, Tübingen, Germany
-
Human EST, Padova, Italy
-
European
Drosophila
Mapping Consortium, Cambridge, UK.
-
Fugu GSS, MRC/HGMP-RC, UK.
-
Human Genome Mapping Project, Sanger Centre, UK.
-
Caenorhabditis elegans
, Sanger Centre, UK.
-
Mycoplasma capricolum
, NCHGR, USA.
-
Schizosccharomyces pombe
, Sanger Centre, UK.
-
Mycobacterium tuberculosis
, Sanger Centre, UK.
-
Ciona intestinalis
, Sanger Centre, UK.
-
Brugia malayi
, Sanger Centre, UK.
The capture of data reported in the patent literature since 1960 has continued
under contract from the European Patent Office (EPO). The number of entries
produced through these activities has turned out to be substantially higher
than initially expected, with more than 25 000 protein and nucleotide sequences
recovered to date. It should be noted that only a portion of the patent entries
are suitable for inclusion in the EMBL Nucleotide Sequence Database; the
remaining data are made available in a separate file. After finishing inclusion
of `backfile' EPO sequences, the EBI and EPO have begun collaborating on new
means of ensuring that patent sequences appear in the public databases in a
timely fashion. Since September 1993, the EPO requires that protein and
nucleotide sequences appearing in patent applications be submitted to the EPO
in electronic form, which will greatly facilitate the speedy incorporation of
these sequences into the database as they become publicly available. New focus
now exists in the context of the collaboration between EPO and EBI on
integration of patent data received at the EPO in electronic form since 1994
(Patent Front file).
Mandatory sequence submission requirements on the parts of many journals, the
regular practice of publishing database accession numbers in papers, as well as
early distribution of `Table of Contents' listings by some of the most
important journals, have greatly enhanced the effectiveness of EBI journal
scanning activities over the past years. The EBI continues to scan all major
European molecular biology journals, but the activity is directed more toward
updating bibliographic references in existing (submitted) entries than toward
capturing new sequences. There is still, unfortunately, a small percentage of
published sequence data which have not been submitted to any of the three
collaborating databases. When these sequences are identified, the authors are
contacted and asked to submit their data. The database regularly makes use of
entries produced by the NCBI journal scanning operations, both for updating
bibliographic references in existing entries, and for including the NCBI
entries in the database when no submission exists.
A set of CD-ROMs is distributed quarterly in the international ISO 9660 standard
format. The main contents are the nucleotide and protein sequence databases.
Software for data query and retrieval is also provided on the CD-ROM (
12
). The program EMBL-Search for Macintosh and Windows (
13
) allows data access by entry name, accession number, keyword, citation, author
name, taxonomic classification, database cross-reference, free text and date. EMBL-Search also accesses the Prosite and Enzyme databases, and enables
navigation between related entries via the cross-references built into the databases. It uses binary indices whose
structure is documented and therefore available for other software systems. The
sequence databases are also provided on a separate CD-ROM in FastA format for use with software such as FastA on Macintosh and
PC systems.
Table 1
Databases distributed by the EBI
Table 2
The EBI is dedicated to developing network services which take full advantage of the rapid progress in computer network technologies. The EBI databases and software archives are currently accessible via the world wide web, electronic mail fileserver and FTP. New and updated entries from the sequence databases are added daily to the network servers, making it possible to retrieve entries and perform sequence similarity searches on the very latest nucleotide data. A collection of more than 50 additional specialist ftp://ftp.ebi.ac.uk/pub/databases/">molecular biology databases is also available (Table 1 ). These databases are produced by other groups in Europe and world-wide, many in collaboration with the European Bioinformatics Institute. One example is the ImmunoGenetics database (IMGT), a database (8 ) containing nucleotide sequence information of genes important in the function of the immune system. Complementing these extensive data resources is a collection of freely available molecular biology software for MS-DOS, Macintosh, VMS and UNIX accessible from the EBI WWW pages and network servers (ftp://ftp. ebi.ac.uk/pub/software). Also available is the Bio-Catalog, a list of software of general interest in molecular biology and genetics. First developed at CEPH/Genethon (10 ), it is now maintained and distributed by the EBI (11 ). Additionally, documents such as subscription and submission forms, and the DDBJ/EMBL/GenBank Feature Table Definition, can also be retrieved.
Figure
Figure
The EBI Network fileserver enables access via electronic mail (e-mail) to the full collection of databases, public domain software and
documentation maintained by EBI. Items are retrieved from the server by sending
a command in an e-mail message to the fileserver address. Detailed instructions on using the
fileserver, and a current list of contents, can be obtained by sending a
message to the Internet address netserv{at}ebi.ac.uk with the word HELP in the body of the message. A full set of instructions will
be returned automatically.
Table 3
The EBI anonymous file transfer protocol (FTP) server enables navigation through
the directories of the EBI molecular biology archives and retrieval of files.
Most directories have `README' files to help with orientation. Users should
connect to the anonymous FTP server at the address ftp.ebi.ac.uk using the username `anonymous,' and their e-mail address as the password.
The EBI provides a number of services that allow external users to compare their
own sequences against the most currently available data in the EMBL Nucleotide
Sequence Database and SWISS-PROT. BLITZ is based on the MPsrch program of Collins and Sturrock
(Edinburgh University) which uses the well-known Smith and Waterman (
14
) algorithm for sensitive searches of the protein sequence databases. It is
implemented on a MasPar massively-parallel computer at the EBI. Detailed instructions can be obtained by
sending an e-mail message to the address blitz{at}ebi.ac.uk with the word HELP in the body of the message. Mail-FastA is based on Pearson's FastA program (
15
). It performs sensitive comparisons of nucleotide or amino acid sequences
against the database. Further information can be obtained by sending an email
to the address fast{at}ebi.ac.uk , with the word HELP in the body of the message.
The EBI provides a query/retrieval system using SRS (
64
), the Sequence Retrieval System (Fig.
4
). Specific query forms are accessable at the URL: http://www.ebi.ac.uk/srs/srsc
The European Molecular Biology Network was initiated in 1988 to link European
laboratories using biocomputing and bioinformatics in molecular biology
research as well as to increase the availability and usefulness of the
ftp://ftp.ebi.ac.uk/pub/databases/">molecular biology databases within Europe. Remote copies of the nucleotide and
protein sequence databases, updated daily, as well as other molecular biology
resources, are held at nationally mandated nodes. As bioinformatics grows,
EMBnet plays an important role in support, training, research and development
for the European bioinformatics research community. Table
3
gives a full listing of sites maintaining daily updated copies of the EMBL
Nucleotide Sequence Database.
The preferred form for citation of the EMBL Nucleotide Sequence Database is:
The EMBL Nucleotide Sequence Database,
Stoesser,G., Sterk,P., Tuli,M.A., Stoehr,P.J. and Cameron,G.N.
Nucleic Acids Res
. 1997, Vol.
25
, No. 1, pages 7-13.





Network:
General enquiries
datalib@ebi.ac.uk">datalib{at}ebi.ac.uk
EBI WWW home page
http://www.ebi.ac.uk
Data submissions (E-mail)
datasubs{at}ebi.ac.uk
Data submissions (WWW)
http://www.ebi.ac.uk /subs
Corrections to nucleotide entries (E-mail)
update{at}ebi.ac.uk
Corrections to nucleotide entries (WWW)
http://www.ebi.ac.uk/ebi_docs/update.html
EBI network fileserver
netserv{at}ebi.ac.uk
FastA sequence search server
fast{at}ebi.ac.uk
MPsrch protein sequence search server
blitz{at}ebi.ac.uk
FTP server (anonymous)
ftp.ebi.ac.uk
Software
ftp.ebi.ac.uk/pub/software
Postal address:
EMBL Outstation-the EBI,
Wellcome Trust Genome Campus,
Hinxton, Cambridge CB10 1SD, UK.
Telephone:
+44 (1223) 494444
Telefax:
+44 (1223) 494468
REFERENCES
Return
