ABSTRACT
The European Bioinformatics Institute (EBI) maintains and distributes the EMBL
Nucleotide Sequence database, Europe's primary nucleotide sequence data
resource. The EBI also maintains and distributes the SWISS-PROT Protein Sequence database, in collaboration with Amos Bairoch of the
University of Geneva. Over fifty additional specialist molecular biology
databases, as well as software and documentation of interest to molecular
biologists are available. The EBI network services include database searching
and sequence similarity searching facilities.
The European Bioinformatics Institute (EBI) is an EMBL Outstation, located at
Hinxton Hall, near Cambridge, UK. Since September 1994, all activities
previously based at the EMBL Data Library (
1
) in Heidelberg, Germany are located at the EBI. The database services of the
EBI (
2
) are the continuation and extension of the EMBL Data Library. A central
activity of the European Bioinformatics Institute (EBI) is the development and
distribution of the EMBL Nucleotide Sequence database. The EBI also maintains
and distributes the SWISS-PROT Protein Sequence database (
3
) in collaboration with Amos Bairoch of the University of Geneva. Over fifty
additional specialist molecular biology databases, some produced in
collaboration with the EBI, are also distributed through EBI releases and
network services.
The main activity of the group is the development, maintenance and distribution
of a comprehensive database of nucleotide sequences. The EMBL nucleotide
sequence database, produced in collaboration with GenBank (
4
) (NCBI, Bethesda, USA) and the DNA database of Japan (Mishima), is Europe's
primary nucleotide sequence data resource. Each of these three groups collect a
portion of the total sequence data reported world-wide. All new and updated database entries are exchanged between the
groups on a daily basis. The rate of growth of the database continues to
accelerate. As an example, release 44 (September 1995), with more than 360
million bases from 506 192 entries, represents an annual increase of ~2.5 times the number of entries and 1.7 times the number of bases.
Important sources of data have been secured through collaborations with genomic
sequencing projects and other groups, such as phylogenetic research groups, who
produce large quantities of new nucleotide sequence data. The ongoing
collaboration with the European Patent Office has resulted in the capture of
nucleotide and protein sequences which were published in patent documents
between 1960 and 1993 and previously not publicly available in electronic form.
The complete database is distributed in quarterly releases on compact disc (CD-ROM). The database including daily additions of all new and updated
entries is available via the EBI network services (see below) and from nodes of
the European Molecular Biology Network (EMBnet, see below).
The nucleotide sequence database entries are distributed in the EMBL flat-file format, which is supported by most sequence analysis software
packages. A typical entry contains a sequence, a brief description for
cataloging purposes, the taxonomic description of the source organism,
bibliographic information, and the feature table, containing locations of
coding regions and other biologically significant sites. The feature table
follows the unified DDBJ/EMBL/GenBank Feature Table Definition (a copy of which
can be retrieved from the EBI network server). Where appropriate, entries may
also be cross-referenced to SWISS-PROT, Eukaryotic Promoter database (
5
), TransFac (
6
) or FlyBase (
7
). The consistency of database entries reflects the diversity of sequence
determination methods. For instance, expressed sequence tag (EST) entries,
`single pass' sequences derived from random clones, often have very little
biological `annotation', compared to the typical entry reported by a researcher
who has carefully investigated a single gene. An entry produced by a genomic
sequencing group, on the other hand, may be extensively annotated, but the
features of the sequence may have been determined by similarity and thus be
more or less putative. The EBI devotes considerable resources to ensuring that
the biological information attached to nucleotide sequences is as complete as
possible. Every effort is made to maintain consistency while preserving the
(varied) richness of these data from their various sources.
The SWISS-PROT protein sequence database is maintained collaboratively by the EMBL
Data Library and Amos Bairoch of the University of Geneva. It is distributed in
the same file format as the Nucleotide Sequence database, with which it is
fully cross-referenced. SWISS-PROT entries are derived from various sources including translations
of DNA sequences in the EMBL database, adapted from the Protein Identification
Resource collection (
8
) (PIR, Washington, DC), extracted from the literature, and directly submitted
by researchers. Its strengths are the quality and consistency of its
annotation, non-redundancy, and the cross-references to other databases, especially to the EMBL nucleotide
sequence database, PROSITE (
9
) and PDB (
10
). SWISS-PROT is distributed on CD-ROM every 3 months, and new entries can be retrieved between
releases via the EBI network servers (see below).
The Radiation Hybrid database (Rhdb) is a new development at the EBI (
11
). This database is an archive of raw data (i.e. PCR results on radiation hybrid
panels) with links to other related databases. All cross-references known to the authors or the databases maintainers are included.
The user is also able to directly query the relational database (on the World
Wide Web) either by using a set of pre-compiled queries or by writing his own ad-hoc queries. The database is distributed in a similar file format as
the EMBL database with which it is fully cross-referenced. It is distributed on CD-ROM twice a year and can also be retrieved between CD-ROM releases via the EBI network servers (see below).
The ImMunoGeneTics database (IMGT) is a database (
12
) containing nucleotide sequence information of genes important in the function
of the immune system. It collects and annotates sequences belonging to the
immunoglobin superfamily which are involved in immune recognition. IMGT works
in close collaboration with the EMBL database and with three other laboratories
in Europe [LIGM (FR), ICRF (UK), Univ. Of Koln (DE)]. It is distributed on CD-ROM twice a year and can also be retrieved between CD-ROM releases via the EBI network servers (see below).
The Bio-Catalog is a list of software of general interest in molecular biology and
genetics. First developed at CEPH/Généthon (
13
) it is now maintained and distributed by the EBI (
14
). In addition to this database the EBI maintains a repository of biology
related software on its network servers. This software is also distributed once
a year on CD-ROM.
The EBI is a major distributor of molecular biological databases produced by
other groups in Europe and world-wide. More than 50 databases are available via the EBI network servers
(see Fig.
1
for the World Wide Web access to the EBI databases) and 30 of them are included
on CD-ROM (see Table
1
). The EBI also mirrors dbEST, a database of Expressed Sequences Tags developed
at the NCBI (
15
), offering query and retrieval access through the World Wide Web.
Today, approximately 95% of all nucleotide sequence data are directly submitted
to one of the collaborating databases (EMBL, GenBank and DDBJ). This has
reduced the delay between determination of a sequence and appearance of that
sequence in the database compared to earlier years. The entries created by each
group are exchanged on a daily basis. The remaining 5% are still extracted from
the literature (especially patent documents), which is a time-consuming and error-prone task
The EBI provides a number of different mechanisms for the direct submission of
data (see Table
2
). Direct submission of sequence data to the nucleotide sequence databases is
the primary means of data acquisition, and the most reliable means of ensuring
that entries accurately and completely reflect the underlying data. Sequences
submitted can be released either immediately after processing or upon
publication, depending on the wishes of the submittor. In general, unless
otherwise directed by the author, submitted sequences are available to the
research community before the sequence appears in a journal. One of the direct
submission mechanisms is via the Authorin program, which allows authors to
prepare their data interactively using MS-DOS or Macintosh computers. One of the main advantages of the Authorin
program is that the resulting submission can be really automatically processed
by the database annotation staff. The Authorin program can be obtained on
diskettes from NCBI (GenBank/NCBI, NIH, Bldg 38A, Bethesda, MD 20894 USA;
email: authorin{at}ncbi.nlm.nih.gov) or electronically from the EBI network
server. The Direct Submission Form can also be used for nucleotide sequence
submissions. It can be obtained from the EBI network server or by contacting
the EBI directly, and a copy is also published periodically in relevant
journals (
59
,
60
). This submission form can either be sent to the EBI by post or by electronic
mail. A new submission system has been developed at the EBI using the World
Wide Web (WWW). There are many benefits to submitting sequences in this way. In
particular, the EBI continually maintains and updates this system, ensuring the
requested information is up to date.
Table 2
Summary of submission mechanisms for the EMBL database
For groups producing large volumes of nucleotide sequence data over an extended
period, submission accounts can be established with the EBI. A submission
protocol is agreed upon and database entries produced at the research site can
be deposited and updated directly by the originating group via FTP. A number of
new genome projects and research groups have established submission accounts in
the past few years, and the procedure has demonstrated itself to be flexible
and efficient both for the research groups and for database staff. Each
submission account is `curated' by EBI biologists, who check to ensure that new
entries follow database annotation conventions and are consistent with other
entries from the same project. The curator also serves as an informed liaison
between the sequencing group and the database. A list of groups who already
submit data using this method or are expected to begin doing so in the near
future is given below.
- European Drosophila Mapping Consortium
- French Arabidopsis cDNA project GDR
- Genexpress Généthon (FR)
- Généthon (FR)
- Genexpress Munich (DE)
- HIV project Amsterdam (NL)
- MHC project Tuebingen
- Mycoplasma capricolum NCHGR
- Sanger Centre (UK),
C.elegans
nematode project
- Sanger Centre (UK) Human genome project
- Sanger Centre (UK)
S.pombe
- Sanger Centre (UK) Yeast Chromosome IV
- Sanger Centre (UK) Yeast Chromosome IX
- Sanger Centre (UK) Yeast Chromosome XIII
- Sanger Centre (UK) Yeast Chromosome XVI
- UK Human Genome Mapping Project
- Radiation Hybrid Mapping Consortium
The capture of data reported in the patent literature since 1960 has continued
under contract from the European Patent Office (EPO). All the `backfile' documents have now been processed, with >25 000 protein and nucleotide sequences captured (with first priority outside
the USA and Japan). It should be noted that only a portion of the patent
entries are suitable for inclusion in the EMBL nucleotide sequence database;
the others are made available in a separate file. The EBI and EPO are
collaborating on new means of ensuring that patent sequences appear in the
public databases with less delay in the future. Since September 1993, the EPO
requires that protein and nucleotide sequences appearing in patent applications
be submitted in an electronic form, which greatly facilitate the speedy
incorporation of these sequences into the database as they become publicly
available.
Mandatory sequence submission requirements on the parts of many journals, the
regular practice of publishing database accession numbers in papers, as well as
early distribution of `Table of Contents' listings by some of the most
important journals, have greatly enhanced the effectiveness of the EBI journal
scanning activities over the past years. The EBI continues to scan all major
European molecular biology journals, but the activity is directed more toward
updating bibliographic references in existing (submitted) entries than toward
capturing new sequences. There is still, unfortunately, a certain small
percentage of published sequence data which has not been submitted to any of
the three collaborating databases. When these sequences are identified, the
authors are contacted and asked to submit their data. The database regularly
makes use of entries produced by the NCBI journal scanning operations, both for
updating bibliographic references in existing entries, and for including the
NCBI entries in the database when no submission exists.
The data library no longer produces magnetic tape distributions replacing this
operation by CD-ROM only, since it is inexpensive and can be used with a wide range of
computer systems. It is distributed quarterly as a set of compact discs written
in the international ISO 9660 standard format. Since release 44, there is a
separate CD-ROM distribution for EMBL and SwissProt databases. The collaborative
databases are distributed on a separate CD-ROM twice a year (see Table
1
for the list of databases included).
Software for data query and retrieval is also provided on the CD-ROM (
61
). The programs EMBL-Search for Macintosh and SRS for DOS (
62
) allow data access by entry name, accession number, keyword, citation, author
name, taxonomic classification, database cross-reference, free text, and date. EMBL-Search also provides access to the Prosite and Enzyme databases, and
enables navigation between related entries via the cross-references built into these databases. It uses binary indices whose
structure is documented and therefore available for other software systems. The
SRS software is a DOS version (this is a port done by the EBI) of the sequence
retrieval software used on the EBI network services. The sequence databases are
also provided in NBRF format for use with software such as FASTA on Macintosh
or MS-DOS systems.
In addition to archiving sequence and genome data, the EBI provides an ever-expanding number of free network services to external users. The EBI
databases and software archives are currently accessible via electronic mail
fileserver, FTP, gopher and World Wide Web (WWW). New and updated entries from
all three collaborating nucleotide sequence databases are added daily to the
network servers, making it possible to retrieve entries and perform sequence
similarity searches on the very latest nucleotide data. The complete collection
of additional specialist molecular biology databases is also available.
Complementing these extensive data resources is a collection of molecular
biology software for MS-DOS, Macintosh, VMS and UNIX. Documents such as subscription and
submission forms, and the DDBJ/EMBL/GenBank Features Table Definition, can also
be retrieved.
The EBI network fileserver (
63
) enables access via electronic mail (e-mail) to the full collection of databases, public domain software and
documentation maintained by EBI. Items are retrieved from the server by sending
a command in an e-mail message to the fileserver address. Detailed instructions on using the
fileserver, and a current list of contents, can be obtained by sending a
message to the Internet address
This is the main route for retrieving databases or software from the EBI's
archive. The EBI anonymous file transfer protocol (FTP) server enables
navigation through the directory hierarchy for the anonymous user. Most
directories have `README' files to help with orientation. Users should connect
to the anonymous FTP server at the address
The EBI Gopher server simplifies the use of network services by hiding
complexity behind a simple graphical user interface. The files are arranged in
a hierarchy of directories like in the FTP server, but have more detailed
titles. In addition to accessing the EBI molecular biology archives, links are
provided to other information resources in Europe and world-wide. Gopher clients can access the server at
The EBI WWW server provides the most advanced network access to a broad range of
molecular biology information resources. In addition to the EBI molecular
biology archives, sequence similarity search and database query/retrieval
services are offered. Users can also directly submit their data using the
direct submission entry page. Connect to the EBI WWW server using the URL:
The EBI provides a query/retrieval system using SRS, the Sequence Retrieval
System (
64
). This system allows entries to be retrieved based on a number of keywords.
Specific query forms are accessible at the URL:
Sequence search facilities
The EBI provides a number of services that allow users to compare their own
sequences against the most currently available data in the EMBL nucleotide
sequence database and SWISS-PROT. BLITZ is based on the MPsrch program of Collins and Sturrock
(Edinburgh University) which uses the Smith and Waterman (
65
) algorithm for sensitive searches of the protein and nucleotide sequence
databases. It is implemented on a MasPar, a massively-parallel computer. Detailed instructions can be obtained by sending an e-mail message to the address
The European Molecular Biology network (EMBnet) was initiated in 1988 to link
European laboratories using biocomputing and bioinformatics in molecular
biology research as well as to increase the availability and usefulness of the
molecular biology databases within Europe. Remote copies of the nucleotide and
protein sequence databases, updated daily, as well as other molecular biology
resources, are held at nationally mandated nodes. As bioinformatics grows, the
EMBnet plays an increasingly important role in support, training, research and
development for the European bioinformatics research community. Table
3
gives a full listing of sites maintaining daily updated copies of the EMBL
nucleotide sequence database.
Table 3
Sites maintaining daily updated copies of EMBL Nucleotide Sequence Database
(Sept 1995)
Network:
Datalib{at}ebi.ac.uk (for general enquiries)
Datasubs{at}ebi.ac.uk (for data submissions to the EMBL and SwissProt databases)Update{at}ebi.ac.uk (for corrections to nucleotide entries)RHdb{at}ebi.ac.uk (for data submission to Rhdb)Netserv@ebi.ac.uk (e-mail file server)NetHelp{at}ebi.ac.uk (for network server enquiries )ftp.ebi.ac.uk (anonymous FTP server)gopher.ebi.ac.uk (Gopher server)http://www.ebi.ac.uk (World Wide Web)blitz@ebi.ac.uk (MPsrch protein sequence search server)fasta@ebi.ac.uk (FastA sequence search server )
Postal address:
EMBL Outstation-the EBI,Hinxton Hall,Hinxton,Cambridge CB10 1RQ, UK.
Telephone:
+44 (1223) 494400
Telefax:
+44 (1223) 494468


REFERENCES
Return
