ABSTRACT
We have compiled the DNA sequence data for
Escherichia coli
available from the GenBank and EMBL data libraries and independently from the
literature. We provide the most definitive version of the ECD
Escherichia coli
database now exclusively via the World Wide Web System: http://susi.bio.uni-giessen.de/usr/local/www/html/ecdc.html . Our database encloses an
assembled set of contiguous sequences. Each of these contigs compiles all
available sequence information, including those derived from a variety of elder
sequences. The organisation of the database allows precise physical location of
each individual gene or regulatory region, even taking into consideration
discrepancies in nomenclature. The WWW program allows to branch into the
original EMBL and SWISSPROT datafiles. A number of links to other WWW servers
is provided. A FASTA and BLAST search may be performed online. Besides the WWW
format a flat file version may be obtained via ftp. The ftp version may also be
obtained from the EMBL data library as part of the CD-ROM issue of the EMBL sequence database, which is released and updated
every 3 months. After deletion of all detected overlaps a total of 3 588 706
individual bp has been determined up to the end of September 1996. This
corresponds to a total of 77.09% of the entire
E.coli
chromosome consisting of
~
4655 kb. About 479 kb (10.3%) are additionally available from Kyoto (Japan).
Another 94 kb (2%) are available, but mapping has not been confirmed. Thus the
total may have reached 89.4%.
Within this database issue we have been able to publish a compilation of DNA
sequences of
Escherichia coli
in 7 contiguous years since 1989 and colleagues from all over the world have
provided additions and corrections (
1
-
7
). The final target to provide the complete sequence of
E.coli
K12 may be reached this year, mainly because at least four groups have devoted their
research to systematic sequencing of the
E.coli
chromosome (
8
-
18
). According to our data a total of 3 588 706 bp had been deposited by the end
of September 1996. Almost all nucleotides are published more than once. The
total corresponds to 138%. Our database may serve as a basis for encouragement
to our colleagues to either send us their unpublished, mostly flanking material
or to determine the sometimes small gaps towards known neighbouring sequences
to complete the sequence.
The final aim of our
E.coli
database (ECD) is to provide an electronic entry into the entire knowledge of
the model organism
Escherichia coli
K12. We use the DNA sequence as the basis for all other information. Since
there are already a number of specialist databases on different aspects of the
E.coli
cell, we prefer to provide a platform for these different data, rather than to
build an entirely new system. We allow an unchanged incorporation of data from
other databases and prefer to act only as a distributer. To make this point as
clear as possible, our World Wide Web system is called ECDC for
For previous and supporting efforts please see our previous papers in this
series and the references quoted therein (
6
,
19
). Since the acquisition of new sequence, physical mapping and other data is so
rapid, most publications become outdated quickly. Thus electronic data
collection has an advantage. However, collecting all the acquired data in one
database is difficult for individual laboratories. Thus ECDC and the nucleotide
collection therein tries to provide a service for all other databases dealing
with
Escherichia coli
. The World Wide Web system seems to be an ideal tool to connect different
databases, which are maintained directly at the original laboratories. Any
comments and corrections are welcome at the addresses given below.
Although we have often been asked to collect data from pathogenic strains, we
have restricted ourself to
E.coli
K12. Instead, we provide a number of links to other WWW servers, which may lead
into other databases collecting these data.
The general scope of this collection is to allow a compilation of all
uncoordinated sequence information to finally end up with a complete
Escherichia coli
nucleotide sequence data base, including all sequenced mutants. The longest
contig runs from min 68.9 to min 31.1 and covers 2 895 410 bp. However, seven
other sequences enlarge this number to 3 258 399 bp with small gaps. Thus
70.0% of the
E.coli
genome is covered contiguously. As expected, the number of contigs has
decreased and has now reached 133. Since the group in Japan has not officially
released their data they are available through a WWW link only. The respective
area covers 1164 kb and is located between min 6.1 and min 31.1. The area is
completely sequenced, but not annotated. A sequence of 648 678 bp thereof are
already included into ECD; an additional 479 322 bp (10.29%) may be obtained
directly from Kyoto (Japan) via URL http://genome4.aist-nara.ac.jp .
We introduced the recently updated genetic map data (
20
) and used them to locate both sequenced and unsequenced areas by approximately
a tenth of a minute. Fine assortment was by a hundredth of a minute, if the
sequences overlap. A hundredth of a minute corresponds to 465.5 bp, which seems
to be a sufficient resolution. If the sequences were mapped in either of the
compilations using the Kohara map (
21
-
29
), we preferred to use their assignment including the respective orientation
according to the original Kohara map (
23
-
25
,
29
). Contigs are only assembled, if either sequence extends over the respective
restriction site. Differences to EcoSeq7 (
20
) may be explained by this. This procedure revealed a fairly good correspondence
between genetic and physical map data. It may be noted that at least three
contigs derived from systematic sequencing projects cover larger deletions
compared with the original Kohara map (
29
). These deletions may have been derived from insertion element guided
recombination (
30
-
32
).
Since a number of smaller contigs could not be localized within the chromosome
with the necessary confidence until now, we refer to these DNA sequences with
map positions greater than 100. In the most recent database we provide 72
unmapped sequences with a total of 94 039 bp. It is important to collect these
entries, since during the course of systematic sequencing it became very clear
that we are dealing with a number of different K12 substrains. A good example
is the new outer membrane-associated protease OmpP, which could not be mapped on any of the Kohara
phages (
33
). Using artificial map positions allows inclusion of all additional sequences
of this type into FASTA or BLAST searches.
The gene symbols are preferentially according to the recent genetic map (
20
) or are taken from recent publications. However, since there are a number of
biases and an increasing number of alternative gene symbols, we have made up
our administration program accordingly. Each gene can be found under its
historic or systematic name, as well as under the rational name. Each
unannotated open reading frame is named according to the respective
publication, but also according to the system propagated by K.Rudd, if already
present in EcoSeq7 (
20
), or clearly annotated in the respective EMBL entry. Thus, although the given
entry name may differ sometimes from the EMBL or GenBank entries, an automatic
retrieval for alternative names is possible with our ECD system. For an example
see Figure 2 of our previous update (
6
). Searches may also be performed using accession numbers or keywords in the
near future.
In a number of cases open reading frames are not directly found in the original
sequences. Since they are part of both literature (
34
-
36
) and recent genetic map (
20
), we have incorporated them into ECD. In most cases they are annoted in SWISS-PROT, but not in the nucleotide database files. Thus it is often not clear
which nucleotide changes have to be introduced. For technical reasons the user
has to elaborate the changes for themself. Also for technical reasons, open
reading frames derived from two individual nucleotide files have to be
assembled manually.
Figure 2 of our previous update (
6
) may also be understood as an example for the respective file architecture. In
principle, we use the same structure as the EMBL data library. Each gene can be
retrieved as an individual file and possesses an individual ECD accession
number. Thus our database can be used directly for cross references using just
this number, in the World Wide Web system.
Individual files are not only provided for structural genes (ECD system number
EGxxxx) but also for specific functional sites (EFxxxx), promoter (EPxxxx),
terminator or hairpin structures (EHxxxx), tRNAs (ETxxxx), ribosomal RNAs
(ERxxxx) or unannoted open reading frames (EOxxxx). The last type of system
numbers are supposed to be replaced by an EGxxxx number gradually, as soon as
open reading frames are assigned to a known function. Together with a short
description line and a line on metabolic function (if known), the keywords
derived from different databases are included. A list of cross references is
read out in the style of the EMBL data library. The feature table (FT) contains
all information collected from various databases as well as the calculated map
position. Thus references to the 2D-protein gel index (
37
), to the list of EC-numbers or metabolic pathway index (
38
-
40
), to the New Haven
E.coli
Genetic Stock Center (
41
) or to the Brookhaven database may be found in the features section. The
respective links are provided as direct as possible with hypertext and allow a
direct entry into the respective databases. The given nucleotide sequence is
the most actual sequence excluding any regulatory or flanking sequences. The
feature table gives a detailed description of the source of this sequence.
Corrections introduced, if necessary, are described individually. Compiling the
sequence data in this way allows monitoring of every correction performed by
the respective author automatically.
ECD provides a major advantage in connecting all
E.coli
EMBL entries to contigs of maximum extent, and breaking them down into
individual files for proteins, insertion elements, catalytic or transfer RNAs.
Thus both handling and homology searches are quick and easy. A search for
promoter, terminator or other regulatory structures is possible, as long as
these features are described in the respective data files. Future issues of ECD
may contain additional information, e.g. keywords, added manually by us. Please
refer always to the most recent electronic release.
Each contig is compared with the PRO and UNC files of the EMBL database in order
to look for as yet undetected overlapping sequences. We are able to calculate
the exact position of each individual EMBL file within our contigs. This allows
a highly detailed map of multiple sequence entries. Figure 3 of the previous
update (
6
) gives an example for such a contig, which is derived from nine EMBL files and
contains 17.2% sequences determined twice or more. Data collection is, however,
from five EMBL files, only.
The full set of information is provided in electronic form, which also includes
some structural information and other functional data, restriction map data,
corrections or sequenced mutations. In addition to the individual data files,
we are able to provide a genetic map both in electronic form as part of the
application program, as well as in printed form (
19
). An example for the interactively usable genetic is given in Figure
1
. Special symbols are used to illustrate the orientation of individual genes and
the presence of promoter and terminator sequences. Gene symbols and EMBL
accession numbers are provided as hypertext. Gene symbols provide the
individual feature table together with the nucleotide sequence of the gene.
Hypertext links within the feature table of the ECD entry as well as within the
EMBL file allows a convenient connection to other databases, e.g. to MEDLINE
abstracts.
The most convenient way to use the ECD
Escherichia coli
database is via the ECDC database collection on the World Wide Web (WWW)
system. Besides a simple and fairly approximate statistical analysis of user
identification (see below) we do not read or collect any data submitted for a
search within ECD. Users of our ECD database or our ECDC database collection
should use the URL: http://susi.bio.uni-giessen.de/usr/local/www/html/ecdc.html . They are politely asked to cite this paper within scientific publications
and/or grant applications.
This compilation is available in its full form quarterly as a set of flat files
from the EMBL data library (
42
) and is automatically distributed with each release of the EMBL data library.
In addition, this compilation is available on the CD-ROM version of the EMBL data library. We have discontinued distribution on
disk or CD-ROM from Gießen, directly.
Computer programs freely available within the WWW system allow a fairly detailed
statistical analysis of user identification and frequency. From June to
September 1996 a total of 85 973 different requests have been performed. This
corresponds to 717 requests per day. An average of 30 different individuals
uses ECDC each day. Users of ECDC reside in >45 different countries. For more
information use either of these email addresses kroeger{at}embl-heidelberg.de. or wahl{at}fmi.ch .
We thank Kenn Rudd (Bethesda) for his unpublished listing of EcoSeq7, and the
staff at EBI (Cambridge) for constant flow of recent database additions. This
work has been supported by the Deutsche Forschungsgemeinschaft (Kr 591/7-1).
REFERENCES
Return
