ABSTRACT
We have compiled the DNA sequence data for
Escherichia coli
available from the GenBank and EMBL data libraries and independently from the
literature. Unlike the previous updates of our
E.coli
databases, we provide the most recent version preferentially via the World Wide
Web System (use URL: http://susi.bio.uni-giessen.de/usr/local/www/html/ecdc.html). Our database includes an
assembled set of contiguous sequences. Each of these contigs compiles all
available sequence information, including those derived from a variety of elder
sequences. The organisation of the database allows one to find the exact
physical location of each individual gene or regulatory region, even regarding
discrepancies in nomenclature. The WWW program allows access into the original
EMBL and SWISSPROT datafiles. A FASTA and BLAST search may be performed online.
Besides the WWW format a flat file version may be obtained via ftp. The
complete compilation, including a full set of genetic map data and the
E.coli
protein index, can be obtained in machine readable form from the EMBL data
library as a part of the CD-ROM issue of the EMBL sequence database, released and updated every three
months. After deletion of all detected overlaps a total of 3 333 878 individual
bp was determined by the end of September 1995. This corresponds to a total of
71.71% of the entire
E.coli
chromosome consisting of about 4720 kbp. About 94 kbp (2%) are available
additionally, but have not yet been definitely mapped.
Within this data base issue we were able to publish a compilation of DNA
sequences of
Escherichia coli
in six contiguous years since 1989 and asked our colleagues from all over the
world for additions and corrections (
1
-
6
). The final target to provide the complete sequence of
E.coli
K12 may be reached by 1997, mainly because at least two groups have devoted
their research to systematic sequencing of the
E.coli
chromosome (
7
-
16
). According to our data a total of 3 333 878 bp had been sequenced by September
1995. Almost one half of these nucleotides are published more than once. Our
database may serve as a basis for encouragement to our colleagues to either
send us their unpublished, mostly flanking material or to determine
additionally the sometimes very small gaps towards known neighbouring sequences
to finally get the complete sequence.
The final aim of our
E.coli
database (ECD) is to provide an electronic entry into the entire knowledge
about the model organism
Escherichia coli
K12. We use the DNA sequence as the stream line for all other information.
Since there are already a number of special databases on different aspect of
the
E.coli
cell, we prefer to provide a platform for these different data, only, rather
than to build up an entirely new system. We allow an unchanged incorporation of
data from other databases and prefer to act as a distributor, only. In order to
make this point as clear as possible, we call our World Wide Web system ECDC
for
E.coli
database collection (
17
).
For previous and supporting efforts please see our earlier papers in this series
and the references quoted therein (
17
,
14
). Since the acquisition of new sequence, physical mapping and other data is so
rapid, most of the respective publications are outdated very quickly. Thus
electronic data collection seems to be the only alternative. However,
collecting all the acquired data in one database is very difficult for
individual laboratories. Thus ECDC and the nucleotide collection therein tries
to provide a shell for all other databases dealing with
Escherichia coli
. The World Wide Web system seems to be an ideal tool to connect different
databases, which are maintained directly at the original laboratories. Any
comment and corrections are highly welcome at the addresses given below.
The general scope of this collection is to allow a compilation of all
uncoordinated sequence information to finally end up with a complete
Escherichia coli
nucleotide sequence data base, including all sequenced mutants. The longest
contig runs from min 68.9 to min 4.0 and covers 1 633 962 bp. However, seven
other sequences enlarge this number to 1 996 951 bp with very small gaps in-between. Thus 42.3% of the
E.coli
genome is covered contiguously. This number is slightly higher than the
recently published complete sequence of
Haemophilus influenzae
Rd (
18
).
We introduced B. Bachmann's genetic map data (
19
) completely and used them to locate both sequenced and unsequenced areas roughly by a tenth
of a minute. Fine assortment was by a hundredth of a minute, if the sequences
overlap. A hundredth of a minute corresponds to 472 bp, which seems to be a
sufficient resolution. If the sequences were mapped in either of the
compilations using the Kohara map (
20
-
28
), we preferred to use their assignment including the respective orientation
acc. to the original Kohara map (
2
-
24
,
28
). Contigs are only assembled, if either sequence extends over the respective
restriction site. Differences to EcoSeq6 (
20
) may be explained by this. This procedure revealed a fairly good correspondance
between genetic and physical map data. It may be noted, that at least three
contigs derived from systematic sequencing projects cover larger deletions
compared to the original Kohara map (
28
). These deletions may have been derived from insertion element guided
recombination (
29
-
31
). This may also be the explanation for greater differences of up to 3 min in
the area between min 40 and 80. Data given in Table
1
may be used to recalculate the genetic map position for genes not yet
sequenced.
Table 1
Since a number of smaller contigs could not be localized within the chromosome
with the necessary confidence until now, we refer to these DNA sequences with
map positions greater than 100. In the most recent database we provide 72
unmapped sequences with a total of 94 039 bp. It seems to be very important to
collect these entries, since during the course of systematic sequencing it
became very clear that we deal with a number of different K12 substrains. A
very good example is the new outer membrane-associated protease OmpP, which could not be mapped on any of the Kohara
phages (
32
). Using artificial map positions allows to include all additional sequences of
this type into FASTA or BLAST searches.
The gene symbols are preferentially according to the Bachmann list (
19
) or are taken from a recent publication. However, since there are a number of
biases and an increasing number of alternative gene symbols, we have changed
our administration program accordingly. Each gene can be found under its
historic or systematic name, as well as under the rational name. Each
unannotated open reading frame is named according to the respective
publication, but also according to the system propagated by K. Rudd, if already
present in EcoSeq6 (
20
), or clearly annotated in the respective EMBL entry. Thus, although the given
entry name may differ sometimes from the EMBL or GenBank entries, an automatic
retrieval for alternative names is possible with our ECD system. For an example
see Figure 2 of our previous update (
6
). Searches may also be performed using accession numbers or keywords in near
future.
Figure 2 of our previous update (
6
) may also be understood as an example for the respective file architecture. In
principle, we use the same structure as the EMBL data library. Each gene can be
retrieved as an individual file and possesses an individual ECD accession
number. Thus our database can be used directly for cross references using just
this number, e.g. in the World Wide Web system.
Individual files are not only provided for structural genes (ECD system number
EGxxxx) but also for specific functional sites (EFxxxx), promoter (EPxxxx),
terminator or hairpin structures (EHxxxx), tRNAs (ETxxxx), ribosomal RNAs
(ERxxxx) or unannoted open reading frames (EOxxxx). The last type of system
numbers are supposed to be replaced by an EGxxxx number gradually, as soon as
open reading frames are assigned to a known function. Together with a short
description line and a line on metabolic function (if known), the keywords
derived from different databases are included. A list of cross references is
read out in the style of EMBL data library. The feature table (FT) contains all
information collected from various databases as well as the calculated map
position. Thus references to the 2D-protein gel index (
33
), to the list of EC-numbers or metabolic pathway index (
34
) , to the New Haven
E.coli
Genetic Stock Center (
35
) or to the Brookhaven data base may be found in the features section. The
respective links are provided, or will be provided, as hypertext and allow a
direct entry into the respective databases. The given nucleotide sequence is
the most actual sequence excluding any regulatory or flanking sequences. The
feature table gives a detailed description of the source of this sequence. Corrections
introduced, if necessary, are described individually. Compiling the sequence
data in this way allows the monitoring of every correction performed by the
respective author automatically.
ECD provides a major advantage in connecting all
E.coli
EMBL entries to contigs of maximum extent, and breaking them down again into
individual files for proteins, insertion elements, katalytic or transfer RNAs.
Thus both handling and homology searches are quick and easy. A search for
promoter, terminator or other regulatory structures is possible, as long as
these features are described in the respective data files. Future issues of ECD
may contain additional information, e.g. keywords, manually added by us. Please
refer always to the most recent electronic release.
Each contig is compared with the PRO and UNC files of the EMBL database in order
to look for yet undetected overlapping sequences. We are able to calculate the
exact position of each individual EMBL file within our contigs. This allows a
highly detailed map of multiple sequence entries. Figure 3 of the previous
update (
6
) gives an example for such a contig, which is derived from nine EMBL files and
contains 17.2% sequences determined twice or more. Data collection is, however,
from five EMBL files, only.
The full set of information is provided in electronic form, which also includes
some structural information and other functional data, restriction map data,
corrections or sequenced mutations. In addition to the individual data files,
we are able to provide a genetic map both in electronic form as part of the
application program, as well as in printed form (
17
). An example for the interactively usable genetic is given in Figure
1
. Special symbols are used to illustrate the orientation of individual genes and
the presence of promoter and terminator sequences. Gene symbols and EMBL
accession numbers are provided as hypertext. Gene symbols will provide the
individual feature table together with the nucleotide sequence of the gene.
Hypertext within the feature table of the ECD entry as well as within the EMBL
file allows a most convenient connection to other databases, e.g. to MEDLINE
abstracts.
The most convenient way to use the ECD
Escherichia coli
database is via the ECDC database collection on the World Wide Web (WWW)
system. Use URL: http://susi.bio.uni-giessen.de/usr/local/www/html/ecdc.html.
This compilation is available in its full form quarterly as a set of flat files
from the EMBL data library (
36
) and is automatically distributed with each release. In addition, this
compilation is available on the CD-ROM version of the EMBL data library. However, the actual version or
specific sets of data is also available on disk or CD-ROM on request from Gießen, directly. This particular CD-ROM version encloses an application program for quick
database search, a direct access to collected sequences, and uniquely a
comparison of the physical map as originally published by Kohara
et al
. (
28
) with the most actual physical map derived from DNA sequence data. This
compilation allows a fairly exact calculation of the size of gaps between two
contigs.
For more information use either of these email addresses KROEGER@EMBL-HEIDELBERG.DE. or WAHL@FMI.CH.
Users of our ECD database or our ECDC database collection are asked to cite this
paper.
We would like to thank Kenn Rudd (Bethesda) for his unpublished listing, and the
staff at EBI (Cambridge) for the constant flow of recent database additions.
Special thanks go to Peter and Catherine Rice (Cambridge) for help in updates
and searches. This work is supported by the Deutsche Forschungsgemeinschaft (Kr
591/7-1).
Bachmann map
Physical map
Bachmann map
Physical map
(min)
(min)
(min)
(min)
0.0
0.0
55.0
57.75
5.0
4.90
60.0
62.90
10.0
10.00
65.0
67.90
15.0
14.55
70.0
72.70
20.0
20.55
75.0
76.35
25.0
25.33
80.0
80.00
30.0
30.30
85.0
85.30
35.0
35.65
90.0
90.10
40.0
40.90
95.0
94.90
45.0
46.20
100.0
100.00
50.0
52.35
REFERENCES
Return
