Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (44K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Kroger, M.
Right arrow Articles by Wahl, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kroger, M.
Right arrow Articles by Wahl, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 1997 Oxford University Press 39-42

Footnote

Compilation of DNA sequences of Escherichia coli K12: description of the interactive databases ECD and ECDC (update 1996)

Compilation of DNA sequences of Escherichia coli K12: description of the interactive databases ECD and ECDC (update 1996) Manfred Kröger* and Ralf Wahl

Institut für Mikrobiologie und Molekularbiologie, Fachbereich Biologie, Justus-Liebig-Universität Gießen, Frankfurter Straße 107, D-35392 Gießen , Germany

Received October 24, 1996; Accepted October 29, 1996

ABSTRACT

We have compiled the DNA sequence data for Escherichia coli available from the GenBank and EMBL data libraries and independently from the literature. We provide the most definitive version of the ECD Escherichia coli database now exclusively via the World Wide Web System: http://susi.bio.uni-giessen.de/usr/local/www/html/ecdc.html . Our database encloses an assembled set of contiguous sequences. Each of these contigs compiles all available sequence information, including those derived from a variety of elder sequences. The organisation of the database allows precise physical location of each individual gene or regulatory region, even taking into consideration discrepancies in nomenclature. The WWW program allows to branch into the original EMBL and SWISSPROT datafiles. A number of links to other WWW servers is provided. A FASTA and BLAST search may be performed online. Besides the WWW format a flat file version may be obtained via ftp. The ftp version may also be obtained from the EMBL data library as part of the CD-ROM issue of the EMBL sequence database, which is released and updated every 3 months. After deletion of all detected overlaps a total of 3 588 706 individual bp has been determined up to the end of September 1996. This corresponds to a total of 77.09% of the entire E.coli chromosome consisting of ~ 4655 kb. About 479 kb (10.3%) are additionally available from Kyoto (Japan). Another 94 kb (2%) are available, but mapping has not been confirmed. Thus the total may have reached 89.4%.

INTRODUCTION

Within this database issue we have been able to publish a compilation of DNA sequences of Escherichia coli in 7 contiguous years since 1989 and colleagues from all over the world have provided additions and corrections ( 1 - 7 ). The final target to provide the complete sequence of E.coli K12 may be reached this year, mainly because at least four groups have devoted their research to systematic sequencing of the E.coli chromosome ( 8 - 18 ). According to our data a total of 3 588 706 bp had been deposited by the end of September 1996. Almost all nucleotides are published more than once. The total corresponds to 138%. Our database may serve as a basis for encouragement to our colleagues to either send us their unpublished, mostly flanking material or to determine the sometimes small gaps towards known neighbouring sequences to complete the sequence.

AIM OF COMPILATION

The final aim of our E.coli database (ECD) is to provide an electronic entry into the entire knowledge of the model organism Escherichia coli K12. We use the DNA sequence as the basis for all other information. Since there are already a number of specialist databases on different aspects of the E.coli cell, we prefer to provide a platform for these different data, rather than to build an entirely new system. We allow an unchanged incorporation of data from other databases and prefer to act only as a distributer. To make this point as clear as possible, our World Wide Web system is called ECDC for E . c oli d atabase c ollection ( 19 ).

For previous and supporting efforts please see our previous papers in this series and the references quoted therein ( 6 , 19 ). Since the acquisition of new sequence, physical mapping and other data is so rapid, most publications become outdated quickly. Thus electronic data collection has an advantage. However, collecting all the acquired data in one database is difficult for individual laboratories. Thus ECDC and the nucleotide collection therein tries to provide a service for all other databases dealing with Escherichia coli . The World Wide Web system seems to be an ideal tool to connect different databases, which are maintained directly at the original laboratories. Any comments and corrections are welcome at the addresses given below.

Although we have often been asked to collect data from pathogenic strains, we have restricted ourself to E.coli K12. Instead, we provide a number of links to other WWW servers, which may lead into other databases collecting these data.

PERFORMED COMPILATION

The general scope of this collection is to allow a compilation of all uncoordinated sequence information to finally end up with a complete Escherichia coli nucleotide sequence data base, including all sequenced mutants. The longest contig runs from min 68.9 to min 31.1 and covers 2 895 410 bp. However, seven other sequences enlarge this number to 3 258 399 bp with small gaps. Thus 70.0% of the E.coli genome is covered contiguously. As expected, the number of contigs has decreased and has now reached 133. Since the group in Japan has not officially released their data they are available through a WWW link only. The respective area covers 1164 kb and is located between min 6.1 and min 31.1. The area is completely sequenced, but not annotated. A sequence of 648 678 bp thereof are already included into ECD; an additional 479 322 bp (10.29%) may be obtained directly from Kyoto (Japan) via URL http://genome4.aist-nara.ac.jp .

We introduced the recently updated genetic map data ( 20 ) and used them to locate both sequenced and unsequenced areas by approximately a tenth of a minute. Fine assortment was by a hundredth of a minute, if the sequences overlap. A hundredth of a minute corresponds to 465.5 bp, which seems to be a sufficient resolution. If the sequences were mapped in either of the compilations using the Kohara map ( 21 - 29 ), we preferred to use their assignment including the respective orientation according to the original Kohara map ( 23 - 25 , 29 ). Contigs are only assembled, if either sequence extends over the respective restriction site. Differences to EcoSeq7 ( 20 ) may be explained by this. This procedure revealed a fairly good correspondence between genetic and physical map data. It may be noted that at least three contigs derived from systematic sequencing projects cover larger deletions compared with the original Kohara map ( 29 ). These deletions may have been derived from insertion element guided recombination ( 30 - 32 ).

Since a number of smaller contigs could not be localized within the chromosome with the necessary confidence until now, we refer to these DNA sequences with map positions greater than 100. In the most recent database we provide 72 unmapped sequences with a total of 94 039 bp. It is important to collect these entries, since during the course of systematic sequencing it became very clear that we are dealing with a number of different K12 substrains. A good example is the new outer membrane-associated protease OmpP, which could not be mapped on any of the Kohara phages ( 33 ). Using artificial map positions allows inclusion of all additional sequences of this type into FASTA or BLAST searches.

The gene symbols are preferentially according to the recent genetic map ( 20 ) or are taken from recent publications. However, since there are a number of biases and an increasing number of alternative gene symbols, we have made up our administration program accordingly. Each gene can be found under its historic or systematic name, as well as under the rational name. Each unannotated open reading frame is named according to the respective publication, but also according to the system propagated by K.Rudd, if already present in EcoSeq7 ( 20 ), or clearly annotated in the respective EMBL entry. Thus, although the given entry name may differ sometimes from the EMBL or GenBank entries, an automatic retrieval for alternative names is possible with our ECD system. For an example see Figure 2 of our previous update ( 6 ). Searches may also be performed using accession numbers or keywords in the near future.

In a number of cases open reading frames are not directly found in the original sequences. Since they are part of both literature ( 34 - 36 ) and recent genetic map ( 20 ), we have incorporated them into ECD. In most cases they are annoted in SWISS-PROT, but not in the nucleotide database files. Thus it is often not clear which nucleotide changes have to be introduced. For technical reasons the user has to elaborate the changes for themself. Also for technical reasons, open reading frames derived from two individual nucleotide files have to be assembled manually.

Figure 2 of our previous update ( 6 ) may also be understood as an example for the respective file architecture. In principle, we use the same structure as the EMBL data library. Each gene can be retrieved as an individual file and possesses an individual ECD accession number. Thus our database can be used directly for cross references using just this number, in the World Wide Web system.

Individual files are not only provided for structural genes (ECD system number EGxxxx) but also for specific functional sites (EFxxxx), promoter (EPxxxx), terminator or hairpin structures (EHxxxx), tRNAs (ETxxxx), ribosomal RNAs (ERxxxx) or unannoted open reading frames (EOxxxx). The last type of system numbers are supposed to be replaced by an EGxxxx number gradually, as soon as open reading frames are assigned to a known function. Together with a short description line and a line on metabolic function (if known), the keywords derived from different databases are included. A list of cross references is read out in the style of the EMBL data library. The feature table (FT) contains all information collected from various databases as well as the calculated map position. Thus references to the 2D-protein gel index ( 37 ), to the list of EC-numbers or metabolic pathway index ( 38 - 40 ), to the New Haven E.coli Genetic Stock Center ( 41 ) or to the Brookhaven database may be found in the features section. The respective links are provided as direct as possible with hypertext and allow a direct entry into the respective databases. The given nucleotide sequence is the most actual sequence excluding any regulatory or flanking sequences. The feature table gives a detailed description of the source of this sequence. Corrections introduced, if necessary, are described individually. Compiling the sequence data in this way allows monitoring of every correction performed by the respective author automatically.

ECD provides a major advantage in connecting all E.coli EMBL entries to contigs of maximum extent, and breaking them down into individual files for proteins, insertion elements, catalytic or transfer RNAs. Thus both handling and homology searches are quick and easy. A search for promoter, terminator or other regulatory structures is possible, as long as these features are described in the respective data files. Future issues of ECD may contain additional information, e.g. keywords, added manually by us. Please refer always to the most recent electronic release.

Each contig is compared with the PRO and UNC files of the EMBL database in order to look for as yet undetected overlapping sequences. We are able to calculate the exact position of each individual EMBL file within our contigs. This allows a highly detailed map of multiple sequence entries. Figure 3 of the previous update ( 6 ) gives an example for such a contig, which is derived from nine EMBL files and contains 17.2% sequences determined twice or more. Data collection is, however, from five EMBL files, only.

The full set of information is provided in electronic form, which also includes some structural information and other functional data, restriction map data, corrections or sequenced mutations. In addition to the individual data files, we are able to provide a genetic map both in electronic form as part of the application program, as well as in printed form ( 19 ). An example for the interactively usable genetic is given in Figure 1 . Special symbols are used to illustrate the orientation of individual genes and the presence of promoter and terminator sequences. Gene symbols and EMBL accession numbers are provided as hypertext. Gene symbols provide the individual feature table together with the nucleotide sequence of the gene. Hypertext links within the feature table of the ECD entry as well as within the EMBL file allows a convenient connection to other databases, e.g. to MEDLINE abstracts.


Figure 1 . Organization of the ECDC interactively usable genetic map. All bars are drawn to scale. Names of fully assigned genes are accompanied by small arrows indicating the direction of transcription. Genes not sequenced until now are located according to the information given in the Bachmann linkage map. Functional sequences are assigned by `W' for terminators and depending on the orientation either `<' or `>' for promoters within the upper most line. For a more detailed description see (19).

DATA DISTRIBUTION IN MACHINE READABLE FORM AND INTERNATIONAL ACCEPTANCE

The most convenient way to use the ECD Escherichia coli database is via the ECDC database collection on the World Wide Web (WWW) system. Besides a simple and fairly approximate statistical analysis of user identification (see below) we do not read or collect any data submitted for a search within ECD. Users of our ECD database or our ECDC database collection should use the URL: http://susi.bio.uni-giessen.de/usr/local/www/html/ecdc.html . They are politely asked to cite this paper within scientific publications and/or grant applications.

This compilation is available in its full form quarterly as a set of flat files from the EMBL data library ( 42 ) and is automatically distributed with each release of the EMBL data library. In addition, this compilation is available on the CD-ROM version of the EMBL data library. We have discontinued distribution on disk or CD-ROM from Gießen, directly.

Computer programs freely available within the WWW system allow a fairly detailed statistical analysis of user identification and frequency. From June to September 1996 a total of 85 973 different requests have been performed. This corresponds to 717 requests per day. An average of 30 different individuals uses ECDC each day. Users of ECDC reside in >45 different countries. For more information use either of these email addresses kroeger{at}embl-heidelberg.de. or wahl{at}fmi.ch .

ACKNOWLEDGEMENT

We thank Kenn Rudd (Bethesda) for his unpublished listing of EcoSeq7, and the staff at EBI (Cambridge) for constant flow of recent database additions. This work has been supported by the Deutsche Forschungsgemeinschaft (Kr 591/7-1).

REFERENCES

1 Kröger, M. (1989) Nucleic Acids Res. 17 (Suppl.), r283-309.

2 Kröger, M., Wahl, R. and Rice, P. (1990) Nucleic Acids Res. 18, 2549-2587.

3 Kröger, M., Wahl, R. and Rice, P. (1991) Nucleic Acids Res. 19, 2023-2043.

4 Kröger, M., Wahl, R., Schachtel, G. and Rice, P. (1992) Nucleic Acids Res. 20, 2119-2144.

5 Kröger, M., Wahl, R. and Rice, P. (1993) Nucleic Acids Res. 21, 2973-3000.

6 Wahl, R., Rice, P., Rice, C.M. and Kröger, M. (1994) Nucleic Acids Res. 22, 3450-3455.

7 Kröger, M. and Wahl, R. (1996) Nucleic Acids Res. 24, 29-31.

8 Daniels, D.L., Plunkett III, G., Burland, V. and Blattner, F. (1992) Science 257, 771-778.

9 Burland, V., Plunkett III, G., Daniels, D.L. and Blattner, F. (1993) Genomics, 16, 551-561.

10 Plunkett III, G., Burland, V., Daniels, D.L. and Blattner, F. (1993) Nucleic Acids Res. 21, 3391-3398.

11 Blattner, F., Burland, V., Plunkett III, G., Sofia, H.J. and Daniels, D.L. (1993) Nucleic Acids Res. 21, 5408-5417.

12 Sofia, H.J., Burland, V., Daniels, D.L., Plunkett III, G. and Blattner, F. (1993) Nucleic Acids Res. 22,2576-2586.

13 Burland, V., Plunkett III, G., Sofia, H.J., Daniels, D.L. and Blattner, F. (1995) Nucleic Acids Res. 23, 2105-2119.

14 Yura, T., Mori, H., Nagai, H., Nagata, T., Ishihama, A., Fujita, N., Isono, K., Mizobuchi, K. and Nakata, A. (1992) Nucleic Acids Res. 20, 3305-3308.

15 Fujita, N., Mori, H., Yura, T. and Ishihama, A. (1994) Nucleic Acids Res. 22, 1637-1639.

16 Richterich, P., Laskey, N., Gryan, G., Jaehn, L., Mintz, L., Robison, K. and Church, G.M. (1993) EMBL/GenBank AccNr. U00007 and U00008

17 Schramm, S. et al. (1996) EMBL/GenBank AccNr. U70214

18 Takemoto, K. et al. (1996) EMBL/GenBank AccNr. D83536

19 Wahl, R. and Kröger, M. (1995) Microbiol. Res. 150, 7-61.

20 Berlyn, M.B., Brooks Low, K. and Rudd, K.E. (1996) in Escherichia coli and Salmonella - Cellular and Molecular Biology, 2.Ed. ; F.C.Neidhardt (ed. in chief), pp. 1715-1902, ASM Press, Washington, D.C.

21 Rudd, K.E. (1992) In `A Short Course in Bacterial Genetics: A Laboratory Manual and Handbook for Escherichia coli and Related Bacteria', J.H.Miller (ed.), pp. 2.3-2.43 Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York.

22 Rudd, K.E., Miller, W., Ostell, J. and Benson, D.A. (1990) Nucleic Acids Res. 18, 313-321.

23 Médigue, C., Bouché, J.P., Hénaut, A. and Danchin, A. (1990) Mol. Microbiol. 4, 169-187.

24 Médigue, C., Hénaut, A. and Danchin, A. (1990) Mol. Microbiol. 4, 1443-1454.

25 Médigue, C., Viari, A., Hénaut, A. and Danchin, A. (1991) Mol. Microbiol. 5, 2629-2640.

26 Watanabe, H. and Kunisawa, T. (1990) Protein Seq. Data Anal. 3, 149-156.

27 Kunisawa, T., Nakamura, M., Watanabe, H., Otsuka, J., Tsugita, A., Yeh, L.-S.L., George, D.G. and Barker, W.C. (1990) Protein Seq. Data Anal. 3, 157-162.

28 Médigue, C., Viari, A., Hénault, A. and Danchin, A. (1993) Microbiol. Rev. 57, 623-654.

29 Kohara, Y., Akiyama, K. and Isono, K. (1987) Cell 50, 495-508.

30 Birkenbihl, R.P. and Vielmetter, W. (1989) Nucleic Acids Res. 17, 5057-5069.

31 Umeda, M. and Othsubo, E. (1989) J. Mol. Biol. 208, 601-614.

32 Birkenbihl, R.P. and Vielmetter, W. (1989) Mol. Gen. Genet. 220, 147-153.

33 Kaufmann, A., Stierhof, Y.-D. and Henning, U. (1994) J. Bacteriol. 176, 359-367.

34 Koonin, E.V., Tatusov, R.L. and Rudd, K.E. (1996) In Escherichia coli and Salmonella-Cellular and Molecular Biology, 2.Ed.; F.C.Neidhardt (ed. in chief), pp. 2203-2217, ASM Press, Washington, D.C.

35 Borodovsky, M., Koonin, E.V. and Rudd, K.E. (1994) Nucleic Acids Res. 22, 4756-4767.

36 Borodovsky, M., Koonin, E.V. and Rudd, K.E. (1994) Trends Biochem. Sci. 19, 309-313.

37 VanBogelen, R.A., Abshire, K.Z., Pertsemlidis, A., Clark, R.L. and Neidhardt, F.C. (1996) In Escherichia coli and Salmonella-Cellular and Molecular Biology, 2.Ed.; F.C.Neidhardt (ed. in chief), pp. 2067-2117, ASM Press, Washington, D.C.

38 Karp, P.D., Riley, M., Paley, S.M. and Pelligrini-Toole, A. (1996) Nucleic Acids Res. 24, 32-39.

39 Riley, M. and Space, D.B. (1996) Nucleic Acids Res. 24, 40. MEDLINE Abstract

40 Riley, M. and Labedan, B. (1996) In Escherichia coli and Salmonella- Cellular and Molecular Biology, 2.Ed.; F.C.Neidhardt (ed. in chief), pp. 2118-2202, ASM Press, Washington, D.C.

41 Berlyn, M.B. and Letovsky (1992) Nucleic Acids Res. 20, 6143-6151.

42 Rodriguez-Tomé, P., Stoehr, P.J., Cameron, G.N. and Flores, T.P. (1996) Nucleic Acids Res. 24, 6-12. MEDLINE Abstract


Return

* To whom correspondence should be addressed. Tel: +49 641 99 35530; Fax: +49 641 99 35549; Email: kroeger@bio.uni-giessen.de
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Bacteriol.Home page
B. Galán, E. Díaz, M. A. Prieto, and J. L. García
Functional Analysis of the Small Component of the 4-Hydroxyphenylacetate 3-Monooxygenase of Escherichia coli W: a Prototype of a New Flavin:NAD(P)H Reductase Subfamily
J. Bacteriol., February 1, 2000; 182(3): 627 - 636.
[Abstract] [Full Text]


Home page
J. Biol. Chem.Home page
A. Ferrandez, B. Minambres, B. Garcia, E. R. Olivera, J. M. Luengo, J. L. Garcia, and E. Diaz
Catabolism of Phenylacetic Acid in Escherichia coli. CHARACTERIZATION OF A NEW AEROBIC HYBRID PATHWAY
J. Biol. Chem., October 2, 1998; 273(40): 25974 - 25986.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
T. K. Van Dyk, B. L. Ayers, R. W. Morgan, and R. A. Larossa
Constricted Flux through the Branched-Chain Amino Acid Biosynthetic Enzyme Acetolactate Synthase Triggers Elevated Expression of Genes Regulated by rpoS and Internal Acidification
J. Bacteriol., February 15, 1998; 180(4): 785 - 792.
[Abstract] [Full Text]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (44K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Kroger, M.
Right arrow Articles by Wahl, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kroger, M.
Right arrow Articles by Wahl, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?