Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (40K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Kroger, M
Right arrow Articles by Wahl, R
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kroger, M
Right arrow Articles by Wahl, R
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 1996 Oxford University Press 29-32

Footnote

Compilation of DNA sequences of Escherichia coli K12 (ECD and ECDC; update 1995)

Compilation of DNA sequences of Escherichia coli K12 (ECD and ECDC; update 1995) Manfred Kröger* and Ralf Wahl

Institut für Mikrobiologie und Molekularbiologie, Fachbereich Biologie, Justus-Liebig-Universität Gießen, Frankfurter Straße 107, D-35392 Gießen , Germany

Received October 3, 1995; Revised and Accepted October 4, 1995

ABSTRACT

We have compiled the DNA sequence data for Escherichia coli available from the GenBank and EMBL data libraries and independently from the literature. Unlike the previous updates of our E.coli databases, we provide the most recent version preferentially via the World Wide Web System (use URL: http://susi.bio.uni-giessen.de/usr/local/www/html/ecdc.html). Our database includes an assembled set of contiguous sequences. Each of these contigs compiles all available sequence information, including those derived from a variety of elder sequences. The organisation of the database allows one to find the exact physical location of each individual gene or regulatory region, even regarding discrepancies in nomenclature. The WWW program allows access into the original EMBL and SWISSPROT datafiles. A FASTA and BLAST search may be performed online. Besides the WWW format a flat file version may be obtained via ftp. The complete compilation, including a full set of genetic map data and the E.coli protein index, can be obtained in machine readable form from the EMBL data library as a part of the CD-ROM issue of the EMBL sequence database, released and updated every three months. After deletion of all detected overlaps a total of 3 333 878 individual bp was determined by the end of September 1995. This corresponds to a total of 71.71% of the entire E.coli chromosome consisting of about 4720 kbp. About 94 kbp (2%) are available additionally, but have not yet been definitely mapped.

INTRODUCTION

Within this data base issue we were able to publish a compilation of DNA sequences of Escherichia coli in six contiguous years since 1989 and asked our colleagues from all over the world for additions and corrections ( 1 - 6 ). The final target to provide the complete sequence of E.coli K12 may be reached by 1997, mainly because at least two groups have devoted their research to systematic sequencing of the E.coli chromosome ( 7 - 16 ). According to our data a total of 3 333 878 bp had been sequenced by September 1995. Almost one half of these nucleotides are published more than once. Our database may serve as a basis for encouragement to our colleagues to either send us their unpublished, mostly flanking material or to determine additionally the sometimes very small gaps towards known neighbouring sequences to finally get the complete sequence.

AIM OF COMPILATION

The final aim of our E.coli database (ECD) is to provide an electronic entry into the entire knowledge about the model organism Escherichia coli K12. We use the DNA sequence as the stream line for all other information. Since there are already a number of special databases on different aspect of the E.coli cell, we prefer to provide a platform for these different data, only, rather than to build up an entirely new system. We allow an unchanged incorporation of data from other databases and prefer to act as a distributor, only. In order to make this point as clear as possible, we call our World Wide Web system ECDC for E.coli database collection ( 17 ).

For previous and supporting efforts please see our earlier papers in this series and the references quoted therein ( 17 , 14 ). Since the acquisition of new sequence, physical mapping and other data is so rapid, most of the respective publications are outdated very quickly. Thus electronic data collection seems to be the only alternative. However, collecting all the acquired data in one database is very difficult for individual laboratories. Thus ECDC and the nucleotide collection therein tries to provide a shell for all other databases dealing with Escherichia coli . The World Wide Web system seems to be an ideal tool to connect different databases, which are maintained directly at the original laboratories. Any comment and corrections are highly welcome at the addresses given below.

PERFORMED COMPILATION

The general scope of this collection is to allow a compilation of all uncoordinated sequence information to finally end up with a complete Escherichia coli nucleotide sequence data base, including all sequenced mutants. The longest contig runs from min 68.9 to min 4.0 and covers 1 633 962 bp. However, seven other sequences enlarge this number to 1 996 951 bp with very small gaps in-between. Thus 42.3% of the E.coli genome is covered contiguously. This number is slightly higher than the recently published complete sequence of Haemophilus influenzae Rd ( 18 ).

We introduced B. Bachmann's genetic map data ( 19 ) completely and used them to locate both sequenced and unsequenced areas roughly by a tenth of a minute. Fine assortment was by a hundredth of a minute, if the sequences overlap. A hundredth of a minute corresponds to 472 bp, which seems to be a sufficient resolution. If the sequences were mapped in either of the compilations using the Kohara map ( 20 - 28 ), we preferred to use their assignment including the respective orientation acc. to the original Kohara map ( 2 - 24 , 28 ). Contigs are only assembled, if either sequence extends over the respective restriction site. Differences to EcoSeq6 ( 20 ) may be explained by this. This procedure revealed a fairly good correspondance between genetic and physical map data. It may be noted, that at least three contigs derived from systematic sequencing projects cover larger deletions compared to the original Kohara map ( 28 ). These deletions may have been derived from insertion element guided recombination ( 29 - 31 ). This may also be the explanation for greater differences of up to 3 min in the area between min 40 and 80. Data given in Table 1 may be used to recalculate the genetic map position for genes not yet sequenced.

Table 1 . Concordance of physical and genetic map data
Bachmann map

Physical map

Bachmann map

Physical map

(min)

(min)

(min)

(min)

0.0

0.0

55.0

57.75

5.0

4.90

60.0

62.90

10.0

10.00

65.0

67.90

15.0

14.55

70.0

72.70

20.0

20.55

75.0

76.35

25.0

25.33

80.0

80.00

30.0

30.30

85.0

85.30

35.0

35.65

90.0

90.10

40.0

40.90

95.0

94.90

45.0

46.20

100.0

100.00

50.0

52.35

Since a number of smaller contigs could not be localized within the chromosome with the necessary confidence until now, we refer to these DNA sequences with map positions greater than 100. In the most recent database we provide 72 unmapped sequences with a total of 94 039 bp. It seems to be very important to collect these entries, since during the course of systematic sequencing it became very clear that we deal with a number of different K12 substrains. A very good example is the new outer membrane-associated protease OmpP, which could not be mapped on any of the Kohara phages ( 32 ). Using artificial map positions allows to include all additional sequences of this type into FASTA or BLAST searches.

The gene symbols are preferentially according to the Bachmann list ( 19 ) or are taken from a recent publication. However, since there are a number of biases and an increasing number of alternative gene symbols, we have changed our administration program accordingly. Each gene can be found under its historic or systematic name, as well as under the rational name. Each unannotated open reading frame is named according to the respective publication, but also according to the system propagated by K. Rudd, if already present in EcoSeq6 ( 20 ), or clearly annotated in the respective EMBL entry. Thus, although the given entry name may differ sometimes from the EMBL or GenBank entries, an automatic retrieval for alternative names is possible with our ECD system. For an example see Figure 2 of our previous update ( 6 ). Searches may also be performed using accession numbers or keywords in near future.

Figure 2 of our previous update ( 6 ) may also be understood as an example for the respective file architecture. In principle, we use the same structure as the EMBL data library. Each gene can be retrieved as an individual file and possesses an individual ECD accession number. Thus our database can be used directly for cross references using just this number, e.g. in the World Wide Web system.

Individual files are not only provided for structural genes (ECD system number EGxxxx) but also for specific functional sites (EFxxxx), promoter (EPxxxx), terminator or hairpin structures (EHxxxx), tRNAs (ETxxxx), ribosomal RNAs (ERxxxx) or unannoted open reading frames (EOxxxx). The last type of system numbers are supposed to be replaced by an EGxxxx number gradually, as soon as open reading frames are assigned to a known function. Together with a short description line and a line on metabolic function (if known), the keywords derived from different databases are included. A list of cross references is read out in the style of EMBL data library. The feature table (FT) contains all information collected from various databases as well as the calculated map position. Thus references to the 2D-protein gel index ( 33 ), to the list of EC-numbers or metabolic pathway index ( 34 ) , to the New Haven E.coli Genetic Stock Center ( 35 ) or to the Brookhaven data base may be found in the features section. The respective links are provided, or will be provided, as hypertext and allow a direct entry into the respective databases. The given nucleotide sequence is the most actual sequence excluding any regulatory or flanking sequences. The feature table gives a detailed description of the source of this sequence. Corrections introduced, if necessary, are described individually. Compiling the sequence data in this way allows the monitoring of every correction performed by the respective author automatically.

ECD provides a major advantage in connecting all E.coli EMBL entries to contigs of maximum extent, and breaking them down again into individual files for proteins, insertion elements, katalytic or transfer RNAs. Thus both handling and homology searches are quick and easy. A search for promoter, terminator or other regulatory structures is possible, as long as these features are described in the respective data files. Future issues of ECD may contain additional information, e.g. keywords, manually added by us. Please refer always to the most recent electronic release.

Each contig is compared with the PRO and UNC files of the EMBL database in order to look for yet undetected overlapping sequences. We are able to calculate the exact position of each individual EMBL file within our contigs. This allows a highly detailed map of multiple sequence entries. Figure 3 of the previous update ( 6 ) gives an example for such a contig, which is derived from nine EMBL files and contains 17.2% sequences determined twice or more. Data collection is, however, from five EMBL files, only.

The full set of information is provided in electronic form, which also includes some structural information and other functional data, restriction map data, corrections or sequenced mutations. In addition to the individual data files, we are able to provide a genetic map both in electronic form as part of the application program, as well as in printed form ( 17 ). An example for the interactively usable genetic is given in Figure 1 . Special symbols are used to illustrate the orientation of individual genes and the presence of promoter and terminator sequences. Gene symbols and EMBL accession numbers are provided as hypertext. Gene symbols will provide the individual feature table together with the nucleotide sequence of the gene. Hypertext within the feature table of the ECD entry as well as within the EMBL file allows a most convenient connection to other databases, e.g. to MEDLINE abstracts.


Figure 1 . Organization of the ECDC interactively usable genetic map. All bars are drawn to scale. Names of fully assigned genes are accompanied by small arrows indicating the direction of transcription. Genes not sequenced until now are located according to the information given in the Bachmann linkage map. Functional sequences are assigned by `[Omega]' for terminators and depending on the orientation either `<' or `>' for promoters within the upper most line. For a more detailed description see (17).

DATA DISTRIBUTION IN MACHINE READABLE FORM

The most convenient way to use the ECD Escherichia coli database is via the ECDC database collection on the World Wide Web (WWW) system. Use URL: http://susi.bio.uni-giessen.de/usr/local/www/html/ecdc.html.

This compilation is available in its full form quarterly as a set of flat files from the EMBL data library ( 36 ) and is automatically distributed with each release. In addition, this compilation is available on the CD-ROM version of the EMBL data library. However, the actual version or specific sets of data is also available on disk or CD-ROM on request from Gießen, directly. This particular CD-ROM version encloses an application program for quick database search, a direct access to collected sequences, and uniquely a comparison of the physical map as originally published by Kohara et al . ( 28 ) with the most actual physical map derived from DNA sequence data. This compilation allows a fairly exact calculation of the size of gaps between two contigs.

For more information use either of these email addresses KROEGER@EMBL-HEIDELBERG.DE. or WAHL@FMI.CH.

Users of our ECD database or our ECDC database collection are asked to cite this paper.

ACKNOWLEDGEMENTS

We would like to thank Kenn Rudd (Bethesda) for his unpublished listing, and the staff at EBI (Cambridge) for the constant flow of recent database additions. Special thanks go to Peter and Catherine Rice (Cambridge) for help in updates and searches. This work is supported by the Deutsche Forschungsgemeinschaft (Kr 591/7-1).

REFERENCES

1 M.Kröger (1989) Nucleic Acids Res. 17 (Suppl.), r283-309.

2 M.Kröger, R.Wahl and P.Rice (1990) Nucleic Acids Res. 18, 2549-2587.

3 M.Kröger, R.Wahl and P.Rice (1991) Nucleic Acids Res. 19, 2023-2043.

4 M.Kröger, R.Wahl, G.Schachtel and P.Rice (1992) Nucleic Acids Res. 20, 2119-2144.

5 M.Kröger, R.Wahl and P.Rice (1993) Nucleic Acids Res. 21, 2973-3000.

6 R.Wahl, P.Rice, C.M.Rice and M.Kröger (1994) Nucleic Acids Res. 22, 3450-3455.

7 D.L.Daniels, G.Plunkett III, V.Burland and F.Blattner (1992) Science 257, 771-778.

8 V.Burland, G.Plunkett III, D.L.Daniels, and F.Blattner (1993) Genomics, 16, 551-561.

9 G.Plunkett III, V.Burland, D.L.Daniels, and F.Blattner (1993) Nucleic Acids Res. 21, 3391-3398.

10 F.Blattner, V.Burland, G.Plunkett III, H.J.Sofia, and D.L.Daniels (1993) Nucleic Acids Res. 21, 5408-5417.

11 H.J.Sofia, V.Burland, D.L.Daniels, G.Plunkett III, and F.Blattner (1993) Nucleic Acids Res. 22, 2576-2586.

12 V.Burland, G.Plunkett III, H.J.Sofia, D.L.Daniels, and F.Blattner (1995) Nucleic Acids Res. 23, 2105-2119.

13 T.Yura, H.Mori, H.Nagai, T.Nagata, A.Ishihama, N.Fujita, K.Isono, K.Mizobuchi and A.Nakata (1992) Nucleic Acids Res. 20, 3305-3308.

14 N.Fujita, H.Mori, T.Yura and A.Ishihama (1994) Nucleic Acids Res. 22, 1637-1639.

15 P.Richterich, N.Laskey, G.Gryan, L.Jaehn, L.Mintz, K.Robison, G.M.Church (1993) EMBL/GenBank AccNr. U00007.

16 P.Richterich, N.Laskey, G.Gryan, L.Jaehn, L.Mintz, K.Robison, G.M.Church (1993) EMBL/GenBank AccNr. U00008.

17 R.Wahl and M.Kröger (1995) Microbiol. Res. (Jena) 150, 7-61.

18 R.D.Fleischmann et al. (1995) Science 269, 496-512.

19 B.J.Bachmann (1990) Microbiol. Rev. 54, 130-1977.

20 K.E.Rudd (1992) In J.H.Miller (ed.) A Short Course in Bacterial Genetics: A Laboratory Manual and Handbook for Escherichia coli and Related Bacteria. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, pp. 2.3-2.43.

21 K.E.Rudd, W.Miller, J.Ostell and D.A.Benson (1990) Nucleic Acids Res. 18, 313-321.

22 C.Médigue, J.P.Bouché, A.Hénaut and A.Danchin (1990) Mol. Microbiol. 4, 169-187.

23 C.Médigue, A.Hénaut and A.Danchin (1990) Mol. Microbiol. 4, 1443-1454.

24 C.Médigue, A.Viari, A.Hénaut and A.Danchin (1991) Mol. Microbiol. 5, 2629-2640.

25 H.Watanabe and T.Kunisawa (1990) Protein Seq. Data Analysis 3, 149-156.

26 T.Kunisawa, M.Nakamura, H.Watanabe, J.Otsuka, A.Tsugita, L.-S.L.Yeh, D.G.George and W.C.Barker (1990) Protein Seq. Data Analysis 3, 157-162.

27 C.Médigue, A.Viari, A.Hénault, and A.Danchin (1993) Microbiol. Rev. 57, 623-654.

28 Y.Kohara, K.Akiyama and K.Isono (1987) Cell 50, 495-508.

29 R.P.Birkenbihl and W.Vielmetter (1989) Nucleic Acids Res. 17, 5057-5069.

30 M.Umeda and E.Othsubo (1989) J. Mol. Biol. 208, 601-614.

31 R.P.Birkenbihl and W.Vielmetter (1989) Mol. Gen. Genet. 220, 147-153.

32 A.Kaufmann, Y.-D.Stierhof and U.Henning (1994) J. Bacteriol. 176, 359-367.

33 R.A.VanBogelen, P.Sanker, R.L.Clark, J.A.Boga and F.C.Neidhardt (1992) Electrophoresis 13, 1014-1054.

34 M.Riley (1993) Microbiol. Rev. 57, 862-952.

35 M.B.Berlyn and Letovsky (1992) Nucleic Acids Res. 20, 6143-6151.

36 D.B.Emmert, P.J.Stoehr, G.Stoesser and G.N.Cameron (1993) Nucleic Acids Res. 21, 2967-2971.


Return

* To whom correspondence should be addressed
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Bacteriol.Home page
E. Díaz, A. Ferrández, and J. L. García
Characterization of the hca Cluster Encoding the Dioxygenolytic Pathway for Initial Catabolism of 3-Phenylpropionic Acid in Escherichia coli K-12
J. Bacteriol., June 1, 1998; 180(11): 2915 - 2923.
[Abstract] [Full Text]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (40K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Kroger, M
Right arrow Articles by Wahl, R
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kroger, M
Right arrow Articles by Wahl, R
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?