Nucleic Acids Research, 2005, Vol. 33, Database issue D329-D333
© 2005, the authors
Nucleic Acids Research, Vol. 33, Database issue © Oxford University Press 2005; all rights reserved
EchoBASE: an integrated post-genomic database for Escherichia coli
Department of Biology (Area 10), University of York, PO Box 373,York, YO10 5YW, UK and 1 Scientific Computing and Mathematical Modelling, GlaxoSmithKline, Gunnels Wood Road, Stevenage, SG1 2NY, UK
* To whom correspondence should be addressed. Tel: +44 1904 328678; Fax: +44 1904 328825; Email: ght2{at}york.ac.uk
Received August 13, 2004; Accepted September 21, 2004
| ABSTRACT |
|---|
|
|
|---|
EchoBASE (http://www.ecoli-york.org) is a relational database designed to contain and manipulate information from post-genomic experiments using the model bacterium Escherichia coli K-12. Its aim is to collate information from a wide range of sources to provide clues to the functions of the approximately 1500 gene products that have no confirmed cellular function. The database is built on an enhanced annotation of the updated genome sequence of strain MG1655 and the association of experimental data with the E.coli genes and their products. Experiments that can be held within EchoBASE include proteomics studies, microarray data, proteinprotein interaction data, structural data and bioinformatics studies. EchoBASE also contains annotated information on orphan enzyme activities from this microbe to aid characterization of the proteins that catalyse these elusive biochemical reactions.
| INTRODUCTION |
|---|
|
|
|---|
The bacterium Escherichia coli K-12 is the most thoroughly studied free-living organism and the completion of the genome sequence of strain MG1655 in 1997 was a landmark event in the study of this model organism (1). Our understanding of this species has been boosted by the genome sequences of the pathogenic strains E.coli O157:H7 (2,3) and E.coli CFT073 (4). The genome sequence has provided the opportunity of being able to identify all the potential components of the cell, and the annotation of E.coli K-12 strain MG1655 has recently been updated with some sequence and resulting gene annotation changes. One striking fact remains 7 years after the sequence was finished: the physiological functions of nearly 35% of its gene products are unknown, while almost 20% of the genome contains genes for which we cannot make a confident prediction of function (functionally unknown or FUN genes) (57).
Data is continuously being generated about FUN genes using many traditional and post-genomic research techniques, including proteomics, biochemical studies, microarrays, structural and bioinformatics approaches. At present, tracing all the published data on a particular FUN gene is difficult and time consuming and there is a need to collate and organize such data into a manageable and efficient database system. Integration of a wide range of information about a particular FUN gene can work synergistically to help in prediction of its function.
We describe herein the creation of EchoBASE, a new database that integrates information from post-genomic experiments into a single resource. For basic curation of gene product information, the database relies on features from other selected E.coli databases, its novelty being the linkage of curated experimental data to individual genes and ability to manipulate data from genome-wide experiments. While we aim to predict biological functions for uncharacterized gene products, there are existing lists of biochemical activities that have been identified in E.coli but not mapped to a gene product, so-called orphan enzymes. To complete our knowledge of metabolic pathways in E.coli, it is essential to identify the genes that encode these orphan enzymes and we describe our curation and analysis of orphan enzymes in E.coli as a component of EchoBASE.
EchoBASE is the major evolution of a simple HTML catalogue of functional updates of uncharacterized gene products that was begun in 1998 (7), and is a component of the E.coli index WWW site (http://ecoli.bham.ac.uk/) (8).
| DESCRIPTION OF EchoBASE |
|---|
|
|
|---|
EchoBASE is a relational database that was designed and implemented using the MySQL database management system and Macromedia ColdFusion MX server technology. The original version was launched in January 2004 and this paper describes version 1.2, released on September 1. Although the database focuses on holding data from post-genomic experiments, it also provides a basic annotation of the E.coli K-12 genome. This annotation was based on EcoGene 16, which we considered the most accurate annotation in 2003 (9). However, in version 1.2, we have added a number of changes to the sequence as a result of additional sequence data presented in version m56 of the GenBank entry U00096 [GenBank] , the first movement towards a single united annotation of the MG1655 genome sequence which will be held in the ASAP database (10).
Each gene record contains a functional description of the product, the location and direction of the gene on the chromosome and predictions of the properties of the gene product (Figure 1). The nucleotide and amino acid sequences are also held in the database. A simple graphical genome navigation tool has been incorporated into the gene page in version 1.2, which shows information relating to relative gene size, chromosomal position and uses a colour coding scheme that reflects the type of features being illustrated, e.g protein-coding sequence, rRNA, sRNA. For additional information on gene products, EchoBASE links out to a suite of other databases that provide a variety of complementary data. For existing literature we link to EcoGene (9), for information on transcription units and metabolic pathways we link to EcoCyc (11), for data on protein modules and families we link to GenProtEC (12) and for comparative genomics tools we link to coliBASE (13).
|
We have added some novel in-house whole-genome annotation, the most useful being a genome-wide survey of predicted subcellular location, the EchoLOCATION feature. The subcellular location of a protein can often provide insight into its functional role in the cell, and we have combined and manually processed data from SignalP v. 2.0 (14), LipoP v. 1.0 (15), TMHMM v. 2.0 (16) and HMMTOP (17) to make a prediction for each gene product.
| CURATION OF EXPERIMENTAL INFORMATION |
|---|
|
|
|---|
The focus of EchoBASE is on the integration of post-genomic data to provide greater insight into the potential functions of the many FUN genes. Currently, data is manually curated from published experimental research and sorted into one of the following types of experiments: proteomics; microarray; transcript level; genetics; biochemistry; bioinformatics, proteinprotein interactions and structure. Each of these experiment types have been created to hold a wide range of different data types that are commonly determined in each of the different experiment types. The bioinformatics experiment type is the most generic schema due to the large variety of data types that could be usefully included in the database. After an experiment is sorted into an experiment type, it is then handled differently depending on whether it is a single or few gene experiment which provides data for up to 15 different gene products, or a genome-wide experiment. Data from the former type are usually manually curated into the database like the previous catalogue version (7), and data from the latter are parsed from data sets either provided directly within the publication, from supplementary information available from the publisher's WWW site, or from direct contact with the authors.
| SEARCHING AND MANIPULATING EXPERIMENTAL DATA IN EchoBASE |
|---|
|
|
|---|
The data held within EchoBASE can be navigated either by browsing/searching different experiment types or by searching for a particular gene. Browsing different experiment types returns a summary list of all experiments of a particular type, e.g. proteomics, from which one can be selected to view our full annotation of the paper. As well as a textual description of the paper and the methods used, the data presented in the manuscript can be displayed in a number of ways. For example, for a proteomics experiment, the data can be sorted by different molecular properties, such as pI observed or relative abundance. This allows the user new ways to look at published results to enable faster sorting and extraction of data useful to them. Once an interesting piece of data has been spotted for a gene of interest, the data relating to this particular gene and experiment can then be viewed in full.
An alternative route into the database is through a particular researcher. All experimental annotations are linked to individual principle investigators and the data can be browsed to see which experiments from a particular group are contained within the database. Currently, there are over 400 research groups in the database, which has been built from a list previously compiled in the E.coli index (8).
In version 1.2, we have implemented the Complex search which increases the power of the database by allowing complex questioning of the data set. For example, if a researcher was looking for candidate gene for a periplasmic binding protein involved in transporting an alternative nitrogen source, they could search the data for proteins that were (i) predicted to be periplasmic binding proteins (bioinformatics), (ii) demonstrated to be located in the periplasm (proteomic) and (iii) induced during nitrogen limiting conditions (microarray). This could be combined with searches for proteins of a certain molecular weight range that are encoded by genes located at a certain position on the genome sequence.
| ORPHAN ENZYMES |
|---|
|
|
|---|
Information on orphan enzymes in E.coli has been extracted from three sources, which were initially merged into a single comprehensive list. Data was taken from a list published by Riley and Serres (18), a list from the EcoCyc database (11) and a list used to construct an in silico metabolic genotype by Edwards and Palsson (19). This list had 100 different enzyme/protein names that could be considered to be orphan enzymes, and 36 of these were immediately removed as they had actually been linked to genes. To assess the remaining 64 activities, a marking scheme was created (out of 100) that was used to estimate how strong the evidence was to support each orphan enzyme. Details of the scheme can be found in the database, but consider factors like whether the activity had been purified with a known molecular weight and whether the locus linked to the activity had been mapped to a region on the genome. A score was given to 57 of the activities, varying between 3 and 83. Since creating the list in July 2003, 4 orphan enzymes have been mapped to their genes and 3 of these are in the top 7 scoring orphan enzymes in our list, strongly supporting our scoring system. The fourth activity that has been mapped was surprisingly rank order 29, mainly due to the only evidence coming from mapping the locus to a small region on the genome.
Given the likelihood that some of these activities are side reactions of known enzymes and that some probably are not present in MG1655, we estimate that there are around 25 genuine orphan enzymes to be mapped to genes in E.coli K-12 MG1655. The list is continuously updated and newly mapped orphan enzymes are highlighted within the list and a link to the appropriate gene is created (Figure 2).
|
| ADDITIONAL FEATURES OF EchoBASE |
|---|
|
|
|---|
One feature we consider important is data tracking within the database so that all changes can be traced. Therefore most pages in the database contain an Annotation history field. EchoBASE will be one of the first databases to have implemented the new sequence and annotation of the E.coli genome (version m56), a significant sequence and annotation change released in 2004. All the sequence changes that alter coding regions, which number over 150, can be found associated with each gene and also have been described in the Annotation page of the database that keeps a record of all changes to genes. Sequence changes result in addition and removal of Blattner numbers as genes are split and fused, changes in lengths of coding sequences as gene starts and stops are changed, and amino acid changes that have occurred as a result of base substitutions within protein-encoding genes.
EchoBASE also has a series of help pages to guide the user around the database and can provide data sets for other users that relate our unique identifier (the EB number) to those of the other resources we link out to from the gene page (Figure 1). Currently, we encourage researchers to send us details of their discoveries that we curate, but eventually we would like to move to a more community-based annotation.
| ACKNOWLEDGEMENTS |
|---|
We would like to acknowledge CNAP and the University of York for technical support and hosting EchoBASE, and Louise Fairweather for collecting information for EchoLOCATION during her MRes project. We thank GlaxoSmithKline and the BBSRC for financial support.
| Notes |
|---|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact journals.permissions{at}oupjournals.org.
| REFERENCES |
|---|
|
|
|---|
- Blattner,F.R., Plunkett,G., III, Bloch,C.A., Perna,N.T., Burland,V., Riley,M., Collado-Vides,J., Glasner,J.D., Rode,C.K., Mayhew,G.F. et al. ( (1997) ) The complete genome sequence of Escherichia coli K-12. Science, , 277, , 14531474.
[Abstract/Free Full Text] . - Perna,N.T., Plunkett,G., III, Burland,V., Mau,B., Glasner,J.D., Rose,D.J., Mayhew,G.F., Evans,P.S., Gregor,J., Kirkpatrick,H.A. et al. ( (2001) ) Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature, , 409, , 529533.[CrossRef][Medline] .
- Hayashi,T., Makino,K., Ohnishi,M., Kurokawa,K., Ishii,K., Yokoyama,K., Han,C.G., Ohtsubo,E., Nakayama,K., Murata,T. et al. ( (2001) ) Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Res., , 8, , 1122.[Abstract] .
- Welch,R.A., Burland,V., Plunkett,G., III, Redford,P., Roesch,P., Rasko,D., Buckles,E.L., Liou,S.R., Boutin,A., Hackett,J. et al. ( (2002) ) Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc. Natl Acad. Sci. USA, , 99, , 1702017024.
[Abstract/Free Full Text] . - Hinton,J.C. ( (1997) ) The Escherichia coli genome sequence: the end of an era or the start of the FUN? Mol. Microbiol., , 26, , 417422.[CrossRef][Web of Science][Medline] .
- Serres,M.H., Gopal,S., Nahum,L.A., Liang,P., Gaasterland,T. and Riley,M. ( (2001) ) A functional update of the Escherichia coli K-12 genome. Genome Biol., , 2, , RESEARCH0035. .
- Thomas,G.H. ( (1999) ) Completing the E. coli proteome: a database of gene products characterised since the completion of the genome sequence. Bioinformatics, , 15, , 860861.
[Abstract/Free Full Text] . - Thomas,G.H. and Bettelheim,K.A. ( (1998) ) Escherichia coli on the WWW. Lett. Appl. Microbiol., , 27, , 122123.[CrossRef][Web of Science][Medline] .
- Rudd,K.E. ( (2000) ) EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res., , 28, , 6064.
[Abstract/Free Full Text] . - Glasner,J.D., Liss,P., Plunkett,G., III, Darling,A., Prasad,T., Rusch,M., Byrnes,A., Gilson,M., Biehl,B., Blattner,F.R. et al. ( (2003) ) ASAP, a systematic annotation package for community analysis of genomes. Nucleic Acids Res., , 31, , 147151.
[Abstract/Free Full Text] . - Karp,P.D., Riley,M., Saier,M., Paulsen,I.T., Collado-Vides,J., Paley,S.M., Pellegrini-Toole,A., Bonavides,C. and Gama-Castro,S. ( (2002) ) The EcoCyc Database. Nucleic Acids Res., , 30, , 5658.
[Abstract/Free Full Text] . - Serres,M.H., Goswami,S. and Riley,M. ( (2004) ) GenProtEC: an updated and improved analysis of functions of Escherichia coli K-12 proteins. Nucleic Acids Res., , 32, , D300D302.
[Abstract/Free Full Text] . - Chaudhuri,R.R., Khan,A.M. and Pallen,M.J. ( (2004) ) coliBASE: an online database for Escherichia coli, Shigella and Salmonella comparative genomics. Nucleic Acids Res., , 32, , D296D299.
[Abstract/Free Full Text] . - Nielsen,H., Engelbrecht,J., Brunak,S. and von Heijne,G. ( (1997) ) A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int. J. Neural Syst., , 8, , 581599.[CrossRef][Medline] .
- Juncker,A.S., Willenbrock,H., von Heijne,G., Brunak,S., Nielsen,H. and Krogh,A. ( (2003) ) Prediction of lipoprotein signal peptides in Gram-negative bacteria. Protein Sci., , 12, , 16521662.[CrossRef][Web of Science][Medline] .
- Sonnhammer,E.L., von Heijne,G. and Krogh,A. ( (1998) ) A hidden Markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol., , 6, , 175182.[Medline] .
- Tusnady,G.E. and Simon,I. ( (2001) ) The HMMTOP transmembrane topology prediction server. Bioinformatics., , 17, , 849850.
[Abstract/Free Full Text] . - Riley,M. and Serres,M.H. ( (2000) ) Interim report on genomics of Escherichia coli. Annu. Rev. Microbiol., , 54, , 341411.[CrossRef][Web of Science][Medline] .
- Edwards,J.S. and Palsson,B.O. ( (2000) ) The Escherichia coli MG1655 in silico metabolic genotype: its definition, characteristics, and capabilities. Proc. Natl Acad. Sci. USA, , 97, , 55285533.
[Abstract/Free Full Text] .
This article has been cited by other articles:
![]() |
R. S. P. Horler, A. Muller, D. C. Williamson, J. R. Potts, K. S. Wilson, and G. H. Thomas Furanose-specific Sugar Transport: CHARACTERIZATION OF A BACTERIAL GALACTOFURANOSE-BINDING PROTEIN J. Biol. Chem., November 6, 2009; 284(45): 31156 - 31163. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. S. P. Horler and C. K. Vanderpool Homologs of the small RNA SgrS are broadly distributed in enteric bacteria but have diverged in size and sequence Nucleic Acids Res., September 1, 2009; 37(16): 5465 - 5476. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. A. Nahum, S. Goswami, and M. H. Serres Protein families reflect the metabolic diversity of organisms and provide support for functional prediction Physiol Genomics, August 7, 2009; 38(3): 250 - 260. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. S. P. Horler, A. Butcher, N. Papangelopoulos, P. D. Ashton, and G. H. Thomas EchoLOCATION: an in silico analysis of the subcellular locations of Escherichia coli proteins and comparison with experimentally derived locations Bioinformatics, January 15, 2009; 25(2): 163 - 166. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Ruiz Bioinformatics identification of MurJ (MviN) as the peptidoglycan lipid II flippase in Escherichia coli PNAS, October 7, 2008; 105(40): 15553 - 15557. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Ruiz, L. S. Gronenberg, D. Kahne, and T. J. Silhavy Identification of two inner-membrane proteins required for the transport of lipopolysaccharide to the outer membrane of Escherichia coli PNAS, April 8, 2008; 105(14): 5537 - 5542. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. D. Karp, I. M. Keseler, A. Shearer, M. Latendresse, M. Krummenacker, S. M. Paley, I. Paulsen, J. Collado-Vides, S. Gama-Castro, M. Peralta-Gil, et al. Multidimensional annotation of the Escherichia coli K-12 genome Nucleic Acids Res., December 3, 2007; 35(22): 7577 - 7590. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Fuhrer, L. Chen, U. Sauer, and D. Vitkup Computational Prediction and Experimental Verification of the Gene Encoding the NAD+/NADP+-Dependent Succinate Semialdehyde Dehydrogenase in Escherichia coli J. Bacteriol., November 15, 2007; 189(22): 8073 - 8078. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. M. Watt, J. Wang, M. Leong, H.-f. Kung, K. S.E. Cheah, D. Liu, A. Danchin, and J.-D. Huang Visualizing the proteome of Escherichia coli: an efficient and versatile method for labeling chromosomal coding DNA sequences (CDSs) with fluorescent protein genes Nucleic Acids Res., March 19, 2007; 35(6): e37 - e37. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Inoue, R. Shingaki, S. Hirose, K. Waki, H. Mori, and K. Fukui Genome-Wide Screening of Genes Required for Swarming Motility in Escherichia coli K-12 J. Bacteriol., February 1, 2007; 189(3): 950 - 957. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. C. Silva, R. Denny, C. Dorschel, M. V. Gorenstein, G.-Z. Li, K. Richardson, D. Wall, and S. J. Geromanos Simultaneous Qualitative and Quantitative Analysis of theEscherichia coli Proteome: A Sweet Tale Mol. Cell. Proteomics, April 1, 2006; 5(4): 589 - 607. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Ochman and L. M. Davalos The nature and dynamics of bacterial genomes. Science, March 24, 2006; 311(5768): 1730 - 1733. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. D. Prickett, M. Page, A. E. Douglas, and G. H. Thomas BuchneraBASE: a post-genomic resource for Buchnera sp. APS Bioinformatics, March 1, 2006; 22(5): 641 - 642. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Riley, T. Abe, M. B. Arnaud, M. K.B. Berlyn, F. R. Blattner, R. R. Chaudhuri, J. D. Glasner, T. Horiuchi, I. M. Keseler, T. Kosuge, et al. Escherichia coli K-12: a cooperatively developed annotation snapshot--2005 Nucleic Acids Res., January 5, 2006; 34(1): 1 - 9. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









