Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (65K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Serres, M. H.
Right arrow Articles by Riley, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Serres, M. H.
Right arrow Articles by Riley, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2004, Vol. 32, Database issue D300-D302
© 2004 Oxford University Press

GenProtEC: an updated and improved analysis of functions of Escherichia coli K-12 proteins

Margrethe H. Serres, Sulip Goswami and Monica Riley*

Marine Biological Laboratory, Woods Hole, MA 02540, USA

*To whom correspondence should be addressed. Tel: +1 508 289 7612; Fax: +1 508 457 4727; Email: mriley{at}mbl.edu

Received September 12, 2003; Revised and Accepted October 6, 2003


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 NEW FEATURES TO IMPROVE...
 REVISION OF EARLIER ANALYSES
 SUMMARY
 REFERENCES
 
Using more than one approach to characterizing functions of unknown proteins, we now present in GenProtEC (http://genprotec.mbl.edu/) some level of function information for 87% of Escherichia coli K-12 proteins. A new approach that has yielded new information entails assigning content of structural domains and their functions to E.coli proteins. In addition, some earlier methods have been further refined to provide more meaningful data. The process of identifying and separating multimodular or fused proteins into component modules has been improved. As a result, groups of sequence-similar (paralogous) proteins have been refined. Experimental information from recent literature on previously unknown genes has been incorporated. We now use a rich system of characterizing cell roles which accents the fact that many proteins play more than one cellular role and therefore carry more than one designation from our detailed catalog of roles, MultiFun.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 NEW FEATURES TO IMPROVE...
 REVISION OF EARLIER ANALYSES
 SUMMARY
 REFERENCES
 
Since GenProtEC (Genes and Proteins of Escherichia coli) was launched initially in 1995 its main goal has been to provide descriptions of functions for gene products encoded by the E.coli K-12 (strain MG1655) chromosome (1). GenProtEC was recently rebuilt as a MySQL-based database and new features have been added that supplement the approach of defining functions through sequence analysis of the encoded proteins. The database contains 4400 gene products of which 116 are RNA molecules and 4284 proteins, using the coding DNA sequences (CDSs) from the GenBank Accession No. U00096 [GenBank] with updates (2) (G. Plunkett, III, personal communication). Each gene product is represented by a gene page, which can be queried through various identifiers, including gene name, Blattner number (bnumber), Swiss-Prot ID, Enzyme Commission (EC) number(s) and gene product description. If more than one gene product matches the query statement, i.e. multiple enzymes with the same EC number, a short list is generated which the user can select from. The gene page gives function assignments for the gene product. The basis for the function description, i.e. experimental, phenotype, sequence similarity, similarity to structural domain(s), membership in a biochemically related group, is given in addition to links to published references. Also provided are gene synonym(s) and identifiers, gene type (such as enzyme, transport protein, regulatory protein, membrane protein) and EC number(s) for enzymes. Cellular role assignment(s) according to the MultiFun classification system (3) are also given for many of the gene products.

Genome-wide information is given on other accessible pages. The information on these pages is as follows:

(i) distribution of gene product types in E.coli;

(ii) multimodular proteins containing two or more independently encoded functions;

(iii) groups of sequence similar proteins (paralogs) and their group functions;

(iv) proteins grouped by biochemical and structural similarity;

(v) the MultiFun classification system for cellular roles;

(vi) a table of the most frequently found structural domains as classified by SCOP (4) present in E.coli.

The distribution of sources of knowledge about E.coli gene products is shown in Table 1. Sequence similarity continues to provide the greatest amount of non-experimental information, with domain structures and protein families making unique and valuable contributions. Table 2 shows the distribution of the types of gene products. Enzymes, including experimentally determined and putative assignments make up over one-third of the chromosomally encoded gene products. For 86.9% of the proteins there is some information on the functional characteristics of the proteins, while for 13.1% still none. New information includes functions inferred from structural domains present in previously unknown gene products. There are presently 3395 literature references linked to 2410 of the gene products. New functions continue to be discovered for E.coli gene products by experimentation. Since our 2001 published functional update (2) we have entered 106 new experimentally based function assignments to our database. One or more cellular role assignments have been made to 3344 of the E.coli gene products.


View this table:
[in this window]
[in a new window]
 
Table 1. Sources of knowledge about E.coli K-12 gene products
 

View this table:
[in this window]
[in a new window]
 
Table 2. Distribution of types of E.coli gene products
 

    NEW FEATURES TO IMPROVE AND EXPAND FUNCTIONAL ANNOTATION
 TOP
 ABSTRACT
 INTRODUCTION
 NEW FEATURES TO IMPROVE...
 REVISION OF EARLIER ANALYSES
 SUMMARY
 REFERENCES
 
Structural domains
Structurally based domains in E.coli proteins have been identified in the Superfamily library of the SCOP database (4,5). These domains represent elements of known function such as binding sites, catalytic sites and more. We have made use of these domain assignments to enrich the annotation for the E.coli proteins whose function is either not known (currently annotated as conserved proteins, conserved hypothetical proteins or unknown CDSs) or is predicted by sequence similarity (putative). Unknowns with no function assignments were reduced from 760 (17.3%) to 578 (13.1%). Although domain functions do not represent the entire function of the gene product, they can give important clues about the activity of the gene product. Domain function descriptions have also been added to 687 of the gene products with putative function assignments to enrich their annotation.

Protein families related by biochemistry and structure
We have identified members in E.coli of some well-known biochemically related families of proteins. Each family contains members that have structurally related binding and catalytic sites (known or imputed) and by biochemical mechanism of action. Some of the proteins in the groups are only distantly related when assessed by sequence criteria alone. Specifically, we have studied four families in detail: pyridoxal phosphate-dependent aminotransferases, thiamine-diphosphate (TPP)-dependent decarboxylases, crotonases and ATP-dependent CoA ligases.


    REVISION OF EARLIER ANALYSES
 TOP
 ABSTRACT
 INTRODUCTION
 NEW FEATURES TO IMPROVE...
 REVISION OF EARLIER ANALYSES
 SUMMARY
 REFERENCES
 
Identification of multifunctional proteins
Some proteins are multifunctional, presumably a consequence of gene fusion, and represent two or more separately encoded functions. Gene fusion events evidently are dynamic and evolutionarily recent as the fusion partners differ from one bacterial species to another. To identify the fusion of proteins with truly independent origins, one needs to distinguish complete protein functions from large functional domains. We have expanded the data set used to identify fused proteins from sequence alignments by including alignments with polypeptides encoded by 49 additional genomes. The sequence alignments were done by Darwin (6) with the requirement that an alignment be at least 83 amino acids long and have a PAM (point accepted mutation) distance of 200 or less. E.coli sequences that aligned to the full length of more than one single function polypeptide (either paralogous or orthologous) were split into modules as shown in Figure 1. Currently we have identified 101 fused proteins with two independent functions and seven proteins with three independent functions. The average length of the fused proteins is 635 amino acids and the average length of their encoded modules is 300 amino acids. The module size is more comparable to the average of 309 amino acids for the remaining (non-fused) or unimodular proteins.



View larger version (7K):
[in this window]
[in a new window]
 
Figure 1. Identification of a fused gene product. Darwin-generated sequence alignments of E.coli proteins against proteins of 50 genomes were used to identify gene products with two or more separately encoded functions. The alignments with ThrA (GI:1786183) support the separation of the protein into two modules. The full-length alignment of Vibrio cholerae aspartokinase locates the N-terminally encoded function while the full-length alignment of the Sulfolobus solfataricus homoserine dehydrogenase locates the C-terminally encoded function. Alignment regions for the respective proteins are shown in brackets.

 
In GenProtEC fused genes are presented in fused form and are annotated with their complete set of functions. In addition, independent functions are listed for the module components. For ThrA, as an example, the complete function is bifunctional: aspartokinase I (N-teminal); homoserine dehydrogenase I (C-terminal). The two modules are annotated separately as b0002_1 and b0002_2, with the two enzymes and EC numbers separately attributed. MultiFun assignment is perforce assigned at the level of module.

Sequence similar (paralogous) protein groups
E.coli contains many families of proteins of similar sequence, reflecting paralogous groups of genes (79). In order to delineate the paralogous groups by sequence similarity, one must separate the multimodular proteins into units of independent origin and function to aviod artificial connections between unrelated proteins (9). Using the revised list of modules, pairwise protein sequence similarities were detected by Darwin (6) as detailed above. The proteins were clustered by a transitive grouping process as described by Liang et al. (7). The revised grouping placed 1999 protein modules in 498 protein groups. Group size ranges from 2 to 93 members. The majority of the groups have two (55%) or three (20%) members. Only 29 groups have 10 or more members. Annotation information for unknown proteins can be derived from membership in sequence-similar groups whose common function is clear.

Cellular role (MultiFun) assignments for E.coli gene products
The E.coli gene products are characterized by their cellular role according to the MultiFun classification system (3). This is a hierarchical classification system consisting of 10 main role categories that are further subdivided. The classification system has been expanded recently to include additional metabolic pathways as well as categories for the substrates of transport proteins. MultiFun is currently also used by A Systematic Annotation Package for community analysis of genomes (ASAP) (10) and in the EcoCyc database (11). We continually make cellular role assignments to our database and have at the present made over 8400 assignments representing on an average more than two assignments per gene product. Assigning multiple cell roles to a gene product is done to more completely characterize its activity in the cell.


    SUMMARY
 TOP
 ABSTRACT
 INTRODUCTION
 NEW FEATURES TO IMPROVE...
 REVISION OF EARLIER ANALYSES
 SUMMARY
 REFERENCES
 
GenProtEC presents information on the functions of E.coli K-12 MG1655 gene products from several points of view. E.coli proteins as single modules have been grouped in sequence similarity. Using the power of group membership of proteins of similar function, open reading frames within any group can be assigned the general function. In addition, the presence of domains of known function within E.coli proteins has been determined. Domain content permits annotation of some functional information to otherwise totally unknown sequences. The rich classification of cellular roles, MultiFun, has been applied, underlining the fact that many gene products have more than one cellular role. All the data presented in GenProtEC is made easily accessible to the users through downloadable flat files in text format.

Supplementary data are available online as follows: gene page list for ThrA (http://genprotec.mbl.edu/result.php? search_field=Gene+Name&field_value=thra); multimodular E.coli proteins (http://genprotec.mbl.edu/prot_mod.php); groups of E.coli proteins (modules) related by sequence similarity (http://genprotec.mbl.edu/prot_grp.php); groups of E.coli proteins related by biochemistry and structure (http://genprotec.mbl.edu/bio_group.html).


    ACKNOWLEDGEMENTS
 
We thank Amar Singh for redesigning and programming GenProtEC. Thank you to Edward A. Adelberg for providing up to date references on E.coli gene products. Incorporation of SCOP domains was done in collaboration with Sarah A. Teichmann.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 NEW FEATURES TO IMPROVE...
 REVISION OF EARLIER ANALYSES
 SUMMARY
 REFERENCES
 

  1. Riley,M. and Space,D.B. (1996) Genes and proteins of Escherichia coli (GenProtEc). Nucleic Acids Res., 24, 40.[Abstract/Free Full Text]

  2. Serres,M.H., Gopal,S., Nahum,L.A., Liang,P., Gaasterland,T. and Riley,M. (2001) A functional update of the Escherichia coli K-12 genome. Genome Biol., 2, RESEARCH0035.

  3. Serres,M.H. and Riley,M. (2000) MultiFun, a multifunctional classification scheme for Escherichia coli K-12 gene products. Microb. Comp. Genomics, 5, 205–222.[Medline]

  4. Lo,C.L., Brenner,S.E., Hubbard,T.J., Chothia,C. and Murzin,A.G. (2002) SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res., 30, 264–267.[Abstract/Free Full Text]

  5. Gough,J. and Chothia,C. (2002) SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res., 30, 268–272.[Abstract/Free Full Text]

  6. Gonnet,G.H., Hallett,M.T., Korostensky,C. and Bernardin,L. (2000) Darwin v. 2.0: an interpreted computer language for the biosciences. Bioinformatics, 16, 101–103.[Abstract/Free Full Text]

  7. Liang,P., Labedan,B. and Riley,M. (2002) Physiological genomics of Escherichia coli protein families. Physiol. Genomics, 9, 15–26.[Abstract/Free Full Text]

  8. Liang,P. and Riley,M. (2001) A comparative genomics approach for studying ancestral proteins and evolution. Adv. Appl. Microbiol., 50, 39–72.[Web of Science][Medline]

  9. Labedan,B. and Riley,M. 1995. Gene products of Escherichia coli: Sequence comparisons and common ancestries. Mol. Biol. Evol., 12, 980–987.

  10. Glasner,J.D., Liss,P., Plunkett,G.,III, Darling,A., Prasad,T., Rusch,M., Byrnes,A., Gilson,M., Biehl,B., Blattner,F.R. et al. (2003) ASAP, a systematic annotation package for community analysis of genomes. Nucleic Acids Res., 31, 147–151.[Abstract/Free Full Text]

  11. Karp,P.D., Riley,M., Saier,M., Paulsen,I.T., Collado-Vides,J. S., Paley,M., Pellegrini-Toole,A., Bonavides,C. and Gama-Castro,S. (2002) The EcoCyc Database. Nucleic Acids Res., 30, 56–58.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Physiol. GenomicsHome page
L. A. Nahum, S. Goswami, and M. H. Serres
Protein families reflect the metabolic diversity of organisms and provide support for functional prediction
Physiol Genomics, August 7, 2009; 38(3): 250 - 260.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. R. Ramakrishnan, C. Vogel, J. T. Prince, R. Wang, Z. Li, L. O. Penalva, M. Myers, E. M. Marcotte, and D. P. Miranker
Integrating shotgun proteomics and mRNA expression data to improve protein identification
Bioinformatics, June 1, 2009; 25(11): 1397 - 1403.
[Abstract] [Full Text] [PDF]


Home page
Genes Dev.Home page
G. Nonaka, M. Blankschien, C. Herman, C. A. Gross, and V. A. Rhodius
Regulon and promoter analysis of the E. coli heat-shock factor, {sigma}32, reveals a multifaceted cellular response to heat stress.
Genes & Dev., July 1, 2006; 20(13): 1776 - 1789.
[Abstract] [Full Text] [PDF]


Home page
ScienceHome page
G. Posfai, G. Plunkett III, T. Feher, D. Frisch, G. M. Keil, K. Umenhoffer, V. Kolisnychenko, B. Stahl, S. S. Sharma, M. de Arruda, et al.
Emergent Properties of Reduced-Genome Escherichia coli
Science, May 19, 2006; 312(5776): 1044 - 1046.
[Abstract] [Full Text] [PDF]


Home page
MicrobiologyHome page
M. J. Allen, G. F. White, and A. P. Morby
The response of Escherichia coli to exposure to the biocide polyhexamethylene biguanide.
Microbiology, April 1, 2006; 152(Pt 4): 989 - 1000.
[Abstract] [Full Text] [PDF]


Home page
ScienceHome page
H. Ochman and L. M. Davalos
The nature and dynamics of bacterial genomes.
Science, March 24, 2006; 311(5768): 1730 - 1733.
[Abstract] [Full Text] [PDF]


Home page
Physiol. GenomicsHome page
A. Sivakumar, C. Wilton, and L. Holm
From sequences to a functional unit
Physiol Genomics, March 13, 2006; 25(1): 1 - 8.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. Vallenet, L. Labarre, Z. Rouy, V. Barbe, S. Bocs, S. Cruveiller, A. Lajus, G. Pascal, C. Scarpelli, and C. Medigue
MaGe: a microbial genome annotation system supported by synteny results
Nucleic Acids Res., January 10, 2006; 34(1): 53 - 65.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Riley, T. Abe, M. B. Arnaud, M. K.B. Berlyn, F. R. Blattner, R. R. Chaudhuri, J. D. Glasner, T. Horiuchi, I. M. Keseler, T. Kosuge, et al.
Escherichia coli K-12: a cooperatively developed annotation snapshot--2005
Nucleic Acids Res., January 5, 2006; 34(1): 1 - 9.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
S. S. Fong, A. R. Joyce, and B. O. Palsson
Parallel adaptive evolution cultures of Escherichia coli lead to convergent growth phenotypes with different gene expression states
Genome Res., October 1, 2005; 15(10): 1365 - 1372.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
P. H. Degnan, A. B. Lazarus, and J. J. Wernegreen
Genome sequence of Blochmannia pennsylvanicus indicates parallel evolutionary trends among bacterial mutualists of insects
Genome Res., August 1, 2005; 15(8): 1023 - 1033.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
N. A. Moran, H. E. Dunbar, and J. L. Wilcox
Regulation of Transcription in a Reduced Bacterial Genome: Nutrient-Provisioning Genes of the Obligate Symbiont Buchnera aphidicola
J. Bacteriol., June 15, 2005; 187(12): 4229 - 4237.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
H. Weber, T. Polen, J. Heuveling, V. F. Wendisch, and R. Hengge
Genome-Wide Analysis of the General Stress Response Network in Escherichia coli: {sigma}S-Dependent Genes, Promoters, and Sigma Factor Selectivity
J. Bacteriol., March 1, 2005; 187(5): 1591 - 1603.
[Abstract] [Full Text] [PDF]


Home page
Appl. Environ. Microbiol.Home page
K. M. Winterberg, J. Luecke, A. S. Bruegl, and W. S. Reznikoff
Phenotypic Screening of Escherichia coli K-12 Tn5 Insertion Libraries, Using Whole-Genome Oligonucleotide Microarrays
Appl. Envir. Microbiol., January 1, 2005; 71(1): 451 - 459.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
R. V. Misra, R. S. P. Horler, W. Reindl, I. I. Goryanin, and G. H. Thomas
EchoBASE: an integrated post-genomic database for Escherichia coli
Nucleic Acids Res., January 1, 2005; 33(suppl_1): D329 - D333.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
V. Barbe, D. Vallenet, N. Fonknechten, A. Kreimeyer, S. Oztas, L. Labarre, S. Cruveiller, C. Robert, S. Duprat, P. Wincker, et al.
Unique features revealed by the genome sequence of Acinetobacter sp. ADP1, a versatile and naturally transformation competent bacterium
Nucleic Acids Res., October 28, 2004; 32(19): 5766 - 5779.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. C. Janga and G. Moreno-Hagelsieb
Conservation of adjacency as evidence of paralogous operons
Nucleic Acids Res., October 11, 2004; 32(18): 5392 - 5397.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (65K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Serres, M. H.
Right arrow Articles by Riley, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Serres, M. H.
Right arrow Articles by Riley, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?