ABSTRACT
GenProtEC is a database of
Escherichia coli
genes and their gene products, classified by type of function and physiological
role and with citations to the literature for each. Also present are data on
sequence similarities among
E.coli
proteins with PAM values, percent identity of amino acids, length of alignment
and percent aligned. GenProtEC can also be accessed through the World Wide Web
at URL http://mbl.edu/html/ecoli.html .
GenProtEC contains the gene name, its synonyms, the SwissProt (
1
) mnemonic for proteins when one has been assigned, its synonyms, the full gene
product name, and the Enzyme Commision EC number for enzymatic reactions. Up to
three literature references are supplied for each entry.
Data on physiological function and sequence similarity are also given. The 2193
gene products have been classified as to type, as either an enzyme, a
regulator, RNA, part of the membrane, a member of the transport system, a
protein factor, a carrier, or a part of the structure of the cell other than
membrane. The gene products have been assigned to at least one or up to four of
118 hierarchically arranged categories of physiological function. Both an early
version of classification of genes and gene products by function (
2
), and a more recent version (
3
) are further refined in the more up-to-date electronic database.
In addition, sequence similarity of each protein to any other
E.coli
protein is given, permitting the grouping together of
E.coli
proteins of similar amino acid sequence. The database contains the results of
similarity analyses (
4
,
5
) that used the AllAllDB of the Darwin suite at Zurich (
6
), requiring an alignment of at least 100 amino acids and a PAM score (accepted
point mutations) (
7
) of <200. Altogether 1347 of
E.coli
K-12 chromosomally encoded proteins had at least one
E.coli
protein partner with sequence similarity as defined above. Proteins with more
than one domain >100 amino acids were divided, each domain treated separately.
The resulting 1430 proteins/domains formed 3685 sequence-related pairs. The pairs were linked by chains of similarities into
sequence-related groups. As of October 1996 there are 350 sequence-related groups of
E.coli
proteins, ranging in size from two to 63, and most or all members of each group
are related by function as well as by sequence.
One can query GenProtEC with a gene name or a synonym or with a SwissProt name
or a synonym, or with a string for description of gene product or a key for
physiological category. Complete pick lists are available for each of these.
The search can be refined by adding more terms in the logical relationships
AND, OR, AND/OR. Information on the gene product and the function of the gene
product is returned, as well as sequence similarities among
E.coli
proteins. For any protein that has at least one sequence-related partner, the name(s) of all other
E.coli
proteins in the related group are returned. For any sequence-related pair, the position and length of the alignment for each of the two
proteins is given, as well as the percent of the protein aligned, the percent
identical amino acids and the PAM score. As new
E.coli
protein sequences appear in the SwissProt database (
1
), information on their sequence relationships within the
E.coli
K-12 chromosome will be incorporated into GenProtEC.
The coupling of sequence similarity and similarity of function may continue to
be useful as a guide to interpretation of the physiological role of protein
sequences from other organisms whose biology is less well known than that of
E.coli
(e.g. ref.
8
). In using the information for
E.coli
to determine what functions another organism posseses, it is important to keep
in mind that protein sequence and function are not always correlated. Among 103
pairs or triplets of
E.coli
enzymes that catalyse the same biochemical reactions, 60% are similar in amino
acid sequence as one might expect (PAM value <250), but the other 40% have little or no relationship of sequence (PAM value
>250) (
3
,
4
). Therefore absence of one class of amino acid sequence from an organism does
not tell if the corresponding function is absent or if another protein of
unrelated sequence is present that might carry out the function in question.
The database can be queried directly on the World Wide Web, accessing through the URL http://mbl.edu/html/ecoli.html . Feedback and corrections will be gratefully received, and it will soon
be possible to enter comments directly at the Web site. Users kindly cite this
article.
Grateful thanks to David Space and David Remsen, Information Sevices Division,
Marine Biological Laboratory, for invaluable programming and site design.
REFERENCES
Return