ABSTRACT
The Encyclopedia of Genes and Metabolism (EcoCyc) is a database that combines
information about the genome and the intermediary metabolism of
Escherichia coli
. It describes 2970 genes of
E.coli
, 547 enzymes encoded by these genes, 702 metabolic reactions that occur in
E.coli
and the organization of these reactions into 107 metabolic pathways. The EcoCyc
graphical user interface allows scientists to query and explore the EcoCyc
database using visualization tools such as genomic-map browsers and automatic layouts of metabolic pathways. EcoCyc spans the
space from sequence to function to allow scientists to investigate an unusually
broad range of questions. EcoCyc can be thought of as both an electronic review
article because of its copious references to the primary literature, and as an
in silicio
model of
E.coli
metabolism that can be probed and analyzed through computational means.
The Encyclopedia of
Escherichia coli
Genes and Metabolism (EcoCyc) is a database (DB) that combines information
about the genome and the intermediary metabolism of
E.coli
K-12. It describes the known genes of
E.coli
, the enzymes of small-molecule metabolism that are encoded by these genes, the reactions
catalyzed by each enzyme and the organization of these reactions into metabolic
pathways. The EcoCyc graphical user interface (GUI) allows scientists to query,
explore and visualize the EcoCyc DB. EcoCyc spans the space from sequence to
function to allow scientists to investigate an unusually broad range of
questions (
5
).
This article describes the scope of EcoCyc, the conceptualization employed to
structure the database, the sources from which we obtained the EcoCyc data, and
the procedures used to construct the database and to verify its correctness.
The article also describes our software for retrieving and visualizing EcoCyc
data. We request that users of EcoCyc cite this article in publications related
to its use.
EcoCyc can be viewed as an electronic review article because it is a carefully
sifted collection of information drawn largely from (and containing 1400
citations to) the primary literature. EcoCyc is also designed to facilitate
complex computations on genomic and metabolic data-to provide an
in silicio
model of
E.coli
that can be probed and analyzed through computational means. Among the problems
that might be addressed using EcoCyc are the following (some of these tasks are
not directly supported by the EcoCyc user interface and would require
additional programming).
Because of its links to sequence DBs such as Swiss-Prot, EcoCyc could be used to perform function-based retrieval of DNA or protein sequences, such as to prepare
datasets for studies of protein structure-function relationships. Scientists who study evolution of the metabolism
could use EcoCyc to search out examples of duplication and divergence of
enzymes and pathways. Systematic computational studies of pathway evolution can
compare related pathways from different organisms. EcoCyc provides a foundation
for performing simulations of the metabolism, although it currently lacks the
kinetics data needed by most simulation techniques.
EcoCyc has been used to predict the metabolic complement of
Haemophilus influenzae
from its genomic sequence (
14
). That metabolic prediction was materialized in DB form and combined with the
EcoCyc software to create an encyclopedia of the
H.influenzae
genome, called HinCyc. This metabolic-analysis technique extracts an added level of biological information from
a genomic sequence, and provides a biological validation of the gene
identifications predicted by sequence analysis.
Biotechnologists seek to design novel biochemical pathways that produce useful
chemical products (such as pharmaceuticals), or that catabolize unwanted
chemicals such as toxins. EcoCyc provides the wiring diagram of
E.coli
K-12, which approximates the starting point for engineering; EcoCyc also
describes the potential engineering variations that can result from importing
E.coli
enzymes into other organisms.
In the past year we have supplemented the EcoCyc data with 25 new pathways, and
we finalized the descriptions of several pathways that previously were
described only partially (these pathways contained little information about
their enzymes). We also added ~50 reactions of intermediary metabolism that play multiple metabolic roles,
depending on conditions, and thus are not assigned to any particular pathway.
Many citations were added to EcoCyc; it now contains 1400 literature citations.
Gene information was downloaded from a new version (version 7) of the EcoGene
database compiled by Rudd and Berlyn (
2
), and additional genes were added by our group. We added a second gene taxonomy
based on product types, as described in the Genes section. Finally, we revised
the representation of cofactors and of coenzymes within EcoCyc.
EcoCyc is now available on both Solaris and SunOS for the Sun workstation.
The GUI was enhanced to include two new visualizations: the Overview, and the
Gene-Reaction Schematic. The Overview provides a birds-eye view of the entire metabolic map of
E.coli
as shown in Figure
1
. Users can interrogate the Overview by moving the mouse over any reaction step
or compound in the diagram, causing EcoCyc to display the name of the compound
or reaction, and the name of its containing pathway, in a separate window. The
user can also request that EcoCyc highlight one or more entities within the
Overview, such as a specified compound, pathway, or enzyme. The Overview is
also useful for comparative analysis of metabolic pathways, such as by
highlighting the subset of pathways that are predicted to occur in
H.influenzae
(
14
). The Overview was generated through a combination of automatic layout of
clusters of pathways and manual layout of the resulting clusters.
The EcoCyc GUI (
4
) provides graphical tools for visualizing and navigating through an integrated
collection of metabolic and genomic information (its retrieval capabilities are
described in Section 7). For each type of biological object in the EcoCyc DB,
the GUI provides a corresponding visualization tool. These tools dynamically
query the underlying DB to produce display windows such as shown in Figure
2
, Figure
3
and Figure
4
. Other displays are provided for genes, enzymes and compounds. All display
algorithms are parameterized to allow the user to select the visual
presentation of an object that is most informative. For example, the algorithms
that produce automatic layouts of metabolic pathways can suppress the display
of enzyme names or side-compound names; they can also draw chemical structures for the compounds
within a pathway. More details on the display algorithms can be found in (
6
).
The EcoCyc data are stored within a frame knowledge representation system (FRS)
called Ocelot. FRSs use an object-oriented data model, and have several advantages over relational DB
management systems (
3
). FRSs organize information within classes: collections of objects that share
similar properties and attributes. The EcoCyc schema is based on the class
hierarchy shown in Figure
5
(
10
). All the biological entities described in EcoCyc are instances of the classes
in Figure
5
. For example, each
E.coli
gene is represented as an instance of the class Genes, and every known
polypeptide is an instance of the class Polypeptides.
Figure
The current size of each class is shown in Table
1
. These statistics pertain to EcoCyc version 3.3. The next released version of
EcoCyc, which should be available by the time this article is published, should
be complete in that it will contain all known enzymes and pathways of
E.coli
small-molecule metabolism (more enzymes will probably be discovered once the
full
E.coli
sequence is known).
Table 1
Each EcoCyc frame contains slots
that describe attributes or properties of the biological object that the frame
represents, or that encode a relationship among that object and other objects.
For example, the slots of a polypeptide frame encode the molecular weight of
the polypeptide, the gene that encodes it, and its cellular location.
The current scope of metabolic information within EcoCyc is intermediary
metabolism only; EcoCyc does not cover macromolecule metabolism such as DNA
replication or repair, nor transcription, nor translation. It does, however,
describe tRNA charging. In the future, we plan to extend EcoCyc to describe
various other aspects of cell function, including the preceding.
Most information on
E.coli
genes in EcoCyc was obtained from the EcoGene DB version 7 (
2
). EcoGene provides synonyms for gene names, physical map positions for all
sequenced genes, and the direction of transcription for each gene. We
supplemented the information in EcoGene significantly by adding descriptions of
additional
E.coli
genes obtained from the literature and from SwissProt. EcoCyc contains 2970
genes, of which 2571 have assigned genomic map positions. The
E.coli
genomic map can be viewed with both circular and linear map-browsing tools that provide multiple levels of magnification within the
chromosome.
EcoCyc classifies genes by using two classification systems. The first is based
on the physiological role of the gene product (e.g., all genes whose products
are involved in tryptophan biosynthesis are in a single category) (
18
). The second system is coarser, and assigns each gene to one of the following
10 product types: Enzyme-Genes, Regulator-Genes, Leader-Genes, Membrane-Genes, Transport-Genes, Structural-Genes, RNA-Genes, Phenotype-Genes, Factor-Genes, Carrier-Genes.
The class Chemicals subsumes all chemical compounds in the
E.coli
cell, such as macromolecules and smaller compounds that act as enzyme
substrates, activators and inhibitors. It also includes some of the elements of
the periodic table. This section focuses on small metabolites contained in
EcoCyc, which are instances of the subclass of Chemicals called Compounds.
These compounds are reaction substrates, and enzyme cofactors, activators and
inhibitors.
EcoCyc contains 1283 compounds; two-dimensional structures are recorded for 965 of them. Among the properties
encoded for compounds are synonyms for their names, molecular weight, empirical
formula, lists of bonds and atoms that encode chemical structures, and two-dimensional display coordinates for each atom that permit drawings of
compound structures.
The compounds were obtained from a variety of sources (
12
), and we continue to update the compound data within EcoCyc by adding new
compounds, adding structures for existing compounds, and correcting errors.
Comprehensive compound data have been surprisingly useful to this project.
Reaction equations in the literature use many different names to refer to the
same compound. We determine if two reactions are the same by asking if their
products and reactants are the same, making frequent use of our comprehensive
compound synonym lists.
The initial set of biochemical reactions in EcoCyc were derived from the ENZYME
DB (
1
), which Bairoch's group prepared by typing in the enzyme nomenclature system (
20
). We also downloaded the enzyme classification system (
20
) from ENZYME. We added comments describing the metabolic role of many of these
reactions. Because the enzyme nomenclature concerns enzymes from all species,
many of the reactions in the ENZYME DB do not actually occur in
E.coli
. EcoCyc reaction windows state whether or not we have evidence that a given
reaction occurs in
E.coli
.
We have added more reactions to EcoCyc because a number of the reactions
catalyzed by
E.coli
have not been classified by the enzyme committee. In addition, some of the
reactions in (
20
) are written with different specificity than the corresponding
E.coli
enzyme. This observation indicates a weakness of the enzyme nomenclature
system: we cannot expect that a single reaction equation will accurately
reflect the substrate specificities of a family of enzymes from several
organisms.
Reaction frames contain information such as lists of reactants and products for
the reaction equation, the EC number of the reaction, and the [Delta]
G
0
for the reaction in the direction it is written. Reaction objects are linked to
the pathway(s) that contain them and to the enzyme(s) that catalyze them.
EcoCyc contains 3038 total reactions organized into 269 classes defined by the
enzyme committee; 702 of the reactions are known to occur in
E.coli
and 136 of the reactions have no EC number.
EcoCyc contains extensive information about
E.coli
enzymes and pathways that we obtained from the biomedical literature. We
performed a comprehensive literature search for each
E.coli
enzyme, reaction, and pathway using Medline, the
E.coli
-
Salmonella
book (
15
) and biochemistry textbooks. We also carried out manual library searches for
other pertinent papers by following citations in journal articles and in the
Science Citation Index
. Our original data entry method was to use a standard text editor to enter
information derived from the literature into a highly structured text file
called a template file. Template files organize information as frames (such as
enzymes and pathways) with labeled slots (attributes). The template files also
permit us to associate chosen literature citations with the appropriate data.
We developed a computer program that parses the template files to extract their
constituent data items, and then inserts those data items into the EcoCyc DB.
The parser program performs consistency checks on the data to correct minor
typographical errors, and verifies, for example, that the entry in a field that
is supposed to contain a gene does in fact refer to a gene in the DB. We
recently discontinued use of the template files in favor of an interactive
editing and browsing tool called the GKB Editor, which allows interactive entry
of information directly into EcoCyc (
16
).
In the EcoCyc schema, all enzyme objects are instances of the class Proteins,
which is partitioned into two subclasses: Protein-Complexes and Polypeptides. These two classes have a number of common
properties, such as molecular weight, pI, cellular location and a relationship
to one or more catalyzed reactions. They differ in that Protein-Complexes have slots that link them to their subunits, whereas
Polypeptides have a slot that identifies their gene. We also record whether
sequence-similarity relationships exist among a set of isozymes, and we provide
links to the SwissProt and the PDB entries for a polypeptide. Proteins are
listed as a subclass of chemicals since in some cases proteins themselves are
substrates in a reaction (such as phosphorylation reactions). The DB contains
623 polypeptides and 319 protein complexes that comprise a total of 547 enzymes
(i.e., 547 of the polypeptides and protein complexes have defined catalytic
activities).
For each enzyme, we have written comments that address topics such as reaction
mechanism, subreactions of complex reactions, interactions of subunits of
complex enzymes, formation of complexes with other proteins, breadth of
substrate specificity, mode of action of inhibitors and activators, place and
function of reactions in metabolic pathways, other reactions catalyzed by the
protein, and relationship of the protein to other proteins catalyzing the same
reaction.
We define a high-fidelity representation
as a formal conceptualization (that is, a portion of a schema) that allows a DB
to accurately capture subtleties of biology (
7
,
9
). The design of the EcoCyc schema (class hierachy) was motivated by several
observations. The properties of a reaction (such as its [Delta]
G
0
and its substrates) are independent of an enzyme(s) that catalyzes it, and an
enzyme has a number of properties (such as molecular weight and amino-acid sequence) that are logically distinct from the reactions it
catalyzes. The relationship between enzymes and reactions is many-to-many since one enzyme can catalyze several reactions, and one
reaction can be catalyzed by more than one enzyme. This distinction has led to
interesting and perhaps counterintuitive observations: EC numbers are actually
a property of reactions, rather than of enzymes. That is, there is a one-to-one correspondence between reactions and EC numbers, but not between
enzymes and EC numbers. An enzyme that catalyzes two reactions will have two EC
numbers, and two enzymes that catalyze the same reaction have the same EC
number.
A further distinction is required because some properties of an enzyme are
meaningful only in the context of a particular reaction that the enzyme
catalyzes. Properties such as activators, inhibitors and cofactors pertain to
the pairing of an enzyme and a reaction because a single enzyme that catalyzes
two reactions may be sensitive to different inhibitors for each reaction, and
we wish to capture this complex relationship. We capture it through a class
called the Enzymatic-Reaction, which links an enzyme to a reaction that it catalyzes, and
essentially describes a single catalytic site within an enzyme.
The slots of the Enzymatic-Reaction class allow us to define four types of activators (competitive,
allosteric, nonallosteric and those whose mechanism is not stated in the
literature) and the analogous four types of inhibitors. By default, these
activators and inhibitors are assumed to have been observed in enzymological
studies; activators and inhibitors that are known to have physiological
relevance are listed in the slot Physiologically-Relevant. Additional slots encode the cofactors, coenzymes and prosthetic
groups of an enzyme. The EcoCyc schema also provides three different means of
encoding substrate specificity. Each approach has different advantages in terms
of succinctness, and in its ability to represent incomplete knowledge (
11
).
Pathway frames list the reactions that make up a pathway, and describe the
ordering of those reactions within the pathway. Information about the ordering
of reactions within a pathway is encoded using a predecessor-list representation (
7
), which for each reaction in a pathway lists the reactions that precede it in
the pathway. This representation allows us to capture complex pathway
topologies, yet does not require entering information that is redundant with
respect to existing reaction objects. We developed algorithms for deriving a
full description of the pathway from the predecessor list (
7
).
If a reaction can be potentially catalyzed by more than one enzyme, but only one
enzyme is physiologically active in a particular pathway (such as the oxidative
succinate dehydrogenase and the anaerobic fumarate reductase), we can encode
this restriction in the pathway frame within a slot called Enzyme-Use by listing each reaction and the enzyme(s) that catalyze(s) it.
The DB uses objects called superpathways to define a new pathway as an
interconnected cluster of smaller pathways. For example, a superpathway called
`complete aromatic amino-acid biosynthesis' links together the individual pathways for biosynthesis
of chorismate, tryptophan, tyrosine and phenylalanine. Superpathways are also
defined using the predecessor list (
7
). EcoCyc currently contains 107 pathways and 28 superpathways.
One possible point of confusion concerns the large discrepancy between the
number of pathways found in EcoCyc, and in the
E.coli
subset of the EMP database (
19
). The number of pathways is larger in EMP because many of its pathways consist
of a single reaction. In contrast, no EcoCyc pathways contain only one
reaction. The average length of an EcoCyc pathway is six reactions.
The EcoCyc GUI uses automatic layout algorithms to generate drawings of linear,
circular and tree-structured pathways. The GUI allows the user to navigate from a pathway to
a superpathway that contains it, or vice versa. Pathway drawings can
incorporate varying amounts of detail as specified by the user. Minimal detail
shows only the names of compounds at branch points and on the exterior of the
pathway; full detail shows all compound names, enzyme names and compound
structures.
The EcoCyc data are subjected to several different validation checks to ensure
their correctness. The DB contains many consistency constraints that are
automatically evaluated with respect to new entries, such as determining
whether the object listed as the product of a gene is in fact an instance of
the class Polypeptides, or that the molecular weight of a protein is a positive
number.
We also employ a reaction mass-balancing program to search for DB errors. The program evaluates all
reactions for which every substrate of the reaction is a known compound within
the EcoCyc DB, and the empirical formula of the compound is known. The program
sums all atoms for each type of element for the products of the reaction, and
for the reactants, and verifies that all atoms are conserved. (In fact, we
allow hydrogen atoms to be unconserved because of inconsistencies in ionization
states across different compounds in the DB.) This program has identified a
number of errors in EcoCyc, including a dozen typographical errors in the
reactions obtained from the ENZYME DB, and errors in our compound structures.
This program further illustrates the utility of including chemical compounds in
a metabolic DB.
Finally, we review each entry before its release. We have recently begun to
enlist scientist experts to review each pathway.
EcoCyc provides the user with two classes of DB retrieval operations: direct
retrieval through menus of predefined queries, and indirect retrieval through
hypertext navigation. For example, imagine that a user seeks information on the
hisA
gene, such as its map position and information about the enzyme it encodes.
EcoCyc allows the user to call up an information window for that gene directly
by querying the gene name.
The indirect approach consists of hypertext navigation among the information
windows for related objects. Such navigation allows the user to find the
hisA
gene by traversing many paths through the DB. The user could issue a direct
query to display the biosynthetic pathway for histidine, and then click on the
name of the enzyme at the last step in the pathway. The resulting information
window for that enzyme will show the name of the gene (
hisA
) coding for the enzyme. Clicking on the gene name will display the information
window for
hisA
. Alternatively, the user could query the compound histidine by name. The
resulting window lists all reactions involving histidine; the user can click on
a reaction to navigate to its window, which lists all enzymes that catalyze the
reaction, plus all genes encoding those enzymes (including
hisA
).
Users invoke queries using menus and dialog windows, rather than through a query
language (although we have partially implemented a declarative query language
for EcoCyc). A distinctive aspect of EcoCyc is its extensive set of taxonomies.
For example, EcoCyc includes two taxonomies of genes developed by Riley (
18
), a taxonomy of metabolic pathways, a taxonomy of chemical compounds, and the
taxonomy of reactions developed by the enzyme committee (
20
). A user can query the gene taxonomy by first selecting a gene class from a
menu of all classes (such as the class of genes coding for membrane proteins);
next, the user chooses one or more of the genes in that class from a second
menu. The full set of queries supported by EcoCyc is as follows.
Gene queries
- Get gene by name, Get gene by substring-Examples: Find
hisA
; find all genes whose name includes `his'
- Get gene by class
- Get enzyme by name, Get enzyme by substring
- Get enzyme by pathway-Example: Select from a menu of the enzymes in glycolysis
Reaction queries
- Get reaction by pathway
- Get reaction by EC number-Example: Find 1.2.3.4
- Get reaction by class-Example: Select from a menu of all reactions in the EC class
1.2.3
Pathway queries
- Get pathway by name, Get pathway by substring
- Get pathway by class-Example: Select from a menu of all pathways for amino acid
biosynthesis
Compound queries
- Get compound by name, Get compound by substring
- Get compound by class
- Get compound by substructure-Example: Find all compounds containing the substructure C-C-OH [substructures are specified using the SMILES
language (
21
)]
Map queries
- Create linear map display
- Create circular map display
- Zoom in on map; position is specified via mouse click, gene name, or
numerical map position
- Add or remove genes from a partial map
Overview queries
- Highlight compounds or reaction steps using virtually all of the
preceding types of queries
When a query returns multiple answers, the user can examine each answer in turn.
The user can also employ a history list to return to a previous window.
EcoCyc is implemented in Common Lisp with a graphical-interface toolkit called the Common Lisp Interface Manager (CLIM). CLIM
and Common Lisp are both highly portable, facilitating the delivery of EcoCyc
on a variety of platforms. EcoCyc now runs on the Sun workstation under Common
Lisp and CLIM products from Franz Inc.
EcoCyc builds on several software components. Metabolic pathway displays make
use of the Grasper-CL graphing tool, developed at SRI (
13
). Grasper-CL provides facilities for manipulation and display of graphs consisting
of nodes and edges, and provides a library of automatic layout algorithms. To
store and manage the EcoCyc data, we use an FRS called Ocelot developed by our
group at SRI; HyperTHEO is the predecessor of Ocelot and is described in (
8
). The World Wide Web (WWW) server capabilities are based on a software tool
called CWEST (
17
).
EcoCyc is available via the Internet in three forms:
(i) A program for the Sun workstation (SunOS or Solaris) bundles together the
EcoCyc GUI and the EcoCyc DB.
(ii) The EcoCyc DB alone is available as a set of flat files.
(iii) The EcoCyc GUI is accessible online through the WWW.
The EcoCyc WWW pages describe all three types of access to EcoCyc; they also
provide links to the EcoCyc User's Guide, to detailed documentation of the
EcoCyc schema, and to all publications produced by the EcoCyc project. The URL
for the EcoCyc home page is http://www.ai.sri.com/ecocyc/ecocyc.html
This work was supported by grant 1-R01-RR07861-01 from the National Center for Research Resources, and by
grant R29-LM-05413-01A1 from the National Library of Medicine. The contents of
this article are solely the responsibility of the authors and do not
necessarily represent the official views of the National Institutes of Health.

Current
Reactions
702
Polypeptides
623
Pathways
107
Genes
2970
Compounds
1283
REFERENCES
Return

