Download Evaluation of computational metabolic

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pathogenomics wikipedia , lookup

Genomics wikipedia , lookup

Designer baby wikipedia , lookup

Wnt signaling pathway wikipedia , lookup

Oncogenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genomic library wikipedia , lookup

Public health genomics wikipedia , lookup

Nicotinamide adenine dinucleotide wikipedia , lookup

Minimal genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Genome editing wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Transcript
Vol. 18 no. 5 2002
Pages 715–724
BIOINFORMATICS
Evaluation of computational metabolic-pathway
predictions for Helicobacter pylori
Suzanne M. Paley and Peter D. Karp ∗
Bioinformatics Research Group, SRI International, EK207, 333 Ravenswood Ave,
Menlo Park, CA 94025, USA
Received on August 26, 2001; revised on December 10, 2001; accepted on December 17, 2001
ABSTRACT
Motivations: We seek to determine the accuracy of computational methods for predicting metabolic pathways in
sequenced genomes, and to understand the contributions
of both the prediction algorithms, and the reference pathway databases used by those algorithms, to the prediction
accuracy.
Results: The comparisons we performed were as follows. (1) We compared two predictions of the pathway
complements of Helicobacter pylori that were computed
by an early version of our pathway-prediction algorithm:
prediction A used the EcoCyc E. coli pathway DB as the
reference database (DB) for prediction, and prediction B
used the MetaCyc pathway DB (a superset of EcoCyc)
as the reference pathway DB. The MetaCyc-based prediction contained 75% more pathway predictions, but we
believe a significant number of those predictions were
false positives. (2) We compared two predictions of the
pathway complement of H. pylori that used MetaCyc
as the reference pathway DB, but that used different
algorithms: the original PathoLogic algorithm, and an
enhanced version of the algorithm designed to eliminate false-positive pathway predictions. The improved
algorithm predicted 30% fewer metabolic pathways than
the original algorithm; all of the eliminated pathways are
believed to be false-positive predictions. (3) We compared
the 98 pathways predicted by the enhanced algorithm
with the results of a manual analysis of the pathways of
H. pylori. Results: 40 of the computationally predicted
pathways were consistent with the manual analysis,
13 pathways are considered false-positive predictions,
and four pathways had partially overlapping topologies.
Twenty-six predicted pathways were not mentioned
in the manual analysis; we believe these are correct
predictions by PathoLogic that were not found by the
manual analysis. Five pathways from the manual analysis
were not found computationally. Agreement between the
computational and manual predictions was good overall,
with the computational analysis inferring many pathways
∗ To whom correspondence should be addressed.
c Oxford University Press 2002
that the manual analysis did not identify. Ultimately the
manual analysis is also partially speculative, and therefore
is not an absolute measure of correctness. The algorithm
is designed to err on the side of more false positives to
bring more potential pathways to the user’s attention. The
resulting H. pylori pathway DB is freely available at http:
//ecocyc.org:1555/HPY/organism-summary?object=HPY.
Availability: The Pathway Tools software is freely available to academic users, and is available to commercial
users for a fee. Contact [email protected] for information
on obtaining the software.
Contact: [email protected], [email protected]
1 INTRODUCTION
As more complete genomes are sequenced, the task of
analysing the resulting data looms ever larger. Automated
analysis techniques can reduce the bottlenecks caused
by human curators, and are vital for keeping up with
the flood of sequence data. But how effective are they?
For example, little is known about the accuracy of
predicting the function of a protein via sequence-similarity
methods, whether these methods are used in a manual
or an automated fashion. Studies of the strengths and
weaknesses of any particular algorithm are imperative, to
identify the situations to which it can be usefully applied,
and to estimate a level of confidence in the results it
produces.
The PathoLogic program predicts the metabolic pathways of an organism from its annotated genome. PathoLogic takes as input an annotated genome sequence for an
organism and produces a new Pathway/Genome Database
(PGDB) containing those pathways that it has computationally inferred to be present in the organism. A PGDB
describes the genes, proteins, and metabolic reactions and
pathways of an organism. The second input required by
PathoLogic is a reference pathway database (DB), such as
EcoCyc or Metacyc. EcoCyc is a PGDB for E. coli (Karp
et al., 2002b). MetaCyc is a metabolic pathway DB containing pathways from over 130 different organisms (Karp
et al., 2002a), and contains all pathways in EcoCyc.
715
S.M.Paley and P.D.Karp
This paper evaluates the results of two recent improvements in our databases and algorithms, and also compares
the use of our most sophisticated pathway-prediction program on the genome of H. pylori with a manual prediction
of the pathways of H. pylori from its genome. More precisely, the comparisons we perform are as follows:
1. We compare two predictions of the pathway complements of H. pylori using our original PathoLogic
algorithm: The first prediction, which we call
HpyCyc–1, used the EcoCyc pathway DB as the
reference DB. The second prediction, which we call
HpyCyc–2A, used the MetaCyc DB as the reference
pathway DB.
2. We compare two predictions of the pathway complement of H. pylori, using MetaCyc as the reference pathway DB, but using different algorithms.
The first prediction used the original PathoLogic algorithm, and is the same as prediction HpyCyc–2A.
The second prediction used an enhanced version of
the algorithm—we call this prediction HpyCyc–2B.
This enhanced version of the algorithm was developed as a result of our experiences in evaluation 1.
3. We compare prediction HpyCyc–2B with a manual
prediction of the pathway complement of H. pylori
by Marais et al. (1999).
2 THE PATHOLOGIC ALGORITHM
We begin with a description of the initial PathoLogic
algorithm (Karp et al., 1999), which is essential to
understanding how the pathway prediction process works.
PathoLogic accepts two inputs: a fully annotated
genome sequence, in Genbank file format, and a reference
pathway DB (either EcoCyc or MetaCyc). Only the
annotation information from the Genbank file is used by
the program—PathoLogic does no sequence analysis.
Specifically, the program uses the gene product names in
the Genbank /product qualifier, and the EC numbers in
the /ec number qualifier (when present).
The process of pathway prediction involves evaluating
the evidence for the presence of each pathway in the
reference DB for the organism being analysed. A pathway
consists of a sequence of enzyme-catalysed reactions. The
more enzymes within a pathway that are encoded by
the genome, the more evidence we say there is for the
presence of that pathway.
2.1 Link enzymes to reactions
In order to determine which enzymes encoded by the
genome are associated with which pathways, PathoLogic
performs a matching process between the enzyme names
and Enzyme Commission (EC) numbers listed in the
annotated genome, and the reaction steps listed in each
pathway in its reference pathway DB. The matching pro716
cess is based on the functions assigned to individual genes
by the genome center that annotated the genome. We call
this matching process ‘linking enzymes to reactions’.
The motivations behind this process are as follows.
(1) Reactions are the bridges between enzymes and
pathways, in the sense that our pathway ontology (Karp
and Paley, 1994) directly links a reaction to the pathways
that contain it; once we know what reaction an enzyme
catalyzes, it is trivial to retrieve the pathways containing
that reaction. (2) Enzyme names are degenerate in the
sense that the biomedical literature, and databases such
as Swiss–Prot and Genbank, refer to a given enzymatic
activity using many different names for the same activity,
and occasionally, using the same name for different enzyme activities. (3) Enzyme names have low information
content because by themselves they do not specify the
precise biochemical activity of the enzyme. For example,
the name ‘2-isopropylmalate synthase’ tells us what product this enzyme can produce, but does not identify the
reactants or the other products in this reaction. Until we
know the full set of reactants and products for the enzyme
(the reaction equation), we have only an approximate
picture of its molecular function. In the course of our
pathway prediction work we encounter many enzyme
names in Genbank entries for which we cannot determine
their reaction equations, even after exhaustive searches in
Swiss–Prot (from which many enzyme names in Genbank
entries for complete genomes are derived, based on
sequence-similarity searches) and in Medline.
More precisely, linking enzymes to reactions means creating a connection between the DB object that represents
an enzyme, and a DB object that describes the reaction that
enzyme catalyzes by listing its reactants and products.
The algorithm LinkEnzymesToReactions(names,ecnum)
accepts as its inputs one or more alternative gene-product
names for a single enzymatic activity, and an EC number,
from a single Genbank coding region. It returns up to two
reactions as its outputs that correspond to that enzyme
activity. The algorithm is as follows:
1. Match by EC Number: If ecnum is non-null, retrieve
from the MetaCyc DB the reaction identified by that
EC number, and store that reaction in R1 .
2. Build a hash table H whose entries associate enzyme names and reactions. The enzyme names in H
are gathered from the Enzyme DB (Bairoch, 1996),
EcoCyc, MetaCyc, and an additional mapping table built up over time by our group. Each name in
H is converted to a canonical form by a function
Canon that converts the name to lowercase and removes spaces, dashes, and other punctuation characters. The purpose of Canon is to prevent variability
based on case and punctuation from interfering with
the matching process.
Evaluation of computational pathway predictions
3. Match by Enzyme Names: For each gene product
name E in names (in some cases, synonyms for the
enzyme name are provided in a Genbank entry):
(a) Set E to the canonical form of E using the
function Canon.
(b) Look up E in H ; if found, store in R2 the
reaction associated with E in H , and GOTO 4.
(c) For each E 1 in a list of computed variant forms
of E, look up E 1 in H ; if found, store the
associated reaction in R2 .
4. IF R1 ∧ R2 ∧ (R1 = R2 ) THEN report to the user
that the EC number and enzyme name for the current
coding region are inconsistent.
5. IF R1 THEN create a connection within the DB
between the current enzyme and R1 .
6. IF R2 THEN create a connection within the DB
between the current enzyme and R2 .
In computing variant forms of E the program attempts
to remove various extraneous text that is all too frequently
found in Genbank /product fields (Karp, 2001), such as
(1) Prefix and suffix words added to the enzyme name such
as ‘putative,’ ‘probable,’ ‘alpha chain,’ ‘large subunit,’
etc. (2) Parenthesized gene names that follow the product
name in some Genbank entries. This type of guesswork
would not be required if gene product names were encoded
using a controlled vocabulary such as the Gene Ontology
(Ashburner et al., 2000), or if EC numbers were used more
frequently in Genbank entries—most genome annotations
omit EC numbers.
We have found that 10–20% of the enzymes in a given
genome are not identified by the preceding techniques
because the names are not found in the hash table H.
These enzymes are linked to their reactions manually by
the PathoLogic user based on literature research.
The only other potential information in the Genbank
record that could aid in linking enzymes to reactions are
gene names, and the sequence itself. Because we believe
gene names are an ambiguous way to identify protein
function, the program does not make use of gene names.
Alternatively, we could have used sequence analysis to
match protein sequences in the target organism with their
counterparts of known function in EcoCyc (MetaCyc
contains no sequence information). However, the field of
fully automated functional assignment based on sequence
analysis is a complex problem that is beyond the scope of
this project, and we believe that the human annotators at
most genome centers do a more reliable job of predicting
function from sequence than does any existing automated
algorithm.
2.2 Infer pathways
Once the matching phase is complete, PathoLogic has
inferred a set of reactions expected to occur in the target
organism. Many of these reactions are components of
pathways in the reference database. The remaining task
is to determine which of those pathways are likely to be
present in the target organism. Evidence for most or all of
the reactions in a pathway constitutes strong evidence that
the pathway is present in the organism. If there is evidence
for some reactions in a pathway but not others, there are
three possible interpretations:
• The pathway is not present in the target organism
(either the reactions for which there is evidence are
being used for some other purpose, these genes are not
expressed, or their functions were predicted incorrectly
from sequence).
• A variant form of the pathway is present in the target
organism that uses some but not all of the steps from
the pathway, as described in the reference database.
• The pathway is present in the target organism, but
the genes for the missing reaction steps either have
not been found by the name-matcher or have not yet
been identified in the genome. An enzyme E that
catalyzes reaction R might not have been identified in
the genome because the sequence of E has diverged
so far from the sequences of other known enzymes
that catalyze R that the homology cannot be detected,
or because E is not evolutionarily related to other
sequenced enzymes that catalyze R.
It is not easy to determine which of these three
interpretations is the correct one. The initial PathoLogic
algorithm errs on the liberal side of predicting more
pathways by assuming that a pathway can be present in
the target organism if there is evidence for at least one
of its reactions (with one exception: MetaCyc contains a
number of sets of specifically marked variant pathways.
If there is more evidence for one member of a variant set
than for the others, then only that variant is assumed to
be present in the target organism). Although a great many
pathways are therefore considered potentially present, the
measure of the likelihood of a pathway actually being
present in the organism is based on the proportion of
reactions in the pathway for which there is evidence,
and the presence of ‘unique’ reactions; i.e. evidence for
a reaction that occurs only in one pathway is stronger
evidence for that pathway than evidence for a reaction that
occurs in several other pathways as well.
PathoLogic is intended to provide a first-pass pathway
analysis—it is not expected to replace manual curation;
rather, it provides a solid stepping-off point to make
manual analysis as quick and straightforward as possible.
717
S.M.Paley and P.D.Karp
3
RESULTS: COMPARISON OF HPYCYC–1
WITH HPYCYC–2A
The first PGDB for H. pylori (HpyCyc–1) was generated
in 1997 with an early version of the PathoLogic software
and using EcoCyc as the reference database. The annotated H. pylori sequence dataset was an early version
that existed before the version deposited in Genbank by
The Institute for Genomic Research (TIGR). The geneto-reaction mapping for HpyCyc–1 was not performed
by PathoLogic, but was provided directly by personnel
at TIGR in personal correspondence with Dr. Karp prior
to the publication of the complete genome sequence, and
was supplemented to a limited extent by additional SRI
analyses. The resulting PGDB was subject to a small
amount of manual correction and enhancement.
The MetaCyc database, created in 1998, contains 130
pathways from EcoCyc, and another 230 metabolic pathways from over 130 different organisms. The pathways
in MetaCyc were gathered from a number of literature
and online sources. When EcoCyc is used as a reference
database for PathoLogic, the only pathways that can be
inferred are those that are shared between the target organism and E. coli. We expected that, by using MetaCyc
as the reference database, PathoLogic would be able to infer additional pathways that are not found in E. coli and,
to some extent, would be able to distinguish between variants of a pathway that are found in different organisms.
For this study we created a second PGDB (HpyCyc–2A)
for H. pylori, using the version of PathoLogic described
in the Methods Section. This PGDB was generated
from the complete annotated Genbank sequence entry
for H. pylori (AE000511 (Tomb et al., 1997)), with
MetaCyc used as the reference database. Most of the
PathoLogic processing is automated, but a number of the
enzyme names in the Genbank record were not recognized
by the PathoLogic enzyme-name matcher. We therefore
performed manual matching of enzyme names to EC
numbers for approximately 10% of the metabolic enzymes
in H. pylori. Table 1 summarizes the relevant contents of
the resulting PGDBs.
With MetaCyc as the reference database, 135 pathways
were predicted (i.e. evidence was found for at least one
reaction in all these pathways), as opposed to 77 pathways
when EcoCyc was used as the reference database. This
confirmed our hypothesis that MetaCyc would infer in
more pathways than EcoCyc.
Closer examination, however, showed that many of
the predicted pathways were false-positive pathways—
pathways that were highly unlikely to be present, based
on other biological knowledge. For example, in many
pathways there was evidence for only a single reaction
that was also present in many other pathways; therefore,
that reaction could not be considered strongly indicative
of any particular pathway. Nonetheless, at least 10 of the
718
new pathways represent significant additions. Some of
these new pathways happened to be E. coli pathways that
had been added to both EcoCyc and MetaCyc after 1997,
when HpyCyc–1 was created. Predicted pathways whose
origin in MetaCyc was from organisms other than E. coli
included an arginase degradation pathway, a UDP-glucose
conversion pathway, a glutamine biosynthesis pathway,
and an aerobic glycerol fermentation pathway. A small
number of variant pathways were inferred by PathoLogic
in place of the original E. coli pathways. An example is
a polyamine biosynthesis pathway from Bacillus subtilis
that formed a subset of the original E. coli pathway.
There are only a few cases where a pathway P was
inferred in HpyCyc–1, but neither P nor any of its variants
were inferred in HpyCyc–2A. These pathways were due to
genes that our name-matching algorithm failed to match to
any reaction in the pathway, but that were matched to such
reactions in HpyCyc–1 (such matches were either supplied
by TIGR or were results of SRI’s analysis). Perusal of
the corresponding entries in the Genbank file showed that
either no enzyme name was present in these cases or only
one enzyme name, whereas the enzyme was presumed to
be multifunctional in the original PGDB. For example,
an aspartate biosynthesis pathway consisting of a single
reaction, EC number 2.6.1.1, was inferred in HpyCyc–1,
but not in HpyCyc–2A. The original PGDB linked this
reaction to HP0624, based on sequence similarity to the
aspB gene in B. subtilis, but the enzyme name ‘aspartate
aminotransferase’ does not appear in the Genbank entry.
Rather, the product for this gene is described as ‘solutebinding signature and mitochondrial signature protein
(aspB),’ which would indicate that this gene product does
not carry out the aspartate aminotransferase reaction. It is
not clear whether the sequence annotation is incomplete
or whether the gene was incorrectly matched to a reaction
in HpyCyc–1.
4
RESULTS: COMPARISON OF HPYCYC–2A
TO HPYCYC–2B
Because a significant number of probable false-positive
pathways existed in the MetaCyc-based prediction
(HpyCyc–2A), we explored heuristics for identifying
these false-positive predictions. It was impossible to
eliminate all undesirable pathways while still keeping all
those that we believed should be kept. For example, a
significant fraction of reactions from the Calvin Cycle are
present in H. pylori, although none of those reactions are
unique to the Calvin Cycle. But since H. pylori is not a
photosynthetic organism, that pathway is highly unlikely
to be present in the organism in any form. We created a
new version of PathoLogic called PathoLogic–2, containing an algorithm that identifies false-positive pathways.
The algorithm singles out for deletion the pathways that
meet the following heuristic criteria:
Evaluation of computational pathway predictions
Table 1. A summary of pathway predictions for H. pylori, using EcoCyc and MetaCyc as the reference DBs to produce HpyCyc–1 and HpyCyc–2A,
respectively. Note that the difference in the number of gene and reaction matches between the two PGDBs is not due to any inherent properties of the
reference databases, but rather to the different methods used to generate the matches
Number of enzymes linked to reactions
Number of unique reactions matched to enzymes
Number of pathways predicted initially
• The pathway contained evidence for no unique
enzyme, AND
• Either
– The pathway consisted of more than two reactions
but contained evidence for only a single reaction,
OR
– The set of reactions for which there was evidence
was a proper subset of the set of reactions for which
there was evidence in some other pathway, AND
one of the following was true:
∗ The pathway was classified as a biosynthetic
pathway and was missing one or more steps
from the end of the pathway. The rationale
for this criterion is that if the final steps are
missing, then the pathway is not producing
its intended target, so the other pathway(s)
containing the common reactions are more
likely to reflect the actual role of those reactions
in the organism.
∗ The pathway was classified as a degradation
pathway and was missing one or more steps
from its beginning. The rationale for this
criterion is that if the initial steps are missing,
then the pathway is not degrading its intended
substrate, so the other pathway(s) containing
the common reactions are more likely to
reflect the actual role of those reactions in the
organism.
∗ The pathway was classified as an energy
metabolism pathway and was missing at least
half its reactions. The rationale for this criterion is that there is a great deal of overlap of
reactions among the energy metabolism, with
many reactions appearing in many pathways.
Thus, many false positives are likely to be
inferred from among these pathways. An
arbitrary but reasonable cutoff for excluding
some of these false positives is 50%.
We ran the enhanced PathoLogic–2 program to
create a new MetaCyc-based pathway prediction
HpyCyc–1
HpyCyc–2A
314
240
77
298
264
135
for H. pylori, called HpyCyc–2B. HpyCyc–2B is
available through the WWW at http://ecocyc.org:
1555/HPY/organism-summary?object=HPY. HpyCyc–
2B contains 98 pathways, which is almost 30% fewer
pathways than does HpyCyc–2A, and the enhanced
algorithm removes only pathways that we believe to be
false-positive predictions. For example, an arabinose
catabolism pathway was pruned because the only reaction
in the pathway for which there was any evidence was
the conversion of ribulose-5-phosphate to xylulose-5phosphate, which also appears in the pentose phosphate
pathway and elsewhere. A 2-dehydro-D-gluconate pyruvate catabolism pathway was pruned because this was a
degradative pathway missing its initial two reactions, and
the six reactions for which there was evidence were all
part of the glycolysis/Entner–Doudoroff super-pathway.
5
RESULTS: COMPARISON OF HPYCYC–2B
WITH MANUAL PATHWAY ANALYSIS OF H.
PYLORI
We compared the HpyCyc–2B prediction created by
PathoLogic–2 with a recent review article that analyses the
principal metabolic pathways of H. pylori by combining
experimental biochemical analyses of H. pylori with data
derived from the whole genome sequence (Marais et al.,
1999). Thus, we can critique our computational methods
by seeing how well their predictions compare to a manual
genome analysis, as well as against experimental data.
This critique is necessarily incomplete, because the review
article selects only a subset of metabolism to analyse and
does not provide details in all cases, and PathoLogic’s
predictions are confined to the enzymes of small molecule
metabolism (for example, PathoLogic does not consider
transport processes, but the review does). Note that
PathoLogic does not make use of the same biochemical
analyses that Marais et al. did, putting the program at
somewhat of a disadvantage from the start.
In general, PathoLogic found most of the pathways that
the article claims are present, but it often did not find
matches in the genome for as many of the reaction steps
as did Marais et al. There were a few pathways that
PathoLogic did not find at all. There are several reasons
why PathoLogic missed reaction steps. In a number of
719
S.M.Paley and P.D.Karp
cases, only a single reaction was ascribed to a gene
known to be multifunctional (since in general only a
single function is listed for each gene in the Genbank
file). In other cases, the Genbank entry for a gene did not
contain the information necessary for PathoLogic’s namematching routine to link the gene to any reaction. In still
other cases, the usual enzyme for a particular step was not
present, but another enzyme (with a different name) could
be used to perform the same task. Marais et al. were able
to identify such cases, but PathoLogic could not.
Although PathoLogic inferred many pathways that were
not mentioned by the article, it did not infer any pathways
that the article specifically said were missing. Pathways
that the article claimed were incomplete were typically
inferred by PathoLogic, but without evidence for the steps
that Marais et al. did not find.
a branched noncyclic pathway. This reductive citric acid
pathway was entered into HpyCyc–1 manually, and was
subsequently copied into MetaCyc. We would therefore
expect that this variant pathway would be inferred to be
present in the new PGDB, rather than the TCA Cycle. This
is in fact what happened. Some of the genes involved in
this pathway have not yet been identified, but PathoLogic
was able to match up the genes for all the other steps.
5.1.2 Pyruvate metabolism. The review describes a
three-step pathway from pyruvate to acetate. MetaCyc
does not contain this pathway, but includes the conversion
of pyruvate to acetate as one branch of a larger propionate
fermentation pathway that was inferred. PathoLogic found
evidence for only two of the three relevant steps. The
gene they identify for the other step has no product name
listed in the Genbank entry (the entry does contain a note
identifying the gene as a putative acetate kinase). PathoLogic did identify genes for the other transformations of
pyruvate that are mentioned in the article.
5.1.4 Amino acid metabolism. The review states that
H. pylori requires arginine, histidine, isoleucine, leucine,
methionine, phenylalanine, and valine for growth, but
can synthesize all other amino acids. In accord with
the observations of the review, PathoLogic does not
infer any pathways for arginine, histidine, or phenylalanine biosynthesis, and infers only partial pathways for
isoleucine, leucine, methionine, and valine. PathoLogic
infers complete biosynthetic pathways for threonine,
tryptophan, serine, glycine, glutamine, glutamate, lysine,
cysteine, and chorismate. It does not infer any pathway for
aspartate or asparagine biosynthesis (the failure to assign
the aspB gene to the aspartate transaminase reaction
is discussed in the section comparing HpyCyc–1 with
HpyCyc–2A). PathoLogic inferred only partial pathways
for tyrosine, alanine and proline biosynthesis. In the case
of tyrosine biosynthesis, the review states that the tyrB
gene is missing, but that another enzyme can perform its
function (we did not link that enzyme to this reaction).
Also, the tyrA gene should catalyze two reactions in
this pathway, but we linked it only to the one function
mentioned in the Genbank file. These results point out
the difficulty of distinguishing between partially inferred
pathways in which steps are actually missing from the
organism and. those in which the steps are present but
have not been identified by our algorithm.
The review cites experimental and genomic evidence
for the catabolism of alanine, arginine, aspartate, asparagine, glutamate, glutamine, and serine. It did not
mention evidence for catabolism of any other amino
acids: presumably none was found. PathoLogic inferred
complete pathways for catabolism of alanine, glutamate
and serine, and partial pathways for catabolism of arginine
and asparagine. It did not find any pathways for aspartate
or glutamine catabolism (although there are reactions that
interconvert glutamine and glutamate, and asparagine and
aspartate, specific pathways are probably not necessary).
It also found a partial proline degradation pathway and
single reactions within pathways for the degradation of
methionine, threonine, and leucine. Pathways for the
degradation of glycine, isoleucine, and valine were originally inferred, but were later classified as false-positive
pathways, in agreement with the review.
5.1.3 The TCA cycle. In place of the normal Krebs or
Tricarboxylic Acid (TCA) Cycle, H. pylori appears to use
5.1.5 Lipid metabolism. PathoLogic found most of the
fatty acid degradation pathways, and all but one reaction
5.1
Detailed comparison of HpyCyc–2B with a
review by Marais et al.
This section provides a detailed comparison of HpyCyc–
2B with each relevant section of the review article.
5.1.1 Glucose metabolism. According to the review,
glucose appears to be the only carbohydrate used by
the organism. Although PathoLogic’s initial analysis did
show several other carbohydrate catabolism pathways,
almost all fall into the category of false-positive pathways
and were removed. The only remaining carbohydrate
catabolism pathways are glycerol catabolism and a pathway that combines galactose, galactoside, and glucose
catabolism.
The Entner–Doudoroff pathway is known to be present
in H. pylori, as are both oxidative and nonoxidative
parts of the pentose phosphate pathway (though with one
gene not yet identified); glycolysis and gluconeogenesis
are very likely present, although not all the genes have
been identified. The PathoLogic predictions bear this out,
with all five pathways inferred, but with the indicated
steps missing from glycolysis and the pentose phosphate
pathway.
720
Evaluation of computational pathway predictions
from the fatty acid synthesis pathways (initiation and elongation). According to the review, phospholipid degradation has been observed experimentally, but the corresponding genes have not been found. PathoLogic did not infer a phospholipid degradation pathway. The phospholipid
biosynthesis pathway presented in the review forms a subset of the corresponding pathway that was inferred from
MetaCyc; our software inferred all but one of the steps
that were identified in the review.
5.1.6 Lipopolysaccharide biosynthesis. PathoLogic
found evidence for almost every step in lipid-A and
(KDO)-lipid-A biosynthesis. It did not infer any steps
from O-antigen biosynthesis (from E. coli), although the
review mentions two enzymes that are possibly present.
PathoLogic inferred the enterobacterial common antigen
synthesis pathway on the basis of a single-step catalyzed
by fucosyltransferase, though there are different kinds
of fucosyltransferases and it appears that in this case
PathoLogic may have linked the polypeptides to the
wrong reaction.
5.1.7 Nucleotide biosynthesis. PathoLogic inferred
complete pathways for the synthesis of PRPP and UMP.
According to the review, although H. pylori is known
experimentally to be able to synthesize IMP from PRPP,
evidence for only two of the ten relevant enzymes
has been found in the genome. Similarly, our software
identified only two steps along the path to IMP in
the purine biosynthesis pathway. The remainder of that
pathway, which synthesizes GMP and AMP from IMP,
is complete, as indicated by the review. Phosphorylation
to GTP and ADP in our PGDB is part of another, larger
nucleotide metabolism pathway: all relevant steps are
inferred by PathoLogic. The review states that guanylate
kinase activity has not been observed experimentally
and speculates that the corresponding gene may not be
expressed normally, and that the syntheses of GTP and
ITP may follow an alternative unknown pathway instead.
If true, this would be an example of genomic data
leading PathoLogic to make incorrect inferences, since
PathoLogic has no way of determining which genes are
not expressed.
The review states that H. pylori has limited pyrimidine
salvage capability, but active purine salvage pathways.
Uracil metabolism has been observed experimentally,
but the gene has not been identified. That step is part
of our larger pyrimidine ribonucleotide/ribonucleoside
metabolism pathway, and is accordingly not inferred.
The metabolism of adenine, guanine and hypoxanthine,
part of our larger nucleotide metabolism pathway, are all
completely inferred.
The synthesis of deoxyribonucleotides and other interconversions among them were all inferred (some as part
of a larger deoxypyrimidine nucleotide/side metabolism
pathway), to the extent that they are identified in the review article.
5.1.8 Respiratory chains. In MetaCyc, there is no
complete respiratory chain pathway. Rather, three ‘pathways’ collections of reactions involving electron donors
(for anaerobic and aerobic respiration) and electron acceptors (for anaerobic respiration). The electron acceptors
pathway and the aerobic respiration electron donors
pathway were inferred. The anaerobic electron donors
pathway was inferred in HpyCyc–2A, but was subsequently classified as a false-positive pathway, because the
only reactions PathoLogic identified for it were shared
by both electron donor pathways, and the name matcher
failed to find the match from the quinone-reactive Ni/Fe
hydrogenase protein to the hydrogenase reaction.
The ubiquinone biosynthesis and polyisoprenoid
biosynthesis pathways are both inferred from the presence of the three genes referenced in the review article.
The menaquinone biosynthesis pathway is not inferred.
Although the review article states that menaquinones are
experimentally known to be present in H. pylori, it does
not cite any genomic evidence.
5.1.9 Catalase and superoxide dismutase. The pathway for the removal of superoxide radicals, which consists
of superoxide dismutase and catalase was completely
inferred by PathoLogic, in agreement with the review
article.
5.1.10 Nitrogen sources. According to the review,
H. pylori uses urea, ammonia, and amino acids as nitrogen sources. Our results are entirely consistent with
that analysis. Amino acid metabolism was addressed in
a previous section. Ammonia is incorporated via a glutamine biosynthesis pathway that matches the one shown
in the review (though the version in the review looks
somewhat different). The urease genes were matched to
the appropriate reaction, which is the only step inferred in
a ureide degradation pathway.
6 DISCUSSION
On the whole, the HpyCyc–2B results match well with
the analysis in the review article. Since our enzyme-toreaction matching procedure is fairly conservative, we
did not expect it to make many incorrect matches (which
would cause pathways to be inferred that were known to
be missing), but due to incomplete annotation and the lack
of uniformity in product names in the Genbank file, we
anticipated that the program would fail to link some genes
to their reactions. The Genbank file for H. pylori contains
few or no EC numbers, so the only way our program can
link enzymes to reactions is by matching enzyme names.
The approximately 10% of the enzyme names that had to
be matched manually may sound like a large group, but
721
S.M.Paley and P.D.Karp
this task is not as onerous as it sounds. Since PathoLogic
extracts from the total list of unmatched genes only those
that are probable enzymes, it is relatively simple to run
down that list and determine those that are likely to match
reactions. Rather, we think it is a significant achievement
that so many genes can be matched correctly to reactions
with no manual intervention.
Table 2 divides the list of pathways that PathoLogic
inferred (or failed to infer) into several categories:
• Consistent: Pathways that were inferred or partially
inferred in a manner consistent with the description of
genomic evidence in the review article. We included
in this category MetaCyc pathways predicted to be
present but for which some smaller branch of the
pathway is probably absent.
• Partially consistent: Pathways that are partially consistent with the description in the review article but are
missing some steps that should have been found (i.e.
the review article says the genomic evidence for those
steps is present).
• Poor topological match: Pathways for which some
steps were correctly inferred but whose topology does
not match pathways described in the review.
• Not reported: Pathways inferred by PathoLogic but not
addressed in the review article (for example, the review article did not discuss many cofactor biosynthesis pathways that are undoubtedly present in the organism). We believe these are correct PathoLogic predictions for pathways whose presence were overlooked by
Marais et al.
• False positive: Pathways inferred by PathoLogic that
are probably not present in the organism.
• Missing: Pathways described by the review that were
not inferred by PathoLogic (or, in one case, inferred
based on incorrect information).
Note that the distinctions between some of these categories are not always clearcut, in part because the definition of what forms a pathway is not standardized, and in
part because the review does not completely specify many
of the pathways. Also, it is difficult to judge whether a
pathway that is not mentioned in the review is in fact missing from the organism.
By switching from using EcoCyc to using MetaCyc as
the reference database, we hoped to expand the range
of pathways that we could predict. We also hoped that
with more pathway variants defined, we would be better
able to select the correct sequence of reactions from
among several similar alternatives, and that therefore
holes in pathways could be more reliably attributed to
722
missing gene assignments (or nonfunctional pathways
in the organism). The PathoLogic–2 predictions have
certainly expanded to cover aspects of metabolism not
seen in E. coli (though not to a large extent, probably
because E. coli and H. pylori are fairly closely related).
It is unclear whether the frequent pathway holes in the
predicted pathways are due to enzymes that will later be
identified in the genome, or are due to pathway variants
that do not exist in MetaCyc.
Although we anticipated that PathoLogic would infer a
number of pathways (generally with low evidence scores)
that were not actually present in the organism, we did
not expect quite as many of these as were observed
in HpyCyc–2A. Nor did we anticipate how much of a
distraction these false-positive pathways would be to a
user trying to make sense of the resulting PGDB as a
whole. The PathoLogic–2 algorithm for weeding out such
pathways and producing HpyCyc–2B, while conservative
(because we did not want to risk discarding important
pathways), will add significantly to the comprehensibility
of the PGDBs that PathoLogic generates.
6.1
Comparison with other pathway prediction
algorithms
We were interested to see whether other available
metabolic reconstruction systems such as KEGG (Goto et
al., 1997; Bono et al., 1998) and WIT (Selkov et al., 1997)
were able to avoid the problems we encountered in attempting to predict pathways that should be present while
avoiding false-positive pathway prediction. Although we
did not perform an exhaustive analysis of the contents
of other pathway databases, a sample of comparisons
suggests the other projects make significant numbers of
false-positive predictions. Comparisons are hampered by
two factors. First is that neither of these projects have
published precise descriptions of their pathway-prediction
algorithms.
The second is that although the KEGG developers
state that they use EC numbers to match enzymes to
pathways (Bono et al., 1998), this description is probably
incomplete because many KEGG pathway maps contain
enzymes that do not have EC numbers, and KEGG
highlights those enzymatic steps as occurring in certain
species. In addition, the KEGG system does not actually
make pathway predictions. The KEGG system simply
highlights, on a WWW page describing a pathway map,
which set of reactions in that map are present in a selected
organism. KEGG does not actually make a decision
as to whether the pathway is present in the organism
or not: every possible pathway is available for every
organism. The user must decide whether a pathway is
present and, as we have seen, that decision is not an
easy one because false-positive evidence often exists for
many pathways. For example, the KEGG pathway maps
Evaluation of computational pathway predictions
Table 2. The list of pathways in HpyCyc–2B categorized by how they compare with the review article. See the text for a description of the different categories
Consistent
(38 pathways)
Partially consistent
(12 pathways)
Poor topological match
(4 pathways)
Not reported
(26 pathways)
False-Positive
(13 pathways)
Missing
(five pathways)
glycolysis
gluconeogenesis
Entner–Doudoroff pathway
glyoxylate cycle
tryptophan biosynthesis
glycine biosynthesis
glutamate biosynthesis
cysteine biosynthesis
homoserine biosynthesis
valine biosynthesis
leucine biosynthesis
glutamate degradation
L-serine degradation
PRPP biosynthesis
pyrimidine biosynthesis
purine biosynthesis
nucleotide metabolism
polyisoprenoid biosynthesis
ubiquinone biosynthesis
pentose phosphate pathway, oxidative branch
pentose phosphate pathway, non-oxidative branch
reductive citric acid pathway
threonine biosynthesis
serine biosynthesis
glutamine biosynthesis
lysine and diaminopimelate biosynthesis
chorismate biosynthesis
methionine biosynthesis from homoserine
isoleucine biosynthesis
L-alanine degradation
arginase degradation pathway
fatty acid elongation (saturated)
fatty acid elongation (unsaturated)
deoxypyrimidine nucleotide/side metabolism
deoxyribonucleotide metabolism
aerobic respiration electron donors
anaerobic respiration electron acceptors
removal of superoxide radicals
alanine biosynthesis
tyrosine biosynthesis
asparagine degradation
phospholipid biosynthesis
KDO biosynthesis
propionate fermentation
proline biosynthesis
arginine proline degradation
fatty acid oxidation pathway
fatty acid biosynthesis (initial steps)
lipid-A-precursor biosynthesis
purine and pyrimidine metabolism from
Mycoplasma pneumoniae
ureide degradation
pyrimidine ribonucleotide/ribonucleoside metabolism
purine salvage from Halobacterium salinarium
galactose, galactoside and glucose catabolism
glycerol metabolism
UDP-glucose conversion
ppGpp metabolism
phosphatidic acid synthesis
acetate utilization
formylTHF biosynthesis
anaerobic respiration
colanic acid biosynthesis
cyanate catabolism
biotin biosynthesis
polyamine biosynthesis
folic acid biosynthesis
thiamine biosynthesis
aerobic glycerol fermentation
UDP-N-acetylglucosamine biosynthesis
thioredoxin pathway
(deoxy)ribose phosphate metabolism
methyl-donor molecule biosynthesis
mannose and GDP-mannose metabolism
glyceraldehyde 3-phosphate catabolism
biosynthesis of proto- and siroheme
pyridine nucleotide biosynthesis
pantothenate and coenzyme A biosynthesis
riboflavin, FMN and FAD biosynthesis
pyridoxal 5’-phosphate biosynthesis
peptidoglycan biosynthesis
galactose metabolism
Calvin cycle
ectoine synthesis
sucrose biosynthesis
methionine degradation
threonine catabolism
pentose phosphate pathway from Mycoplasma pneumoniae
catechol ortho-cleavage
protocatechuate ortho-cleavage
proline degradation
ethylene biosynthesis from methionine
leucine degradation (two pathways)
aspartate biosynthesis
asparagine biosynthesis
menaquinone biosynthesis
anaerobic respiration electron donors
enterobacterial common antigen biosynthesis
show evidence for two large photosynthetic pathways
in H. pylori (PathoLogic–2 predicts no photosynthetic
pathways), despite the low probability of such a pathway
being present in the organism.
In the case of WIT, it is not even clear if the pathway
prediction process is automated or manual. WIT does
723
S.M.Paley and P.D.Karp
seem to be much more selective in its pathway predictions.
It did not predict any obviously incorrect photosynthetic
pathways for H. pylori. However, it also failed to predict
such pathways as glycolysis or the Entner–Doudoroff
pathway, both of which are almost certainly present
(Marais et al., 1999).
7 CONCLUSIONS
This study validates the usefulness of PathoLogic as a
tool for metabolic analysis of an organism’s annotated
genome. If the reference pathway DB used by PathoLogic
from the EcoCyc DB is expanded to the larger MetaCyc
DB, the number of true positive pathways predicted in
H. pylori increases. In addition, the number of falsepositive predictions increases. False-positive predictions
can be decreased by means of the scoring heuristics
in the PathoLogic–2 algorithm, which give preference
to pathways containing enzymes that are not used in
other pathways in the organism. The study also shows
that computational pathway analysis by PathoLogic–2
produces results that have accuracy comparable to that of
analyses by expert biochemists, but that exceed the expert
analyses in comprehensiveness.
Furthermore, the generation of HpyCyc requires only a
few hours of manual effort and computer processing. The
result of the computational analysis is not just a set of predictions, but also a sophisticated software environment for
exploring and building on those predictions. Some manual
polishing is required in order to make the generated PGDB
truly reflective of the organism it describes, but the PGDB
generated by PathoLogic provides a valuable and accurate
foundation to support further metabolic research.
ACKNOWLEDGMENTS
The first EcoCyc-based pathway analysis of H. pylori was
joint work with Jean-Francois Tomb and Karen Nelson.
This work was supported by grant 1-R01-RR07861-01
from the National Center for Research Resources.
REFERENCES
Ashburner,M., Ball,C., Blake,J., Botstein,D., Butler,H., Cherry,J.,
Davis,A., Dolinski,K., Dwight,S., Eppig,J., Harris,M., Hill,D.,
724
Issel-Tarver,L., Kasarskis,A., Lewis,S., Matese,J., Richardson,J.,
Ringwald,M., Rubin,G. and Sherlock,G. (2000) Gene Ontology:
Tool for the unification of biology. Nature Genet., 25, 25–29.
Bairoch,A. (1996) The ENZYME databank in 1995. Nucleic Acids
Res., 24, 221–222.
Bono,H., Ogata,H., Goto,S. and Kanehisa,M. (1998) Reconstruction
of amino acid biosynthesis pathways from the complete genome.
Genome Res., 8, 203–210.
Goto,S., Bono,H., Ogata,H., Fujibuchi,W., Nishioka,T., Sato,K.
and Kanehisa,M. (1997) Organizing and computing metabolic
pathway data in terms of binary relations. Pacific Symposium
on Biocomputing ’97. pp. 175–186.
Karp,P. (2001) Many Genbank entries for complete microbial
genomes violate the Genbank standard. Comparative and Functional Genomics, 2, 25–27.
Karp,P., Krummenacker,M., Paley,S. and Wagg,J. (1999) Integrated
pathway/genome databases and their role in drug discovery.
Trends Biotechnol., 17, 275–281.
Karp,P. and Paley,S. (1994) Representations of metabolic knowledge: Pathways. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. Altman,R.,
Brutlag,D., Karp,P., Lathrop,R. and Searls,D. (eds), AAAI Press,
Menlo Park, CA, pp. 203–211.
Karp,P., Riley,M., Paley,S. and Pellegrini-Toole,A. (2002a) The
MetaCyc database. Nucleic Acids Res., 30, 56–58.
Karp,P., Riley,M., Saier,M., Paulsen,I., Paley,S. and PellegriniToole,A. (2002b) The EcoCyc database. Nucleic Acids Res., 30,
59–61.
Marais,A., Mendz,G., Hazell,S. and Megraud,F. (1999) Metabolism
and genetics of Helicobacter pylori: the genome era. Microbiol.
Mol. Biol. Rev., 63, 642–674.
Selkov,E., Galimova,M., Goryanin,I., Gretchkin,Y., Ivanova,N., Komarov,Y., Maltsev,N., Mikhailova,N., Nenashev,V., Overbeek,R.,
Panyushkina,E., Pronevitch,L. and Jr,E.S. (1997) The metabolic
pathway collection: an update. Nucleic Acids Res., 25, 37–38.
Tomb,J.-F., White,O., Kerlavage,A.R., Clayton,R.A., Sutton,G.G., Fleischmann,R.D., Ketchum,K.A., Klenk,H.P.,
Gill,S., Dougherty,B.A., Nelson,K., Quackenbush,J., Zhou,L.,
Kirkness,E.F., Peterson,S., Loftus,B., Richardson,D., Dodson,R., Khalak,H.G., Glodek,A., McKenney,K., Fitzgerald,L.M., Lee,N., Adams,M.D., Hickey,E.K., Berg,D.E.,
Gocayne,J.D., Utterback,T.R., Peterson,J.D., Kelley,J.M., Cotton,M.D., Weidman,J.M., Fujii,C., Bowman,C., Watthey,L.,
Wallin,E., Hayes,W.S., Borodovsky,M., Karp,P.D., Smith,H.O.,
Fraser,C.M. and Venter,J.C. (1997) The complete genome
sequence of the gastric pathogen Helicobacter pylori. Nature,
388, 539–547.