Download From molecular activities and processes to biological function

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Signal transduction wikipedia , lookup

Protein moonlighting wikipedia , lookup

Multi-state modeling of biomolecules wikipedia , lookup

JADE1 wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Gene regulatory network wikipedia , lookup

Metabolic network modelling wikipedia , lookup

List of types of proteins wikipedia , lookup

Biochemical cascade wikipedia , lookup

Transcript
From molecular activities and processes to biological function
27/12/00 20:12
From molecular activities and
processes to biological function
To appear in Briefings in Bioinformatics
1,2
1
Jacques van Helden , Avi Naim , Christian
2
1
Lemer , Renato Mancuso , Matthew Eldridge
and Shoshana J. Wodak
1
1,&
,
1,2*
European Bioinformatics Institute (EBI). Genome
Campus - Hinxton Cambridge CB10 1SD - UK. Email:
{jvanheld, naima, shosh}@ebi.ac.uk
2
Service de Conformation des Macromolécules
Biologiques. Université Libre de Bruxelles.50 av. F.D.
Roosevelt. B-1050 Bruxelles. Belgium. Email:
{jvanheld,shosh}@ucmb.ulb.ac.be
&
Present address: Synomics Ltd, Compass House,
Vision Park, Chivers Way, Histon, Cambridge CB4
9AD, Email: Eldridge@ synomics.com
* Corresponding Author: Shoshana J. Wodak, email:
[email protected];[email protected]
Running title: From molecular activities and
processes to biological function
Keywords: Database, metabolism, regulation,
networks, pathways, process, function
This paper describes how biological function
can be represented in terms of molecular activities
and processes. It presents several key features of
a data model that is based on a conceptual
description of the network of interactions between
molecular entities within the cell and between
cells. This model is implemented in the aMAZE
database that presently deals with information on
metabolic pathways, gene regulation, sub- or
supra-cellular locations, and transport. We show
that this model constitutes a useful generalisation
of data representations currently implemented in
metabolic pathway databases, and that it can
furthermore include multiple schemes for
categorising and classifying molecular entities,
activities, processes and localisations. In
particular, we highlight the flexibility offered by our
system in representing multiple molecular
activities and their control, in viewing biological
function at different levels of resolution and in
updating this view as our knowledge evolves. The
aMAZE
project
Web
site
is
at
http://www.ebi.ac.uk/research/pfbp/.
Introduction
“Fonction: action, rôle caractéristique d’un élément,
d’un organe, dans un ensemble (souvent opposé à
structure)”
(Robert, 1982)
With the sequences of over one hundred complete
genomes of species ranging from bacteria to human
becoming available, a major challenge that we face is
the interpretation of this information in terms of the
biological function of the genes and proteins encoded
in these sequences.
This is a monumental task, first of all, because
biological function is itself a very complex notion. A
witness to this complexity is the fact that different
meanings are given to function by different authors.
These meanings often depend on the level (molecular,
cellular, organism) at which the author’s viewpoint and
research interests resides. Some attempted general
definitions of gene function, like that by Rison et al.
(2000), which states that function is “a capability that a
physical gene product (or gene product group) carries
as a potential” are not really operational.
More rigorous and meaningful definitions, such as
those given by Karp (2000), incorporate the fact that
biological function needs to be defined at different
levels simultaneously; e.g. not only at the molecular
level, but also in a larger context. This context
represents the myriad of ways in which the individual
genes and gene products interact, physically or
otherwise, to give rise to the living cell, and can
furthermore be extended to whole organisms where
millions of cells interact.
The success in deciphering biological function will
thus depend on our ability of representing, in a formal
and rigorous way, information on individual molecular
entities and on their interactions at the molecular,
cellular and multicellular levels. Currently, the main
resources for information on biological function are
sequence databases such as SWISS-PROT (Bairoch
& Apweiler, 2000), GenBank (Benson et al., 2000), or
the EMBL data library (Baker et al., 2000). Although
these databases are of undeniable utility for
consultation by humans, they represent information on
function as largely unstructured textual descriptions,
which are not readily amenable to computational
manipulation for validation, comparison and more
sophisticated analyses of function (see discussion by
(Karp, 2000; van Helden et al., 2000).
Well aware of these limitations several systems
have been devised for representing and categorising
biological function in a more rigorous way, which also
1
From molecular activities and processes to biological function
reflects the various levels of the molecular and cellular
organisation (for review, see (van Helden et al.,
2000)). Roughly speaking, these systems can be
divided into 3 categories. One category consists of
comprehensive catalogues, often referred to as ‘bioontologies’, whose main purpose is to ensure uniform
annotation and to facilitate sharing of information
between different databases. These catalogues
typically contain a vocabulary of terms and some
specifications of their meaning, as well as hierarchical
classifications of the terms defined in the controlled
vocabulary (Ashburner et al., 2000; Mewes et al.,
1997; Riley, 1993; Riley, 1998), for review see (Rison
et al., 2000).
The second category comprises systems, which
represent information on molecular interactions. These
databases are often restricted to a specific type of
interaction:
enzymatic
catalysis
in
BRENDA
(Schomburg et al., 1990-1995), protein-protein
interactions in DIP (Xenarios et al., 2000), protein-DNA
interactions in Transfac (Wingender et al., 2000) and
RegulonDB (Salgado et al., 1999; Salgado et al.,
2000). CSNdb (Takai-Igarashi et al., 1998) combines
different types of interactions involved in signal
transduction pathways, but its structure does not allow
storing information on pathways as such. Instead, all
possible pathways are obtained as the result of a path
finding algorithm, and displayed automatically.
The third category comprises systems, which in
addition to categorising molecules and activities
represent information on molecular interactions
together with their organisation into processes or
pathways. Such systems have been implemented in
databases dealing mainly with metabolic pathways,
such as EcoCyc/MetaCyc (Karp et al., 2000), KEGG
(Kanehisa & Goto, 2000) and WIT /EMP (Overbeek et
al., 2000). The database BIND (Bader & Hogue, 2000),
which primarily represents molecular interactions, also
handles information on signal transduction pathways.
Borrowing from the analogy with language (Bray,
1997), the bio-ontologies, can be thought of as
defining biological function at the semantic level. But
they provide no syntax or grammar rules, which define
how words can be associated to form sentences. Such
rules are however required to ‘speak’ the language of
biological function, and all the power of this language
will be necessary if we want computers to help us
unravel the complexity of living cells.
Recently, we reported a general data model, based
on a conceptual description of the network of
interactions between molecular entities, which has the
ambition of embodying such rules (van Helden et al.,
2000). This data model is implemented in the aMAZE
database, which can deal with information on
metabolic pathways, gene regulation, sub-cellular
locations and transport. AMAZE, stands for ‘a maze’:
the labyrinth of interactions between the myriad of
molecular entities in the cell.
In this paper we provide a more detailed
description of several key aspects of our model. We
show that it represents a useful generalisation and an
extension of data models underlying metabolic
databases such as those in EcoCyc/MetaCyc (Karp et
al., 2000), and that it goes further than the bioontologies towards meeting the requirements of
27/12/00 20:12
representing biological function in the computer. In
particular we highlight the flexibility offered by a
system such as ours in representing multiple molecular
activities and their control, in viewing biological
function at different levels of resolution and in updating
this view as our knowledge progresses.
A data model for representing
networks of interacting molecules
A schematic representation of the class hierarchy
currently implemented in the aMAZE database is
shown in Figure 1. High up in this hierarchy are four
BiochemicalEntity,
Interaction,
major
classes:
Process/Pathway and Compartment (we denote class
names in italics and leading uppercase letters). In what
follows we describe how this subdivision is used to
represent the different levels of molecular and cellular
function in a rigorous yet flexible manner.
Separating apples and oranges
An essential feature of the aMAZE data model is
the distinction between two fundamental classes,
BiochemicalEntity and Interaction.
BiochemicalEntity represents structural units.
These can be complete molecules (such as
metabolites or proteins), part of molecules (such as
genes or regulatory elements in non-coding regions),
or supra molecular assemblies (e.g. a protein complex,
or the ribosome). The attributes of an object in the
BiochemicalEntity class are mostly restricted to its
structural properties: the sequence of a polypeptide,
the position of a gene on the chromosome, the
SMILES formula of a compound and so on.
The Interaction class contains objects describing
molecular activities. It is subdivided into a set of
subclasses specifying particular types of activities such
as Reaction, Expression, TranscriptionalRegulation,
and Assembly/Disassembly. The originality of our data
model is that interactions are stored as fully
instantiated objects in different subclasses of the class
Interaction (Figure 1). Each interaction is characterised
by its particular list of inputs and outputs. Thus, a
Reaction has as attributes a set of substrates (the
inputs) and products (the outputs). In an analogous
manner, Assembly/Disassembly has polypeptides and
compounds (e.g. a prosthetic group) as inputs and a
multi-molecular complex as output, whereas
Expression has a gene as input and a polypeptide as
output. All these interactions have in common the fact
that both their inputs and outputs are sets of entities,
and are regrouped under a common class
Transformation.
This contrasts with another subclass of
interactions, which have as outputs other interactions
rather than entities. A good example is the class
ReactionCatalysis, which represents the action of a
protein (enzyme) in accelerating a chemical reaction.
This is illustrated in Figure 2 using as example the
uridylylation reaction of the protein PII, catalysed by
the enzyme uridylyltransferase. Figure 2 also shows
that this catalysis is itself activated by alphaketoglutarate. This Activation is also an activity exerted
by a BiochemicalEntity (alpha-ketoglutarate) on an
2
From molecular activities and processes to biological function
Interaction (the Catalysis). Interactions having other
interactions as output are denoted as Control
interactions in our data model (van Helden et al.,
2000). An attribute of a Control is the sign (+ or -),
indicating whether its effect is to enhance or reduce
the target interaction.
Our data model thus establishes a clear distinction
between structure and activity, thereby addressing
concerns about the necessity of separating apples
from oranges when describing the function of genes
and gene products (Rison et al., 2000). The
advantages of this model have been discussed in
detail elsewhere (van Helden et al., 2000). One of its
main attractive features is that it defines the activities
of a particular structural entity (compound, gene or
protein) within a context, rather than within the entity
object itself. This allows for a flexible description of
multiple activities in which the same molecule can be
involved.
One might argue that multiple activities could
equally well be represented as a multivalue attribute of
the entity object itself. However, the instantiation of full
objects for each interaction allows the description of
controls that could not be represented by multivalue
attributes.
This is illustrated by the example of
Uridylyltransferase (Figure 2). This bifunctional
enzyme catalyses both the uridylylation and the deuridylylation of the protein PII. These reactions are not
the inverse of each other, as during uridylylation, UTP
is consumed and pyrophosphate (PPi) is produced,
whereas de-uridylylation consumes orthophosphate
(Pi) and produces UDP. Combining the two reactions
would thus result in a futile cycle, and regulation is
hence crucial to avoid wasting energy. This regulation
is mediated by alpha-ketobutyrate, which activates
only uridylylation, and not de-uridylylativion.
In a classical data model, one would describe the
activation by alpha-ketobutyrate as acting directly on
the protein uridyliltransferase. But this would not allow
specifying that only one of the 2 catalyses performed
by this protein is selectively activated. This can
however be readily achieved with our representation,
since we instantiate two separate Catalysis objects, so
that the Activation can selectively operate only on one
of them (Figure 2).
Apples and apple trees
A third important class of the aMAZE data model is
Process/Pathway. Process/Pathway is a collection of
interconnected pathway elements. Graph theory would
describe it as a graph (network), whose nodes are
pathway elements, connected by pathway arcs.
Process/Pathway elements can be of different types.
In the simplest case these elements consist of
biochemical entities and interactions (Figure 3A).
Thus, the BiochemicalEntity and Interaction classes
described above derive from the common
PathwayBuildingBlock super-class (Figure 1).
In addition, pathways can be generalised to allow
higher level representations, for example to represent
how different biochemical pathways are interconnected
(pathways of pathways). Figure 3B, illustrates such a
super-pathway for the biosynthesis of amino acids
27/12/00 20:12
derived from aspartate. This yields a synthetic view of
the interconnections between the methionine, lysine,
threonine and isoleucine pathways. In this superpathway, some PathwayBuildingBlock are other
pathways, and thus the Process/Pathway class is itself
a subclass of PathwayBuildingBlock, as indicated in
Figure 1.
This also illustrates the flexibility of our description,
as classes can be readily made more general or more
specific in order to represent information on different
types of biological processes, which may not
necessarily be at the molecular level.
It should be realised that the information stored in
the pathway/process class refers to specific portions
(subgraphs) of the global network of interactions,
whose boundaries are defined according to some
criteria. So far these portions correspond mainly to
metabolic processes as defined by researchers on the
basis of biochemical criteria or experimental
observations. Thus in general, a metabolic process or
‘pathway’ has its boundaries defined by the essential
compounds that are its inputs and outputs.
Alternatively, a process may be characterised by the
co-regulation at the transcription level of a group of
genes, and have its boundaries defined by the
feedback control of the end product on the first step in
the process. Importantly, current definitions are based
on observations made on a limited number of model
organisms, and it is already becoming clear that they
will display significant variations in other organisms.
How the global network of interactions is carved up
into processes that make biological sense, may
therefore depend on the organism, tissue or cellular
compartment. The data structure of aMAZE readily
allows for that. In addition, by enabling ‘views’ of the
network at different hierarchical levels, and developing
query tools for flexibly cropping across it, new ways for
carving up the network could be tested in the future.
Orchards and orange groves
An important class high up in our hierarchy
describes compartments (Figure 1), which can be
defined at different levels: sub-cellular compartment,
cell type, tissue, organ, and so on. Organisms can
themselves be considered as a special type of
compartment. Indeed, it is not uncommon to observe a
metabolic pathway shared between different
organisms (symbionts). In our data model, the class
Organism is thus derived from Compartment, exactly
in the same way as SubcellularCompartment,
CellType, Tissue, and Organ. Any BiochemicalEntity,
Interaction, or Process/Pathway can be linked to a
given combination of objects in the Compartment class
in order to describe where it occurs (e.g. plasma
membrane of T-cells in Homo sapiens). This
information is stored in the separate class
LocationSpecification (not shown in Figure 1 for clarity
sake), which contains one slot for each level of
Compartment.
Hierarchies of classes and objects
In the aMAZE data model, as usually in Object
Oriented data models, two main types of relationships
are represented. One is the inheritance relationship
(“is_a”) encoded in the class hierarchy, whereby a heir
3
From molecular activities and processes to biological function
class inherits the properties of its parent class. The
other is the reference relationship (“refers_to”),
operating between objects, whereby a given object
refers to another object in the database. This latter
type of relationship is encoded in the objects as
attributes.
A very wide use is made in our model of such
attributes to represent input/output relationships, which
link objects of the Interaction class to objects of other
classes. This allows, for example, to handle multiple
inputs and outputs of interactions. Examples of such
relationships can be found in the pathway graph of
Figure 3A. Translation has a messenger RNA as input
and a polypeptide as output. Catalysis has a protein
(enzyme) as input and a Reaction as output, and
Reaction has several compounds as inputs and as
outputs, and so on.
Reference relationships can also be used to
describe hierarchic relations between objects. One
example is the ContainedCompartments attribute of
the Compartment class, which allows to represent a
topological hierarchy, for example the fact that the
nucleosome is contained in the nucleus, itself
contained in a (Eukaryotic) cell. In this case the
referred objects are not necessarily of the same class
as the object that refers to them. Hierarchical
classifications between objects, for example, the
organism taxonomy, can also be represented using the
reference relationships, but here all the objects belong
to the same class. Such classifications are
represented in our model using the Subtypes attribute,
as detailed below.
Classification schemes
Hierarchical classifications via Subtypes attribute
are implemented in many classes of our model.
Examples are illustrated in Figure 4, using the Unified
Modelling Language (UML) conventions.
The most familiar example for the biologist is
probably the systematic classification of organisms. In
our model, this classification is described by recursive
relations between objects of the Systematic Group
class (Figure 4a): a systematic group can include other
systematic groups (primate is a subtype of mammal;
Homo Sapiens is a subtype of primate).
Such classifications apply not only to organisms,
but also to other levels of organisation such as
compartments (e.g. subcellular compartments in
Figure 4a). Note however, that compartments can also
be classified according to their containment hierarchy,
using the ContainedCompartment attribute (Figure 4a).
The relation represented is however quite distinct from
that in Subtypes attribute, since it indicates that a
compartment is physically contained within another.
ContainedCompartment
SubType
and
provide
complementary information, which can be used to
address distinct queries.
Compounds are also classified hierarchically. In
this case a generic compound can refer to other
generic compounds, or to a specific compound,
allowing the definition of a complete taxonomy of
compounds, as illustrated in Figure 4b. The same
schema is used to represent the well-known EC
27/12/00 20:12
classification (Bairoch, 1994; Bairoch, 2000) (Figure
4c).
Hierarchic classifications of Pathways and
processes are also supported by the capability to use
a Pathway as a PathwayBuildingBlock to represent a
pathway of pathways (Figure 4d).
Discussion
How is Function Described in our model?
Translated into English, the definition of the French
dictionary Petit Robert (Robert, 1982) reads:
“characteristic action (role) of an element (organ)
within an ensemble (often opposed to structure)”. This
definition has the merit of highlighting that the function
of an object is always defined in reference to a given
context (an ensemble which contains this object). It
makes indeed no sense to talk about the function of an
object per se. In molecular biology, the same holds
true. The function of a molecule can only be defined by
the link between the activities exerted by this molecule
and the process in which these activities play a role.
For example, the activity of aspartate kinase III
coli)
(Escherichia
is
to
catalyse aspartate
phosphorylation, whereas its function, or rather one of
its functions, is to catalyse the first step of the
threonine biosynthesis pathway. Since this pathway is
itself an element of the super-pathway of amino acid
biosynthesis, the function of aspartate kinase III might
as well be described as “enzyme catalysing a reaction
involved in amino acid biosynthesis”.
Our definition comes close to that provided by Karp
(2000), who distinguishes between 2 separate notions
of function, the local function, and integrated function,
respectively. The first concerns the individual activity
exerted by a biological entity, whereas the second
concerns the role played by this entity in the context of
a larger system. However, while we agree on this
subdivision, we prefer to replace the term local
function by activity, since function is by definition
context dependent while activity may not.
Our activity/function definition makes it very easy to
cope with multiple functions, as well as to distinguish
between two distinct mechanisms of multi-functionality.
The first concerns proteins that combine several
distinct
activities,
sometimes
carried
out
simultaneously and resulting from the existence of
several domains within the same polypeptide. The
second mechanism concerns proteins that exert a
single activity, but take part indifferent processes,
thereby giving rise to different functions. Examples of
both mechanisms can be found in (Jeffery, 1999).
In some sense, the recent Gene Ontology
(Ashburner et al., 2000) addresses these various
issues by combining 3 distinct classification schemes:
biological process, molecular function, and cellular
component. Each gene is classified according to these
3 schemes, and in addition, a gene can be assigned to
multiple classes within each scheme. However, since
these 3 schemes are stored independently, it is not
easy to deduce from them, which molecular function is
involved in which process, and in which
compartment/location a given process takes place.
Representing the various facets of multi-functionality is
4
From molecular activities and processes to biological function
hence not straightforward with this system, and would
require adding extra dimensions to the current
description.
This contrasts with our model where each process
is defined as a collection of interactions and entities,
which are themselves linked to the different locations
where they occur. Thus, using our representation,
information on the function, or functions, of a gene
product is not obtained by looking up catalogues, but is
computed on the fly via queries to the aMAZE
database, which analyse the network of interactions in
which the gene product takes part. In particular, this
involves collecting from the database all the activities
in which the gene product is involved, and the
processes in which these activities play a role,
corresponding precisely to the definition of function
given by the Petit Robert.
The aMAZE data model is not an ontology
As already mentioned above, the so-called bioontologies (which have not much in common with the
philosophic definition of ontology) arose in response to
the difficulty of extracting information from a humanannotated text in sequence databases. The first step in
addressing the issue was to define controlled
vocabularies, in order to deal with synonyms.
Taxonomical
classifications
were
introduced
subsequently to describe inclusion relationships
between the terms of the controlled vocabulary.
The scope of most ‘bio-ontologies’, thus does not
extend to the description of molecular networks.
Processes can be described in these ontologies, but
this description is restricted to inclusion relationships.
Even though this description can be fine-grained, and
finally result in a list of all molecules involved in a given
process, there is no way to indicate the order in which
these molecules are interconnected to form a pathway
or network.
In stark contrast to this, the primary function of our
model is to describe cellular processes in terms of
molecular interactions, which is why the most detailed
branches of our class tree describe the different types
of interactions.
Nevertheless, our data model allows to incorporate
classifications from existing bio-ontologies. But rather
than representing them in a rigid framework, these
classifications are stored as object hierarchies using
the Subtypes attribute, and can hence be readily
modified. In this regard, our data model offers the
possibility of storing different classification schemes in
the same database, leaving to the user the choice of
the most appropriate scheme for the particular query to
be addressed. Providing such flexibility is essential,
since having a single scheme as a unifying concept in
biology is probably not realistic, especially not in the
current state of our incomplete knowledge of the field.
Rison et al (2000) recently characterised a series
of functional classifications in terms of breath, depth
and resolution. This analysis can not be used to
evaluate data models such as ours, since we restrict
our description to class hierarchies, and therefore the
levels of depth and resolution of our data model will
appear as very restricted. In fact, since classification
schemes are represented as attributes of objects
27/12/00 20:12
instantiated in the database, our data structure can
store classification schemes of any breath, depth and
resolution. Since this may also be the case for some of
the other data models reviewed by Rison and coll. the
conclusions reached by these authors on those
models, may need to be re-examined.
Major differences with the
representations in other biological
databases
Our data organisation contrasts sharply with those
found in most sequence-oriented databases, which are
built around specific types of biochemical entities such
as genes (Genbank), or proteins (SWISSPROT), and
store information on their activities in a few general
description fields, essentially in textual form.
Other more specialised databases, particularly
EcoCyc/MetaCyc (Karp et al., 2000), use more
elaborate data structures to represent function (Karp,
2000). They also make a clear distinction between
entities and interactions. However, except for chemical
reactions and catalysed reactions, all other types of
interactions are represented as relationships between
the objects, rather than objects in their own right, as in
our model. This is due to the fact the majority of these
databases and their underlying data models deal
primarily with information on metabolic pathways. Our
model covers a larger variety of activities
(transcriptional regulation, attenuation, macromolecule
processing, transport, …) and processes (metabolic
regulation, signal transduction) and deals with spatial
locations and compartments, as shown above.
The very detailed ASN.1-based data specification
of BIND (Bader & Hogue, 2000) shares some features
with our model by representing entities, interactions
and pathways as distinct objects. But their model
focuses entirely on binary interactions, and it is unclear
to what extent, if at all, it represents hierarchical
relations between object classes, and whether it
enables the representation of object taxonomies, as
the aMAZE model does.
Concluding remarks
In this paper we described a data model which
represents information on molecular activities and
cellular processes in a rigorous manner, and showed
how this model differs from current schemes for
representing and categorising function. In particular,
this model makes a clear distinction between
molecular entities and activities, and extends this
distinction to very wide spectrum of both. Furthermore,
it provides ready means for describing the physical
and functional interactions between genes and gene
products, for dealing with multiple functions and spatial
locations, for describing a large variety of processes
and for concurrently including different classification
schemes.
Our data model is clearly still evolving as our
limited experience grows and the types of data we
handle expands. The aMAZE database, which
implements this model, currently handles information
on metabolic pathways, gene regulation and transport.
Work on its extension to other types of pathways, in
particular T-cell signal transduction cascades, is in
5
From molecular activities and processes to biological function
progress. Also being worked on are modifications of
our schema, which will allow us to represent
interactions between sub-cellular compartments, cells
tissues and whole organisms.
Availability
Additional information can be found on the Web at
the
following
address:
http://www.ebi.ac.uk/research/pfbp/. A version of the
aMAZE database is presently implemented in Java,
with a lower level storage in ORACLE. Access to it
over the Internet will be provided in the near future.
Acknowledgements
We thank Georges Cohen, Kirill Degtyarenko and
Gaurab Mukherjee for valuable help in annotation of
metabolic and regulatory pathways, and acknowledge
Jean Richelle, Thure Etzold, Dietmar Schomburg, and
Bruno André for useful discussions. The work
described here has been sponsored by a consortium
of industries, comprising, Astra-Zeneca, Aventis,
Monsanto, Organon, and Roche. Scientists from these
companies are thanked for valuable input. We
furthermore acknowledge support from the European
Commission, Grant Nº QLRI-CT-1999-01333.
References
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D.,
Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K.,
Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P.,
Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.
C., Richardson, J. E., Ringwald, M., Rubin, G. M. &
Sherlock, G. (2000). Gene ontology: tool for the
unification of biology. The Gene Ontology
Consortium. Nat Genet 25(1), 25-9.
Bader, G. D. & Hogue, C. W. (2000). BIND--a data
specification for storing and describing biomolecular
interactions, molecular complexes and pathways.
Bioinformatics 16(5), 465-77.
Bairoch, A. (1993). The ENZYME data bank. Nucleic
Acids Res 21(13), 3155-6.
Bairoch, A. (1994). The ENZYME data bank. Nucleic
Acids Res 22(17), 3626-7.
Bairoch, A. (2000). The ENZYME database in 2000.
Nucleic Acids Res 28(1), 304-305.
Bairoch, A. & Apweiler, R. (2000). The SWISS-PROT
protein sequence database and its supplement
TrEMBL in 2000. Nucleic Acids Res 28(1), 45-48.
Baker, W., van den Broek, A., Camon, E., Hingamp,
P., Sterk, P., Stoesser, G. & Tuli, M. A. (2000). The
EMBL Nucleotide Sequence Database. Nucleic
Acids Res 28(1), 19-23.
27/12/00 20:12
Jeffery, C. J. (1999). Moonlighting proteins. Trends
Biochem Sci 24(1), 8-11.
Kanehisa, M. & Goto, S. (2000). KEGG: Kyoto
Encyclopedia of Genes and Genomes. Nucleic
Acids Res 28(1), 27-30.
Karp, P. D. (2000). An ontology for biological function
based on molecular interactions. Bioinformatics
16(3), 269-85.
Karp, P. D., Riley, M., Saier, M., Paulsen, I. T., Paley,
S. M. & Pellegrini-Toole, A. (2000). The EcoCyc and
MetaCyc databases. Nucleic Acids Res 28(1), 5659.
Mewes, H. W., Albermann, K., Bahr, M., Frishman, D.,
Gleissner, A., Hani, J., Heumann, K., Kleine, K.,
Maierl, A., Oliver, S. G., Pfeiffer, F. & Zollner, A.
(1997). Overview of the yeast genome [published
erratum
appears
in
Nature
1997
Jun
12;387(6634):737]. Nature 387(6632 Suppl), 7-65.
Overbeek, R., Larsen, N., Pusch, G. D., D'Souza, M.,
Jr, E. S., Kyrpides, N., Fonstein, M., Maltsev, N. &
Selkov, E. (2000). WIT: integrated system for highthroughput genome sequence analysis and
metabolic reconstruction. Nucleic Acids Res 28(1),
123-125.
Riley, M. (1993). Functions of the gene products of
Escherichia coli. Microbiol Rev 57(4), 862-952.
Riley, M. (1998). Systems for categorizing functions of
gene products. Curr Opin Struct Biol 8(3), 388-92.
Rison, S. C. G., Hodgman, T. C. & Thornton, J. M.
(2000). Comparison of functional annotation
schemes for genomes. Funct Integr Genomics 1,
56-69.
Robert, P. (1982). Le Petit Robert - dictionaire
alphabetique et analogique de la langue francaise.
1982 edit (J.Rey-Debove, A. R. a., Ed.), 1. 2 vols,
Le Robert, Paris.
Salgado, H., Santos, A., Garza-Ramos, U., van
Helden, J., Diaz, E. & Collado-Vides, J. (1999).
RegulonDB (version 2.0): a database on
transcriptional regulation in Escherichia coli. Nucleic
Acids Res 27(1), 59-60.
Salgado, H., Santos-Zavaleta, A., Gama-Castro, S.,
Mill#n, Z. r. D., Blattner, F. R. & Collado-Vides, J.
(2000). RegulonDB (version 3.0): transcriptional
regulation and operon organization in Escherichia
coli K-12. Nucleic Acids Res 28(1), 65-67.
Schomburg, D., Salzmann, D. & Stephan, D. (19901995). Enzyme handbook. 13 vols, Springer.
Takai-Igarashi, T., Nadaoka, Y. & Kaminuma, T.
(1998). A database for cell signaling networks. J
Comput Biol 5(4), 747-54.
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J.,
Ostell, J., Rapp, B. A. & Wheeler, D. L. (2000).
GenBank. Nucleic Acids Res 28(1), 15-18.
van Helden, J., Naim, A., Mancuso, R., Eldridge, M.,
Wernisch, L., Gilbert, D. & Wodak, S. J. (2000).
Representing and analysing molecular and cellular
function using the computer [In Process Citation].
Biol Chem 381(9-10), 921-35.
Bray, D. (1997). Reductionism for biochemists: how to
survive the protein jungle. Trends Biochem Sci
22(9), 325-6.
Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich,
I., Matys, V., Meinhardt, T., Pr, M., Reuter, I. &
Schacherer, F. (2000). TRANSFAC: an integrated
6
From molecular activities and processes to biological function
system for gene expression regulation. Nucleic
Acids Res 28(1), 316-319.
27/12/00 20:12
Database of Interacting Proteins. Nucleic Acids Res
28(1), 289-291.
Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K.,
Marcotte, E. M. & Eisenberg, D. (2000). DIP: the
Figures
Compound
Specific Compound
Generic Compound
Proteinaceous
Protein domain
Polypeptide
Protein complex
DNA
Gene
Operon
Regulatory Site
Biochemical
Entity
RNA
Pathway
BuildingBlock
Transformation
Reaction
Assembly/Disassembly
Transport Event
Macromolecule
Processing
Conformation Change
Interaction
Control of
Transformation
Control
Transcription
RNA Processing
Translation
Expression
Reaction Catalysis
Transport Catalysis
Transcriptional
Regulation
Translational
Regulation
Attenuation
Indirect Interaction
Process/Pathway
Database
Object
Compartment
Subcellular
Compartment
Cell Type
Tissue
Organ
Control of Control
Metabolic Pathway Step
Metabolic Pathway
Pathway of Pathways
Systematic Group
Reference
Figure 1
Evidence
Publication
Lab Info
External DB link
Figure 1: The current class hierarchy of the aMAZE database.
alpha-ketoglutarate
activation
catalysis
n PPi
Uridylyltransferase
reaction
2.7.7.59
PII (UMP)n
n UTP
PII
expression
n UDP
glnB
reaction
n Pi
catalysis
Figure 2
Figure 2: Schematic representation of the action and control of a multi-functional enzyme using the aMAZE data
model. The example shown is that of uridylyltransferase, an enzyme which catalyses both the uridylylation and
de-uridylylation of the protein PII (see text for details). In this representation, objects of the Interaction class,
namely, reaction, catalysis expression and activation are boxed. Objects of the BiochemicalEntity class are
represented by their names, printed in bold. Input/output relationships are represented by arrows.
7
From molecular activities and processes to biological function
27/12/00 20:12
(a)
Metabolic
PathwayStep
L-Aspartate
ATP
inhibition
translation
Aspartate kinase I
homoserine dehydrogenase I
catalysis
2.7.2.4
ADP
L-Aspartyl-4-P
NADPH
asd
expression
Aspartate semialdehyde
deshydrogenase
catalysis
1.2.1.11
NADP+; Pi
MetabolicPathway
L-Aspartic semialdehyde
NADPH
catalysis
inhibition
1.1.1.3
NADP+
L-Homoserine
ATP
translation
Cystathionine-gamma-synthase
catalysis
inhibition
2.7.1.39
ADP
L-Homoserine phosphate
H2O
translation
Cystathionine-beta-lyase
catalysis
4.4.1.8
Pi
L-Threonine
thrABC mRNA
transcription
thrABC operon
Attenuation
Figure 3A
(b)
aspartate
Common
fork
inhibition
inhibition
inhibition
L-aspartic semialdehyde
Homoserine
biosynthesis
PathwayOfPathways
inhibition
L-Cysteine
inhibition
Methionine
biosynthesis
inhibition
L-Homoserine
inhibition
Lysine
biosynthesis
inhibition
L-Lysine
Threonine
biosynthesis
inhibition
L-Threonine
L-Methionine
inhibition
Isoleucine
biosynthesis
L-Isoleucine
Figure 3B
Figure 3: Graphical representations of pathways using the aMAZE data model.
(a) The metabolic regulation pathways of threonine biosynthesis in Escherichia coli. As in Figure 2, objects of the
Interaction class, namely, reaction, catalysis expression, inhibition and attenuation are boxed. Objects of the
BiochemicalEntity class are represented by their names, printed in bold. Input/output relationships are
represented by arrow. On this diagram we contrast the MetabolicPathwayStep (light dotted box frames) from the
entire MetabolicPathway (heavy dotted box frame).
(b) Example of a pathway of pathways. The example represents the network formed by the biosynthetic pathways
for amino-acids having aspartate as precursor. In this diagram, some nodes represent whole pathways;
IndirectInteractions between pathways are labelled as inhibition (negative) or activation (positive, not shown).
2
From molecular activities and processes to biological function
(a)
Classes
27/12/00 20:12
Compartment
ContainedCompartments
SubcellularCompartment
SubTypes
SystematicGroup
SubTypes
Subtypes relationships
Membrane
Mammal
Rodent
Primate
Cell membrane
Endoplasmic
reticulum membrane
Homo sapiens
ContainedCompartments relationships
Plasma membrane
Eukaryote
(Systematic group)
Cytoplasm
Nucleated cell
(Cell Type)
Nucleus
Nucleosome
Nucleus
(Subcellular Compartment)
Figure 4a
(b)
Classes
Compound
GenericCompound
SubTypes
SpecificCompound
Subtype relationships
Sugar
Hexose
Fructose
L-Fructose
Glucose
L-Glucose
D-Glucose
Figure 4b
3
From molecular activities and processes to biological function
(c)
27/12/00 20:12
Classes
Catalysable
EC
SubTypes
Reaction
Subtypes relationships
1. Oxidoreductases
1.1 Acting on the CH-OH group of donors
1.1.1 With NAD+ or NADP+ as acceptor
1.1.1.1 Alcohol dehydrogenase; Aldehyde reductase
Alcohol + NAD+ <=> Aldehyde + NADH
Figure 4c
(d)
Classes
PathwayBuildingBlock
Pathway
PathwayNodes
Interaction
PathwayNodes relationships
Amino acid metabolism
(pathway of pathways)
Amino acid biosyntheses
(pathway of pathways)
Asparate-derived
amino acid biosyntheses
(pathway of pathways)
Methionine biosynthesis (metabolic pathway)
Lysine biosynthesis (metabolic pathway)
Isoleucine biosynthesis (metabolic pathway)
Threonine biosynthesis (metabolic pathway)
Aspartate phosphorylation step
(metabolic pathway step)
asparate phosphorylation
reaction (interaction)
Figure 4d
Figure 4: Representation of Classification schemes in the aMAZE data model. The data model supports
classification schemes at multiple levels using subtype and ContainedCompartment relationships. Some
examples are illustrated here.
(a) and organism classification, and classification of sub-cellular compartments
(b) Hierarchic classification of Compounds.
(c) Classification of enzymatic reactions, using the EC classification (Bairoch, 1993).
(d) Hierarchy of pathways.
4