* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download From molecular activities and processes to biological function
Survey
Document related concepts
Signal transduction wikipedia , lookup
Protein moonlighting wikipedia , lookup
Multi-state modeling of biomolecules wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Gene regulatory network wikipedia , lookup
Metabolic network modelling wikipedia , lookup
Transcript
From molecular activities and processes to biological function 27/12/00 20:12 From molecular activities and processes to biological function To appear in Briefings in Bioinformatics 1,2 1 Jacques van Helden , Avi Naim , Christian 2 1 Lemer , Renato Mancuso , Matthew Eldridge and Shoshana J. Wodak 1 1,& , 1,2* European Bioinformatics Institute (EBI). Genome Campus - Hinxton Cambridge CB10 1SD - UK. Email: {jvanheld, naima, shosh}@ebi.ac.uk 2 Service de Conformation des Macromolécules Biologiques. Université Libre de Bruxelles.50 av. F.D. Roosevelt. B-1050 Bruxelles. Belgium. Email: {jvanheld,shosh}@ucmb.ulb.ac.be & Present address: Synomics Ltd, Compass House, Vision Park, Chivers Way, Histon, Cambridge CB4 9AD, Email: Eldridge@ synomics.com * Corresponding Author: Shoshana J. Wodak, email: [email protected];[email protected] Running title: From molecular activities and processes to biological function Keywords: Database, metabolism, regulation, networks, pathways, process, function This paper describes how biological function can be represented in terms of molecular activities and processes. It presents several key features of a data model that is based on a conceptual description of the network of interactions between molecular entities within the cell and between cells. This model is implemented in the aMAZE database that presently deals with information on metabolic pathways, gene regulation, sub- or supra-cellular locations, and transport. We show that this model constitutes a useful generalisation of data representations currently implemented in metabolic pathway databases, and that it can furthermore include multiple schemes for categorising and classifying molecular entities, activities, processes and localisations. In particular, we highlight the flexibility offered by our system in representing multiple molecular activities and their control, in viewing biological function at different levels of resolution and in updating this view as our knowledge evolves. The aMAZE project Web site is at http://www.ebi.ac.uk/research/pfbp/. Introduction “Fonction: action, rôle caractéristique d’un élément, d’un organe, dans un ensemble (souvent opposé à structure)” (Robert, 1982) With the sequences of over one hundred complete genomes of species ranging from bacteria to human becoming available, a major challenge that we face is the interpretation of this information in terms of the biological function of the genes and proteins encoded in these sequences. This is a monumental task, first of all, because biological function is itself a very complex notion. A witness to this complexity is the fact that different meanings are given to function by different authors. These meanings often depend on the level (molecular, cellular, organism) at which the author’s viewpoint and research interests resides. Some attempted general definitions of gene function, like that by Rison et al. (2000), which states that function is “a capability that a physical gene product (or gene product group) carries as a potential” are not really operational. More rigorous and meaningful definitions, such as those given by Karp (2000), incorporate the fact that biological function needs to be defined at different levels simultaneously; e.g. not only at the molecular level, but also in a larger context. This context represents the myriad of ways in which the individual genes and gene products interact, physically or otherwise, to give rise to the living cell, and can furthermore be extended to whole organisms where millions of cells interact. The success in deciphering biological function will thus depend on our ability of representing, in a formal and rigorous way, information on individual molecular entities and on their interactions at the molecular, cellular and multicellular levels. Currently, the main resources for information on biological function are sequence databases such as SWISS-PROT (Bairoch & Apweiler, 2000), GenBank (Benson et al., 2000), or the EMBL data library (Baker et al., 2000). Although these databases are of undeniable utility for consultation by humans, they represent information on function as largely unstructured textual descriptions, which are not readily amenable to computational manipulation for validation, comparison and more sophisticated analyses of function (see discussion by (Karp, 2000; van Helden et al., 2000). Well aware of these limitations several systems have been devised for representing and categorising biological function in a more rigorous way, which also 1 From molecular activities and processes to biological function reflects the various levels of the molecular and cellular organisation (for review, see (van Helden et al., 2000)). Roughly speaking, these systems can be divided into 3 categories. One category consists of comprehensive catalogues, often referred to as ‘bioontologies’, whose main purpose is to ensure uniform annotation and to facilitate sharing of information between different databases. These catalogues typically contain a vocabulary of terms and some specifications of their meaning, as well as hierarchical classifications of the terms defined in the controlled vocabulary (Ashburner et al., 2000; Mewes et al., 1997; Riley, 1993; Riley, 1998), for review see (Rison et al., 2000). The second category comprises systems, which represent information on molecular interactions. These databases are often restricted to a specific type of interaction: enzymatic catalysis in BRENDA (Schomburg et al., 1990-1995), protein-protein interactions in DIP (Xenarios et al., 2000), protein-DNA interactions in Transfac (Wingender et al., 2000) and RegulonDB (Salgado et al., 1999; Salgado et al., 2000). CSNdb (Takai-Igarashi et al., 1998) combines different types of interactions involved in signal transduction pathways, but its structure does not allow storing information on pathways as such. Instead, all possible pathways are obtained as the result of a path finding algorithm, and displayed automatically. The third category comprises systems, which in addition to categorising molecules and activities represent information on molecular interactions together with their organisation into processes or pathways. Such systems have been implemented in databases dealing mainly with metabolic pathways, such as EcoCyc/MetaCyc (Karp et al., 2000), KEGG (Kanehisa & Goto, 2000) and WIT /EMP (Overbeek et al., 2000). The database BIND (Bader & Hogue, 2000), which primarily represents molecular interactions, also handles information on signal transduction pathways. Borrowing from the analogy with language (Bray, 1997), the bio-ontologies, can be thought of as defining biological function at the semantic level. But they provide no syntax or grammar rules, which define how words can be associated to form sentences. Such rules are however required to ‘speak’ the language of biological function, and all the power of this language will be necessary if we want computers to help us unravel the complexity of living cells. Recently, we reported a general data model, based on a conceptual description of the network of interactions between molecular entities, which has the ambition of embodying such rules (van Helden et al., 2000). This data model is implemented in the aMAZE database, which can deal with information on metabolic pathways, gene regulation, sub-cellular locations and transport. AMAZE, stands for ‘a maze’: the labyrinth of interactions between the myriad of molecular entities in the cell. In this paper we provide a more detailed description of several key aspects of our model. We show that it represents a useful generalisation and an extension of data models underlying metabolic databases such as those in EcoCyc/MetaCyc (Karp et al., 2000), and that it goes further than the bioontologies towards meeting the requirements of 27/12/00 20:12 representing biological function in the computer. In particular we highlight the flexibility offered by a system such as ours in representing multiple molecular activities and their control, in viewing biological function at different levels of resolution and in updating this view as our knowledge progresses. A data model for representing networks of interacting molecules A schematic representation of the class hierarchy currently implemented in the aMAZE database is shown in Figure 1. High up in this hierarchy are four BiochemicalEntity, Interaction, major classes: Process/Pathway and Compartment (we denote class names in italics and leading uppercase letters). In what follows we describe how this subdivision is used to represent the different levels of molecular and cellular function in a rigorous yet flexible manner. Separating apples and oranges An essential feature of the aMAZE data model is the distinction between two fundamental classes, BiochemicalEntity and Interaction. BiochemicalEntity represents structural units. These can be complete molecules (such as metabolites or proteins), part of molecules (such as genes or regulatory elements in non-coding regions), or supra molecular assemblies (e.g. a protein complex, or the ribosome). The attributes of an object in the BiochemicalEntity class are mostly restricted to its structural properties: the sequence of a polypeptide, the position of a gene on the chromosome, the SMILES formula of a compound and so on. The Interaction class contains objects describing molecular activities. It is subdivided into a set of subclasses specifying particular types of activities such as Reaction, Expression, TranscriptionalRegulation, and Assembly/Disassembly. The originality of our data model is that interactions are stored as fully instantiated objects in different subclasses of the class Interaction (Figure 1). Each interaction is characterised by its particular list of inputs and outputs. Thus, a Reaction has as attributes a set of substrates (the inputs) and products (the outputs). In an analogous manner, Assembly/Disassembly has polypeptides and compounds (e.g. a prosthetic group) as inputs and a multi-molecular complex as output, whereas Expression has a gene as input and a polypeptide as output. All these interactions have in common the fact that both their inputs and outputs are sets of entities, and are regrouped under a common class Transformation. This contrasts with another subclass of interactions, which have as outputs other interactions rather than entities. A good example is the class ReactionCatalysis, which represents the action of a protein (enzyme) in accelerating a chemical reaction. This is illustrated in Figure 2 using as example the uridylylation reaction of the protein PII, catalysed by the enzyme uridylyltransferase. Figure 2 also shows that this catalysis is itself activated by alphaketoglutarate. This Activation is also an activity exerted by a BiochemicalEntity (alpha-ketoglutarate) on an 2 From molecular activities and processes to biological function Interaction (the Catalysis). Interactions having other interactions as output are denoted as Control interactions in our data model (van Helden et al., 2000). An attribute of a Control is the sign (+ or -), indicating whether its effect is to enhance or reduce the target interaction. Our data model thus establishes a clear distinction between structure and activity, thereby addressing concerns about the necessity of separating apples from oranges when describing the function of genes and gene products (Rison et al., 2000). The advantages of this model have been discussed in detail elsewhere (van Helden et al., 2000). One of its main attractive features is that it defines the activities of a particular structural entity (compound, gene or protein) within a context, rather than within the entity object itself. This allows for a flexible description of multiple activities in which the same molecule can be involved. One might argue that multiple activities could equally well be represented as a multivalue attribute of the entity object itself. However, the instantiation of full objects for each interaction allows the description of controls that could not be represented by multivalue attributes. This is illustrated by the example of Uridylyltransferase (Figure 2). This bifunctional enzyme catalyses both the uridylylation and the deuridylylation of the protein PII. These reactions are not the inverse of each other, as during uridylylation, UTP is consumed and pyrophosphate (PPi) is produced, whereas de-uridylylation consumes orthophosphate (Pi) and produces UDP. Combining the two reactions would thus result in a futile cycle, and regulation is hence crucial to avoid wasting energy. This regulation is mediated by alpha-ketobutyrate, which activates only uridylylation, and not de-uridylylativion. In a classical data model, one would describe the activation by alpha-ketobutyrate as acting directly on the protein uridyliltransferase. But this would not allow specifying that only one of the 2 catalyses performed by this protein is selectively activated. This can however be readily achieved with our representation, since we instantiate two separate Catalysis objects, so that the Activation can selectively operate only on one of them (Figure 2). Apples and apple trees A third important class of the aMAZE data model is Process/Pathway. Process/Pathway is a collection of interconnected pathway elements. Graph theory would describe it as a graph (network), whose nodes are pathway elements, connected by pathway arcs. Process/Pathway elements can be of different types. In the simplest case these elements consist of biochemical entities and interactions (Figure 3A). Thus, the BiochemicalEntity and Interaction classes described above derive from the common PathwayBuildingBlock super-class (Figure 1). In addition, pathways can be generalised to allow higher level representations, for example to represent how different biochemical pathways are interconnected (pathways of pathways). Figure 3B, illustrates such a super-pathway for the biosynthesis of amino acids 27/12/00 20:12 derived from aspartate. This yields a synthetic view of the interconnections between the methionine, lysine, threonine and isoleucine pathways. In this superpathway, some PathwayBuildingBlock are other pathways, and thus the Process/Pathway class is itself a subclass of PathwayBuildingBlock, as indicated in Figure 1. This also illustrates the flexibility of our description, as classes can be readily made more general or more specific in order to represent information on different types of biological processes, which may not necessarily be at the molecular level. It should be realised that the information stored in the pathway/process class refers to specific portions (subgraphs) of the global network of interactions, whose boundaries are defined according to some criteria. So far these portions correspond mainly to metabolic processes as defined by researchers on the basis of biochemical criteria or experimental observations. Thus in general, a metabolic process or ‘pathway’ has its boundaries defined by the essential compounds that are its inputs and outputs. Alternatively, a process may be characterised by the co-regulation at the transcription level of a group of genes, and have its boundaries defined by the feedback control of the end product on the first step in the process. Importantly, current definitions are based on observations made on a limited number of model organisms, and it is already becoming clear that they will display significant variations in other organisms. How the global network of interactions is carved up into processes that make biological sense, may therefore depend on the organism, tissue or cellular compartment. The data structure of aMAZE readily allows for that. In addition, by enabling ‘views’ of the network at different hierarchical levels, and developing query tools for flexibly cropping across it, new ways for carving up the network could be tested in the future. Orchards and orange groves An important class high up in our hierarchy describes compartments (Figure 1), which can be defined at different levels: sub-cellular compartment, cell type, tissue, organ, and so on. Organisms can themselves be considered as a special type of compartment. Indeed, it is not uncommon to observe a metabolic pathway shared between different organisms (symbionts). In our data model, the class Organism is thus derived from Compartment, exactly in the same way as SubcellularCompartment, CellType, Tissue, and Organ. Any BiochemicalEntity, Interaction, or Process/Pathway can be linked to a given combination of objects in the Compartment class in order to describe where it occurs (e.g. plasma membrane of T-cells in Homo sapiens). This information is stored in the separate class LocationSpecification (not shown in Figure 1 for clarity sake), which contains one slot for each level of Compartment. Hierarchies of classes and objects In the aMAZE data model, as usually in Object Oriented data models, two main types of relationships are represented. One is the inheritance relationship (“is_a”) encoded in the class hierarchy, whereby a heir 3 From molecular activities and processes to biological function class inherits the properties of its parent class. The other is the reference relationship (“refers_to”), operating between objects, whereby a given object refers to another object in the database. This latter type of relationship is encoded in the objects as attributes. A very wide use is made in our model of such attributes to represent input/output relationships, which link objects of the Interaction class to objects of other classes. This allows, for example, to handle multiple inputs and outputs of interactions. Examples of such relationships can be found in the pathway graph of Figure 3A. Translation has a messenger RNA as input and a polypeptide as output. Catalysis has a protein (enzyme) as input and a Reaction as output, and Reaction has several compounds as inputs and as outputs, and so on. Reference relationships can also be used to describe hierarchic relations between objects. One example is the ContainedCompartments attribute of the Compartment class, which allows to represent a topological hierarchy, for example the fact that the nucleosome is contained in the nucleus, itself contained in a (Eukaryotic) cell. In this case the referred objects are not necessarily of the same class as the object that refers to them. Hierarchical classifications between objects, for example, the organism taxonomy, can also be represented using the reference relationships, but here all the objects belong to the same class. Such classifications are represented in our model using the Subtypes attribute, as detailed below. Classification schemes Hierarchical classifications via Subtypes attribute are implemented in many classes of our model. Examples are illustrated in Figure 4, using the Unified Modelling Language (UML) conventions. The most familiar example for the biologist is probably the systematic classification of organisms. In our model, this classification is described by recursive relations between objects of the Systematic Group class (Figure 4a): a systematic group can include other systematic groups (primate is a subtype of mammal; Homo Sapiens is a subtype of primate). Such classifications apply not only to organisms, but also to other levels of organisation such as compartments (e.g. subcellular compartments in Figure 4a). Note however, that compartments can also be classified according to their containment hierarchy, using the ContainedCompartment attribute (Figure 4a). The relation represented is however quite distinct from that in Subtypes attribute, since it indicates that a compartment is physically contained within another. ContainedCompartment SubType and provide complementary information, which can be used to address distinct queries. Compounds are also classified hierarchically. In this case a generic compound can refer to other generic compounds, or to a specific compound, allowing the definition of a complete taxonomy of compounds, as illustrated in Figure 4b. The same schema is used to represent the well-known EC 27/12/00 20:12 classification (Bairoch, 1994; Bairoch, 2000) (Figure 4c). Hierarchic classifications of Pathways and processes are also supported by the capability to use a Pathway as a PathwayBuildingBlock to represent a pathway of pathways (Figure 4d). Discussion How is Function Described in our model? Translated into English, the definition of the French dictionary Petit Robert (Robert, 1982) reads: “characteristic action (role) of an element (organ) within an ensemble (often opposed to structure)”. This definition has the merit of highlighting that the function of an object is always defined in reference to a given context (an ensemble which contains this object). It makes indeed no sense to talk about the function of an object per se. In molecular biology, the same holds true. The function of a molecule can only be defined by the link between the activities exerted by this molecule and the process in which these activities play a role. For example, the activity of aspartate kinase III coli) (Escherichia is to catalyse aspartate phosphorylation, whereas its function, or rather one of its functions, is to catalyse the first step of the threonine biosynthesis pathway. Since this pathway is itself an element of the super-pathway of amino acid biosynthesis, the function of aspartate kinase III might as well be described as “enzyme catalysing a reaction involved in amino acid biosynthesis”. Our definition comes close to that provided by Karp (2000), who distinguishes between 2 separate notions of function, the local function, and integrated function, respectively. The first concerns the individual activity exerted by a biological entity, whereas the second concerns the role played by this entity in the context of a larger system. However, while we agree on this subdivision, we prefer to replace the term local function by activity, since function is by definition context dependent while activity may not. Our activity/function definition makes it very easy to cope with multiple functions, as well as to distinguish between two distinct mechanisms of multi-functionality. The first concerns proteins that combine several distinct activities, sometimes carried out simultaneously and resulting from the existence of several domains within the same polypeptide. The second mechanism concerns proteins that exert a single activity, but take part indifferent processes, thereby giving rise to different functions. Examples of both mechanisms can be found in (Jeffery, 1999). In some sense, the recent Gene Ontology (Ashburner et al., 2000) addresses these various issues by combining 3 distinct classification schemes: biological process, molecular function, and cellular component. Each gene is classified according to these 3 schemes, and in addition, a gene can be assigned to multiple classes within each scheme. However, since these 3 schemes are stored independently, it is not easy to deduce from them, which molecular function is involved in which process, and in which compartment/location a given process takes place. Representing the various facets of multi-functionality is 4 From molecular activities and processes to biological function hence not straightforward with this system, and would require adding extra dimensions to the current description. This contrasts with our model where each process is defined as a collection of interactions and entities, which are themselves linked to the different locations where they occur. Thus, using our representation, information on the function, or functions, of a gene product is not obtained by looking up catalogues, but is computed on the fly via queries to the aMAZE database, which analyse the network of interactions in which the gene product takes part. In particular, this involves collecting from the database all the activities in which the gene product is involved, and the processes in which these activities play a role, corresponding precisely to the definition of function given by the Petit Robert. The aMAZE data model is not an ontology As already mentioned above, the so-called bioontologies (which have not much in common with the philosophic definition of ontology) arose in response to the difficulty of extracting information from a humanannotated text in sequence databases. The first step in addressing the issue was to define controlled vocabularies, in order to deal with synonyms. Taxonomical classifications were introduced subsequently to describe inclusion relationships between the terms of the controlled vocabulary. The scope of most ‘bio-ontologies’, thus does not extend to the description of molecular networks. Processes can be described in these ontologies, but this description is restricted to inclusion relationships. Even though this description can be fine-grained, and finally result in a list of all molecules involved in a given process, there is no way to indicate the order in which these molecules are interconnected to form a pathway or network. In stark contrast to this, the primary function of our model is to describe cellular processes in terms of molecular interactions, which is why the most detailed branches of our class tree describe the different types of interactions. Nevertheless, our data model allows to incorporate classifications from existing bio-ontologies. But rather than representing them in a rigid framework, these classifications are stored as object hierarchies using the Subtypes attribute, and can hence be readily modified. In this regard, our data model offers the possibility of storing different classification schemes in the same database, leaving to the user the choice of the most appropriate scheme for the particular query to be addressed. Providing such flexibility is essential, since having a single scheme as a unifying concept in biology is probably not realistic, especially not in the current state of our incomplete knowledge of the field. Rison et al (2000) recently characterised a series of functional classifications in terms of breath, depth and resolution. This analysis can not be used to evaluate data models such as ours, since we restrict our description to class hierarchies, and therefore the levels of depth and resolution of our data model will appear as very restricted. In fact, since classification schemes are represented as attributes of objects 27/12/00 20:12 instantiated in the database, our data structure can store classification schemes of any breath, depth and resolution. Since this may also be the case for some of the other data models reviewed by Rison and coll. the conclusions reached by these authors on those models, may need to be re-examined. Major differences with the representations in other biological databases Our data organisation contrasts sharply with those found in most sequence-oriented databases, which are built around specific types of biochemical entities such as genes (Genbank), or proteins (SWISSPROT), and store information on their activities in a few general description fields, essentially in textual form. Other more specialised databases, particularly EcoCyc/MetaCyc (Karp et al., 2000), use more elaborate data structures to represent function (Karp, 2000). They also make a clear distinction between entities and interactions. However, except for chemical reactions and catalysed reactions, all other types of interactions are represented as relationships between the objects, rather than objects in their own right, as in our model. This is due to the fact the majority of these databases and their underlying data models deal primarily with information on metabolic pathways. Our model covers a larger variety of activities (transcriptional regulation, attenuation, macromolecule processing, transport, …) and processes (metabolic regulation, signal transduction) and deals with spatial locations and compartments, as shown above. The very detailed ASN.1-based data specification of BIND (Bader & Hogue, 2000) shares some features with our model by representing entities, interactions and pathways as distinct objects. But their model focuses entirely on binary interactions, and it is unclear to what extent, if at all, it represents hierarchical relations between object classes, and whether it enables the representation of object taxonomies, as the aMAZE model does. Concluding remarks In this paper we described a data model which represents information on molecular activities and cellular processes in a rigorous manner, and showed how this model differs from current schemes for representing and categorising function. In particular, this model makes a clear distinction between molecular entities and activities, and extends this distinction to very wide spectrum of both. Furthermore, it provides ready means for describing the physical and functional interactions between genes and gene products, for dealing with multiple functions and spatial locations, for describing a large variety of processes and for concurrently including different classification schemes. Our data model is clearly still evolving as our limited experience grows and the types of data we handle expands. The aMAZE database, which implements this model, currently handles information on metabolic pathways, gene regulation and transport. Work on its extension to other types of pathways, in particular T-cell signal transduction cascades, is in 5 From molecular activities and processes to biological function progress. Also being worked on are modifications of our schema, which will allow us to represent interactions between sub-cellular compartments, cells tissues and whole organisms. Availability Additional information can be found on the Web at the following address: http://www.ebi.ac.uk/research/pfbp/. A version of the aMAZE database is presently implemented in Java, with a lower level storage in ORACLE. Access to it over the Internet will be provided in the near future. Acknowledgements We thank Georges Cohen, Kirill Degtyarenko and Gaurab Mukherjee for valuable help in annotation of metabolic and regulatory pathways, and acknowledge Jean Richelle, Thure Etzold, Dietmar Schomburg, and Bruno André for useful discussions. The work described here has been sponsored by a consortium of industries, comprising, Astra-Zeneca, Aventis, Monsanto, Organon, and Roche. Scientists from these companies are thanked for valuable input. We furthermore acknowledge support from the European Commission, Grant Nº QLRI-CT-1999-01333. References Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M. & Sherlock, G. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1), 25-9. Bader, G. D. & Hogue, C. W. (2000). BIND--a data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics 16(5), 465-77. Bairoch, A. (1993). The ENZYME data bank. Nucleic Acids Res 21(13), 3155-6. Bairoch, A. (1994). The ENZYME data bank. Nucleic Acids Res 22(17), 3626-7. Bairoch, A. (2000). The ENZYME database in 2000. Nucleic Acids Res 28(1), 304-305. Bairoch, A. & Apweiler, R. (2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 28(1), 45-48. Baker, W., van den Broek, A., Camon, E., Hingamp, P., Sterk, P., Stoesser, G. & Tuli, M. A. (2000). The EMBL Nucleotide Sequence Database. Nucleic Acids Res 28(1), 19-23. 27/12/00 20:12 Jeffery, C. J. (1999). Moonlighting proteins. Trends Biochem Sci 24(1), 8-11. Kanehisa, M. & Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28(1), 27-30. Karp, P. D. (2000). An ontology for biological function based on molecular interactions. Bioinformatics 16(3), 269-85. Karp, P. D., Riley, M., Saier, M., Paulsen, I. T., Paley, S. M. & Pellegrini-Toole, A. (2000). The EcoCyc and MetaCyc databases. Nucleic Acids Res 28(1), 5659. Mewes, H. W., Albermann, K., Bahr, M., Frishman, D., Gleissner, A., Hani, J., Heumann, K., Kleine, K., Maierl, A., Oliver, S. G., Pfeiffer, F. & Zollner, A. (1997). Overview of the yeast genome [published erratum appears in Nature 1997 Jun 12;387(6634):737]. Nature 387(6632 Suppl), 7-65. Overbeek, R., Larsen, N., Pusch, G. D., D'Souza, M., Jr, E. S., Kyrpides, N., Fonstein, M., Maltsev, N. & Selkov, E. (2000). WIT: integrated system for highthroughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res 28(1), 123-125. Riley, M. (1993). Functions of the gene products of Escherichia coli. Microbiol Rev 57(4), 862-952. Riley, M. (1998). Systems for categorizing functions of gene products. Curr Opin Struct Biol 8(3), 388-92. Rison, S. C. G., Hodgman, T. C. & Thornton, J. M. (2000). Comparison of functional annotation schemes for genomes. Funct Integr Genomics 1, 56-69. Robert, P. (1982). Le Petit Robert - dictionaire alphabetique et analogique de la langue francaise. 1982 edit (J.Rey-Debove, A. R. a., Ed.), 1. 2 vols, Le Robert, Paris. Salgado, H., Santos, A., Garza-Ramos, U., van Helden, J., Diaz, E. & Collado-Vides, J. (1999). RegulonDB (version 2.0): a database on transcriptional regulation in Escherichia coli. Nucleic Acids Res 27(1), 59-60. Salgado, H., Santos-Zavaleta, A., Gama-Castro, S., Mill#n, Z. r. D., Blattner, F. R. & Collado-Vides, J. (2000). RegulonDB (version 3.0): transcriptional regulation and operon organization in Escherichia coli K-12. Nucleic Acids Res 28(1), 65-67. Schomburg, D., Salzmann, D. & Stephan, D. (19901995). Enzyme handbook. 13 vols, Springer. Takai-Igarashi, T., Nadaoka, Y. & Kaminuma, T. (1998). A database for cell signaling networks. J Comput Biol 5(4), 747-54. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A. & Wheeler, D. L. (2000). GenBank. Nucleic Acids Res 28(1), 15-18. van Helden, J., Naim, A., Mancuso, R., Eldridge, M., Wernisch, L., Gilbert, D. & Wodak, S. J. (2000). Representing and analysing molecular and cellular function using the computer [In Process Citation]. Biol Chem 381(9-10), 921-35. Bray, D. (1997). Reductionism for biochemists: how to survive the protein jungle. Trends Biochem Sci 22(9), 325-6. Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Pr, M., Reuter, I. & Schacherer, F. (2000). TRANSFAC: an integrated 6 From molecular activities and processes to biological function system for gene expression regulation. Nucleic Acids Res 28(1), 316-319. 27/12/00 20:12 Database of Interacting Proteins. Nucleic Acids Res 28(1), 289-291. Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K., Marcotte, E. M. & Eisenberg, D. (2000). DIP: the Figures Compound Specific Compound Generic Compound Proteinaceous Protein domain Polypeptide Protein complex DNA Gene Operon Regulatory Site Biochemical Entity RNA Pathway BuildingBlock Transformation Reaction Assembly/Disassembly Transport Event Macromolecule Processing Conformation Change Interaction Control of Transformation Control Transcription RNA Processing Translation Expression Reaction Catalysis Transport Catalysis Transcriptional Regulation Translational Regulation Attenuation Indirect Interaction Process/Pathway Database Object Compartment Subcellular Compartment Cell Type Tissue Organ Control of Control Metabolic Pathway Step Metabolic Pathway Pathway of Pathways Systematic Group Reference Figure 1 Evidence Publication Lab Info External DB link Figure 1: The current class hierarchy of the aMAZE database. alpha-ketoglutarate activation catalysis n PPi Uridylyltransferase reaction 2.7.7.59 PII (UMP)n n UTP PII expression n UDP glnB reaction n Pi catalysis Figure 2 Figure 2: Schematic representation of the action and control of a multi-functional enzyme using the aMAZE data model. The example shown is that of uridylyltransferase, an enzyme which catalyses both the uridylylation and de-uridylylation of the protein PII (see text for details). In this representation, objects of the Interaction class, namely, reaction, catalysis expression and activation are boxed. Objects of the BiochemicalEntity class are represented by their names, printed in bold. Input/output relationships are represented by arrows. 7 From molecular activities and processes to biological function 27/12/00 20:12 (a) Metabolic PathwayStep L-Aspartate ATP inhibition translation Aspartate kinase I homoserine dehydrogenase I catalysis 2.7.2.4 ADP L-Aspartyl-4-P NADPH asd expression Aspartate semialdehyde deshydrogenase catalysis 1.2.1.11 NADP+; Pi MetabolicPathway L-Aspartic semialdehyde NADPH catalysis inhibition 1.1.1.3 NADP+ L-Homoserine ATP translation Cystathionine-gamma-synthase catalysis inhibition 2.7.1.39 ADP L-Homoserine phosphate H2O translation Cystathionine-beta-lyase catalysis 4.4.1.8 Pi L-Threonine thrABC mRNA transcription thrABC operon Attenuation Figure 3A (b) aspartate Common fork inhibition inhibition inhibition L-aspartic semialdehyde Homoserine biosynthesis PathwayOfPathways inhibition L-Cysteine inhibition Methionine biosynthesis inhibition L-Homoserine inhibition Lysine biosynthesis inhibition L-Lysine Threonine biosynthesis inhibition L-Threonine L-Methionine inhibition Isoleucine biosynthesis L-Isoleucine Figure 3B Figure 3: Graphical representations of pathways using the aMAZE data model. (a) The metabolic regulation pathways of threonine biosynthesis in Escherichia coli. As in Figure 2, objects of the Interaction class, namely, reaction, catalysis expression, inhibition and attenuation are boxed. Objects of the BiochemicalEntity class are represented by their names, printed in bold. Input/output relationships are represented by arrow. On this diagram we contrast the MetabolicPathwayStep (light dotted box frames) from the entire MetabolicPathway (heavy dotted box frame). (b) Example of a pathway of pathways. The example represents the network formed by the biosynthetic pathways for amino-acids having aspartate as precursor. In this diagram, some nodes represent whole pathways; IndirectInteractions between pathways are labelled as inhibition (negative) or activation (positive, not shown). 2 From molecular activities and processes to biological function (a) Classes 27/12/00 20:12 Compartment ContainedCompartments SubcellularCompartment SubTypes SystematicGroup SubTypes Subtypes relationships Membrane Mammal Rodent Primate Cell membrane Endoplasmic reticulum membrane Homo sapiens ContainedCompartments relationships Plasma membrane Eukaryote (Systematic group) Cytoplasm Nucleated cell (Cell Type) Nucleus Nucleosome Nucleus (Subcellular Compartment) Figure 4a (b) Classes Compound GenericCompound SubTypes SpecificCompound Subtype relationships Sugar Hexose Fructose L-Fructose Glucose L-Glucose D-Glucose Figure 4b 3 From molecular activities and processes to biological function (c) 27/12/00 20:12 Classes Catalysable EC SubTypes Reaction Subtypes relationships 1. Oxidoreductases 1.1 Acting on the CH-OH group of donors 1.1.1 With NAD+ or NADP+ as acceptor 1.1.1.1 Alcohol dehydrogenase; Aldehyde reductase Alcohol + NAD+ <=> Aldehyde + NADH Figure 4c (d) Classes PathwayBuildingBlock Pathway PathwayNodes Interaction PathwayNodes relationships Amino acid metabolism (pathway of pathways) Amino acid biosyntheses (pathway of pathways) Asparate-derived amino acid biosyntheses (pathway of pathways) Methionine biosynthesis (metabolic pathway) Lysine biosynthesis (metabolic pathway) Isoleucine biosynthesis (metabolic pathway) Threonine biosynthesis (metabolic pathway) Aspartate phosphorylation step (metabolic pathway step) asparate phosphorylation reaction (interaction) Figure 4d Figure 4: Representation of Classification schemes in the aMAZE data model. The data model supports classification schemes at multiple levels using subtype and ContainedCompartment relationships. Some examples are illustrated here. (a) and organism classification, and classification of sub-cellular compartments (b) Hierarchic classification of Compounds. (c) Classification of enzymatic reactions, using the EC classification (Bairoch, 1993). (d) Hierarchy of pathways. 4