Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Concurrency control wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Functional Database Model wikipedia , lookup
Relational model wikipedia , lookup
Clusterpoint wikipedia , lookup
Object-relational impedance mismatch wikipedia , lookup
Ulrike Wittig works as a research associate in the Scienti®c Database and Visualisation group of the European Media Laboratory GmbH (EML) Heidelberg, Germany. Her research interests include modelling and visualisation of biochemical pathways, information extraction from biological data sources and data integration from biological databases. Ann De Beuckelaer is a member of the Scienti®c Database and Visualisation Group of the European Media Laboratory GmbH (EML) in Heidelberg looking at the electronic representation of biochemical pathways. Her interest is the development of data models allowing the complete biochemical system to be represented for organisms of any kind. Keywords: database, pathway, metabolism, enzyme, genome, bioinformatics Analysis and comparison of metabolic pathway databases Ulrike Wittig and Ann De Beuckelaer Date received: 10th January 2001 Abstract Enormous amounts of data result from genome sequencing projects and new experimental methods. Within this tremendous amount of genomic data 30±40 per cent of the genes being identi®ed in an organism remain unknown in terms of their biological function. As a consequence of this lack of information the overall schema of all the biological functions occurring in a speci®c organism cannot be properly represented. To understand the functional properties of the genomic data more experimental data must be collected. A pathway database is an effort to handle the current knowledge of biochemical pathways and in addition can be used for interpretation of sequence data. Some of the existing pathway databases can be interpreted as detailed functional annotations of genomes because they are tightly integrated with genomic information. However, experimental data are often lacking in these databases. This paper summarises a list of pathway databases and some of their corresponding biological databases, and also focuses on information about the content and the structure of these databases, the organisation of the data and the reliability of stored information from a biological point of view. Moreover, information about the representation of the pathway data and tools to work with the data are given. Advantages and disadvantages of the analysed databases are pointed out, and an overview to biological scientists on how to use these pathway databases is given. INTRODUCTION Ulrike Wittig, European Media Laboratory, Villa Bosch, Schloss-Wolfsbrunnenweg 33, 69118 Heidelberg, Germany Tel: 49 6221 533117 Fax: 49 6221 533298 E-mail: [email protected] 126 Initially, biological databases were created to store information gained during a particular project. As a consequence, all the data are stored in large and widely distributed databases, with heterogeneous data formats and access mechanisms. In most of the review articles published on biological databases1±3 a critical analysis of the data is missing. However, to extract and use data from heterogeneous data sources it is essential for a user to have information about the origin of the data, updates, redundancies and reliability. This paper primarily focuses on pathway databases and critically analyses their structure, the content of information being stored and especially the reliability of the data. Pathway databases are closely related to enzyme databases and this means that some of the enzyme databases were also checked according to their biological content. The fast-growing amount of data, creation of new databases and changes in existing databases on the Internet means that the analysis is inevitably incomplete. PATHWAY DATABASES Databases developed in the last decades can be classi®ed into different categories, including genome databases, protein databases, enzyme databases, pathway databases, literature databases and some very specialised databases. This classi®cation of databases is often based on their biological content. Although the content of the databases is mostly restricted to speci®c biochemical compounds or functions, a lot of overlap occurs. For example protein sequence information is listed in genome databases and functional data of corresponding enzymes can be found in protein and pathway databases. While analysing databases containing information about biochemical reactions a distinction must be made between & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 Analysis and comparison of metabolic pathway databases what is there? enzyme databases and the more complex pathway databases. Enzyme databases mainly contain information about enzymes and their properties. On the other hand pathway databases re¯ect information about reactions and pathways in general, data related to organismspeci®c information about genes, their related gene products, protein functions, expression data, data about enzymatic activities, kinetic data, etc. In most cases the above-mentioned enzyme databases are closely connected with pathway databases (Figure 1). Pathway databases can also be subclassi®ed into databases containing metabolic and/or signalling pathways. This review will mainly concentrate on metabolic pathway databases because it is dif®cult to compare the two types of content. Currently, the most commonly used pathway databases include WIT, KEGG, EcoCyc, ExPASy±Biochemical Pathways, PathDB and UM-BBD (Table 1). These pathway databases were selected and analysed particularly with regard to their biological/biochemical contents, data structures and data representation. One major advantage of pathway databases over other biological databases is the possibility of providing several types of information in the context of the graphical representation of pathways. For example pathway databases are able to represent the high complexity of all the biochemical reactions within a single cell or a complete organism. There are different ways to show a graphical view of a pathway. Graphical representations of the pathways are drawn manually and are non-dynamic (KEGG, WIT) or are produced automatically on demand and are user-dependent (PathDB, EcoCyc). For a user-dependent interaction, ¯exible representations will be more and more important because they enable different levels of information to be represented. To avoid information overload, but on the other hand to be able to differentiate between general overviews and detailed representations of reactions or pathways containing structural formulas of chemical compounds, users should be able to zoom into their speci®c level of interest. A challenge of pathways representation is to combine a ¯exible drawing of pathways, with the content behind all the elements within that given pathway, without loosing clarity. Therefore the following pathway databases were checked for their graphical representation, including issues about the availability of automatic drawings and the possibility of userdependent changes and interactions. Furthermore, tools available to work with the data were analysed. WIT (WHAT IS THERE?) Content of the database Figure 1: The connectivity of the main database types The WIT system connects data about genes and genomes, enzymes, reactions and pathways.4,5 The Enzyme and Metabolic Pathways (EMP) database is currently embedded within WIT. The EMP data are derived from about 17,000 literature sources and are organised in a number of tables. Organism-speci®c information was extracted manually from the literature, including protocols of experiments (puri®cation steps, storage conditions, etc.) and reaction mechanisms & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 127 Wittig and De Beuckelaer Table 1: Pathway databases and relevant databases for data retrieval in pathway databases Classi®cation Content Example databases Pathway databases Pathway databases Metabolic and/or signalling pathways, reaction participants, relations between genes, enzymes and reactions KEGG http://www.genome.ad.jp/kegg/kegg.html WIT http://wit.mcs.anl.gov/WIT2/ EcoCyc/MetaCyc http://ecocyc.pangeasystems.com/ecocyc/ ExPASy±Biochemical Pathways http://expasy.proteome.org.au/cgi-bin/search-biochemindex CSNDB http://geo.nihs.go.jp/csndb/ PathDB http://www.ncgr.org/software/pathdb/ UM-BBD http://www.labmed.umn.edu/umbbd/index.html SPAD http://www.grt.kyushu-u.ac.jp/spad Relevant databases often used for data retrieval in pathway databases Genome databases Nucleotide sequences, GenBank functional annotations EMBL Database DataBank of Japan TIGR MGDB FlyBase Protein and protein- Protein sequences and their SWISS-PROT/TrEMBL related databases characteristics, functional PROSITE annotations SCOP PDB Pedant Transfac PIR Enzyme databases Enzymatic properties of ExPASy±ENZYME proteins BRENDA EMP Chemical databases Chemical properties of Beilstein compounds ChemFinder Literature database Abstracts of scienti®c literature Medline/PubMed (de®ned as a network of elementary reactions). Access to the data WIT offers a general functional overview as one possible starting point represented as a classi®cation table of pathways. Unfortunately this classi®cation is not consistent with other pathway databases. For instance WIT has many different types of glycolysis dependent on the initial substrate and/or ®nal product. To ®nd a speci®c pathway some detailed information about the pathway of interest is necessary. Since pathway names contain a mixture of abbreviations and systematic names6 they are often very long and incomprehensible. Querying the EMP database for enzymes by name or EC number can be improved by de®ning compartments, cell types, tissues or organisms. Pathways can 128 http://www.ncbi.nlm.nih.gov/ http://www.ebi.ac.uk/embl/index.html http://www.ddbj.nig.ac.jp/ http://www.tigr.org/tdb/ http://mbgd.genome.ad.jp/ http://¯y.ebi.ac.uk:7081 http://www.ebi.ac.uk/swissprot/index.html http://www.expasy.ch/prosite/ http://scop.mrc-lmb.cam.ac.uk/scop/ http://www.rcsb.org/pdb/ http://pedant.mips.biochem.mpg.de/ http://transfac.gbf-braunschweig.de/TRANSFAC/index.html http://www-nbrf.georgetown.edu/pirwww http://www.expasy.ch/enzyme/ http://www.brenda.uni-koeln.de/ http://wit.mcs.anl.gov/EMP http://www.beilstein.com/ http://www.chem®nder.com http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbPubMed be searched by selecting compartments for initial substrate, intermediate and ®nal products, etc. Furthermore, organisms, cell types or tissues can also be selected in the general query system. Since the query results contain internal accession numbers for enzymes and abbreviated names for pathways they are dif®cult to interpret. Data representation The `View Annotation' window gives names, EC number and functional description of an enzyme. Graphical representations contain comments and information on the compartments where the reaction occurs and links to information about the enzyme. The descriptive page of enzymes is linked to KEGG, EMP and Medline. However, the enzyme information derived from KEGG contains some slight differences from the original KEGG project, eg links to KEGG & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 Analysis and comparison of metabolic pathway databases Kyoto Encyclopedia of Genes and Genomes maps, 3D-structure information from PDB (Protein Data Bank) and links to the BRENDA enzyme database are removed, but accession numbers to the PIR (Protein Information Resource) database are given. The link to the EMP database allows the retrieval of literature-related and organism-speci®c enzyme information.7 The `Diagram Data' window contains detailed information on the overall balance of the reaction, every compound involved in the reaction and the location or compartment of the cell where the reaction occurs. These data are used for the graphical representation of the pathways that are represented in the `Diagram Picture' view. The `Assertions Table' contains information about existing enzymes of a given pathway in complete sequenced organisms. One major disadvantage of the WIT system is that there is no graphical representation of all pathways in a general overview. Reliability/inconsistency WIT uses a very comprehensive schema for classifying biochemical pathways. Unfortunately this classi®cation system is not comparable with classi®cations used in other databases (Figure 2). Going into the detailed list of pathways the user is confronted with very long pathway names that contain abbreviations of start and/or ®nal compound of the pathway. For example the catabolism of glucose (from glucose to pyruvate) is named `GLCPYR.CAT' but names such as `GLCPYRADPATPNAD.CAT' also occur, where it is very dif®cult to understand the meaning. Additional tools Homology searches for genes or enzymes and a pathway search option are implemented. The result of these queries contains only enzymes and the number of matches for each pathway name but no graphical representation. KEGG Content of the database KEGG (Kyoto Encyclopedia of Genes and Genomes) contains all known metabolic pathways and a limited number of regulatory pathways and transport mechanisms.8,9 The KEGG system consists of three main databases which are tightly connected: LIGAND with information about compounds, enzymes and reactions stored in ¯at ®les,10,11 PATHWAY, which contains the graphical representations of the pathways and lists of enzymes and reactions within the pathways, and GENES, which contains organism-related genome and gene information and lists of genes within an organism and pathway. Furthermore KEGG provides many links to other databases that are integrated within the DBGET integrated database retrieval system.12 KEGG's gene and genome data are based on genome databases containing all publicly available nucleotide sequences and their functional assignments. Biochemical data of pathways, reactions, etc., for instance compound and enzyme names, reactions or comments, are based on IUPAC (International Union of Pure and Applied Chemistry) and IUBMB (International Union of Biochemistry and Molecular Biology) recommendations.13 Chemical structures of compounds are stored in two different formats: a GIF image format and 2D coordinates stored in an MDL-MOL ®le (®le format for molecular structures from MDL Information Systems Inc.), which can be used to launch a proper ISIS/Draw application. Access to the data KEGG's main menu offers several different ways to enter the pathway systems: one can select an organism from a list of complete or partially sequenced organisms, or select a pathway from the classi®cation of pathways (subdivided into metabolic and regulatory pathways), or search for a speci®c compound or enzyme & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 129 Wittig and De Beuckelaer in the LIGAND database or for a gene in the GENE database. The query interfaces do not allow complex queries; for example, a query for all known enzymes that use a de®ned substrate. The general overview of all pathways using a colour-code according to the classi®cation of pathways enables the initial entry of the program. This allows access to speci®c pathway classes, eg carbohydrate metabolism. A window opens which represents links to all speci®c pathways within carbohydrate metabolism. Data representation Pathways in KEGG are classi®ed according to the chemical structures of their main compounds, eg carbohydrates, lipids, amino acids. All speci®c pathway maps and overviews are manually drawn pictures where pathway maps consist of links to speci®c information about compounds, enzymes and genes. Pathway maps contain all known reactions catalysed by proteins/enzymes derived from gene products. Non-enzymatic reactions are not always included in that system. However, in some cases nonenzymatic reactions are shown within a pathway map but do not have links to detailed information of this speci®c reaction. Reactions within the pathway maps do not represent side compounds, eg ATP (adenosine triphosphate) or NADH (reduced nicotinamide adenine dinucleotide). Pathway maps are also linked to some other related pathways connected by their contributing compounds. This allows the user to get an overview about connections to other pathways, eg one main ®nal product of glycolysis ± pyruvate ± is used in the different pathways of amino acid metabolism. Unfortunately not all possible connections are given. As well as the reference pathway the user can select pathway information of a speci®c organism. Within the reference pathway all the enzymes that are known to be expressed in the speci®c organism (found by homology search) are 130 highlighted in green. Furthermore other selections for speci®c information are also available,10 eg highlighting of all enzymes with identi®ed 3D structure, enzymes with sequences in the SWISS-PROT, GenBank or PIR databases, and enzymes that have a link to genetic diseases in the OMIM (Online Mendelian Inheritance in Man) database. Genome maps are represented as graphics that give information about gene positions and their relationship with the pathways. Gene catalogues contain hierarchical texts that include all known genes for each organism, listed according to the pathway classi®cation. Information about orthologous genes are stored in tables containing the information of a conserved, functional unit in a pathway, a comparative list of genes for the functional unit in different organisms, and the positional information of genes clustered in each genome. Molecule catalogues are represented by hierarchical texts containing macromolecules (including proteins and RNAs) and small chemical compounds. These data are based on classi®cations of enzymes and chemical compounds. Enzymes are classi®ed based on IUPAC/ IUBMB recommendations and additionally according to PIR superfamilies, SCOP (Structural Classi®cation of Proteins) 3D-folds or PROSITE (database of protein families and domains) motifs. The pathway query result shows all pathways containing a given enzyme or compound but offers no graphical representation. Organismspeci®c information is also not available within such results. Searching for a pathway by selecting a speci®c organism does not provide information about the enzymes available but only links to the gene information related to that organism. A comparison of pathways of two or more organisms cannot be implemented. Enzyme pages contain the following information: names, EC classi®cation according to IUBMB, reactions and their participants (substrates, products, cofactors, inhibitors, etc.), pathways, & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 Analysis and comparison of metabolic pathway databases related genes in organisms and links to diseases (OMIM), motifs (PROSITE) and 3D structures (PDB). Compounds are represented with names, chemical and structural formula, pathways in which the compound occurs and enzymes that catalyse a reaction containing the compound as participant. Reliability/inconsistency KEGG does not contain all possible reactions catalysed by an enzyme. When comparing reaction equations with the lists of substrates and products, inconsistencies sometimes arise, eg the substrate list of pyruvate kinase (EC 2.7.1.40) contains more compounds than those listed in the reaction equations. Often reactions catalysed by an enzyme, eg alcohol dehydrogenase (EC 1.1.1.1), with a wide substrate speci®city are given as general reaction equations, which means loss of detailed information concerning the correlating products of each substrate. Moreover, enzymes are represented with their substrates and products, but reaction equations and pathway information are missing, eg peptidases. Within compounds some errors and inconsistencies arise. For example sometimes one compound ID is given for two different compounds (eg C00023 for both Fe2 and Fe3 ) or different compound IDs are given for the same compound (C02038 and C02156 for glycyl-peptide or C04230 and C01174 for 1-acyl-glycero-3-phosphocholine among others). Additional tools EcoCyc/MetaCyc KEGG offers some tools for querying within different maps and also methods to compute with KEGG data. Tools for searching enzymes (EC numbers ± names not possible), compounds (compound ID ± names not possible) or genes (gene name or accession number) are available. Within genome maps queries for gene positions or homologous gene clusters are available as well as comparisons of two genome maps for exhaustive search of homologous gene clusters. Tools for colouring and thereby highlighting the genes in the pathway or genome map make the system user-friendly and enable a quick overview on the result of the query. One unique feature of the KEGG system is the creation of a pathway between two given compounds.10 As a starting point, two compound IDs, the parameter level and hierarchy of relaxation have to be de®ned. Unfortunately there is no help function or documentation about the parameters. Based on binary relations two algorithms (Dijkstra/Floyd) are used to ®nd the shortest way between two compounds. In such cases where a reaction has two or more substrates the implementation is not able to distinguish between different substrates and their correlating products or compartments and transport mechanisms. Pathways are created based on one main substrate without paying attention to side substrates or products, compound/enzyme locations and cofactors. Results of the pathway creation contain only the compound ID and EC number. For users who are not familiar with the compound IDs used in KEGG it is very time-consuming to understand the resulting pathway since there is no graphical representation of the search results. This tool does not apply organism-speci®c searches and creates pathways containing all possible reactions regardless of whether the speci®c enzyme has been identi®ed in the speci®c organism or not. Highlighting these identi®ed enzymes and proven pathways among them would be very helpful. Automatic reconstruction of organismspeci®c metabolic pathways can be accomplished by matching the genes corresponding to the enzymes in the gene catalogues with the enzymes on the reference pathway diagrams. ECOCYC/METACYC Content of the database EcoCyc is a metabolic-pathway database that describes the genome and the & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 131 Wittig and De Beuckelaer biochemistry of Escherichia coli based on information from the EcoGene database,14 SWISS-PROT and the scienti®c literature.15,16 The database consists of all sequences and functional annotations of E. coli genes. Pathways of E. coli and its reactions and enzymes are annotated with references to the literature. In addition MetaCyc describes pathways, reactions and enzymes of a variety of organisms, with a microbial focus. But MetaCyc does not contain organism-speci®c genome or protein information such as genomic maps or sequences. MetaCyc uses the same database schema and visualisation software as EcoCyc.17 Access to the data The EcoCyc query interface provides search options for E. coli genes, proteins, reactions, compounds and pathways by names, sub-strings, classi®cation hierarchy, EC number or chemical structure. In contrast to EcoCyc the MetaCyc query interface only offers searching for pathways, reactions and compounds. Proteins or genes can be selected for querying but no matches will be found by searching, for example, EC numbers. The EcoCyc user is also able to start the system from an E. coli metabolic overview in addition to the query page. MetaCyc offers no general pathway overview. The EcoCyc overview is a bird's-eye view of the complete E. coli biochemical pathways.3 The glycolysis and TCA (tricarboxylic acid) cycles in the middle separate predominately catabolic pathways on the right from pathways of anabolism and intermediary metabolism on the left. Signalling pathways run along the bottom of the ®gure. Individual reactions at the right of the ®gure represent reactions not currently assigned to any pathway. However, these single reactions without a classi®cation are not very helpful in the overview. The picture uses different icons (circles, squares, triangles, etc.) for different compound classes. If a mouse pointer is assigned to a circle in the diagram, the name of the corresponding 132 metabolite and the name of the related pathway are displayed at the bottom of the www browser window. This operation will work only for JavaScriptenabled browsers. If such a system is not available, the user needs biochemical knowledge to recognise which graph can represent the pathway. In that case the metabolic overview will not help to get detailed information. By clicking on a compound the pathway containing the selected compound will be displayed. Unfortunately there is no way to get compound information directly from the overview window. The metabolic overview is linked directly to representations of speci®c pathways or disconnected single reactions but it is not possible to detect where the pathway is located within the complex overview. As in many other databases, compartmentalisation and consequentially transport mechanisms are missing. Data representation Pathways and reactions are drawn automatically using graph-drawing algorithms, and size may be adjusted. The graphical representation of pathways offers different levels of information. The lowest level contains only the major reactions of a given pathway. Information about enzymes and genes, etc. is not available. In the more detailed level all enzymes with their corresponding EC numbers and corresponding genes are given. Some pathways show chemical structures of the reactants on that level, but this feature is not consistent. Some pathways also include green arrows that indicate regulatory mechanisms (dotted lines) or connected pathways (solid lines). Not all available connections between pathways and reactions shown in the metabolic overview picture are shown within the pathway representation (eg TCA cycle). Within the graphical representation of pathways, inhibitors and activators of reactions are represented, but detailed compound information is not accessible. Names of the compounds are available in & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 Analysis and comparison of metabolic pathway databases PathDB the bottom line of the JavaScript-enabled www browser window. Enzyme information consists of names, enzyme classi®cation, catalysed reactions and pathways in which the enzyme occurs. More data about the enzyme are given by connecting to the ExPASyENZYME database. Within EcoCyc a graphical representation of the genereaction schema shows the relation between E. coli genes coding for enzymes or subunits. It also includes reactions that can be catalysed by more than one enzyme. A linkage of the enzyme to the related gene or vice versa is implemented. The E. coli genes are characterised by their names, sequences, classi®cation, positions within the genome and information about the corresponding enzymes. Links to the E. coli Stock Center Database (CGSC)18 and automatic queries in GenBank are installed. Compound properties are represented with names, chemical formulas, molecular weights, CAS (Chemical Abstracts Service) number and structural formulas, but these formulas are not based on uniform drawings. In addition a list of reactions in which the compound occurs is given. Reliability/inconsistency The classi®cation of pathways refers to the classi®cation of E. coli gene products,19,20 but is not used consistently. The biosynthesis of amino acid families includes, for example, the following instances: `ileval biosynthesis', `leuvalalaile biosynthesis' and `leuvalile biosynthesis'. In such a case a differentiation between these pathways does not make sense. Compounds are based on a de®ned classi®cation, but this system of compound classi®cation has some discrepancies. For example it has the type `unclassi®ed compounds' but within this category compounds with known classi®cation (eg phospholipids, steroids) also occur. There is a category `animal compounds' which includes hormones, etc., but these compounds do not participate in reactions of E. coli and are therefore not included within speci®c pathways. On the other hand, the differences between animal compounds, unclassi®ed and non-metabolic compounds are unclear. Additional tools Unfortunately the query options are not complex enough to search for reactions or pathways by de®ning substrates, products, etc. One important feature of the EcoCyc system is the possibility of importing E. coli expression data and colour-coded highlighting of reactions in the general pathway overview according to the relative or absolute amount of the gene expressions. PATHDB Content of the database PathDB is a general metabolic pathway project of NCGR (National Center for Genome Resources) representing compounds, metabolic proteins (enzymes or transport proteins), metabolic steps (reactions) and pathways.21 Information about locations within cells or organisms are also given if available. PathDB also includes descriptions of the kinetic, thermodynamic and physicochemical properties of pathway components based on the scienti®c literature. The integration of these data into PathDB offers a good basis for the connection with simulation tools. Access to the data A query interface allows the user to ®nd pathways, enzymes, reactions or compounds in the database by searching given attributes (all possible roles of compounds, kinetic types, modi®ers, taxonomy, citations, etc.). A Java application can be downloaded to run complex queries. This interface makes it possible to navigate between the relations in various entities. Data representation The graphical representation of pathways is based on graph-drawing algorithms, & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 133 Wittig and De Beuckelaer ExPASy-biochemical pathways including, hierarchical and circular views using prede®ned positions for the compounds. Zooming and scaling tools are installed and the user can select speci®c information. For instance enzymes can be displayed either by their names, their EC numbers or both. In addition the user can choose whether primary and/or secondary substrates and products are visible. Depending on the amount of selected information the representation can be very complex and confusing because of overlapping terms and arrows. Compounds are characterised by their names, chemical and structural formulas, molecular masses and other physicochemical data. The information about chemical structures is stored as a SMILES (Daylight Chemical Information Systems, Inc.) string that can be used to generate the 2D structure. Metabolic proteins are characterised by their names, modi®ers and physicochemical properties and links are given to databases containing sequence or structure information. A reaction is de®ned by its substrates, products and the stoichiometric equation. They are displayed as metabolic steps that also include transport mechanisms. Pathways are built by sequences of metabolic steps including their direction, catalysts, modi®ers and location. Modi®ers can also be subdivided into activators and inhibitors. Reliability/inconsistency Unfortunately cross-references within the pathway representation to other connected pathways are missing. Besides common and alternative names of a compound, PathDB also contains a speci®c compound label used for the graphical representation. However, most of the compound labels are not commonly used abbreviations of compound names. Additional tools One feature of PathDB is a tool for a user-dependent pathway generation between two compounds, so the database is not restricted to prede®ned pathways. 134 EXPASY±BIOCHEMICAL PATHWAYS Content of the database ExPASy (Expert Protein Analysis System) of the Swiss Institute of Bioinformatics consists of two databases related to biochemical pathway information ± ENZYME and Biochemical Pathways.22 ExPASy±Biochemical Pathways is the electronic version of the Boehringer Mannheim Biochemical Pathways Chart.23 An electronic index allows the localisation of any metabolite or enzyme on the chart. Access to the data Querying for compounds, enzymes or pathways by names or substrings results in a list of related positions (squares) on the graphical map. If a reaction is not complete within that resulting square the reaction is cut and the corresponding information is lost. One of the four adjacent squares can be selected to expand the view. Most enzyme names on the chart act as links to the ExPASy±ENZYME database. Data representation Unfortunately, ExPASY does not implement a general overview of all pathways on the complete map. ENZYME, like Biochemical Pathways, does not include organism-speci®c pathway information except for the usage of three colours to distinguish pathways related to higher plants (green), animals (blue) or microorganisms (red). Black arrows indicate general biochemical pathways. The information about the colour-coded representation of reactions and compounds and other remarks related to the pathway maps (eg explanations of abbreviations) is only available on the printed versions of the Boehringer Mannheim wall chart.23,24 Since there is no documentation within the electronic version the user has to have one of the printed versions for data interpretation in the electronic version. Enzymes on the electronic map are & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 Analysis and comparison of metabolic pathway databases connected to the enzyme properties stored in ExPASy±ENZYME containing names, reaction properties and links to other enzyme or pathway databases. Compounds on the map are not linked to any source of chemical database. Reliability/inconsistency Within the graphical pathway representation it is dif®cult to distinguish the pathways according to their classi®cation. Since only two squares of the chart can be selected at the same time it is not possible to view complete pathways, eg glycolysis. On the electronic as well as on the printed version the pathway representation includes redundancies because some enzymes or compounds appear more than once on the complete map. Links between enzymes or compounds are not included, which implies no connectivity between reactions or pathways comprising the same enzymes or compounds. This fact disables an overview of the complexity and connectivity of pathways. Additional tools The ExPASy system does not offer any tools to navigate on the pathway map or to search for reactions or pathways by de®ning substrates, etc. University of Minnesota Biocatalysis/ Biodegradation Database UM-BBD Content of the database UM-BBD (University of Minnesota Biocatalysis/Biodegradation Database) contains information on reactions related to biodegradation pathways of xenobiotic compounds in microorganisms.25,26 It provides information on pathways, single reactions, chemical compounds including intermediate states of the reactions, organisms, enzymes and genes. The system is connected to other resources such as GenBank, ChemFinder, Medline, etc. Access to the data A search function allows queries for speci®c compounds or enzymes by de®ning names or substrings, chemical formula, CAS number or EC number respectively. The search page also has links to lists of pathways, compounds, enzymes, reactions and organisms that can be selected. UM-BBD gives no general pathway overview. Most of the pathways and reactions are disconnected. Data representation When appropriate citations in the literature exist UM-BBD generates new pathway, reaction and compound pages based on the literature source. The representation of a pathway contains the names of compounds and reactions and gives links to the more detailed information of the compounds on their corresponding pages. Starting compounds are used to de®ne the name of a pathway. Organisms are listed if they are known to initiate this pathway or carry out later steps. If both genus and species names are known, the organism name links to a search of a microbial database. The pathway also contains possible cofactors. If compounds occur that are parts of an intermediary, metabolism pathway links to the KEGG database are given. The compound that links to KEGG is shown as a red circle in the KEGG pathway map, whereas KEGG does not link to the UMBBD on its pathway pages. A graphical representation of a reaction contains the reacting atoms and bonds highlighted in a bold red font. If literature information about the reaction mechanism is available a link to a graphic showing the reaction mechanism appears. There are two different types of pathway representations: a text format representation and a static drawing, which is more restrictive. Enzymes of a reaction are linked to GenBank, when one or more hits have been found. Furthermore enzyme pages in the KEGG-LIGAND, in the ExPASyENZYME and/or in the BRENDA enzyme database are connected when a four-digit EC number exists. If an incomplete EC number occurs, ExPASy± ENZYME gives a list of all possible enzymes within that enzyme subclass. The & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 135 Wittig and De Beuckelaer enzyme pages contain a list of reactions that are catalysed by the enzyme and links to each of the static UM-BBD pathway maps in which one or more reactions catalysed by this enzyme appear. The compound information includes names and synonyms, chemical formula, molecular mass, a SMILES format, CAS number and links to ChemFinder. The compound page also contains a list of all reactions in the database which use or produce this compound and links to the UM-BBD pathway maps in which the compound appears. Reliability/inconsistency One important feature is the plausibility of a reaction or pathway. Pathways with less evidence or unknown intermediates are indicated in several ways and mentioned in the pathway introduction. If compounds are plausible, but not certain, or are intermediates, they are put in brackets. Multiple arrows indicate unknown intermediates. Within the introduction of a pathway or reaction page the signi®cance of the pathway is given. The names and af®liations of the people who contributed the pathway to the UM-BBD are important. This feature is unique for this database system and allows the user to contact the person in question directly. Additional tools A dynamically generated pathway map starting from a de®ned reaction can be generated and is shown only in text format representation. The created pathway maps include all possible reactions from all organisms found in the UM-BBD and link to each of the static pathway maps in which the reactions appear. Comparison of databases COMPARISON OF THE DATABASES A comparison of the analysed databases is shown in Table 2. The simple molecule `Ethanol' was selected as a criterion to compare the database entries. The result of this comparison points 136 out the following problems. WIT, KEGG and ExPASy offer no speci®c queries for compounds which, for example, results in lists of enzymes or genes containing the substring `ethanol'. Except for PathDB all selected queries use the keyword as substring and result in a list of entries. From these lists `ethanol' can be selected directly, except for WIT and ExPASy, where the compound pages can only be accessed by navigation, for example through the enzyme pages. Unfortunately in the case of `ethanol' ExPASy has no implemented link to a corresponding compound page. In general compound information is available but not consistently for all compounds. The databases have different preferences for contents within their compound pages. Some of the databases represent structural formulas or SMILES strings. The databases also contain different lists of names, which are written with either capital or lower case letters. Also within the same databases the types of letters are used inconsistently. Only KEGG and EcoCyc refer to pathways containing this compound but the pathways are assigned to different types of classi®cations. The representation of reactions containing the selected compound is very inconsistent between the databases. WIT has no reference to reactions, KEGG refers to numbers of enzymes that use the compound as reactant (R), cofactor (C), inhibitor (I), etc. and other databases refer to reaction equations. One example of inconsistent reactions within a database is shown in EcoCyc. It contains reactions that use `ethanol' as substrate and the resulting product is `an aldehyde'. Ethanol is a speci®c compound with molecular mass, structural formula, etc. An aldehyde represents a class of compounds that contain the same functional group. Using ethanol as the substrate only one possible aldehyde (acetaldehyde) can be generated. Note that, regarding the references on the compound pages, UM-BBD gives information about the author and date of the entry. One piece of missing & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 137 C2 H6 O ÿ 64-17-5 Chemical formula Structural formula SMILES string CAS number ÿ CCO ÿ Ethanol, ethyl alcohol, methylcarbinol ± ± ± ± ± C2 H6 O ÿ 64-17-5 Absolute alcohol, alcohol, etOH, eth, ethyl alcohol, grain alcohol No access to compound Select `ethanol' from query results page (only position on the pathway map), access to compound pages through enzyme pages where the compound is used in the reaction Not available for ethanol ! C2 H6 O ÿ OCC 64-17-5 Alcohol, ethanol, ethyl alcohol, ethyl hydroxide, grain alcohol, methylcarbinol, potato alcohol, spirit, synasol, wine spirit, cologne spirit Select `ethanol' from query results 1 compound 23 compounds Select `ethanol' from query results Query compound by name PathDB Query compound by name EcoCyc General keywords search (no speci®cation) 7 enzymes, 8 compounds ExPASy Query the LIGAND database 23 enzymes, 51 compounds KEGG Compound page info: Names Ethanol, ethyl alcohol, methylcarbinol Search for compound `ethanol' Query ®eld and Search all data entries speci®cation (no compound speci®c query) Number of hits 35 genes, 46 pathways, 23 enzymes, 21 topics in functional overview Access to compound No query for compounds page available, access to compound pages through navigation of a pathway (eg ethanol oxidation) and diagram picture WIT continued overleaf Anhydrol, alcohol, methylcarbinol, denatured alcohol, ethyl hydrate, ethyl hydroxide, ethanol, algrain, cologne spirit, fermentation alcohol, grain alcohol, jaysol, jaysol s, molasses alcohol, potato alcohol, spirit, spirits of wine, tecsol, alcohol dehydrated, ethanol 200 proof, cologne spirits (alcohol), sd alcohol 23-hydrogen, Synasol C2 H6 O CCO 64-17-5 Select `ethanol' from query results Search for compound whose name contains the keyword 19 compounds UM-BBD Table 2: Comparison of the pathway databases (WIT, KEGG, ExPASy, EcoCyc, PathDB, UM-BBD) in regard to the search options and query results. `Ethanol' is used as keyword to access compound information 138 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 ÿ ÿ ÿ NSC number: 85228 DaylightTM Molecular mass Pathway Enzyme/reaction Reference/link WIT Table 2: (Continued) ± ÿ Glycolysis/gluconeogenesis 1.11.1.6 (R) 2.3.1.152 (R) 3.1.1.67 (R) 3.5.1.75 (R) KEGG ± ± ± ± ExPASy 46.069 Fermentation Acetaldehyde NADH ethanol NAD NAD ethanol NADH an aldehyde or ketone NADP ethanol NADPH an aldehyde NADP ethanol NADPH an aldehyde H2O urethan NH3 CO2 ethanol H2O a long chain-acyl ethyl ester ethanol a fatty acid ± EcoCyc ± 46.08 ÿ Ethanol acetaldehyde Ethanol aldehyde Urethane ethanol NH3 PathDB ChemFinder Natl. Toxicology Program Page Author Date 46.07 ÿ Ethyl acetate acetate ethanol Ethanol acetaldehyde UM-BBD Analysis and comparison of metabolic pathway databases information on these compound pages in most of the databases is a reference to the classi®cation of compounds used, similar to the classi®cation of enzymes on most enzyme pages. errors and inconsistencies ERRORS AND INCONSISTENCIES As can be seen within the abovementioned examples there are always problems, errors and inconsistencies within one single database and between different databases. Until now it has not been possible to automate the detection of errors within databases. For example errors such as wrong structural formulas of compounds or different compound names of the same compound which also cause two different compound entries in the database can only be detected manually by checking each single entry. Typically manual databases will always contain typing errors. In most cases such errors do not change the meaning of the content. But certainly there are some cases where a typing error changes the property of a protein and enzyme. The enzyme L-ribulose-5-phosphate 4epimerase (EC 5.1.3.4) catalyses the reaction between L-ribulose-5-phosphate and D-xylulose-5-phosphate, according to IUBMB recommendations.13 Databases such as BRENDA, KEGG, PathDB and WIT/EMP use this equation with the above-mentioned substrate and product. In contrast, the protein database SWISSPROT and ExPASy±ENZYME associate the same enzyme to a reaction between Lribulose-5-phosphate and D-xylose-5phosphate. Obviously the product name D-xy(lu)lose-5-phosphate contains a typing error that changes the catalytic property of the catalysing enzyme. Since several other databases extract their data from this source, the wrong relation between the compound and the enzyme, including all the detailed information on the reaction, is transferred and spread throughout other database systems. Copying or linking data causes the transfer of errors from one database to another. Such errors are relatively easy to detect for an expert user. In contrast typing errors within gene or protein sequences are not usually so obvious. Other problems that occur within databases are errors from the original source or experimental problems. These errors include sequencing errors, errors from homology searches, structure prediction, etc. and incorrect, incomplete or wrong transfers from literature information into the database. The last case occurs by incorrect or incomplete interpretation of experimental results or by drawing wrong conclusions from the literature sources that are incorporated manually. The risk here is the decrease in reliability in the statements and contents of such systems. One way to avoid such problems is to add the source of data and the reference, which gives the database user the chance to see the original results from a homology search, etc. The comparisons of predicted data between databases are very error-prone because of the use of different tools for prediction. This especially affects data about 3D structures, motifs, molecular mass or isoelectric points of protein sequences. Both database user and database developer must remember that information is always contextdependent.27 Data cannot be generalised because of the dependence on environmental conditions. An important example is the recent expression data. In this case the environmental conditions and methods used are essential for the interpretation of the data and are often missing in the database. In the worst case these data will never be trusted by expert users and the database therefore virtually useless. Inconsistencies can be found by comparing database entries from different databases. Typically problems of inconsistencies arise from names and abbreviations used in different databases. There is no consistent nomenclature for genes and proteins unlike the enzyme nomenclature recommended by the IUBMB.13 Databases refer to different & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 139 Wittig and De Beuckelaer names, synonyms and abbreviations, depending on the source of the data including different types of spelling. In addition to the names no consistent nomenclature for functional descriptions of genes and proteins exists. This inconsistency is also transferred to inconsistent classi®cations of pathways (Figure 2). One example for the classi®cation of functions came up with the classi®cation of E. coli gene products established by Monica Riley.19,20 A de®ned nomenclature of functions similar to the enzyme classi®cation is certainly a great advantage.28 That should also imply a standard de®nition of keywords within comment lines ± understandable and usable for different types of users. In this context a biochemical classi®cation of chemical compounds has to be installed. There are different ways to classify chemicals: either by the chemical properties recommended by the IUPAC or by functional properties. The last option is being employed by most of the databases described above and causes a large degree of variability. For systems representing biochemical pathways a ¯exible classi®cation is needed which ideally combines chemical and functional characteristics of the compound but still has to be consistent throughout the system. Figure 2: Comparison of the classi®cation of Glycolysis (Embden± Meyerhoff±Parnas pathway ± EMP) used in KEGG, EcoCyc and WIT 140 A demonstrative representation of the connectivity of the main database types is given in Figure 1. Genome and chemical databases can be seen as the basis for all other databases. Data from literature databases enter other types of database. Pathway databases are located above other databases containing parts of data from the databases below. Consequently pathway databases might also contain the errors made in the other databases. This indicates the need for pathway database developers to be especially aware of errors and inconsistencies. Expert knowledge is needed to avoid or to detect errors and inconsistencies within databases: a time-consuming and expensive aspect. SUMMARY In general all analysed databases are far from being complete. In terms of userfriendly access and the amount of data incorporated in the database the KEGG system clearly deserves the highest praise. The more or less consistent nomenclature of the compounds based on international agreements such as the IUPAC and IUBMB, as well as the easy-to-understand classi®cation of compounds and reactions, makes this system an easy tool for ®rst access to pathway databases. However, for users familiar with the biochemical pathways of E. coli and with a JavaScriptenabled www browser, the EcoCyc system offers the most information on that speci®c organism since it also incorporates direct links to the literature and experimental data being used to set up the system. For users interested in the creation of new biochemical pathways between different compounds the KEGG, PathDB and the UM-BBD systems are the most useful ones. Concerning graphical representation of biochemical pathways, those users already familiar with the Boehringer Mannheim chart will easily ®nd their way through the ExPASy±Biochemical Pathways, whereas absolute beginners will de®nitely prefer the KEGG graphics. One disadvantage of all these pathway & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 Analysis and comparison of metabolic pathway databases systems is the absence of help functions or windows describing, for example, the concepts and/or restrictions that one has to de®ne when querying the data. In most of the databases meta-information, for example about the source or origin of the data (eg experiments, homology search, literature), is missing.29 The user is not provided with information as to whether the data are experimentally proven or not. Since most of the data are predicted by homology search, information about algorithms and their parameters being used for the homology search or the degree of homology would be very useful for data interpretation. As regards problems of errors, sequence databases were already analysed some years ago.29 Recommendations were given on how to solve problems of errors and missing meta-information. These suggestions made for sequence databases should now be applied to pathway databases. A comparison of data of pathway databases is complicated because of different classi®cations of compounds, genes, proteins, pathways and gene/ protein functions. Only the classi®cation of enzymes recommended by the IUBMB is used as a standard. One suggestion would be to have more standards for nomenclature. In database development a standardisation of data structures and ®le formats would make further applications more powerful. In conclusion, pathway database users should be aware of many possible errors within the databases and the large number of inconsistencies between the databases. The inconsistencies would cause problems for example during the extraction of data from different data sources. Developers of pathway databases should focus more on curation of the data. One suggestion would also be to try to standardise their database entries, for example, in the case of classi®cation. (Project BioRegio 0312212). The authors thank Andreas Kohlbecker and Isabel Rojas for discussions and reviewing the manuscript. Acknowledgements This work was supported by the Klaus Tschira Foundation gGmbH (KTF) and the BMBF 14. EcoGene database (2000), URL: http:// bmb.med.miami.edu/EcoGene/index.html References 1. Frishman, D., Heumann, K., Lesk, A. and Mewes, H. (1998), `Comprehensive, comprehensible, distributed and intelligent databases: Current status', Bioinformatics, Vol. 14 (7), pp. 551±561. 2. Karp, P. D. (1998), `Metabolic databases', Trends Biochem. Sci., Vol. 23 (3), pp. 114±116. 3. Karp, P. D., Krummenacker, M., Paley, S. and Wagg, J. (1999), `Integrated pathway/genome databases and their role in drug discovery', Trends Biotechnol., Vol. 17 (7), pp. 275±281. 4. Overbeek, R., Larsen, N., Smith, W. et al. (1997), `Representation of function: The next step', Gene, Vol. 191, pp. GC1±GC9. 5. Overbeek, R., Larsen, N., Pusch, G. D. et al. (2000), `WIT: Integrated system for highthroughput genome sequence analysis and metabolic reconstruction', Nucleic Acids Res., Vol. 28 (1), pp. 123±125. 6. Selkov, E. Jr, Grechkin, Y., Mikhailova, N. and Selkov, E. (1998), `MPW: The Metabolic Pathways Database', Nucleic Acids Res., Vol. 26 (1), pp. 43±45. 7. Selkov, E., Basmanova, S., Gaasterland, T. et al. (1996), `The metabolic pathway collection from EMP: The enzymes and metabolic pathways database', Nucleic Acids Res., Vol. 24 (1), pp. 26±28. 8. Kanehisa, M. and Goto, S. (2000), `KEGG: Kyoto Encyclopedia of Genes and Genomes', Nucleic Acids Res., Vol. 28 (1), pp. 27±30. 9. Ogata, H., Goto, S., Sato, K. et al. (1999), `KEGG: Kyoto Encyclopedia of Genes and Genomes', Nucleic Acids Res., Vol. 27 (1), pp. 29±34. 10. Goto, S., Nishioka, T. and Kanehisa, M. (1998), `LIGAND: Chemical database for enzyme reactions', Bioinformatics, Vol. 14 (7), pp. 591±599. 11. Goto, S., Nishioka, T. and Kanehisa, M. (2000), `LIGAND: Chemical database for enzyme reactions', Nucleic Acids Res., Vol. 28 (1), pp. 380±382. 12. DBGET/LinkDB ± Integrated database retrieval system (2000), URL: http://www.genome.ad.jp/dbget/ 13. IUBMB ± International Union of Biochemistry and Molecular Biology (1992), `Enzyme Nomenclature: Recommendations of the Nomenclature Committee of the IUBMB', Academic Press, New York. 15. Karp, P. D., Riley, M., Paley, S. M. et al. & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 141 Wittig and De Beuckelaer (1997), `EcoCyc: Encyclopedia of Escherichia coli genes and metabolism', Nucleic Acids Res., Vol. 25 (1), pp. 43±50. 16. Karp, P. D. and Riley, M. (1999), `EcoCyc: The resource and the lessons learned' in Letovsky, S. Ed., `Bioinformatics Databases and Systems', Kluwer Academic, Boston, MA, pp. 47±62. 17. Karp, P. D., Riley, M., Saier, M. et al. (2000), `EcoCyc and MetaCyc databases', Nucleic Acids Res., Vol. 28 (1), pp. 56±59. 24. Michal, G. (1999), `Biochemical Pathways', Spektrum Akad. Verlag, Heidelberg. 25. Ellis, L. B. M., Hershberger, C. D. and Wackett, L. P. (2000), `The University of Minnesota Biocatalysis/Biodegradation Database: Microorganisms, genomics and prediction', Nucleic Acids Res., Vol. 28 (1), pp. 377± 379. 18. CGSC ± E. coli Stock Center Database (2000), URL: http://cgsc.biology.yale.edu/ cgsc.html 26. Ellis, L. B. M., Speedie, S. M. and McLeish, R. (1998), `Representing metabolic pathway information: an object-oriented approach', Bioinformatics, Vol. 14 (9), pp. 803±806. 19. GenProtEC ± E. coli genome and proteome database (2000), URL: http://genprotec.mbl.edu:80/cat1.htf 27. Bork, P. (2000), `Powers and pitfalls in sequence analysis: The 70% hurdle', Genome Res., Vol. 10, pp. 398±400. 20. Riley, M. (1993), `Functions of the gene products of Escherichia coli', Microbiol. Rev., Vol. 57 (4), pp 862±952. 28. Karp, P. D. (2000), `An ontology for biological function based on molecular interactions', Bioinformatics, Vol. 16 (3), pp. 269± 285. 21. PathDB (2000), URL: http://www.ncgr.org/ software/pathdb/ 22. ExPASy ± Biochemical Pathways (2000), URL: http://expasy.ch/cgi-bin/searchbiochem-index 142 23. Michal, G. (1993), `Biochemical Pathways', Boehringer Mannheim GmbH ± Biochemica. 29. Karp, P. D. (1998), `What we do not know about sequence analysis and sequence databases', Bioinformatics, Vol. 14 (9), pp. 753±754. & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001