* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Conceptual modelling methods for biological data
Survey
Document related concepts
Data center wikipedia , lookup
Concurrency control wikipedia , lookup
Versant Object Database wikipedia , lookup
Operational transformation wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Information privacy law wikipedia , lookup
Data analysis wikipedia , lookup
3D optical data storage wikipedia , lookup
Forecasting wikipedia , lookup
Business intelligence wikipedia , lookup
Data vault modeling wikipedia , lookup
Open data in the United Kingdom wikipedia , lookup
Clusterpoint wikipedia , lookup
Transcript
Biological data and conceptual modelling methods by C. Maria (Marijke) Keet School of Computing, Napier University, 10 Colinton Road, Edinburgh EH10 5DT, Scotland Abstract The article highlights characteristics of biological data, and its effect on conceptual modelling. Regarding biological data and its semantics, there is no legacy to rely and build upon, there is an abundance of non-discrete data, uncertainties on relevant parameters and a general lack of standardization in nomenclatures and concepts. General features of ER, OO and ORM are discussed, emphasising differences in graphical representation, understandability from the customer’s perspective and inclusiveness of types and attributes in the model. A second example, taxonomy, addresses (Extended-) OO, ad hoc solution POOM and the possibilities of FCA to formalize biological data and its concepts. The more abstract conceptual modelling techniques ORM and FCA may be more promising in capturing the biological semantics as inclusive and formal as possible, in order to build-up an extensive repository and aid standardization, which in turn will improve the quality of developed software. 1. Introduction In recent years, growth in availability of biological data has been exponential, and it is expected to continue at the same, if not faster, pace. It is a natural step to organise these vast amounts of data by making use of developments in the field of computing, where the combination of biology and computing gave rise to the discipline of bioinformatics. Viewed from the IT angle, it covers computational chemistry, neural networks, evolutionary computing and software and database development. However, for IT specialist to design software to meet the requirements of biologists, an understanding of peculiarities of biological data is a necessity, which is different from human-generated concepts of for example financial or logistics systems. This will be addressed in the next section. Subsequently, several conceptual modelling techniques are discussed, and certain features highlighted, aided by examples of the analysis phase for the development of a bacteriocin database (conduced by this author) and of modelling taxonomic classifications. 2. Some characteristics of biological data What makes biological data different from the more “standard” type of data that it merits special attention? Aside from Universe of Discourse (UoD) specific aspects, there are five general characteristics. First, there is no legacy to rely upon. For example, compare the common entity type Person: in a company or club conceptual model the Person is either M (male) or F (female) but not ‘mostly M, depending on some factors’, whereas a molecule, e.g. a bacteriocin, can be coded ‘mostly’ on plasmids and transposons, though ‘rarely’ on chromosomal DNA, plus a transposon can insert itself into a plasmid: should one classify the gene location as transposon or plasmid, or both? There are no hundreds of databases implemented where the data analysts pondered about the same question and have concluded to represent it one way or the other, whereas this is the case with, say, financial databases capturing a business processes. Second, production of a metabolite (a molecule produced by an organism) or e.g. strength of inhibition by an antibiotic to kill the bacteria causing an infection can have ‘stronger’ effects in some environments and weaker under other circumstances. This poses two questions, which would need to be analysed and modelled somehow: how much weaker or stronger, how to represent gradations, non-discrete data, in relationships? There is no such equivalent in, say, hockey club membership: either you are a member, or you are not. The second question relates to the “some environments”. What environment, what are the determining factors and, more importantly, what is their effect on “occasional relationships”? It would require a model capturing “if parameter x is above threshold a, parameter y ‘somewhat warm’ and a ‘low level’ of z” and so forth, then there is a relationship – only to note that the exact parameters (and their possible values) involved to determine the existence of a relationship are often not fully known or understood even by the domain experts themselves. How can a computer scientist represent the semantics correctly and comprehensively? How ought one to represent environmental conditionality, heterogeneous information and fluctuating data quality? This is a serious design consideration, especially prevalent in attempting to meet requirements of biological science researchers, primarily because this kind of data cannot easily be generalised. Alternatively, for example an address from a company: one knows the components (attributes), all of them and modelled numerous times before. On the contrary with biological data: in addition to aforementioned uncertainties, functionality can be ‘confirmed’ as well as ‘postulated’, i.e. there is a requirement to document a plethora of conjectures by researchers; how can one anticipate attributes and entity types if researchers do not precisely know the parameters? These ‘informed guesstimates’ may not only be valid in hindsight, but be of such importance, that what at present suffice as an attribute, may become of such importance, with its own related parameters, that is has to be “upgraded” to become an entity type. Third, another difference is the lack of versus the abundance of data in a certain subject area. For example, storing extensive knowledge of all intricacies of one bacteriocin, nisin (the most researched bacteriocin), but there hardly exist any information on e.g. reuterin, thereby leaving 95% of the attribute values empty – a waste of resources of the table. However, note that the latter would not occur to such extend if one were to implement an object-oriented database as opposed to a relational database (Thierry-Mieg et al., 1999), because instances are only created on demand (I will return to this matter in the next section). Fourth, there are definitional problems and a general lack of standardization in nomenclature in biological data (Wittig and De Beuckelaer (2001); Frishman et al. (1998); Macauley et al. (1998); Laser and Roest Crollius (1998), among many others): “anarchy” according to Drysdale (2001), although the FlyBase1 she describes adds to this problem because they devised their own keyword system. The MBGD elevates this to a feature: the user can create his/her own classification table (Uchiyama, 2003). There are a few coordinated attempts to unify data formats via Abstract Syntax Notation I (Frishman et al. (1998) and Bader et al. (2001), the NEXUS file format (Maddison et al., 1997) and the establishment of the Gene Ontology Consortium2. The latter approach may be criticised 1 Biological databases mentioned in this article are listed at the end of this page after the references. 2 More information on the Gene Ontology Consortium is online available via: http://www.geneontology.org/, GOC (2001) and for an example of its use with pathway databases, see Krishnamurthy et al. (2003). There are longer established nomenclature attempts in naming enzymes and coordinated bacterial nomenclature (the latter subject to excessive re-classifications for ‘dumping’ semantic and conceptual disagreements of research groups into the lap of ontologists; there is an apparent lack of cooperation with its implementers and, more importantly, ontology efforts use divergent approaches. There are distinctions from e.g. a function-based vocabulary (GOC) to descriptive-hierarchical (in taxonomy [PrometheusDB]), where the former devises a vocabulary with for example an ‘energy generating device’ (covering organelles like mitochondria), whereas descriptive ontologies drill down from ‘flower’ to ‘petal’ and so forth, alas in some cases introducing new incompatibilities, the very aspect they try to solve. The fifth, and last general aspect, is related to the previous one: the definitional problems and lack of standardisation is not just due to the complexity of biological data, but there are disagreements between (sub-)disciplines and even within disciplines amongst research groups as well as within research groups. Taking a brief look at some of the extant molecular biology databases, there are longer established databases on DNA, protein sequence and genome mapping databases (Uberbacher) and relatively more recent developments covering metabolic pathways, protein interactions (e.g. Xenarios and Eisenberg, 2001), gene expression and function databases that likely will expand encompassing the emerging epigenetic data, which are relatively more challenging due to the increasing levels of interaction and relationships between the objects/entity types. Of another kind are phylogenetic databases, which involve additional neural network-type query and search tools, and protein structure databases, primarily focussed on multimedia and representational factors of the data (e.g. Wittig and De Beuckelaer (2001)). These databases can be further categorised into data type specific (like GenBank and Swiss-Prot), species specific (FlyBase) or subject matter specific (REBASE), at least partially requiring horizontal and/or vertical linking of data, enforcing not only social issues of cooperation, but also pose “hard scientific questions” to be answered (Macauley et al., 1998; Frishman et al., 1998). Macauley et al. (1998) define ‘horizontal’ linking of data as sequence, structure, mapping, position and phenotype and ‘vertical’ as linking related elements of the same type that pertain to other genes in the same or other organisms. However, one could also interpret horizontal as the same components (e.g. DNA with DNA and so forth) and vertical as DNARNA-protein etc, alike a (complicated) “biological OSI model”. On top of aforementioned divisions, there are so-called primary source databases resulting form molecular biology, analogous to the “New Drude” in plant taxonomy (Graham, et al., 2002)). (TIGR) as well as “boutique collections” to meet specific requests of smaller research communities. The latter have a tendency not just to link, but to copy the few sections of relevance from a primary source database into the communal database. The ‘advantage’ of copying data is that you can suit the data format into whatever way you prefer for your own database, but of course that does not aid data(base) integration. Consult Shoop et al. (2001) for a comprehensive discussion on this matter and related integration problems of biological databases. Last, note that for each UoD there are additional data type specific problems to resolve on top of these discussed general aspects of biological data; for example classification systems in plant taxonomy (Raguenaud et al, (2002) and Priss (2003), among others) or the loosely defined groups of microorganisms (Keet, 2003b). 3. Modelling With the characteristics of biological data in mind, I will discuss some aspects on Entity-Relationship (ER), Object-Oriented (OO) and Object Role Modelling (ORM) before addressing two practical examples of ER versus ORM and (Extended-) OO versus formal concept analysis (FCA) and generalise from these conducted modelling exercises. With ER modelling, decisions have to be made in an early stage on what will be an entity type and what its attribute(s). However, as mentioned above, one cannot know beforehand which factor is going to (appear to) be important in biological data, or is/will be/might be subject to modification, but nevertheless ER ‘fixes’ the diagram and once implemented, is difficult and laborious, if not impossible, to change. This can be partially addressed by resorting to ORM (refer to Halpin (2001) for an explanation) to reveal intricacies and postpone design details. Further, a limitation of ER is that it does not allow relationships of any arity, whereas ORM does. ORM can include attribute restrictions more clearly, and the use of sample data accompanied with the model aids understanding by domain experts. Halpin (2001), North (1999) and Ter Hofstede and Proper (1998) elaborate further on this aspect. Modelling in ORM still provides the opportunity to design and implement it in either a relational or an object database (The interested reader may like to read an example of ORM to ER mapping in Halpin (2001:343-346) and ORM to UML mapping is addressed on pp396-397 by Halpin). The second aspect is related pros and cons between ER and Object-Oriented (OO) data modelling. Thierry-Mieg et al. (1999) claim that “[r]elational systems are best when the schema is simple, the data is regular and successive queries are independent. Object systems are best when the schema is complex, the data irregular and the queries correlated” and with OO it is easier to “search the neighbourhood”. Although this is not substantiated by experimental comparative research on biological databases, Uchiyama’s (2003) MSGD discusses “similarity relationships”, Thierry-Mieg et al. (1999) address “progressively explor[ing] the surrounding area” in relation to the ACeDB and Raguenaud (2001) also addresses “localised” searches. These types of localised searches are of relevance in biological databases when one would want to explore for example sections of the evolutionary tree or structurally or functionally related enzymes. Another factor on suitability of either ER or OO is the primary requirement for its intended use: the most commonly used methodology in molecular biology is gene comparison, which both ER and OO can facilitate. However, recent developments of metabolic pathway databases try to capture far more complex information than simple gene sequences because of the type of interactions between the molecules (chemical reactions), where the objects forming the data are nodes of networks linked by edges representing the chemical reactions (Frishman et al., 1998; Wittig and de Beuckelaer, 2001 an Krishnamurty et al., 2003). Raguenaud et al. (2002) consider taxonomic data as too complicated to be adequately represented by the simple structures of relational models (see also further below in §3.2). The non-suitability of ER is refuted by others (e.g. Markowitz et al., 2001)3. A limited comparison between ER and OO (using UML) on a theoretical level has been carried out by Bornberg-Bauer and Paton (2002) discussing what is possible in biological data modelling, but not what should be in order to meet database requirements of biologists. Is one or the other merely the ‘lesser of two evils’? Although, in this context as an aside, it seems that requirements set by the various sub-disciplines of biology are not compatible with one another and/or that further standardisation in definitions and data formats would be required before the next step towards designing consistent and compatible databases can be taken. Noteworthy is that of the published conceptual models for biological databases, most remain within the realms of ER and OO. Juristo and Moreno (2000) argue these modelling methods as 3 Another, the object-relational approach, is not further discussed here. BIND (Bader et al., 2001) and the Arabidopsis thaliana database (Frishman et al., 1998) make use of this modelling approach. in-between computational model and conceptual model, and categorise ORM, conceptual graph theory (CG) and formal concept analysis (FCA) as being on a more abstract level, hence ‘true’ conceptual modelling techniques. Could it be that ER and OO cannot fully capture the intricacies of biological data, but ORM/CG/FCA can? 3.1 ER with/versus ORM The illustrative examples in this section are taken from conceptual models of the bacteriocin database (Keet, 2003a), developed during the FYP of the author, who had a supervisor who is unfamiliar with the subject matter of the database (microbiology and food science) and a customer who is not cognizant of databases, let alone drawings of conceptual models. In other words: in theory, this author could have modelled whatever she liked, and either ignored, or at least postponed, any potential difficulties to the implementation phase (and subsequently swiftly moving on to another project). This may sound unprofessional, but an example may suffice. The customer’s main requirements for the bacteriocin database were to have an easily accessible, structured and searchable repository for bacteriocin-related data extracted from the vast amount of journal articles she gathered over the years. Bacteriocins are compounds (peptides) similar to antibiotics and inhibit growth of other, often closely related, bacteria; though unlike antibiotics, they are functionally non-therapeutic, so there is potential to use bacteriocins as a natural ingredient in food produce for food safety and preservation. A preliminary ER-model was generated, with the microorganisms, bacteriocins and plasmids (containing the genes coding for bacteriocins) as shown in the diagram in Figure 1. A complete diagram, including Figure 1 and several other entity types (18 in total) was inspected by both the customer and supervisor of the research project… and accepted. However, the prime function of bacteriocins is inhibiting and killing other microorganisms, but none of this is modelled! Due to a sense of disquiet of the semantics between the three main entity types microorganisms, bacteriocins and plasmids, this author resorted to ORM to try to uncover what was “missing”, i.e. attempting to make the implicit explicit. Figure 1. Section of the preliminary ER diagram. Figure 2. Overview of the three main entity types in the conceptual model and their relationships Where ER emphasises the relations between entity types, and pushes its attributes to the background, ORM requires one to explicitly state the relationship(s) not only between entity types, but also how the entity types relate to their attribute(s). A very first simple exercise to model these three entity types revealed the rather serious lacuna, the absence of inhibition of microorganisms, in the model (Figure 2). Re-analysing the ER model revealed further specific details, as included in Figure 3 and Figure 4 (some attributes are omitted from the figures). With the aid of sample data and the verbalizer feature in VisioModeler, these changes the author made could be communicated with the customer in a more fruitful manner. For example, the original ‘MicroOrganism containsA Plasmid’, has changed: in theory, a microorganism can have more than one plasmid, and a plasmid can occur in more than one microorganism. Further, the assumption that there is no gene coding for a bacteriocin on chromosomal DNA was abandoned. The occurrence of a bacteriocin gene residing on chromosomal DNA is unlikely, but not impossible and the exception had to be catered for. This also meant that the entity type Plasmid had to be renamed into the generic name GeneticDeterminant. With one thing leading to another, it appeared that certain mobile DNA fragments, transposons, could carry bacteriocin genes as well, which, as mentioned, can insert themselves into plasmids. Subsequently, it needed to be recorded what the actual location of the gene was, as well as its type. Other details were clarified and confirmed as well (refer to Keet (2003b) for a description and discussion on the semantics of the subject matter). Interestingly, after this interaction with the customer, she preferred an “uncluttered” ER-diagram above the complete ORM model, because “now I know what to think when I see these boxes”, i.e. imagining what is, or can be / may be, captured in the conceptual model, but hidden from the visual representation. One could argue encountered problematic is due to a lack of ERmodelling expertise from the side of the author. However, more likely is the UoD knowledge: if the author would not have been familiar with the subject matter, (one of) the preliminary model(s) would have been used to design a logical model and implement the database (how can one know something is ‘missing’ if one is unfamiliar with the subject matter?), only to realise at the testing stage the biological semantics were not accurately modelled. On the other hand, knowledge of the subject matter might have contributed to obfuscating certain details of the biological data, the author filling the ‘gaps’ in the visual representation by thinking them there (ER does not oblige me to include it), as well as putting a slightly different emphasis on data related to genetics and biochemistry4. In this project, both ER and ORM were used, the former as it was a requirement of the research project, the latter to make up for the shortcomings of the former to subsequently set it aside to use the ‘simplified’ ER-model. However, this could easily be met with a feature in e.g. VisioModeler to select, say, “hide attributes” and “swap fact types for a line” in a menu option ‘alternate views of the same model in order to (un)clutter it’. Ideally, an iterative process between different conceptual data modelling tools ought not to be necessary: a single conceptual modelling technique should be sufficiently expressive to be able to capture ‘everything’ (or at least biological semantics). ORM is closer to this ideal than ER. 3.2 OO and FCA Whereas the previous example highlights graphical representation, understandability and inclusiveness of types and attributes of modelling, here follows a brief discussion on different representations of taxonomic data, (which from an outsider’s view would be exceedingly suitable for hierarchical modelling) and limitations of conceptual modelling facilities built into the modelling techniques. However, there is not one hierarchy, but three principle ones (classification, name and rank), each with varying definitions of their actual instances as used by taxonomists. Within the ranking hierarchy, data has the ability to acquire roles or change behaviour according to context; further, the intended conceptual model should support recursive behaviour and composite entity types (Raguenaud et al., 2002). ER does not allow for such complex data types, except when one would implement this in the application layer, which is not the intention when devising a conceptual model. Raguenaud (2002) created his own version of conceptual modelling, based on the Extended OO model, called POOM (Prometheus Object Oriented Model), to allow for taxonomic complexities. On the other hand, Priss (2003) modelled overlapping hierarchies, especially the taxonomic ranking (variety, species, genus, and so forth), and devised a mathematical formalization via Formal Concept Analysis (FCA, refer to http://www.upriss.org.uk/fca/fca.html and Ganter and Wille (1999) for details). In principle, FCA facilitates reuse of software instead of having to write ad hoc solutions, like POOM, and it emphasizes the use of logic to make the implicit 4 Note that the customer is a food microbiologist and the author a general microbiologist (by first study), which sounds similar, but is not exactly the same discipline. explicit. Albeit providing a convincing integration of taxonomies (Figure 5 shows a merger of two, which the author created with JaLaBA, based on Priss’ example), the prime aspect is the assumption that one can capture biological semantics in formalizations. Can one formalize everything mathematically? Without digressing in philosophical matters if at some point in the future understanding of biology has advanced to such an extend that humans may be able to capture all aspects of the life sciences in mathematical formulae, or if this would be impossible, at the time of writing, there is, from the viewpoint of a computer scientist, a considerable lack of structure, abundance of uncertainties and apparent inconsistencies of biological data and disagreements on biological concepts that would make conceptual modelling with FCA an extremely difficult undertaking. 4. Concluding remarks Modelling biological data faces different problems compared to the more standard business processes. There is no legacy to rely and build upon, there is an abundance of non-discrete data, uncertainties on relevant parameters and a general lack of standardization in nomenclatures and concepts. General features of ER, OO and ORM were discussed before addressing a modelling example with the bacteriocin database, emphasising differences in graphical representation, understandability from the customer’s perspective and inclusiveness of types and attributes in the model. A second example related to taxonomy highlighted (lack) features of (Extended-) OO, ad hoc solution POOM and the (im)possibilities of FCA to formalize biological data and its concepts. Whereas none of the models meet everybody’s requirements and at the same time being capable of conceptually representing ‘everything in biology’, the more abstract conceptual modelling techniques ORM and FCA may be more promising in capturing the biological semantics as inclusive and formal as possible and potentially could create reusable models, or sections thereof, in order to build-up an extensive repository and aid standardization, which in turn will improve the quality of developed software. Figure 3. Refinements of the conceptual model resulting from ORM exercises. Figure 4. The ER-diagram notation of Figure 3. Figure 5. Merged taxonomic hierarchies, as generated in JaLaBA. culi = culinary; bio = biological fruit.(http://juffer.xs4all.nl/cgi-bin/jalaba/JaLaBA.pl?action=output&xinvoer=fca.txt) References Bader, G.D., Donaldson, I., Wolting, C., Ouellette, B.F.F., Pawson, T. and Hogue, C.W.V., (2001), ‘BIND — The Biomolecular Interaction Network Database’. Nucleic Acids Research, 29(1), 242245. Bornberg-Bauer, E. and Paton, N.W., (2002), ‘Conceptual data modelling for bioinformatics’, Briefings in Bioinformatics, 3(2), 166–180. Drysdale, R., (2001), ‘Phenotypic data in FlyBase’. Briefings in Bioinformatics, 2(1), 68-80. Ganter, B. and Wille, R., (1999), Formal Concept Analysis – Mathematical foundations. BerlinHeidelberg: Springer-Verlag. 284p. Gene Ontology Consortium. http://www.geneontology.org/. Date accessed: 12-6-2003. Gene Ontology Consortium, (2001), ‘Creating the Gene Ontology Resource: design and implementation’. Genome Research, 11(8), 1425-1433. Graham, M., Watson, M.F. and Kennedy, J.B., (2003), ‘Novel visualisation techniques for working with multiple, overlapping classification hierarchies’. Taxon, 51, 351-358. Frishman, D., Heurmann, K., Lesk, A. and Mewes, H.-W., (1998), ‘Comprehensive, comprehensible, distributed and intelligent databases: current status’. Bioinformatics, 14(7), 551-561. Halpin, T., (2001), Information Modeling and Relational Databases. San Francisco: Morgan Kaufmann Publishers. 761p. Juristo, N. and Moreno, A.M., (2000), ‘Introductory paper: Reflections on Conceptual Modelling’. Data & Knowledge Engineering, 33(2), 103-117. Keet, C.M., (2003a), ‘The use of bacteria and bacteriocins in the food industry – modelled and documented in a relational database’. BSc Final Year Project, Department of Technology and Department of Computing, Open University, UK. 149p. Keet, C.M., (2003b), ‘Conceptual Modelling for Applied Bioscience: The Bacteriocin Database’. CSPS: Computational intelligence/0310001. 25p. Available online: http://www.compscipreprints.com/comp/Preprint /mkeet/20031008/1 Krishnamurthy, L., Nadeau, J., Ozsoyoglu, G., Ozsoyoglu, M., Schaeffer, G., Tasan, M. and Xu, W. (2003), ‘Pathways database system: an integrated system for biological pathways’. Bioinformatics, 19(8), 930-937. Laser, U., Lehrach, H. and Roest Crollius, H., (1998), ‘Issues in developing integrated genomic databases and application to the human X chromosome’. Bioinformatics, 14(7), 583-90. Macauley, J., Wang, H. and Goodman, N., (1998), ‘A model system for studying the integration of molecular biology databases’. Bioinformatics, 14(7), 575-582. Maddison, D.R., Swofford, D.L and Maddison, W.P., (1997), ‘NEXUS: an extensible file format for systematic information’. Systems Biology, 46(4), 59-621. Markowitz, V.M., Chen, I.A., Kosky, A.S. and Szeto, E., (1999), ‘OPM: Object-Protocol Model Data Management tools ‘97’. In: Bioinformatics – databases and systems. Letovsky, S.L. (ed.). Massachusetts: Kluwer Academic Publishers. pp 187-199. North, K., (1999), ‘Modeling, data semantics and natural language’. New Architect, 7 [Electronic]. http://www.webtechniques.com/archives/1999/0 7/data/. Date accessed: 27-4-2003. Priss, U., (2003), ‘Formalizing Botanical Taxonomies’. Proceedings of the 11th International Conference on Conceptual Structures, 2003. 14p. Online preprint: http://www.upriss.org.uk/papers/iccs03.pdf Raguenaud, C., (2002), Managing complex taxonomic data in an object-oriented database. PhD Thesis, Napier University, Edinburgh. 196p. Available online: http://www.soc.napier.ac.uk/publication/op/getpu blication/publicationid/1845313 Raguenaud, C., Pullan, M.R., Watson, M.F., Kennedy, J.B., Newman, M.F. and Barclay, P.J., (2002), ‘Implementation of the Prometheus Taxonomic Model: a comparison of database models and query languages and an introduction to the Prometheus Object-Oriented Model’. Taxon, 51, 131-142. Available online: http://www.soc.napier.ac.uk/publication/op/getpu blication/publicationid/278988 Shoop, E., Silverstein, K.A.T., Johnson, J.E. and Retzel, E.F., (2001), ‘MetaFam: a unified classification of protein families. II. Schema and query capabilities’. Bioinformatics, 17(3), 262271. Ter Hofstede, A.H.M. and Proper, H.A., (1998), ‘How to formalize it? Formalization principles for information systems development methods’. Information and Software Technology, 40(10), 519-540. Thierry-Mieg, J., Thierry-Mieg, D. and Stein, L., (1999), ‘ACeDB: The ACe Database Manager’. In: Bioinformatics – databases and systems. Letovsky, S.L. (ed.). Massachusetts: Kluwer Academic Publishers. 265-278. Uberbacher, E., Computing the Genome. http://www.ornl.gov/ORNLReview/v30n34/genome.htm. Date Accessed: 24-8-2002. Uchiyama, I., (2003), ’MBGD: microbial genome database for comparative analysis’. Nucleic Acids Research, 31(1), 58-62. Wittig, U. and De Beuckelaer, A., (2001), ‘Analysis and comparison of metabolic pathway databases’. Briefings in Bioinformatics, 2(2), 126-142. Xenarios, I. and Eisenberg, D., (2001), ‘Protein interactions databases’. Current Opinion in Biotechnology, 12, 334-339. ACeDB – A C. elegans DataBase (genome project): http://www.acedb.org BIND – Biomolecular Interaction Network Database: http://www.bind.ca/ FlyBase – Drosophila genome: http://flybase.bio.indiana.edu/ GenBank: http://www.psc.edu/general/software/packages/g enbank/genbank.html MBGD – MicroBial Genome Database: http://mbgd.genome.ad.jp/ PrometheusDB: www.prometheusdb.org REBASE – Restriction Enzyme database: http://rebase.neb.com/rebase/rebase.html SRS – Sequence Retrieval System: http://srsmips.gsf.de / http://srs.ebi.ac.uk/ Swiss-Prot – Protein knowledgebase: http://www.ebi.ac.uk/swissprot/index.html TIGR – The Institute of Genomic Research: http://www.tigr.org About the author: Marijke Keet received her MSc in Microbiology from Wageningen Agricultural University in the Netherlands, expects to graduate with a 1st class honours BSc IT & Computing from the Open University UK, and currently pursues a PhD at the School of Computing at Napier University in Edinburgh, Scotland. She has also worked as an engineer in various jobs in the IT industry and will receive a 1st class MA in Peace & Development Studies from the University of Limerick, Ireland. [email protected]