Download Conceptual modelling methods for biological data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data center wikipedia , lookup

Concurrency control wikipedia , lookup

Versant Object Database wikipedia , lookup

Operational transformation wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Information privacy law wikipedia , lookup

Data analysis wikipedia , lookup

3D optical data storage wikipedia , lookup

Forecasting wikipedia , lookup

Business intelligence wikipedia , lookup

Database wikipedia , lookup

Data model wikipedia , lookup

Data vault modeling wikipedia , lookup

Open data in the United Kingdom wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Biological data and conceptual modelling methods
by C. Maria (Marijke) Keet
School of Computing, Napier University, 10 Colinton Road, Edinburgh EH10 5DT, Scotland
Abstract
The article highlights characteristics of biological data, and its effect on conceptual
modelling. Regarding biological data and its semantics, there is no legacy to rely and build
upon, there is an abundance of non-discrete data, uncertainties on relevant parameters and a
general lack of standardization in nomenclatures and concepts.
General features of ER, OO and ORM are discussed, emphasising differences in graphical
representation, understandability from the customer’s perspective and inclusiveness of types
and attributes in the model. A second example, taxonomy, addresses (Extended-) OO, ad hoc
solution POOM and the possibilities of FCA to formalize biological data and its concepts.
The more abstract conceptual modelling techniques ORM and FCA may be more promising
in capturing the biological semantics as inclusive and formal as possible, in order to build-up
an extensive repository and aid standardization, which in turn will improve the quality of
developed software.
1. Introduction
In recent years, growth in availability of biological
data has been exponential, and it is expected to
continue at the same, if not faster, pace. It is a
natural step to organise these vast amounts of data
by making use of developments in the field of
computing, where the combination of biology and
computing gave rise to the discipline of
bioinformatics. Viewed from the IT angle, it covers
computational chemistry, neural networks,
evolutionary computing and software and database
development. However, for IT specialist to design
software to meet the requirements of biologists, an
understanding of peculiarities of biological data is a
necessity, which is different from human-generated
concepts of for example financial or logistics
systems. This will be addressed in the next section.
Subsequently, several conceptual modelling
techniques are discussed, and certain features
highlighted, aided by examples of the analysis
phase for the development of a bacteriocin database
(conduced by this author) and of modelling
taxonomic classifications.
2. Some characteristics of biological data
What makes biological data different from the
more “standard” type of data that it merits special
attention? Aside from Universe of Discourse
(UoD) specific aspects, there are five general
characteristics.
First, there is no legacy to rely upon. For example,
compare the common entity type Person: in a
company or club conceptual model the Person is
either M (male) or F (female) but not ‘mostly M,
depending on some factors’, whereas a molecule,
e.g. a bacteriocin, can be coded ‘mostly’ on
plasmids and transposons, though ‘rarely’ on
chromosomal DNA, plus a transposon can insert
itself into a plasmid: should one classify the gene
location as transposon or plasmid, or both? There
are no hundreds of databases implemented where
the data analysts pondered about the same question
and have concluded to represent it one way or the
other, whereas this is the case with, say, financial
databases capturing a business processes.
Second, production of a metabolite (a molecule
produced by an organism) or e.g. strength of
inhibition by an antibiotic to kill the bacteria
causing an infection can have ‘stronger’ effects in
some environments and weaker under other
circumstances. This poses two questions, which
would need to be analysed and modelled somehow:
how much weaker or stronger, how to represent
gradations, non-discrete data, in relationships?
There is no such equivalent in, say, hockey club
membership: either you are a member, or you are
not. The second question relates to the “some
environments”. What environment, what are the
determining factors and, more importantly, what is
their effect on “occasional relationships”? It would
require a model capturing “if parameter x is above
threshold a, parameter y ‘somewhat warm’ and a
‘low level’ of z” and so forth, then there is a
relationship – only to note that the exact parameters
(and their possible values) involved to determine
the existence of a relationship are often not fully
known or understood even by the domain experts
themselves. How can a computer scientist represent
the semantics correctly and comprehensively? How
ought
one
to
represent
environmental
conditionality, heterogeneous information and
fluctuating data quality? This is a serious design
consideration, especially prevalent in attempting to
meet requirements of biological science
researchers, primarily because this kind of data
cannot easily be generalised. Alternatively, for
example an address from a company: one knows
the components (attributes), all of them and
modelled numerous times before. On the contrary
with biological data: in addition to aforementioned
uncertainties, functionality can be ‘confirmed’ as
well as ‘postulated’, i.e. there is a requirement to
document a plethora of conjectures by researchers;
how can one anticipate attributes and entity types if
researchers do not precisely know the parameters?
These ‘informed guesstimates’ may not only be
valid in hindsight, but be of such importance, that
what at present suffice as an attribute, may become
of such importance, with its own related
parameters, that is has to be “upgraded” to become
an entity type.
Third, another difference is the lack of versus the
abundance of data in a certain subject area. For
example, storing extensive knowledge of all
intricacies of one bacteriocin, nisin (the most
researched bacteriocin), but there hardly exist any
information on e.g. reuterin, thereby leaving 95%
of the attribute values empty – a waste of resources
of the table. However, note that the latter would not
occur to such extend if one were to implement an
object-oriented database as opposed to a relational
database (Thierry-Mieg et al., 1999), because
instances are only created on demand (I will return
to this matter in the next section).
Fourth, there are definitional problems and a
general lack of standardization in nomenclature in
biological data (Wittig and De Beuckelaer (2001);
Frishman et al. (1998); Macauley et al. (1998);
Laser and Roest Crollius (1998), among many
others): “anarchy” according to Drysdale (2001),
although the FlyBase1 she describes adds to this
problem because they devised their own keyword
system. The MBGD elevates this to a feature: the
user can create his/her own classification table
(Uchiyama, 2003). There are a few coordinated
attempts to unify data formats via Abstract Syntax
Notation I (Frishman et al. (1998) and Bader et al.
(2001), the NEXUS file format (Maddison et al.,
1997) and the establishment of the Gene Ontology
Consortium2. The latter approach may be criticised
1
Biological databases mentioned in this article are
listed at the end of this page after the references.
2
More information on the Gene Ontology
Consortium is online available via:
http://www.geneontology.org/, GOC (2001) and for
an example of its use with pathway databases, see
Krishnamurthy et al. (2003). There are longer
established nomenclature attempts in naming
enzymes and coordinated bacterial nomenclature
(the latter subject to excessive re-classifications
for
‘dumping’
semantic
and
conceptual
disagreements of research groups into the lap of
ontologists; there is an apparent lack of cooperation
with its implementers and, more importantly,
ontology efforts use divergent approaches. There
are distinctions from e.g. a function-based
vocabulary (GOC) to descriptive-hierarchical (in
taxonomy [PrometheusDB]), where the former
devises a vocabulary with for example an ‘energy
generating device’ (covering organelles like
mitochondria), whereas descriptive ontologies drill
down from ‘flower’ to ‘petal’ and so forth, alas in
some cases introducing new incompatibilities, the
very aspect they try to solve.
The fifth, and last general aspect, is related to the
previous one: the definitional problems and lack of
standardisation is not just due to the complexity of
biological data, but there are disagreements
between (sub-)disciplines and even within
disciplines amongst research groups as well as
within research groups. Taking a brief look at some
of the extant molecular biology databases, there are
longer established databases on DNA, protein
sequence and genome mapping databases
(Uberbacher) and relatively more recent
developments covering metabolic pathways,
protein interactions (e.g. Xenarios and Eisenberg,
2001), gene expression and function databases that
likely will expand encompassing the emerging
epigenetic data, which are relatively more
challenging due to the increasing levels of
interaction and relationships between the
objects/entity types. Of another kind are
phylogenetic databases, which involve additional
neural network-type query and search tools, and
protein structure databases, primarily focussed on
multimedia and representational factors of the data
(e.g. Wittig and De Beuckelaer (2001)). These
databases can be further categorised into data type
specific (like GenBank and Swiss-Prot), species
specific (FlyBase) or subject matter specific
(REBASE), at least partially requiring horizontal
and/or vertical linking of data, enforcing not only
social issues of cooperation, but also pose “hard
scientific questions” to be answered (Macauley et
al., 1998; Frishman et al., 1998). Macauley et al.
(1998) define ‘horizontal’ linking of data as
sequence, structure, mapping, position and
phenotype and ‘vertical’ as linking related elements
of the same type that pertain to other genes in the
same or other organisms. However, one could also
interpret horizontal as the same components (e.g.
DNA with DNA and so forth) and vertical as DNARNA-protein etc, alike a (complicated) “biological
OSI model”. On top of aforementioned divisions,
there are so-called primary source databases
resulting form molecular biology, analogous to the
“New Drude” in plant taxonomy (Graham, et al.,
2002)).
(TIGR) as well as “boutique collections” to meet
specific requests of smaller research communities.
The latter have a tendency not just to link, but to
copy the few sections of relevance from a primary
source database into the communal database. The
‘advantage’ of copying data is that you can suit the
data format into whatever way you prefer for your
own database, but of course that does not aid
data(base) integration. Consult Shoop et al. (2001)
for a comprehensive discussion on this matter and
related integration problems of biological
databases.
Last, note that for each UoD there are additional
data type specific problems to resolve on top of
these discussed general aspects of biological data;
for example classification systems in plant
taxonomy (Raguenaud et al, (2002) and Priss
(2003), among others) or the loosely defined
groups of microorganisms (Keet, 2003b).
3. Modelling
With the characteristics of biological data in mind,
I will discuss some aspects on Entity-Relationship
(ER), Object-Oriented (OO) and Object Role
Modelling (ORM) before addressing two practical
examples of ER versus ORM and (Extended-) OO
versus formal concept analysis (FCA) and
generalise from these conducted modelling
exercises.
With ER modelling, decisions have to be made in
an early stage on what will be an entity type and
what its attribute(s). However, as mentioned above,
one cannot know beforehand which factor is going
to (appear to) be important in biological data, or
is/will be/might be subject to modification, but
nevertheless ER ‘fixes’ the diagram and once
implemented, is difficult and laborious, if not
impossible, to change. This can be partially
addressed by resorting to ORM (refer to Halpin
(2001) for an explanation) to reveal intricacies and
postpone design details. Further, a limitation of ER
is that it does not allow relationships of any arity,
whereas ORM does. ORM can include attribute
restrictions more clearly, and the use of sample
data accompanied with the model aids
understanding by domain experts. Halpin (2001),
North (1999) and Ter Hofstede and Proper (1998)
elaborate further on this aspect. Modelling in ORM
still provides the opportunity to design and
implement it in either a relational or an object
database (The interested reader may like to read an
example of ORM to ER mapping in Halpin
(2001:343-346) and ORM to UML mapping is
addressed on pp396-397 by Halpin).
The second aspect is related pros and cons between
ER and Object-Oriented (OO) data modelling.
Thierry-Mieg et al. (1999) claim that “[r]elational
systems are best when the schema is simple, the
data is regular and successive queries are
independent. Object systems are best when the
schema is complex, the data irregular and the
queries correlated” and with OO it is easier to
“search the neighbourhood”. Although this is not
substantiated by experimental comparative research
on biological databases, Uchiyama’s (2003) MSGD
discusses “similarity relationships”, Thierry-Mieg
et al. (1999) address “progressively explor[ing] the
surrounding area” in relation to the ACeDB and
Raguenaud (2001) also addresses “localised”
searches. These types of localised searches are of
relevance in biological databases when one would
want to explore for example sections of the
evolutionary tree or structurally or functionally
related enzymes.
Another factor on suitability of either ER or OO is
the primary requirement for its intended use: the
most commonly used methodology in molecular
biology is gene comparison, which both ER and
OO can facilitate. However, recent developments
of metabolic pathway databases try to capture far
more complex information than simple gene
sequences because of the type of interactions
between the molecules (chemical reactions), where
the objects forming the data are nodes of networks
linked by edges representing the chemical reactions
(Frishman et al., 1998; Wittig and de Beuckelaer,
2001 an Krishnamurty et al., 2003). Raguenaud et
al. (2002) consider taxonomic data as too
complicated to be adequately represented by the
simple structures of relational models (see also
further below in §3.2). The non-suitability of ER is
refuted by others (e.g. Markowitz et al., 2001)3.
A limited comparison between ER and OO (using
UML) on a theoretical level has been carried out by
Bornberg-Bauer and Paton (2002) discussing what
is possible in biological data modelling, but not
what should be in order to meet database
requirements of biologists. Is one or the other
merely the ‘lesser of two evils’? Although, in this
context as an aside, it seems that requirements set
by the various sub-disciplines of biology are not
compatible with one another and/or that further
standardisation in definitions and data formats
would be required before the next step towards
designing consistent and compatible databases can
be taken.
Noteworthy is that of the published conceptual
models for biological databases, most remain
within the realms of ER and OO. Juristo and
Moreno (2000) argue these modelling methods as
3
Another, the object-relational approach, is not
further discussed here. BIND (Bader et al., 2001)
and the Arabidopsis thaliana database (Frishman et
al., 1998) make use of this modelling approach.
in-between computational model and conceptual
model, and categorise ORM, conceptual graph
theory (CG) and formal concept analysis (FCA) as
being on a more abstract level, hence ‘true’
conceptual modelling techniques. Could it be that
ER and OO cannot fully capture the intricacies of
biological data, but ORM/CG/FCA can?
3.1 ER with/versus ORM
The illustrative examples in this section are taken
from conceptual models of the bacteriocin database
(Keet, 2003a), developed during the FYP of the
author, who had a supervisor who is unfamiliar
with the subject matter of the database
(microbiology and food science) and a customer
who is not cognizant of databases, let alone
drawings of conceptual models. In other words: in
theory, this author could have modelled whatever
she liked, and either ignored, or at least postponed,
any potential difficulties to the implementation
phase (and subsequently swiftly moving on to
another project). This may sound unprofessional,
but an example may suffice. The customer’s main
requirements for the bacteriocin database were to
have an easily accessible, structured and searchable
repository for bacteriocin-related data extracted
from the vast amount of journal articles she
gathered over the years. Bacteriocins are
compounds (peptides) similar to antibiotics and
inhibit growth of other, often closely related,
bacteria; though unlike antibiotics, they are
functionally non-therapeutic, so there is potential to
use bacteriocins as a natural ingredient in food
produce for food safety and preservation. A
preliminary ER-model was generated, with the
microorganisms, bacteriocins and plasmids
(containing the genes coding for bacteriocins) as
shown in the diagram in Figure 1. A complete
diagram, including Figure 1 and several other
entity types (18 in total) was inspected by both the
customer and supervisor of the research project…
and accepted.
However, the prime function of bacteriocins is
inhibiting and killing other microorganisms, but
none of this is modelled! Due to a sense of disquiet
of the semantics between the three main entity
types microorganisms, bacteriocins and plasmids,
this author resorted to ORM to try to uncover what
was “missing”, i.e. attempting to make the implicit
explicit.
Figure 1. Section of the preliminary ER diagram.
Figure 2. Overview of the three main entity types in the conceptual model and their relationships
Where ER emphasises the relations between entity
types, and pushes its attributes to the background,
ORM requires one to explicitly state the
relationship(s) not only between entity types, but
also how the entity types relate to their attribute(s).
A very first simple exercise to model these three
entity types revealed the rather serious lacuna, the
absence of inhibition of microorganisms, in the
model (Figure 2).
Re-analysing the ER model revealed further
specific details, as included in Figure 3 and Figure
4 (some attributes are omitted from the figures).
With the aid of sample data and the verbalizer
feature in VisioModeler, these changes the author
made could be communicated with the customer in
a more fruitful manner. For example, the original
‘MicroOrganism containsA Plasmid’, has changed:
in theory, a microorganism can have more than one
plasmid, and a plasmid can occur in more than one
microorganism. Further, the assumption that there
is no gene coding for a bacteriocin on
chromosomal DNA was abandoned. The
occurrence of a bacteriocin gene residing on
chromosomal DNA is unlikely, but not impossible
and the exception had to be catered for. This also
meant that the entity type Plasmid had to be
renamed
into
the
generic
name
GeneticDeterminant. With one thing leading to
another, it appeared that certain mobile DNA
fragments, transposons, could carry bacteriocin
genes as well, which, as mentioned, can insert
themselves into plasmids. Subsequently, it needed
to be recorded what the actual location of the gene
was, as well as its type. Other details were clarified
and confirmed as well (refer to Keet (2003b) for a
description and discussion on the semantics of the
subject matter). Interestingly, after this interaction
with the customer, she preferred an “uncluttered”
ER-diagram above the complete ORM model,
because “now I know what to think when I see these
boxes”, i.e. imagining what is, or can be / may be,
captured in the conceptual model, but hidden from
the visual representation. One could argue
encountered problematic is due to a lack of ERmodelling expertise from the side of the author.
However, more likely is the UoD knowledge: if the
author would not have been familiar with the
subject matter, (one of) the preliminary model(s)
would have been used to design a logical model
and implement the database (how can one know
something is ‘missing’ if one is unfamiliar with the
subject matter?), only to realise at the testing stage
the biological semantics were not accurately
modelled. On the other hand, knowledge of the
subject matter might have contributed to
obfuscating certain details of the biological data,
the author filling the ‘gaps’ in the visual
representation by thinking them there (ER does not
oblige me to include it), as well as putting a slightly
different emphasis on data related to genetics and
biochemistry4.
In this project, both ER and ORM were used, the
former as it was a requirement of the research
project, the latter to make up for the shortcomings
of the former to subsequently set it aside to use the
‘simplified’ ER-model. However, this could easily
be met with a feature in e.g. VisioModeler to
select, say, “hide attributes” and “swap fact types
for a line” in a menu option ‘alternate views of the
same model in order to (un)clutter it’.
Ideally, an iterative process between different
conceptual data modelling tools ought not to be
necessary: a single conceptual modelling technique
should be sufficiently expressive to be able to
capture ‘everything’ (or at least biological
semantics). ORM is closer to this ideal than ER.
3.2 OO and FCA
Whereas the previous example highlights graphical
representation, understandability and inclusiveness
of types and attributes of modelling, here follows a
brief discussion on different representations of
taxonomic data, (which from an outsider’s view
would be exceedingly suitable for hierarchical
modelling) and limitations of conceptual modelling
facilities built into the modelling techniques.
However, there is not one hierarchy, but three
principle ones (classification, name and rank), each
with varying definitions of their actual instances as
used by taxonomists. Within the ranking hierarchy,
data has the ability to acquire roles or change
behaviour according to context; further, the
intended conceptual model should support
recursive behaviour and composite entity types
(Raguenaud et al., 2002). ER does not allow for
such complex data types, except when one would
implement this in the application layer, which is
not the intention when devising a conceptual
model. Raguenaud (2002) created his own version
of conceptual modelling, based on the Extended
OO model, called POOM (Prometheus Object
Oriented Model), to allow for taxonomic
complexities. On the other hand, Priss (2003)
modelled overlapping hierarchies, especially the
taxonomic ranking (variety, species, genus, and so
forth), and devised a mathematical formalization
via Formal Concept Analysis (FCA, refer to
http://www.upriss.org.uk/fca/fca.html and Ganter
and Wille (1999) for details). In principle, FCA
facilitates reuse of software instead of having to
write ad hoc solutions, like POOM, and it
emphasizes the use of logic to make the implicit
4
Note that the customer is a food microbiologist
and the author a general microbiologist (by first
study), which sounds similar, but is not exactly the
same discipline.
explicit. Albeit providing a convincing integration
of taxonomies (Figure 5 shows a merger of two,
which the author created with JaLaBA, based on
Priss’ example), the prime aspect is the assumption
that one can capture biological semantics in
formalizations. Can one formalize everything
mathematically?
Without
digressing
in
philosophical matters if at some point in the future
understanding of biology has advanced to such an
extend that humans may be able to capture all
aspects of the life sciences in mathematical
formulae, or if this would be impossible, at the time
of writing, there is, from the viewpoint of a
computer scientist, a considerable lack of structure,
abundance of uncertainties and apparent
inconsistencies
of
biological
data
and
disagreements on biological concepts that would
make conceptual modelling with FCA an extremely
difficult undertaking.
4. Concluding remarks
Modelling biological data faces different problems
compared to the more standard business processes.
There is no legacy to rely and build upon, there is
an abundance of non-discrete data, uncertainties on
relevant parameters and a general lack of
standardization in nomenclatures and concepts.
General features of ER, OO and ORM were
discussed before addressing a modelling example
with the bacteriocin database, emphasising
differences
in
graphical
representation,
understandability from the customer’s perspective
and inclusiveness of types and attributes in the
model. A second example related to taxonomy
highlighted (lack) features of (Extended-) OO, ad
hoc solution POOM and the (im)possibilities of
FCA to formalize biological data and its concepts.
Whereas none of the models meet everybody’s
requirements and at the same time being capable of
conceptually representing ‘everything in biology’,
the more abstract conceptual modelling techniques
ORM and FCA may be more promising in
capturing the biological semantics as inclusive and
formal as possible and potentially could create
reusable models, or sections thereof, in order to
build-up an extensive repository and aid
standardization, which in turn will improve the
quality of developed software.
Figure 3. Refinements of the conceptual model resulting from ORM exercises.
Figure 4. The ER-diagram notation of Figure 3.
Figure 5. Merged taxonomic hierarchies, as generated in JaLaBA. culi = culinary; bio = biological
fruit.(http://juffer.xs4all.nl/cgi-bin/jalaba/JaLaBA.pl?action=output&xinvoer=fca.txt)
References
Bader, G.D., Donaldson, I., Wolting, C., Ouellette,
B.F.F., Pawson, T. and Hogue, C.W.V., (2001),
‘BIND — The Biomolecular Interaction Network
Database’. Nucleic Acids Research, 29(1), 242245.
Bornberg-Bauer, E. and Paton, N.W., (2002),
‘Conceptual data modelling for bioinformatics’,
Briefings in Bioinformatics, 3(2), 166–180.
Drysdale, R., (2001), ‘Phenotypic data in FlyBase’.
Briefings in Bioinformatics, 2(1), 68-80.
Ganter, B. and Wille, R., (1999), Formal Concept
Analysis – Mathematical foundations. BerlinHeidelberg: Springer-Verlag. 284p.
Gene Ontology Consortium.
http://www.geneontology.org/. Date accessed:
12-6-2003.
Gene Ontology Consortium, (2001), ‘Creating the
Gene Ontology Resource: design and
implementation’. Genome Research, 11(8),
1425-1433.
Graham, M., Watson, M.F. and Kennedy, J.B.,
(2003), ‘Novel visualisation techniques for
working with multiple, overlapping classification
hierarchies’. Taxon, 51, 351-358.
Frishman, D., Heurmann, K., Lesk, A. and Mewes,
H.-W., (1998), ‘Comprehensive, comprehensible,
distributed and intelligent databases: current
status’. Bioinformatics, 14(7), 551-561.
Halpin, T., (2001), Information Modeling and
Relational Databases. San Francisco: Morgan
Kaufmann Publishers. 761p.
Juristo, N. and Moreno, A.M., (2000),
‘Introductory paper: Reflections on Conceptual
Modelling’. Data & Knowledge Engineering,
33(2), 103-117.
Keet, C.M., (2003a), ‘The use of bacteria and
bacteriocins in the food industry – modelled and
documented in a relational database’. BSc Final
Year Project, Department of Technology and
Department of Computing, Open University,
UK. 149p.
Keet, C.M., (2003b), ‘Conceptual Modelling for
Applied Bioscience: The Bacteriocin Database’.
CSPS: Computational intelligence/0310001. 25p.
Available online:
http://www.compscipreprints.com/comp/Preprint
/mkeet/20031008/1
Krishnamurthy, L., Nadeau, J., Ozsoyoglu, G.,
Ozsoyoglu, M., Schaeffer, G., Tasan, M. and Xu,
W. (2003), ‘Pathways database system: an
integrated system for biological pathways’.
Bioinformatics, 19(8), 930-937.
Laser, U., Lehrach, H. and Roest Crollius, H.,
(1998), ‘Issues in developing integrated genomic
databases and application to the human X
chromosome’. Bioinformatics, 14(7), 583-90.
Macauley, J., Wang, H. and Goodman, N., (1998),
‘A model system for studying the integration of
molecular biology databases’. Bioinformatics,
14(7), 575-582.
Maddison, D.R., Swofford, D.L and Maddison,
W.P., (1997), ‘NEXUS: an extensible file format
for systematic information’. Systems Biology,
46(4), 59-621.
Markowitz, V.M., Chen, I.A., Kosky, A.S. and
Szeto, E., (1999), ‘OPM: Object-Protocol Model
Data Management tools ‘97’. In: Bioinformatics
– databases and systems. Letovsky, S.L. (ed.).
Massachusetts: Kluwer Academic Publishers. pp
187-199.
North, K., (1999), ‘Modeling, data semantics and
natural language’. New Architect, 7 [Electronic].
http://www.webtechniques.com/archives/1999/0
7/data/. Date accessed: 27-4-2003.
Priss, U., (2003), ‘Formalizing Botanical
Taxonomies’. Proceedings of the 11th
International Conference on Conceptual
Structures, 2003. 14p. Online preprint:
http://www.upriss.org.uk/papers/iccs03.pdf
Raguenaud, C., (2002), Managing complex
taxonomic data in an object-oriented database.
PhD Thesis, Napier University, Edinburgh. 196p.
Available online:
http://www.soc.napier.ac.uk/publication/op/getpu
blication/publicationid/1845313
Raguenaud, C., Pullan, M.R., Watson, M.F.,
Kennedy, J.B., Newman, M.F. and Barclay, P.J.,
(2002), ‘Implementation of the Prometheus
Taxonomic Model: a comparison of database
models and query languages and an introduction
to the Prometheus Object-Oriented Model’.
Taxon, 51, 131-142. Available online:
http://www.soc.napier.ac.uk/publication/op/getpu
blication/publicationid/278988
Shoop, E., Silverstein, K.A.T., Johnson, J.E. and
Retzel, E.F., (2001), ‘MetaFam: a unified
classification of protein families. II. Schema and
query capabilities’. Bioinformatics, 17(3), 262271.
Ter Hofstede, A.H.M. and Proper, H.A., (1998),
‘How to formalize it? Formalization principles
for information systems development methods’.
Information and Software Technology, 40(10),
519-540.
Thierry-Mieg, J., Thierry-Mieg, D. and Stein, L.,
(1999), ‘ACeDB: The ACe Database Manager’.
In: Bioinformatics – databases and systems.
Letovsky, S.L. (ed.). Massachusetts: Kluwer
Academic Publishers. 265-278.
Uberbacher, E., Computing the Genome.
http://www.ornl.gov/ORNLReview/v30n34/genome.htm. Date Accessed: 24-8-2002.
Uchiyama, I., (2003), ’MBGD: microbial genome
database for comparative analysis’. Nucleic
Acids Research, 31(1), 58-62.
Wittig, U. and De Beuckelaer, A., (2001),
‘Analysis and comparison of metabolic pathway
databases’. Briefings in Bioinformatics, 2(2),
126-142.
Xenarios, I. and Eisenberg, D., (2001), ‘Protein
interactions databases’. Current Opinion in
Biotechnology, 12, 334-339.
ACeDB – A C. elegans DataBase (genome
project): http://www.acedb.org
BIND – Biomolecular Interaction Network
Database: http://www.bind.ca/
FlyBase – Drosophila genome:
http://flybase.bio.indiana.edu/
GenBank:
http://www.psc.edu/general/software/packages/g
enbank/genbank.html
MBGD – MicroBial Genome Database:
http://mbgd.genome.ad.jp/
PrometheusDB: www.prometheusdb.org
REBASE – Restriction Enzyme database:
http://rebase.neb.com/rebase/rebase.html
SRS – Sequence Retrieval System: http://srsmips.gsf.de / http://srs.ebi.ac.uk/
Swiss-Prot – Protein knowledgebase:
http://www.ebi.ac.uk/swissprot/index.html
TIGR – The Institute of Genomic Research:
http://www.tigr.org
About the author:
Marijke Keet received her MSc in Microbiology
from Wageningen Agricultural University in the
Netherlands, expects to graduate with a 1st class
honours BSc IT & Computing from the Open
University UK, and currently pursues a PhD at the
School of Computing at Napier University in
Edinburgh, Scotland. She has also worked as an
engineer in various jobs in the IT industry and will
receive a 1st class MA in Peace & Development
Studies from the University of Limerick, Ireland.
[email protected]