Download ontology design patterns for the formalisation of biological ontologies

ONTOLOGY DESIGN PATTERNS FOR THE FORMALISATION OF BIOLOGICAL ONTOLOGIES A U NIVERSITY OF M ANCHESTER M ASTER OF P HILOSOPHY REPORT SUBMITTED TO THE FOR THE DEGREE OF IN THE FACULTY OF E NGINEERING AND P HYSICAL S CIENCES 2005 By Mikel Egaña Aranguren Department of Computer Science Contents Abstract 5 1 Introduction 1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 1.2 1.3 Research hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . Report outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 12 2 Background 2.1 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 2.1.1 2.1.2 2.2 Introduction to ontologies . . . . . . . . . . . . . . . . . . . Web Ontology Language . . . . . . . . . . . . . . . . . . . . 13 14 2.1.2.1 2.1.2.2 14 Introduction to Web Ontology Language . . . . . . Summary of Web Ontology Language technical properties . . . . . . . . . . . . . . . . . . . . . . . . . 15 Bio-ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Open Biomedical Ontologies . . . . . . . . . . . . . . . . . . 17 18 2.2.1.1 The Gene Ontology . . . . . . . . . . . . . . . . . 2.2.1.2 Other OBO ontologies . . . . . . . . . . . . . . . . Other biomedical ontologies outside OBO . . . . . . . . . . . 19 22 22 2.2.2 3 Formalising knowledge in bio-ontologies: rationale and previous work 3.1 3.2 23 The need for formalised bio-ontologies: advantages of OWL DL and problems of OBO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 3.1.2 3.1.3 Integration of different ontologies . . . . . . . . . . . . . . . Automatic maintenance of multiple inheritance . . . . . . . . Not clear semantics: lack of expressivity and formality . . . . 25 25 26 Gene Ontology Next Generation (GONG) and Biological Ontology Next Generation (BONG) . . . . . . . . . . . . . . . . . . . . . . . . 28 2 4 Formalising knowledge in bio-ontologies: Ontology Design Patterns 33 4.1 4.2 Introduction to Ontology Design Patterns . . . . . . . . . . . . . . . Documenting Ontology Design Patterns . . . . . . . . . . . . . . . . 33 35 4.3 4.2.1 Description template of Software Design Patterns . . . . . . . 4.2.2 Description template of Ontology Design Patterns . . . . . . Ontology Design Patterns explored so far . . . . . . . . . . . . . . . 35 36 39 4.3.1 Extensional ODPs . . . . . . . . . . . . . . . . . . . . . . . 4.3.1.1 N-ary Relationships . . . . . . . . . . . . . . . . . 39 39 4.3.1.2 Exception . . . . . . . . . . . . . . . . . . . . . . Good practice ODPs . . . . . . . . . . . . . . . . . . . . . . 4.3.2.1 Normalisation . . . . . . . . . . . . . . . . . . . . 41 44 44 4.3.2.2 4.3.2.3 Value Partition . . . . . . . . . . . . . . . . . . . . Upper Level Ontology . . . . . . . . . . . . . . . . 48 51 Modelling ODPs . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3.1 List . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3.2 Adapted SEP triples . . . . . . . . . . . . . . . . . 52 52 55 4.3.2 4.3.3 5 Conclusion 5.1 5.2 Research hypothesis revisited and extended: research aims, objectives and questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 GONG and BONG . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Integration of ODPs in BONG . . . . . . . . . . . . . . . . . 60 60 61 5.2.3 ODPs catalog . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3.1 Properties of the biological knowledge domain . . . 61 62 5.2.3.2 Ontological constructs for ODPs . . . . . . . . . . Documenting ODPs . . . . . . . . . . . . . . . . . . . . . . Improved bio-ontologies . . . . . . . . . . . . . . . . . . . . 63 63 64 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Research plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 65 5.2.4 5.2.5 5.3 5.4 59 Bibliography 67 3 List of Figures 1.1 Simplified small example ontology . . . . . . . . . . . . . . . . . . . 8 1.2 Ontology Design Pattern applied to the simplified small example ontology of Figure 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Position of the term polarisome in GO . . . . . . . . . . . . . . . . 29 3.2 Functional and chemical classification in metabolism for the term acetylcholine biosynthesis . . . . . . . . . . . . . . . . . . . . 30 4.1 Simple mapping of OWL to UML . . . . . . . . . . . . . . . . . . . 38 5.1 Research plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4 Abstract Bioinformatics manages the information that has been gathered in databases since the advent of the molecular biology technological revolution. The successful research is based in interpretations of that information that can be accessed and managed computationally and efficiently, which is a difficult task considering that there are too many Bioinformatics resources and the resources are not integrated efficiently. An attempt to solve that problem is to use Ontologies. Ontologies are computational formalisations of the knowledge about a given domain, allowing computers to manage the information in a semantic level. The most successful ontologies applied in Bioinformatics are the ones in the Open Biomedical Ontologies (OBO) project. Most of the OBO ontologies are very simple and intuitive but lack formality and expressivity. The Web Ontology Language (OWL) is a official proposal for ontologies implementation in the semantic web. Its OWL DL variant is grounded in a sound formalism (Description Logics) and it is very expressive. Implicit knowledge can be made explicit and the consistency of the ontology can be checked automatically in OWL DL ontologies. There are technics that aid in the building of OWL DL ontologies such as the application of Ontology Design Patterns (ODPs), which are formalised abstractions of modelling solutions that can be applied to different ontologies. The hypothesis of this research is that by providing biologists tools and methods like ODPs the migration from the OBO language to OWL DL and the creation of OWL DL ontologies can be done with ease. This will produce more maintainable and expressive ontologies where more complex queries can be done and the biological knowledge is represented with higher fidelity. This research aims to explore the application, documentation and usage of ODPs in the context of previous successful attempts to formalise GO such as the Gene Ontology Next Generation project. This document gives and overview of the research field, it explores the preliminary work in ODPs and provides the research plan for the following two years. 5 Chapter 1 Introduction This chapter gives an overview of the problem that this research will try to solve and how will it be solved. The problem and its context is explained in section 1.1. The proposed solution is given in section 1.2 and finally the structure of the whole report is explained in section 1.3. 1.1 Context Bioinformatics is the discipline that deals with the information and knowledge created around the technical revolution that has been happening in molecular biology in the last 25 years. As a consequence of that revolution, large amounts of complex information are being created and stored. This information is very heterogeneous, including data that range from plain DNA sequences to complex protein 3D structures. All those data are annotated, including information about origin, reliability, structural interpretation, possible drug targets, etc. The first Bioinformatics resources that stored those data were built on flat files, and then in relational databases. These storage methods are still used nowadays and as a result massive hyperlinking is the main way of resources integration. This large resource offer for the biologist is around 700 databases [Gal05] and is even larger when analysis tools are taken into account. Despite the growing amount of available information [Tho03], the knowledge (useful, describable and computationally manageable information) related to it is not growing at the same pace, so the biologist is usually left alone in a sea of unmanageable and thus less useful data [DBD+ 04, SK05, MTES05] because of the large number of available resources and different formats. One way of tackling that problem is by making computers able to logically manage 6 the semantics of that information; resources can be better integrated, many tedious and error-prone processes can be automated and new knowledge regarding the field can be unleashed in an automatic and more efficient way [BMM05]. The biggest example of this strategy is the semantic web1 [BLHL01]. The semantic web is a means to build a World Wide Web where the semantic content is accessible for computers, not just for the human users. One of the main components of the semantic web are ontologies. Ontologies are models that represent knowledge about a domain in a computable way. In the semantic web ontologies are the mechanism for providing a vocabulary that will describe data held in a common data model. The vocabulary and the semantics provided by the ontology all facilitate machine processing. Ontologies are usually collections of classes, each class being a group of individuals, where the classes are linked by different logical relationships, creating a structure like the one shown in Figure 1.1. The semantic meaning of a class is given by its position in the structure: the lines are logical relationships that connect that class with others; a class can be a subclass of other class, a part of other class, and other logical relationships. One of the main aims for an ontology is to create a shared understanding. All the people using an ontology to annotate and query resources use the same terms in the same manner. This shared understanding can be extended to computers. Whilst not having the same understanding as a human, the computer can make inferences about the symbols themselves. By enabling a computer to do more sophisticated processing, it is possible to gain more added value from the process of annotating data with terms from an ontology. Ontologies can be created in different Knowledge Representation (KR) languages. These languages differ in their expressivity: the more expressive a language is, the more complex can be the knowledge represented by the ontology. Expressivity comes with a cost: as more expressive a KR language is, less tractable is by computational methods. The KR language can reach a point in expressivity where the logic that it maps to is said to be undecidable; as a consequence the computational tools will not be able to give a result when operating with the ontology. One of the most widely used KR languages is the Web Ontology Language2 (OWL). OWL is divided in three variants, depending on the expressivity: OWL Lite, OWL DL and OWL Full. OWL DL, which is the focus of this research, maps to Description Logics (DL). DL languages are a decidable fragment of First Order Logic (FOL). The 1 http://www.w3.org/2001/sw/ 2 http://www.w3.org/2004/OWL/ 7 Figure 1.1: Simplified small example ontology. This invented trivial ontology represents knowledge about a hypothetical society where all the women are mothers and lawyers, whereas all the men are fathers but they are unemployed. There are different relationships, each of them with different logical characteristics: Is A, Part Of and Has Parent. For example Is A means that a class is a more specific subclass of another class (mother is a type of person), whereas Part Of means that a class is a constitutive element of some other class (this society is build upon infrastructures and people, but a infrastructure is not a type of society). Has Parent only relates people to their parents, so a child will have two Has Parent relationships. These relationships link classes, creating the semantic meaning of a class by the position it has in the structure: child is defined by being a type of person and having two parents: a lawyer mother and an unemployed father. The string of characters that forms each class-name is completely meaningless except for human users: a child is a child because of its position in the structure (anything that has a lawyer mother, an unemployed father and is a person), not because of the string child. Therefore the class child represents the group of individuals that fulfil those conditions. Description Logics that relate to OWL DL are very expressive and formal, allowing reasoning: a program called a reasoner can compute the structure of the ontology and check the logical consistency of that ontology, amongst other things. 8 The biggest example of ontology usage in Bioinformatics is the Gene Ontology3 (GO), which is part of the Open Biomedical Ontologies4 (OBO) project. GO has three independent subontologies, describing the molecular function, cellular component and biological process of gene products of other biological databases. The structure of GO is simple: the terms5 are related by two types of relationships, IsA and PartOf, creating a tree-like structure. GO allows for the annotation of gene products with GO terms and acts as a semantic integrating system: given a gene product, a query can be done against GO obtaining the terms that are annotated to the protein. The other terms related via IsA or PartOf relationships to those terms can be accessed, including their annotations, giving an overall picture of the semantic relationships of the gene product with other biological entities and processes. GO has a wide and active community of curators and it is very appreciated by the molecular biologists. This is partly due to its intuitive structure and simple relationships. However this lack of formality has drawbacks: the ontology is very difficult to maintain and it offers an over-simplified representation of the current biological knowledge. Migrating GO to OWL DL can help solving those problems. OWL DL gives the possibility of a more expressive modelling, closer to real biological knowledge. In expressive models, capturing biological knowledge with higher fidelity, more complex queries can be done to them so more complex knowledge, closer to real biological knowledge, can be retrieved from Bioinformatics resources. Reasoning, which comes from the formality of OWL DL, plays an important role in the proposed system: reasoning can make implicit knowledge explicit (reasoning can compute the whole ontology structure from the implicit assertions) or it can be used to query the systems in more sophisticated ways. Reasoning can also be used to maintain big ontologies computationally with minimal human intervention. The biologists are the ones who can do the migration from OBO to OWL DL because they are the knowledge domain experts and they can exploit the full potential of the OWL DL expressivity. However the biologists must be presented with easy and simple tools that help them in the task. There has already been demonstrated how the migration of GO to OWL DL can be done in an easy manner in the Gene Ontology Next Generation6 project (GONG). GONG’s ready-to-use methodology provides the 3 http://www.geneontology.org/ 4 http://obo.sourceforge.net/ 5 GO curators use the word term to refer to classes in the ontology. When referring to GO and other OBO ontologies in this document, the word term will be used instead of class. 6 http://gong.man.ac.uk/ 9 biologists with simple semantic scissors to dissect GO: regular expressions. Another technique is the application of Ontology Design Patterns. The notion of Ontology Design Patterns comes from object-oriented programming, where they are known as simply Design Patterns [GHJV95]. Design Patterns can be briefly defined as abstractions of solutions to modelling problems: when designing systems, there are constant problems that appear again and again; a Design Pattern is a formalised way of solving one of those problems. Instead of trying to solve the problem, the programmer can simply use or adapt a Design Pattern, which is a solution proven to be efficient many times before, making development a faster and more reliable process. The same concept can be applied to ontology engineering: an Ontology Design Pattern that solves a given problem can be applied every time that the problem appears in different ontologies. See Figure 1.2 for an Ontology Design Pattern example, called Value Partition. The benefit of using Ontology Design Patterns is that they help to produce better structured ontologies and ontologies that capture biological knowledge with higher fidelity. They provide the biologists with an abstraction of the underlying semantics to easily use when creating ontologies in OWL DL, very much like a semantic swissknife. Ontology Design Patterns can be formally defined and plugged into the GONG workflow that has been developed as part of this work to be used as an off the shelf semantic modelling tool, helping biologists in expressing complex biological knowledge in OWL DL ontologies. 10 Figure 1.2: Ontology Design Pattern applied to the simplified small example ontology of Figure 1.1. This Ontology Design Pattern is called Value Partition and it is used to model classes that can only have certain attributes. In the example’s hypothetical simplified society the occupation can only be Teacher, Lawyer or Unemployed, so a new class is defined as being the union of the three. This new term is not a physical part of the society, is something that describes certain elements of the society, so it is a modifier. Using this Ontology Design Pattern the ontology has become more expressive: there is a new condition for being considered mother or father, and the elements of the society and their attributes have been decoupled producing a cleaner ontology: other attributes can easily be built using other Value Partitions and added. 1.2 Research hypothesis There is a vast amount of knowledge captured in bio-ontologies that are semantically weak. There is a representation language (OWL DL) and associated reasoning tools that could exploit more richly structured ontologies to expand the capabilities of bioontologies. The principal research question of this work is how to allow a biologist migrate from the former to the latter with ease. The hypothesis of this research is as 11 follows: A usable migration methodology and tool from OBO to OWL DL, incorporating Ontology Design Patterns, will enable biologists to produce richer ontologies with greater analysis capabilities. This richer representation of biological knowledge will capture the domain with higher fidelity and facilitate analysis of data via more detailed, precise queries. 1.3 Report outline Chapter 2 gives an explanation of the basic concepts that relate to this research. Chapter 3 gives a review of the work done in the field, analysing the need for formalisation of bio-ontologies and giving a detailed explanation of the GONG project. Chapter 4 gives an analysis of Ontology Design Patterns including a proposal of documentation and classification scheme and some examples of Ontology Design Patterns applied to bio-ontologies. Finally chapter 5 summarises the expected development of the research: the future improvements regarding Ontology Design Patterns, the application of Ontology Design Patterns to real problems, the result evaluation criteria and the research plan for the following two years. 12 Chapter 2 Background This chapter explores all the neccesary background information about bio-ontologies. It starts by explaining ontologies and the Web Ontology Language in section 2.1. In the section 2.2 an analysis of current bio-ontologies is provided. 2.1 Ontologies 2.1.1 Introduction to ontologies The term ontology has been borrowed in computer science from philosophy, where it can both refer to the branch of metaphysics concerned with the nature and relations of beings and a particular theory about the nature of being or the kinds of existents [McG01]. In computer science it describes a more concrete construct: a model that semantically captures the knowledge about a domain. The classical definition of an ontology is a specification of a conceptualisation [Gru93]. A more technical definition of an ontology considers it an engineering artifact to describe a certain reality, plus a set of explicit assumptions regarding the intended meaning of the vocabulary words [Gua98]. In biology, ontologies allow scientists to specify to any degree of resolution, how data, terminology (i.e. controlled vocabularies) concepts and ideas all relate to each other [NMW04]. The term ontology is overloaded and it is a subject of controversy: different things that range from thesauri to knowledge bases are considered to be ontologies. Ontologies play a pivotal role in the semantic web as vehicles for knowledge representation. They are used for different functions, such as web agents [Hen01] and 13 web services [GHS04], GRID technology [SRG03], e-commerce [Kwo03], data mining and text mining [LZ04, KOTT03] and computer security [LT04], amongst others. It is likely that as the tools and languages to build the ontologies get more sophisticated and robust [Mus05] new uses of ontologies will arise, unpredictable from today’s point of view, as happened with technologies such as HTML and HTTP or databases. 2.1.2 Web Ontology Language 2.1.2.1 Introduction to Web Ontology Language During the initial development of semantic web technologies there has been an evolution from data exchange standards like XML1 (eXtensible Markup Language) to exchange languages with more semantics like RDF2 (Resource Description Framework). OWL (Web Ontology Language) [AvH04] is the next layer in semantic expressivity ahead of RDF [WGA05]. OWL is a W3C3 official proposal4 for a semantic exchange language in the semantic web. The origin of OWL can be traced back to two different languages: DAML (Darpa Agent Markup Language) and OIL (Ontology Inference Layer). DAML was a project in the US funded by the DARPA (Defense Advanced Research Projects Agency) that included the markup language and some tools. OIL was primarily based in Europe, funded by the European Union’s Information Society Technologies Program. Whereas DAML was less formal and based in the notion of frames, OIL was based in more formal DLs. The efforts converged in DAML+OIL, incorporating the best of both, which would become OWL [HPSvH03]. OWL was designed to fulfil the following aims [ZM03]: • OWL ontologies should be suitable for sharing; they should be public, so that different systems on the web can refer to the same ontology. • OWL ontologies should be able to evolve and a given resource should be able to point to the version of the ontology which is being used. • OWL should allow ontologies to interoperate between each other when the same concepts are represented in different ontologies, allowing a web of ontologies. 1 http://www.w3.org/XML/ 2 http://www.w3.org/RDF/ The W3C (http://www.w3c.org) is a consortium for open web standards. It is lead by Tim Berners-Lee, the creator of HTTP and HTML and the idea of the semantic web. 4 http://www.w3c.org/2004/OWL/ 3 14 • It should be possible to detect inconsistencies between different ontologies that are contradictory. • OWL aims to meet a balance between expressivity and computational tractability, which leads to reasoning. The more expressive a language is, the less computationally tractable it becomes. • OWL should be easy to use and intuitive. • OWL should be compatible with other standards like XML or UML (Unified Modelling Language). • OWL should be compatible with internationalisation (use in different languages). There are different ontology editors that can manage OWL. The one used as a platform for this research is Protégé5 which provides a flexible plugin architecture and plenty of different functionalities.6 2.1.2.2 Summary of Web Ontology Language technical properties OWL ontologies fall into three different species: OWL-Lite is the simplest type, only simple class hierarchies and simple restrictions are allowed. OWL-Lite maps to DL. OWL-DL maps to the DL7 S H O I N (D ). Automated reasoning can be applied to OWL DL. It is more expressive than OWL-Lite specially with regards to class constructors. This is the type of OWL which will be the basis of this research. OWL-Full is the most expressive OWL type, the computational tractability is not guaranteed and reasoning is not possible. OWL-Full is the union of RDF(S)8 and OWL DL. In OWL ontologies there are three main elements [Hor04]: 1.- Individuals: the actual objects of the knowledge domain. They are analogous 5 http://protege.stanford.edu 6 http://www.co-ode.org/downloads/ 7 http://dl.kr.org 8 http://www.w3.org/TR/rdf-schema/ 15 to instances in frames based systems or object oriented programming. 2.- Properties: binary relations on individuals. Properties are interpreted as sets of pairs of individuals. Properties can be of different types: • Object properties link individuals to individuals. • Datatype properties link individuals to values of data (integers, for example). • Annotation properties are used to add extra information like comments from the ontology maintainer, authors, cross references, etc. Object properties link individuals from a certain domain to individuals of a certain range and they can have inverse properties (the inverse of a property that links individual A to individual B will link the individual B to individual A). Object properties can have the following characteristics: • Functional: in a functional property, there can be at most one individual related to a given individual. • Inverse functional: in an inverse functional property the inverse property is functional. Thus, there can be at most one individual related by the property for a given individual. • Transitive: a transitive property states that if A is related to B and B is related to C, A is related to C by the same relationship. • Symmetric: in a symmetric property, if A is related to B then B is related to A. Thus, the property is the inverse of itself. 3.- Classes: classes are interpreted as sets that contain individuals. The conditions for class membership of the individuals are stated precisely using restrictions. Restrictions are anonymous classes of individuals that have certain relationships to other individuals of the filler class. There are different kinds of restrictions: • Existential restrictions (∃) state that there is at least one relationship along the restricted property to an individual of the filler class. • Universal restrictions (∀) state that there is only one (or none) relationship along the restricted property to an individual of the filler class, and not to other classes. 16 • Cardinality restrictions state the minimum (≥), maximum (≤) or exact (=) number of relationships along the restricted property. The conditions (restrictions) for class membership can be neccesary (⊑) or neccesary and sufficient (≡). The neccesary conditions assert what is needed for an individual to be a member of a class, but they are not enough to define that membership: for example, a neccesary condition for being considered a member of the class human, is to be a biped, but not all bipeds are humans. Neccesary and sufficient define class membership: following the given example, having a very developed neocortex is enough to infer membership to the class human; anything that has got a very developed neocortex is human. Classes that only have neccesary conditions are called primitive classes. Classes that have neccesary and sufficient conditions are called defined classes. Both types of conditions can be used combined when building a class. Classes can also be built combining other classes with logical operators like union (⊔), intersection (⊓) and complement (¬). Logical operators can also be included in restrictions, building complex expressions. Individuals can belong to more than one class. It can be explicitly stated that two classes are disjoint (an individual can not belong to both classes). OWL works with an open world assumption: unless the contrary is explicitly stated, the fact that something has not been found does not mean that it is false. Databases, for example, work with a closed world assumption: if one item has just one value for a given attribute, unless another value is found it will be assumed that that item has got only that value. In an OWL ontology, for example, an individual can belong to two different classes unless the contrary is explicitly stated by the ontologist by making the classes disjoint. 2.2 Bio-ontologies The mentioned transition from data to semantics that is happening in the semantic web is a transition that it is also happening in Bioinformatics: the discipline does not only deal with data gathering, computing tools also interpret the data and deal with the knowledge related to those data [NMW04], following the path of the transition from XML to RDF and to OWL. Bioinformatics is a suitable discipline for that transition because it is a knowledge based discipline [SGB00] and plenty of biologists are willing to annotate that knowledge. The new semantic level will bring [NMW04]: • Integrating hetereogeneus data. 17 • Using logic to unleash new hypothesis. • More expressive models of nature. • Annotating discoveries formally so sharing them becomes more efficient. Ontologies are used to reach that semantic level as they are not just controlled vocabularies: ontologies relate concepts in expressive relationships [SK02]. The aim of ontologies in biology is to express the complex knowledge related to biology in a way that is computationally tractable. Ontologies are widely used in the area of Bioinformatics [SWLG04] and they can be classified in respect to the function they fulfil: Task oriented ontologies are designed for concrete tasks such as data mining and text mining [KSK02, CY03, MBH+ 05], web services [OGA+ 05] or resources integration. In resources integration ontologies are used to integrate databases at different levels [Jac04]; to tackle the problem of semantic heterogeneity of database entries (e.g. Gene Ontology, see section 2.2.1.1) or database schemas (e.g. TAMBIS [SGP+ 03] or SEMEDA [KPL03]). Domain oriented ontologies capture the knowledge of a concrete domain of knowledge. The ontology, apart of being queried, can be used as the centre for other technologies. Plenty of OBO ontologies (see section 2.2.1) and other examples like PhosphaBase [WMS+ 05] fall in this category. Generic ontologies are high level ontologies with general concepts that are used to integrate different ontologies. They are also known as Upper Level Ontologies (see section 4.3.2.3). 2.2.1 Open Biomedical Ontologies The Open Biomedical Ontologies organisation9 (OBO) offers a platform for biomedical ontologies that satisfy the following criteria: • OBO ontologies must be open (no restriction in use). • OBO ontologies must be implemented in standard ways (languages like OWL). • OBO ontologies must be orthogonal to each other (independent). 9 http://obo.sourceforge.net/ 18 • OBO ontologies must have a unique identifier prefix. • Definitions of the concepts of the OBO ontologies must be given. 2.2.1.1 The Gene Ontology The Gene Ontology10 (GO) [Lew05, Con00, BSG+ 04] provides an ontology that describes attributes of the gene products of an abstract and non pathological eucaryotic cell. GO offers a way of dealing with the semantic heterogeneity of gene product annotations in different databases: the annotations on different databases point to the same GO term. The Gene Ontology is responsibility of the GO consortium, a joint project formed by different organism databases11 that was started by FlyBase [Con99], Mouse Genome Informatics (MGI) [Bla00] and the Saccharomyces Genome Database (SGD) [KSR+ 04]. The main component of GO are the terms and the relationships that connect those terms (see below). Each term has an unique identifier apart from the term name, like GO:0005488 for the term binding.12 GO is divided in three independent ontologies: molecular function, biological process and cellular component. Molecular function describes basic and concrete molecular roles of gene products (e.g. thioredoxindisulfide reductase activity GO:0004791). Each biological process is made of different molecular functions and it describes a higher level role (e.g. development GO:0007275). The cellular component ontology represents the structure of eucaryotic cells (e.g. organelle GO:0043226). The whole ontology is implemented using Directed Acyclic Graphs (DAGs): multiple parent-child relationships are allowed in the structure, but cycles (a term being a child of itself) are prohibited. The top of the hierarchy is populated by general terms and as we move deeper (more terms in the path) the terms become more specialised. The terms on the edge of the path are called leaves and terms in the path itself are called nodes. There are two types of relationships in GO: IsA: it is also known as subsumption relationship; one term subsumes the other. It can be described as a term being a subclass of a bigger class: autosome GO:0030849 10 http://www.geneontology.org/ 11 http://www.geneontology.org/GO.consortiumlist.html 12 In this document OBO terms and identifiers are put together the first time the term is introduced (for example binding GO:0005488). In further uses of the term through the document the identifier is left out for clarity. 19 is a subclass of chromosome GO:0005694, therefore autosome GO:0030849 IsA chromosome GO:0005694. Officially, the IsA relationship does not mean an Instance Of 13 [WA03]. The IsA relationship is transitive. PartOf: this relationship means that a child is a structural component (in the cellular component ontology) or a sub-process (in the biological process ontology) of its parent [Con01]. This relationship is also transitive. An important assumption behind GO is the true path rule: starting from a leaf all the relationships that go up in the tree along its path must be biologically true. Another important aspect of GO organisation is the use of the word sensu: it is used when a term can have different meanings [LM04]. For example, the term cell wall GO:0005618 can be used to refer to bacteria, fungi, and plants. In biology the same word is used to refer to the three types of cell wall but the cell walls have different properties. Therefore the word sensu is added to the term, meaning in the sense of or as described in: the term cell wall GO:0005618 has three children: cell wall (sensu Bacteria) GO:0009274, cell wall (sensu Fungi) GO:0009277 and cell wall (sensu Magnoliophyta) GO:0009505. GO can be explored using various tools, the most common one being the AmiGO web interface.14 DAG-EDIT, which is a standalone program written in JAVA, is another popular tool for editing and exploring ontologies in DAGs.15 GO ontologies can be obtained in different ways, including OBO format, flat files, XML, MySQL tables, etc. Apart from ontologies, other resources are available. Slims are high level slimmed down ontologies for analysing gene group annotations [Con04]. Annotations of other databases to GO are available in a list.16 The databases that include GO annotations are: SGD (Saccharomyces cerevisiae), FlyBase (Drosophila melanogaster), TAIR (Arabidopsis thaliana), WormBase (Caenorhabditis elegans), RGD (Rattus norvegicus), Gramene (Oryza sativa), ZFIN (Danio rerio), DictyBase (Dictyostelium discoideum), TIGR, Sanger GeneDB, GenBank and UniProt. Every 13 As the GO users guide says, clogs are a subclass or is-a of shoes, while the shoes I have on my feet now are an instance of shoes. 14 http://www.godatabase.org 15 See the following web for a list of all the GO related tools, some of which are mentioned further on in the document: http://www.geneontology.org/GO.tools.html DAG-EDIT can be downloaded in http://sourceforge.net/projects/geneontology 16 http://www.geneontology.org/GO.current.annotations.shtml 20 GO annotation needs an evidence code17 that states where the evidence for the annotation came from (e.g. inferred from direct assay, inferred from electronic annotation, etc.). Mappings of GO to other external systems (e.g. Enzyme Commission numbers, SWISS-PROT keywords) are also available;18 recently GO has been mapped to the UMLS [LM04]. The growth and success of GO has been spectacular in recent years because of its openness, community involvement, intuitive structure and other reasons pointed to in [BSG+ 04]. It is a very dynamic project and full-time curators include the large amount of change requests from the community, supervised by each organisms’ database staff. Plenty of new resources include GO annotations.19 As a consequence, its functionality has been augmented to include, amongst others: • The Gene Ontology Annotation project (GOA): assigns GO codes to other database annotations [CBM+ 04, CBB+ 03]. • The Gene Ontology Annotation Tool (GOAT): closely related to GONG, this project aims to create a tool that helps creating consistent annotation when using GO terms [BMWS03]. • Automated [HGL03, RCSA02, XWL+ 02, GLH04, KSDS03, CLT+ 05] or integrated [JSH+ 03] gene annotation. • Use of semantic similarity for sequence searching [Zeh03, LSBG03]. • Categorisation of gene groups [JMFH04, ZSKS04, ZFW+ 03, BS04, ASDUD04]; given a large set of genes a node or nodes on GO are used to summarise their function [JM04]. • Categorisation of gene expression [DSD+ 03, KBBD04, VEF+ 04, LHK04, RWBB04, JSA+ 04, YWCS05, BWG+ 04, CGGG+ 05, MBR+ 04] and statistical genomics [Car03]. For an up-to-date detailed overview of the tools for the analysis of gene expression based in GO see [KD05]. • Prediction of protein function by coupling machine learning with GO [LHMK03, KFD+ 03, DTSC04], prediction of subcellular location of a given protein [CC04] or prediction of functional modules in bacterial genomes [WSM+ 05]. 17 http://www.geneontology.org/GO.evidence.html 18 http://www.geneontology.org/GO.indices.html 19 http://www.geneontology.org/GO.annotation.html 21 2.2.1.2 Other OBO ontologies There is a growing amount of bio-ontologies in OBO. One of the most important ones is the Cell Type ontology [BRA05], which covers procaryotyc cells, cells of animals, plants or fungi, and either in vitro or in vivo cells. Another OBO ontology that should be mentioned is MGED (Microarray Gene Expression Data) [SK05], which describes the data generated by microarrays. It is one of the few OBO ontologies implemented in OWL. 2.2.2 Other biomedical ontologies outside OBO Being OBO a relatively recent development, there are other biomedical ontologies that were developed before OBO was established or simply there were developed outside of OBO: • OpenGalen20 is an ontology used for medical information management. • BioPAX21 describes biological pathways and it is implemented in OWL. • Ecocyc22 is one of the oldest bio-ontologies and describes the whole metabolism of Escherichia coli. 20 http://www.opengalen.org/ 21 http://www.biopax.org/ 22 http://ecocyc.org/ 22 Chapter 3 Formalising knowledge in bio-ontologies: rationale and previous work This chapter gives the reasons for a need of formalisation in bio-ontologies and reviews previous work regarding the mentioned problem. The reasons for a need of formalisation and a literature review is given in section 3.1. In section 3.2 the Gene Ontology Next Generation (GONG) project and the related Biological Ontology Next Generation (BONG) Protégé plugin (developed by the author during the first year of work) are presented. Part of the information gathered herein was collected during the author’s visit to the EBI1 (European Bioinformatics Institute) funded by the Semantic Mining Network of Excellence. 3.1 The need for formalised bio-ontologies: advantages of OWL DL and problems of OBO This whole research aims to analyse as many as possible bio-ontologies. However, GO, the most widely used bio-ontology, is analysed almost exclusively. More bioontologies will be included in further developments, and, nonetheless, the analyses, conclusions and Ontology Design Patterns regarding GO can be extrapolated to other bio-ontologies. 1 http://www.ebi.ac.uk/ 23 There is a clear trend in current bio-ontologies towards a more expressive and formal Knowledge Representation language: there are new biomedical ontologies implemented in OWL DL [FSP+ 04] or being transformed into OWL DL [GZB05]. In the case of OBO ontologies there are some ontologies in OWL (MGED2 and NCI thesaurus3 ) and other ontologies like the Sequence Ontology4 that present subrelationships and three relationship attributes (Is cyclic, Is transitive and Is symmetric). DAG-Edit also allows the use of the properties InverseOf and DisjointFrom. The tools [LHP03] and reasoners for OWL DL such as RACER [VR03] or FACT [IR98] are becoming more efficient and robust, making OWL DL available for more users in the biomedical domain [SH05]. Despite the mentioned trend towards OWL, GO still presents a very simple and intuitive structure to the biologists: just IsA and PartOf relationships are allowed. The rest of the expressivity needed for modelling gene products’ attributes is reached by a mixture of curational guidelines, embedding content in the terms and other more or less explicit work-arounds like overloading of the PartOf relationship. Amongst the reasons for GO remaining in its current format is the reluctance of biologists to adopt any new technology that regardless of its quality represents a big change. This has happened in the GO consortium [Ire], where, for example, the developers have been discussing for more than a year whether the relationship regulates should be included in the ontology. It appears as a straight forward decision from the ontology engineering point of view, but biologist simply reject anything that it is new and it is not absolutely evident that will work. This attitude is grounded in the fact that bio-ontologist must provide ontologies that work [GW04] and have to be continuously up to date with the databases, so other considerations are leaved to a second level [SK05]. Another related problem is the need to offer biologists simple interfaces to any new, complex and expressive language like OWL DL [Har]. It has already been pointed in the literature that the priority in the GO working group [SWSK03] and other bio-ontology developers [SK05] is to add as much knowledge as fast as possible to the ontology, leaving the consideration of formal principles to a second level. Thus, GO has become a victim of its own success: its simple structure has make it the preferred ontology for many biological databases, but its simplicity and lack of formality makes it very hard to maintain manually. Apart of being difficult for manual maintenance, GO has plenty of inconsistencies in the way it represents the domain knowledge and it 2 http://obo.sourceforge.net/cgi-bin/detail.cgi?mged 3 http://obo.sourceforge.net/cgi-bin/detail.cgi?ncithesaurus 4 http://obo.sourceforge.net/cgi-bin/detail.cgi?sequence 24 is not very expressive, being an opaque resource for other systems to interact with it in a computational and more sophisticated way. 3.1.1 Integration of different ontologies GO includes other ontologies in itself: GO terms are generally syntactically formed by combining certain constant words [OCAM+ 04, SK04a] and a big proportion of GO terms is made up by including terms of other ontologies. For example, all the GO terms having some kind of cell in the term name include terms from the Cell Type ontology (CL): GO: fat cell differentiation (GO:0045444) CL: fat cell (CL:0000136) From September of 2005 there has been an ongoing effort to synchronise both ontologies; there are cells appearing in GO that do not appear in CL or they appear with a different name and viceversa. The strategy followed to achieve the aim was to syntactically parse GO in search for CL terms, using the BONG plugin (see section 3.2) and OBOL [Mun05]. This strategy is rather ad-hoc and does not tackle the problem of really integrating different ontologies: semantic integration achieved by syntactic parsing is not a scalable and sound solution. OWL DL offers a technology grounded in the mentality of the semantic web: OWL DL ontologies can import other OWL ontologies, either locally or via HTTP, very much like importing programming libraries. Thus the reuse of ontologies is done in a semantic level the whole time and without having to develop parsing tools. To achieve the possibility of OBO ontologies being efficiently reused clear upperlevel semantics must be stated first. One attempt towards that aim is the use of a set of established relationships with well defined semantics in the OBO relationship types ontology5 [SCK+ 05]. Another attempt is the establishment of a well defined Upper Level Ontology (see section 4.3.2.3). 3.1.2 Automatic maintenance of multiple inheritance GO has around 18.000 terms, and it is impossible to maintain an exhaustive structure with the curational methods used now: the curators try to manually find any relationship that should be included with any new term with the aid of term definitions. The 5 http://obo.sourceforge.net/cgi-bin/detail.cgi?relationship 25 GO curators themselves recognised the utility of an automatic tool that could find automatically the needed relationships [Ire, Har]. If the ontology is implemented in OWL DL this can be done using the reasoner, specially if the ontology is properly normalised (see section 4.3.2.1). 3.1.3 Not clear semantics: lack of expressivity and formality As a consequence of not using a formal and expressive language, plenty of semantics are reduced to the level of syntactics and stated as editorial guidelines outside the ontology or as parts of the term names or definitions, if stated at all. In this way the computational tools are unable of meaningfully access the ontology and therefore plenty of automated tasks can not be accomplished [BB05], including maintenance [SK04a], consistency checking and new knowledge discovery [YKNA03, Ait05]. Inconsistencies in the use of sensu The word sensu is added to a term to express as described in taxon. For example cell wall biosynthesis (sensu Fungi) GO:0009272 means the cell wall biosynthesis understood in the way it has been described in the Fungi. This means that genes of other taxa apart of Fungi can be annotated to cell wall biosynthesis (sensu Fungi). But sensu is in practice understood as well as appearing in taxon, so for example to respect the true path rule the hierarchy of taxa narrows down as the GO hierarchy approaches its leafs, mixing both meanings of sensu. Other problems with sensu have been pointed in [SK04a] and [SKK04] (in respect to the IsA relationship). Proliferation of terms As the GO language is not expressive enough, plenty of semantics must be stated adding new terms or adding syntactic elements to already existing terms and as consequence there is an uneccesary proliferation of terms. For example there are plenty of terms with the token during within them, to refer to processes that act within another process, e.g. cellular morphogenesis during differentiation GO:0000904 and ethanol biosynthesis during fermentation GO:0043458 [Ire]. Different levels of granularity The importance of levels of granularity in biomedical ontologies has already been pointed [AJT05, GCB04] and the problems in GO derived by the mixture of different levels of granularity in [KSN04]. There are two main problems: the different levels 26 of organisation or granularity are not explicitly stated in GO, and they are mixed. For example the highest level of organisation in GO is the cell level, but terms that refer to the metacellular level can be found in GO (e.g. organ development GO:0048513). Overload of the IsA relationship The official definition of the IsA relationship in GO states that if A IsA B, every instance of A is an instance of B. But IsA is also used to denote KindOf leading to confusing situations as pointed in [AWB04], [SWSK03] and [SKK04]. Overload of the PartOf relationship There has been considerable research regarding mereology and biomedical ontologies [RR00, AWB04] as the partonomy relationship can have different semantics or types of PartOf [Ode94]. In GO PartOf is used as a wildcard relationship when other relationships with different semantics should be used, as addressed before by [SDSH05, SWSK03], in the context of all the OBO ontologies by [SCK+ 05] and as a general phenomenon in ontology engineering by [CSF03]. As PartOf holds other relationships within, for example location, any term that it is PartOf two different GO terms will have the same location as the ancestors of those terms along the PartOf relationship, and that can create conflicts depending on the terms [Har]. For example polarisome GO:0000133 is part of cell cortex GO:0005938 and part of site of polarized growth GO:0030427, so it must be deduced that polarisome is located in both. That is partially true, because when it is located in the site of polarized growth encloses a small portion of the whole cell cortex (see Figure 3.1). This is trivially fixed in a curator guideline that reads as follows:6 The part-of relationship used in GO is usually (...) necessarily is-part, [meaning] that wherever the child exists, it is as part of the parent. To give a biological example, replication fork is part of chromosome, so whenever replication fork occurs, it is as part-of chromosome, but chromosome does not necessarily have part replication fork Whenever polarisome occurs, it is as part of cell cortex, but cell cortex does not neccesarilly have part polarisome, so it could be that cell cortex has as a part polarisome only in the site of polarized growth, allowing for proper assumptions regarding location by the human users. In any case there is not any semantical 6 http://geneontology.org/GO.usage.shtml 27 statement regarding location in the model, hence the claim that the editorial guideline quoted is a trivial solution, because the semantic definition of PartOf remains the same, encompassing all other relationships. If the model will be queried for the location of polarisome it will be (wrongly) deduced to be located both in the whole of the cell cortex and in the site of polarized growth. The problem can be fixed by entering a more specific term with the location as a child of polarisome, for example polarisome on site of polarized growth (a technic often used in GO). However this solution is bad for two reasons: it leads to an unnecessary proliferation of terms and still there is no semantical way of modelling location [SH04]: it would be avoiding the problem but not solving it (see section 4.3.3.2). 3.2 Gene Ontology Next Generation (GONG) and Biological Ontology Next Generation (BONG) The Gene Ontology Next Generation project7 (GONG) offers a simple workflow to migrate and improve parts of GO into OWL DL [WSGA03]. The process relies in dissecting GO terms with regular expressions defined by the user and extracting new semantic content that can be combined with other ontologies. Once combined with other ontologies and translated into OWL DL, the ontology can be sent to a reasoner and the reasoner will point new relationships that should be added back to GO. GONG demonstrates the advantages of automatic maintenance of ontologies: a human curator can not create and maintain all the neccesary subsumption relationships in an ontology with more than 18.000 classes, but given that a correct set of semantics is provided, a reasoner will. GONG relies on how syntactically conserved the GO terms are to dissect a chosen subtree of GO in different semantic axes. For example the term acetylcholine biosynthesis GO:0008292 belongs to two tangled subtrees: a chemical subtree leading to acetylcholine and a functional subtree leading to biosynthesis (see Figure 3.2). Dissecting the term allows for new semantics to be defined in an automatic way: acetylcholine biosynthesis can be redefined adding a restriction in OWL DL that can be read as biosynthesis that acts on acetylcholine, maintaining the original GO relationships of the term (see Figure 3.2). If the resulting 7 http://gong.man.ac.uk/ 28 Figure 3.1: Position of the term polarisome in GO. The term polarisome is part of both cell cortex and site of polarized growth. As a consequence of overloading PartOf, there is a conflict regarding the location of Polarisome. ontology is combined with a chemical ontology, the reasoner will have enough semantics to infer new relationships. For example, as acetylcholine is a subclass of neurotransmitter in the chemical ontology, acetylcholine biosynthesis would be inferred to be a subclass of neurotransmitter biosynthesis GO:0042136. This process is triggered when the term is captured by the respective regular expression 29 ((.+?) (biosynthesis) in this case). Any term that matches that regular expression will held the new semantics in the resulting OWL DL ontology; each regular expression has its own new semantics defined. Around 10 percent of the new relationships suggested by the reasoner in the last GONG execution were accepted by the GO curators, showing the performance and utility of the workflow.8 Figure 3.2: Functional and chemical classification in metabolism for the term acetylcholine biosynthesis. The functional classification in the case of metabolism includes three elements: catabolism, metabolism and biosynthesis. Catabolism is included in the diagram for clarity, although is not present in the GO subtree of the example. The chemical classification (simplified in the diagram) is more complex, depending on the term. Biological Ontology Next Generation9 (BONG) is a Protégé plugin that gives the biologists a chance to use the GONG workflow with any GO subtree and any OBO 8 See author’s MSc dissertation in http://gong.man.ac.uk/publications/ 9 http://gong.man.ac.uk/downloads/ 30 ontology. The BONG plugin can be used as an OBO to OWL DL converter, as a GO (MySQL) to OWL DL converter, both, or as GONG workflow. If it is used as a GONG workflow, the other two steps (OBO to OWL DL and GO -MySQL- to OWL DL) must be executed first and a GONG ontology must be loaded into Protégé. The GONG ontology describes the GONG workflow and the plugin reads it to perform the workflow. The GONG ontology is the core of the plugin, as it describes the regular expressions that will dissect the GO terms, and the semantics related to those regular expressions. It is neccesary to perform the GONG workflow, but it is not neccesary for the previous steps (convert OBO to OWL DL and GO -MySQL- to OWL DL). The users can define their own ontologies and send them to [email protected], to put them in the central repository,10 so other users can use the new GONG workflows without having to create new GONG ontologies. An example of a GONG ontology is provided, bundled with the plugin, called gong_cell_diff_cell_type.owl. It dissects and improves the GO subtree cell differentiation using the OBO Cell Type ontology. The most important sections of the ontology are: • gong:Group: the subclasses of this class are used as convenience classes for filling the restrictions described in the Regexp classes (see below). Each class represents a group on the regular expressions. There must be as many classes as matching groups the regular expressions will have. For example, the regexp (.+?) (development) has two groups. • gong:Map: this class describes the mapping between GO sub-terms (portions of terms) and OBO terms. For example, the GO term brown adipocyte differentiation has brown adypocite as a portion of the term. brown adypocite maps to the term brown_fat_cell of the OBO Cell Type ontology (when it is transformed into OWL DL, as it is the case, otherwise the OBO term would be brown fat cell). • gong:Regexp: in this class the regular expressions are expressed, in the form of gong:Regexp_n, where n are numbers, starting with 1. The most specific regular expressions have the lowest numbers; the plugin will try to match the most specific ones first, and if there is no match, it will try the next one. Thus, (negative regulation of) (.+?) (development) should have a smaller number than (.+?) (development), as (.+?) 10 http://gong.man.ac.uk 31 (development) will catch all the terms that were caught by the other regular expression: once a term is caught, it is not checked for more regular expressions. The only annotation property that must be filled in the case of regular expression classes is gong:regexp_string_value, and it is used to describe the actual regular expression (e.g. (.+?) (development)). The OWL DL conditions that the term should have are defined using the equivalent class of the regular expression class. Two kinds of conditions can be defined: superclasses and restrictions. The superclasses should point to an already existing class, usually on the accessory ontologies (see below). In the restrictions, the filler is usually either an already existing class (again, probably in the accessory ontology) or a group in the matching regular expression, thus, a portion of the matched term. For example, if the term adipocyte differentiation is matched by the regular expression (.+?) (differentiation) and the regular expression class restriction condition says gong:acts_on someValuesFrom Group_10, the resulting OWL DL class will have a restriction like gong:acts_on someValuesFrom fat_cell (the OBO Cell Type term, mapped and translated to OWL DL). In the matched term, the matched group 1 (the first group) is adypocite, which corresponds to fat_cell. • gong:Accesory_Ontology: under this class is any accessory ontology that will be used to semantically complement the GONG workflow. There is no format requisite, but it should match the fillers or superclasses described in the Regexp classes. To use the plugin as a GONG workflow, the other two steps must be executed first. The GONG ontology must be included, importing it: it can be imported either as an URL (if the system is permanently connected: for instance the example ontology can be imported from the project website11 ) or as a local file. After the ontology has been imported and the GONG workflow has been executed, the ontology should be classified with the reasoner. Some of the new subsumption relationships should make sense as new GO IsA or PartOf relationships, so in this way a handful of hundreds of new legitimate relationships can be automatically created with a minimum effort. If a suitable GONG ontology is found by the user in the repository, the effort invested is small and the user can maintain GO subtrees of around a thousand terms automatically, easily taking advantage of the usefulness of an OWL DL approach. 11 http://gong.man.ac.uk/gong_cell_diff_cell_type.owl 32 Chapter 4 Formalising knowledge in bio-ontologies: Ontology Design Patterns In this chapter the concept of Ontology Design Patterns is explained, including their application and documentation. The section 4.1 explains the concept of Design Patterns and then explores the concept of Ontology Design Patterns. The section 4.2 explains the method that will be followed to document Ontology Design Patterns. The Ontology Design Patterns that have been explored and that will consist the basis of this research are presented in section 4.3. 4.1 Introduction to Ontology Design Patterns The concept of Software Design Patterns (SDPs) comes from Object Oriented Programming [GHJV95]. There are modelling problems that rise again and again when designing different programs. Each of the problems is common to different systems and hence the modelling solution for each problem can be described in a generic manner, suitable for different implementations; the solution is called Design Pattern (Software Design Pattern -SDP-). Thus SDPs are very general methods of solving modelling issues that have been proven to be efficient many times and therefore become established in an abstract form. There are anti-patterns as well: potential pitfalls that should be avoided when designing and developing a program. Ontology Design Patterns (ODPs) are the application of the same concept to the creation of ontologies. Thus ODPs are modelling abstract solutions to known problems 33 in ontology engineering. Some ODPs can be found in the Semantic Web Best Practices and Deployment Working Group web.1 ODPs improve ontological modelling in different ways: • ODPs are abstractions. ODPs provide biologists with an easy way of dealing with the complexity of OWL DL, making ontology creation a faster and more reliable process. Biologists working in bio-ontologies creation prefer the complexity of the language they are using as hidden as possible [Har]. • ODPs can be made computationally explicit. ODPs allow for automatic building of sectors of an ontology that are complex, making ontology building easier for non-experts. The user can be guided step by step in the ODP application. For example the Protégé wizards plugin2 gives the user the possibility of automatically creating some ODPs like Value Partitions, RDF lists and N-ary Relationships. • ODPs provide a neat way of producing more modular and robust ontologies. By using ODPs the entities and the structure of the ontology can be explicitly separated [CTP04]. • The use of ODPs improves communication between ontology developers. The developers can easily recognise the different features of the ontology produced by the ODP, as the ODP represents a well known and easy to understand abstraction. • ODPs produce more expressive ontologies. ODPs allow for a more fine-grained modelling of the knowledge domain. • By using ODPs the potential of reasoning can be exploited in more efficient ways. The expressivity needed for efficient and productive reasoning is reached more easily using ODPs. Research on ODPs is very recent [Dev02], and therefore there is not a established strict definition for ODPs, apart of the given here adapted from Object Oriented programming. In this research two ODPs are classified as ODPs when another possible definition would be best practices: Normalisation (section 4.3.2) and Upper Level Ontology (section 4.3.2.3). They are included as ODPs because it would be rather arbitrary not to do so: they are ontological structures, as other ODPs. The only difference is the 1 http://www.w3.org/2001/sw/BestPractices/ 2 http://www.co-ode.org/downloads/wizard/co-ode-index.php 34 amplitude of their aim: Normalisation is a way of building better ontologies in its own (rather than being a means of improving concrete parts of the ontology as other ODPs) and Upper Level Ontology provides a way of integrating ontologies (rather than only improving the modelling in a concrete ontology as other ODPs). 4.2 Documenting Ontology Design Patterns There is a difference between SDPs and ODPs when modelling them: in SDPs there is a metalanguage to describe the SDP (for example UML3) whereas in ODPs there is not. The SDPs are described with UML in a generic manner, and then the instances of the SDP are applied in the programming language of choice. The description of the SDP (in UML) is different from the implemented instance (in the chosen programming language). There is not a metalanguage for describing ODPs, and as a consequence they are described using instances: the model, rather than being a generic structure like in SDPs, is an instance that implicitly describes the generic structure. Another difference is that whereas SDP models express some kind of timing (messages are send between objects, there are phases within the SDP, etc.) the ODP models are completely static. 4.2.1 Description template of Software Design Patterns There is not a community-accepted guideline for documenting ODPs and no explicit attempts have been made to solve the problem. In Object Oriented programming there is a commonly used format for representing SDPs that usually includes the following information sections for each SDP [GHJV95]: Name and classification: each SDP has a unique name and the SDP is usually classified by the problem it solves. It can be classified as: fundamental Design Pattern, creational Design Pattern, structural Design Pattern, behavioural Design Pattern, concurrency Design Pattern. Intent: the reason for using the SDP, the problem the SDP solves. Also known as: another name that the SDP could have. Motivation: a possible context of the problem where the SDP can be used. Applicability: in which situation the SDP is usable. 3 http://www.uml.org/ 35 Structure: graphical representation of the SDP, usually in UML. Participants: the elements (objects, classes, packages, interfaces) that make up the SDP. Collaborations: the interactions between the elements of the SDP. Consequences: the results, consequences and trade offs of applying the SDP. Implementation: how the SDP can be built in a real situation to solve the problem. Sample code: the source code of a program that implements the SDP. Known uses: real implementations of the SDP. Related patterns: SDPs with a similar or related function. 4.2.2 Description template of Ontology Design Patterns A similar scheme to the one used for SDPs in section 4.2.1 will be used to describe the ODPs on this research, as there is no prior guideline. Most of the sections can be recreated again when describing ODPs without major problems, but there are some sections that need a deeper analysis and some sections are added: Name and classification: the ODPs classification used for this research is based on the general usage rather than on the problem they are intended to solve: • Extensional ODPs: ODPs that extend the limits of OWL DL. OWL DL has got limitations as a Knowledge Representation (KR) language. Some ODPs can be used to overcome those limitations and present a suitable representation of the knowledge domain that wants to be captured. • Good practice ODPs: ODPs that are used to ensure a modelling good practice. These ODPs are used to produce more modular, efficient and maintainable ontologies, tackling already known pitfalls of ontology engineering. • Modelling ODPs: ODPs that are used to model a concrete part of the knowledge domain. They can be defined as signature ODPs or idioms: each knowledge domain has got its peculiarities and these ODPs are used to model those peculiarities. For example biological knowledge differs from other domains in that the development of things is very important, there is symmetry, there are different level of complexity interacting with each other, there are emergent properties, etc (see section 5.2.3.1). 36 The first two types are common to all ontologies. The ODPs of the third type are more specific to the knowledge domain (in this case biology) but they can also be used in other domains. Intent: similar to Object Oriented programming SDPs. Also known as: similar to Object Oriented programming SDPs. Motivation: similar to Object Oriented programming SDPs. Applicability: similar to Object Oriented programming SDPs. Structure: in Object Oriented programming UML is used for the task. There is no analog of UML in OWL DL; there is not a graphical representation that holds all the semantics of the ODP on it. There are different approaches to the problem: • The OWLViz Protégé plugin.4 It is very useful for simple class-subclass hierarchies. • GrOWL5 (Graphical OWL). Its use is not very extended. • Diagrams like the ones used in the W3C Best Practices web.6 Although they are very simple, they mix property characteristics with relationships in the graph. • OWL-ed UML diagrams. UML can be used to express OWL DL, as pointed in [BVEL04]. Using UML has got the advantage of a graphic paradigm which is already widely used: there are plenty of tools and big communities that are already familiar with the format, so for example it can be easier for biologists who already have notions of Object Oriented programming to understand ODPs expressed in UML. There is extensive literature regarding the relation of OWL and UML [HEC+ 04, FSS03]. • UML diagrams. UML can be used to describe very general ODPs, without considering the semantics of the target KR language, as shown in [GCB04]. The choice for this research is to use OWLViz for the general subsumption structure of the ODP, and to use OWL-ed UML diagrams for the most important details of the ODP, following the OWL to UML mapping described in [BVEL04], summarised in Figure 4.1. Participants: the list of classes or class groups used in the ODP. This section will be called Elements instead of Participants. 4 http://www.co-ode.org/downloads/owlviz/co-ode-index.php 5 http://www.uvm.edu/˜skrivov/growl/ 6 http://www.w3.org/TR/swbp-n-aryRelations/ 37 Figure 4.1: Simple mapping of OWL to UML. The OWL expressions are described in the left column and the respective UML diagrams are described in the right column. Not all the possible OWL expressions are included. Collaboration: the relationships linking the classes. Only the most important ones are described. This section will be called Relationships instead of Collaboration. Consequences: similar to Object Oriented programming SDPs. Implementation: similar to Object Oriented programming SDPs. Sample code: the information of this section is presented in three different manners: • An OWL DL ontology with the whole ODP, available via URL. • The most important parts of the ODP described using Description Logics notation. • The most important parts of the ODP described using the Manchester abstract 38 OWL syntax.7 Known uses: similar to Object Oriented programming SDPs. Related ODPs: similar to Object Oriented programming SDPs. References: this section is added to put on it possible publications or web pages were the ODP was originated and where can it be found, for example to be imported and included in an ontology or to be properly referenced. Additional information: this section is added for some complementary information that does not fit in any of the previous sections. For example, any information regarding the origin or history of the ODP can be added here. 4.3 Ontology Design Patterns explored so far The aim of this section is not to exhaustively explore all the ODPs that can be applied to bio-ontologies and assess them. The aim is to give some examples of the ODPs that will be explored during the research and how they will be described. Nonetheless some of the ODPs already explored promise to solve important problems in the creation and maintenance of biological-ontologies. Some of the ODPs are still very experimental; the potential implementations problems and trade-offs have not been completely explored. All the sections mentioned in the description template of section 4.2.2 will be maintained for consistency between ODPs; if there is no information for a given section the word none will be used. In structure and sample code, if any of the choices is not suitable (OWLviz, UML, DL notation or Manchester abstract syntax) it will be simply not included; for example it is redundant to use UML to model the subsumption hierarchy in the case of Upper Level Ontology so the UML graph is left out of the documentation. 4.3.1 Extensional ODPs 4.3.1.1 N-ary Relationships Name and classification: N-ary Relationships, Extensional. Intent: to model complex phenomena that have relationships linking more than one element. 7 http://www.co-ode.org/resources/reference/manchester_syntax/ 39 Also known as: Relationships of higher arity. Motivation: the biomedical domain is full of situations were relationships should hold between more than one element, but OWL DL only allows to express properties linking two individuals at a time. There can be a situation where a relationship and some properties of that relationship must be modelled; that can not be done in a direct manner with OWL DL. For example a diagnosis has a result, a probability, and the person who has been diagnosed. A catalytic reaction has got a substrate, some products, catalytic constants and it is catalysed by an enzyme. Applicability: any ontology where the KR language can not link more than one individual in the same relationship. A GO example can be found in the term Golgi to plasma membrane CFTR protein transport GO:0043000: there is a transport phenomenon which relates to three elements at the same time: the start (Golgi), the end (plasma membrane) and the transportee (CFTR protein). The transport relation can not be modelled in OWL DL pointing to the three elements, so this ODP must be applied. Elements: the original elements of the N-ary Relationship are conserved in classes and a new class is reified to model the N-ary Relationship, in this case a class called CFTRGolgiToPlasmaTransport. Relationships: the relationships of each element to the reified class are created: transports from, transports to and transports. Structure: details of the reified class definition: Consequences: the N-ary Relationship of the knowledge domain is explicitly stated in the ontology. Implementation: the only important step is to identify the new class (the reified class) that will hold the N-ary Relationship. Sample code: 40 • The whole ODP as an OWL DL ontology is available at: http://gong.man.ac.uk/owl/CFTR.owl • DL notation of the reified class definition: CFTRGolgiToPlasmaTransport ⊑ ∃ transports to Plasma membrane CFTRGolgiToPlasmaTransport ⊑ ∃ transports CFTRProtein CFTRGolgiToPlasmaTransport ⊑ ∃ transports from Golgi • Manchester abstract OWL syntax notation of the reified class definition: class CFTRGolgiToPlasmaTransport partial transports to SOME Plasma membrane AND transports SOME CFTRProtein AND transports from SOME Golgi Known uses: none. Related ODPs: none. References: • http://www.w3.org/TR/swbp-n-aryRelations • http://gong.man.ac.uk/ontologydesignpatterns/ • http://www.co-ode.org/resources/tutorials/bio/ • See [SAW+ 05]. Additional information: none. 4.3.1.2 Exception Name and classification: Exception, Extensional. Intent: to model exceptions, classes that break canonical classifications. Also known as: none. Motivation: plenty of areas of knowledge work with defaults or canonical knowledge: biological classifications, for example, state what is the canonical norm and then the exceptions are classified under the norm, even if the classification is inconsistent from the logical point of view. A clear example can be found in the classification of cells [ABL+ 89]: in canonical biology eukaryotic cells are considered to be cells with a nucleus. Mammalian red blood cells are considered by any biologist as eukaryotic cells, 41 but they lack a nucleus. Thus they are a subclass of eukaryotic cells, but they break the condition for belonging to that class (having a nucleus). Applicability: any ontology that has to deal with knowledge based in canonical norms and exceptions and is based in a KR language that does not handle exceptions directly. OWL DL, as other DL based languages [HdCD+ 05, RWRR01], does not allow exceptions. In a cell classification ontology the class MammalianRedBloodCell (with the restriction hasNucleus = 0) would be a subclass of EukaryotiCell (with the restriction hasNucleus = 1), resulting in an inconsistent ontology. There can be exceptions to the exception in the next level: avian red blood cells do posses a nucleus, thus, they are considered normal eukaryotic cells (they are an exception to the norm that all red blood cells lack a nucleus). So the problem can rise in different levels. Elements: the most important elements are the newly created Typical (TypicalEukaryoticCell, TypicalRedBloodCell) and Atypical (AtypicalEukaryoticCell, AtypicalRedBloodCell) classes. The rest of the classes are maintained. Relationships: the most important property is the discriminating property, in this case, hasNucleus. Structure: • Subsumption hierarchy before reasoning (darker ovals are defined classes): • Subsumption hierarchy after reasoning: • Details of the Typical/Atypical structure: 42 Consequences: if the ODP is used in plenty of different levels of the ontology it can produce too complex and unmanageable ontologies. This type of structure can be very counterintuitive for biologists. Implementation: • Starting from the example ontology described in applicability, two disjoint classes are created for typical and atypical elements. • The discriminating condition (hasNucleus) is only stated in the typical subclass. • A covering axiom is added to the main class (i.e EukaryoticCell) to state that all instances must belong to one or the other subclass (i.e TypicalEukaryoticCell or AtypicalEukaryoticCell). A covering axiom is done by creating a equivalent class (a neccesary and sufficient condition) that is the union of the subclasses (In this case TypicalEukaryoticCell and AtypicalEukaryoticCell). • The reasoner will infer the whole structure. Sample code: • The whole ODP as an OWL DL ontology is available at: http://gong.man.ac.uk/owl/eukariotic.owl • DL notation of the Typical/Atypical structure: 43 AtypicalRedBloodCell ≡ (= 1 hasNucleus) ⊓ RedBloodCell RedBloodCell ⊑ EukaryoticCell RedBloodCell ⊑ TypicalRedBloodCell ⊔ AtypicalRedBloodCell TypicalRedBloodCell ⊑ RedBloodCell AvianRedBloodCell ⊑ = 1 hasNucleus AvianRedBloodCell ⊑ RedBloodCell MammalianRedBloodCell ⊑ = 0 hasNucleus MammalianRedBloodCell ⊑ RedBloodCell • Manchester abstract OWL syntax notation of the Typical/Atypical structure: class AtypicalRedBloodCell complete RedBloodCell AND hasNucleus EXACTLY 1 class RedBloodCell partial EukaryoticCell AND TypicalRedBloodCell OR AtypicalRedBloodCell class RedBloodCell partial RedBloodCell class TypicalRedBloodCell partial RedBloodCell class AvianRedBloodCell partial RedBloodCell AND hasNucleus EXACTLY 1 class MammalianRedBloodCell partial RedBloodCell AND hasNucleus EXACTLY 0 Known uses: none. Related ODPs: none. References: • http://gong.man.ac.uk/ontologydesignpatterns/ • http://www.co-ode.org/resources/tutorials/bio/ • See [SAW+ 05]. Additional information: in the case of GO, it could be applied to virion GO:0019012, which is not a cellular component GO:0005575 even if it is classified as such. 4.3.2 Good practice ODPs 4.3.2.1 Normalisation Name and classification: Normalisation, Good Practice. Intent: to build modular and reusable ontologies where the majority of subsumption 44 relationships are maintained by the reasoner, rather than hard-coded by the ontology maintainer. Also known as: Untangling, Modularisation. Motivation: there are ontologies where a given class can have plenty of superclasses, building a structure that is called polyhierarchy. If all those subsumption relationships are directly stated by the ontology maintainer two main problems arise: • The ontology becomes very difficult to maintain: whenever a subsumption must be deleted (because a class has changed) or created (because a new class has been created) it has to be done by hand; in a polyhierarchy the process becomes very inefficient and error-prone. • The semantics are implicitly stated, not explicitly: any other ontologist or reasoner only knows that a class is a subclass of its superclasses, without knowing why. The application example for this ODP is adapted from the Cell Type Ontology. In the example the subsumption relationships that already are in the Cell Type Ontology are inferred by the reasoner. The term neutrophil CL:0000096 is used as an example class to show how a class can relate to different modules. Applicability: any OWL DL ontology that consists of a polyhierarchy and some semantic axes can be pointed: each of those axes will be a module. Elements: the original classes of the ontology are divided in different axes. Relationships: the conditions for each subsumption relationship are encoded as properties that will relate the different modules. Structure: the basis of the ODP is that each primitive class should only have a primitive parent, and primitive sibling classes should be disjoint, creating the modules. • Subsumption hierarchy of the normalised ontology before reasoning: 45 • Subsumption hierarchy of the normalised ontology after reasoning (the polyhierarchy is built by the reasoner): • Details of the class neutrophil: Consequences: the ontology gets untangled and becomes a collection of neat modules. The rest of the semantics are given by restrictions pointing to the modules. Implementation: the implementation is done in the following steps: • Identify the modules: group the classes. • Create the modules, maintaining only one parent for any given primitive class and making primitive siblings disjoint. • Redefine the classes (or define the newly added classes) according to the conditions for belonging to each module. 46 Sample code: • The whole ODP as an OWL DL ontology is available at: http://gong.man.ac.uk/owl/CellType.owl • DL notation of the class neutrophil and directly related classes: neutrophil ⊑ ∃ has function circulation neutrophil ⊑ ∃ has function cell motility neutrophil ⊑ ∃ has function stuff accumulation neutrophil ⊑ ∃ has function defense neutrophil ⊑ animal cell circulation ⊑ biological function animal cell ⊑ eukaryotic cell animal cell ⊑ ¬ plant cell cell ⊑ biological structure eukaryotic cell ⊑ cell circulating cell ⊑ cell circulating cell ≡ ∃ has function circulation • Manchester abstract OWL syntax notation of class neutrophil and directly related classes: class neutrophil partial animal cell AND has function SOME [circulation, cell motility, stuff accumulation, defense] class circulation partial biological function class animal cell partial eukaryotic cell AND NOT plant cell class cell partial biological structure class eukaryotic cell partial cell class circulating cell partial cell class circulating cell complete has function SOME circulation Known uses: openGALEN.8 Related ODPs: Value Partition, Upper Level Ontology. References: • See [RWRR01, Rec03, SK04b, Hor04]. • http://www.w3.org/TR/owl-guide 8 http://www.opengalen.org 47 • http://www.co-ode.org/resources/tutorials/bio/ • http://gong.man.ac.uk/ontologydesignpatterns/ Additional information: Protégé has two wizards9 that facilitate the creation of this ODP: • The Value Partition wizard allows for creation of Value Partitions: the conditions for class membership can be restrictions that point to the Value Partition. • Restriction matrix: it allows for quickly creating existential restrictions in several classes at the same time. 4.3.2.2 Value Partition Name and classification: Value Partition, Good Practice. Intent: to model attributes of objects that can only have certain already known values. Also known as: Enumeration, if it is built using individuals instead of classes. Motivation: reality is full of attributes of elements. For example, a person can be defined as being short, medium or tall, and the attribute height can just get those values. Height is said to be covered or exhausted by those values; the possible heights are only those three. Biology is full of such situations: metabolism can only be anabolism or catabolism, membrane transport can only be uniport, sinport or antiport, regulation is always positive or negative, and so forth. The example evaluated herein is the remodelling of the GO term regulation of cell killing GO:0031341 with its two subclasses, positive regulation of cell killing GO:0031343 and negative regulation of cell killing GO:0031342. Applicability: any KR language that allows for covering axioms and any knowledge domain with attributes that can only have certain values. Elements: the main elements are the classes that make up the Value Partition itself: a class for the attribute and the subclasses for the values. In this case, RegulationType, positive and negative, respectively. Relationships: the most important relationship is the one that links each element of the knowledge domain with the values of the Value Partition. In this case, is regulation of type. Structure: 9 http://www.co-ode.org/downloads/wizard/index.php 48 • Subsumption hierarchy of the Value Partition and the classes that are defined using the Value Partition: • Details of the Value Partition and the class positive regulation of cell killing: Consequences: the attributes and the elements that are described or modified by the attributes get untangled: whenever a new element enters the domain (e.g. another regulation phenomenon) it is only a matter of adding a restriction pointing to the pertinent Value Partition class. The values that can be given to a certain attribute are constrained enforcing a better modelling. Implementation: the implementation is done in the following steps: • Identify the attributes every element must be described with. • For each attribute, create a class under Modifier (or the pertinent upper level distinction that it is used in the ontology). 49 • In each attribute class create a subclass for every value. • Create a covering axiom defining the attribute class. • Create the restrictions pointing to the values of the Value Partition. Sample code: • The whole ODP as an OWL DL ontology is available at: http://gong.man.ac.uk/owl/regulation.owl • DL notation of the Value Partition and the class positive regulation of cell killing: positive ⊑ RegulationType RegulationType ≡ positive ⊔ negative positive regulation of cell killing ⊑ ∃ is regulation of type positive • Manchester abstract OWL syntax notation of the Value Partition and the class positive regulation of cell killing: class RegulationType complete positive OR negative class positive regulation of cell killing partial is regulation of type SOME positive Known uses: none. Related ODPs: Value Partition is related to Normalisation and Upper Level Ontology. In Normalisation Value Partitions can be used as fillers for the restrictions that will be used to build the normalised modules. As Value Partitions are not elements of the knowledge domain on their own right they are usually put under the class modifiers (or the analogous) in an Upper Level Ontology. References: • http://www.w3.org/TR/swbp-specified-values • http://www.co-ode.org/resources/tutorials/bio/ • http://gong.man.ac.uk/ontologydesignpatterns/ Additional information: the Value Partition wizard10 in Protégé allows for quick and easy creation of several Value Partitions. The Value Partition built with classes offers an advantage over the Enumeration (a Value Partition built with individuals): new subpartitions can be built for each of the value classes (e.g. very tall). 10 http://www.co-ode.org/downloads/wizard/index.php 50 4.3.2.3 Upper Level Ontology Name and classification: Upper Level Ontology, Good Practice. Intent: to create an ontology that can integrate different ontologies in itself. Also known as: foundational ontology. Motivation: different ontologies of a given domain share very general types of concepts, like substance, modifier, etc. These types of concepts are grounded in philosophical criteria, like endurants and perdurants. The different domain ontologies can thus be integrated in one Upper Level Ontology, each ontology having different relationships pointing to the concepts of the Upper Level Ontology. The Upper Level Ontology used here as an example is the Ontology of Biomedical Reality (OBR) [RKM+ 05]. Applicability: any KR language that supports subsumption relationships and disjoints. Elements: all the classes are important (see Structure). Relationships: only subsumption relationships are used. Structure: subsumption hierarchy of OBR: Consequences: by endorsing to a given Upper Level Ontology when building a domain ontology the ontologists makes the integration of the ontology with other ontologies a much easier process. However, the ontology is committed to a concrete view of the domain, and therefore the use and implantation of Upper Level Ontologies is very controversial. Implementation: the different hierarchies of primitive classes must be asserted using disjoints. Sample code: the whole ODP as an OWL DL ontology is available at: http://gong.man.ac.uk/owl/OBR.owl Known uses: openGALEN.11 11 http://www.opengalen.org 51 Related ODPs: Normalisation, Value Partition. References: • See [RKM+ 05]. • http://www.co-ode.org/resources/tutorials/bio/ • http://gong.man.ac.uk/ontologydesignpatterns/ Additional information: there is extensive literature and different Upper Level Ontologies, with different properties [BGG+ 02, GSG04, RR04]. A related attempt to unify different ontologies is the use of formalised foundational relationships [SCK+ 05, SR04]. 4.3.3 Modelling ODPs 4.3.3.1 List Name and classification: List, Modelling. Intent: to model ordered groups of elements. Also known as: Linked List. Motivation: an ordered group of elements is a very intuitive modelling structure, yet the semantics of such a construct in OWL DL are complex. Biology is full of structures where the order of the elements is vital, either in time (e.g. phases of processes) or space (e.g. parts of genes). If that order is altered (e.g. a change of the order of introns and exons in a gene) there can be serious damage in Biological systems. In this case the ODP will be used to build a gene starting from some elements of the Sequence Ontology [KSC+ 05]: promoter SO:0000167, terminator SO:0000141, intron SO:0000188 and exon SO:0000147. For the sake of clarity a minimalist gene is built, with a very simple structure. Applicability: any KR language that allows the use of subproperties, functional properties, transitive properties, intersections and unions. Elements: the most important elements are the different classes that can be used to build the List (promoter, terminator, intron and exon) and the class that it is modelled using the List (in this case gene). Relationships: the needed relationships are: contents (functional), rest (transitive) and next (functional and a subproperty of rest). Structure: details of the gene class (a list formed in the following order: Promoter, 52 Exon, Intron, Exon, Terminator): Consequences: if very long and complex lists are used there can be a decrease in reasoning performance. Implementation: there is a Protégé wizard for creating lists. Sample code: 53 • The whole ODP as an OWL DL ontology is available at: http://gong.man.ac.uk/owl/Genes.owl • DL notation of the Gene and EmptyList classes: Gene ⊑ GeneStructure Gene ⊑ ∃ contents Promoter Gene ⊑ ∃ next (GeneStructure ⊓ (∃ contents Exon) ⊓ (∃ next (GeneStructure ⊓ (∃ contents Intron) ⊓ (∃ next (GeneStructure ⊓ (∃ contents Exon) ⊓ (∃ next (GeneStructure ⊓ (∃ contents Terminator) ⊓ (∃ next (GeneStructure ⊓ (∃ contents EmptyList)))))))))) EmptyList ≡ (≤ 0 next) ⊓ (≤ 0 contents) ⊓ GeneStructure • Manchester abstract OWL syntax notation of Gene and EmptyList classes: class Gene partial GeneStructure AND contents SOME Promoter AND next SOME (GeneStructure AND (contents SOME (GeneStructure AND (contents SOME (GeneStructure AND (contents SOME (GeneStructure AND (contents SOME (GeneStructure AND (contents SOME SOME SOME SOME SOME Exon) AND (next Intron AND (next Exon) AND (next Terminator) AND (next EmptyList)))))))))) class EmptyList complete GeneStructure AND next MAX 0 AND contents MAX 0 Known uses: experimental modelling of protein Fingerprints [DMS]. Related ODPs: none. References: • http://www.co-ode.org/resources/tutorials/bio/ • http://gong.man.ac.uk/ontologydesignpatterns/ Additional information: the Linked List is one of the oldest and most widely used data structures in computer science;12 plenty of programming languages offer primitives similar to it. The Circularly Linked List is a List that ends up with the beggining 12 http://en.wikipedia.org/wiki/Linked_list 54 of itself, creating a circle. The application of the circularly Linked List in OWL DL has not been investigated yet. Apart of being an efficient way of modelling ordered elements, Lists offer the possibility of creating a powerful classyfing system: Lists of plenty of kinds can be defined (e.g. definitions of the type any List containing elements A and B, not followed by C and then followed by two D-s.) and they will be put in the correct position of the hierarchy of already defined lists. Using that procedure, for example, different protein fingerprints (lists of regular expressions) or different kinds of genes can be defined. The models can be queried, for example, with a given gene defined with a certain ordered combination of introns, exons, promoter and terminator to see in which position of the hierarchy is classified and to which genes does it relate.13 For example, a query of the type Any gene with two successive exons would be written in DL notation as follows: AnyGeneSuccesiveExons ≡ GeneStructure ⊓ (((∃ contents Exon) ⊓ (∃ next (GeneStructure ⊓ (∃ contents Exon)))) ⊔ (∃ rest (GeneStructure ⊓ (∃ contents Exon) ⊓ (∃ next (GeneStructure ⊓ (∃ contents Exon)))))) 4.3.3.2 Adapted SEP triples Name and classification: Adapted SEP triples, Modelling. Intent: propagation of properties along the partonomy relation. Also known as: Propagator. Motivation: in the biomedical domain the propagation of properties along the partonomy relation is very important. For example, there are cases where the fault of the part should be assumed to be a fault of the whole (an appendix perforation is an intestine perforation) and other cases where it should not be like that (appendicitis is not enteritis). The problem of propagating properties along partonomy relates directly to the problem of overloading part of in GO: for example location, a property that should propagate (or not) with part of, is always implicitly present anywhere there is a part of relation. As explained in section 3.1.3, polarisome is part of cell cortex and part of site of polarized growth, inheriting both locations, creating a conflict: polarisome is not located in the whole of the cell cortex, is only located in the cell cortex in the site of polarised growth. This ODP gives an example of how to solve that problem, using a technic originally described in [SR05]. 13 In OWL, defined classes can be seen as queries that they are done against the ontology; once classified, the subclasses of the defined classed would be the answers to the query. 55 Applicability: any KR language with transitive properties and a knowledge domain with the need for propagation along transitive properties. OWL DL does not have an explicit idiom for that requirement, like the propagates via construct of GRAIL [RR04]. However the same effect can be achieved using another structure. Elements: the elements of the partonomy hierarchy are maintained and in this case two new elements are added to represent concrete locations in the cell (cellular location pole and cellular location periphery). Relationships: the partOf relationship is maintained (defined as transitive) and in this case a new property is added to link locations with cellular components, cellularLocationOf. Structure: detailed outline of all the classes of the ODP and their relationships: Consequences: the location property cellularLocationOf is propagated along partOf in a selective way, allowing for a precise and unambiguous definition of the polarisome location. Implementation: the most important step is to define the class cellular location pole of growth as the location of site of polarized growth or any of its parts, so the location is propagated to the parts (but it is not propagated in the case of cell cortex). Sample code: • The whole ODP as an OWL DL ontology is available at: http://gong.man.ac.uk/owl/Polarisome.owl 56 • DL notation of the whole ODP: cellular location pole of growth ⊑ ∃ cellularLocationOf (site of polarized growth ⊔ (∃ partOf site of polarized growth)) polarisome ⊑ ∃ partOf cell cortex polarisome ⊑ ∃ partOf site of polarized growth cellular location periphery ⊑ ∃ cellularLocationOf cell cortex • Manchester abstract OWL syntax notation of the whole ODP: class cellular location pole of growth partial cellularLocationOf SOME (site of polarized growth OR (partOf SOME site of polarized growth)) class polarisome partial partOf SOME [cell cortex, site of polarized growth] class cellular location periphery partial cellularLocationOf SOME cell cortex Known uses: none. Related ODPs: none. References: • See [SR05]. • http://www.w3.org/2001/sw/BestPractices/OEP/SimplePartWhole/ • http://gong.man.ac.uk/ontologydesignpatterns/ Additional information: The ODP can be checked by creating the following two classes: PolarisomeLocation ⊑ ∃ cellularLocationOf polarisome SiteOfPolarisedGrowthLocation ≡ ∃ cellularLocationOf (site of polarized growth ⊔ (∃ partOf site of polarized growth)) After reasoning PolarisomeLocation should be a subclass of SiteOfPolarisedGrowthLocation. There have been different proposal in the literature for modelling transitive propagation in the biomedical domain. The approach chosen for this ODP [SR05, Rec02] relies on the possibility of creating transitive properties given by OWL DL. Another 57 approach is the one described in [SRH98, SH05], which relies in simulating the transitivity by creating SEP triples (Structure - Entity - Part) for each class of the partonomy hierarchy, allowing for selective inheritance of properties. This ODP can also be applied to the problem of sensu in GO described in section 3.1.3. The property sensu can be decoupled in two properties, described in (the official definition of sensu) and appearing in (to point to the taxon where the entity appears). It can be applied to partonomy hierarchies of GO: appearing in should propagate along part of (the part of the whole should appear in the same taxon or subtaxon of the whole) and described in should not propagate, as the description taxon of the part does not have any relationship with the description taxon of the whole. 58 Chapter 5 Conclusion This chapter explains the future developments and contributions of the research. Section 5.1 explains in more detail the research hypothesis relating it to future developments. Section 5.2 explores the contributions that will come up in the research. Section 5.3 describes the criteria that will be used to evaluate the result of the research and finally the section 5.4 gives an overview of how the work will be organised in the following two years. 5.1 Research hypothesis revisited and extended: research aims, objectives and questions The most extended paradigm in ontology creation and maintenance in Bioinformatics is OBO (Open Biomedical Ontologies). OBO ontologies give a low fidelity representation of biological complexity because the language they are implemented on is very simple. OBO ontologies do not rely in any formalism and thus they are not amenable of automated treatments such as reasoning and advanced querying. The hypothesis of this research is that by migrating the actual (OBO) biological ontologies to a more expressive and formal paradigm like OWL DL will allow for a higher fidelity biological knowledge representation. This representation will provide more sophisticated querying, more efficient interaction and easier maintenance of the ontologies. Once the semantic expressivity is reached, the basis for new resources is set up. The aim of this research is to develop modelling technics (mainly Ontology Design Patterns -ODPs-), tools and user interfaces that will help biologists in that modelling and migration, specially when confronted to more expressive and formal languages like OWL 59 DL. Therefore the main objectives of the research are to develop: • A precise, formal and understandable description of ODPs. • An user friendly framework for creating and migrating biological ontologies into OWL DL, including the application of ODPs. • Examples of application of ODPs in real bio-ontologies. From this basis some research questions can be summarised: • What are the properties of OWL DL over the already existing OBO paradigm that makes it a better Knowledge Representation technology, from the point of view of a biologist? • Which user interfaces can be built to help biologists dealing with the migration to OWL DL from the OBO ontologies or creating OWL DL ontologies? • What is the formal definition of an ODP? How can an ODP be documented and easily explained? How can an ODP be implemented? • Which particularities of biological knowledge make it suitable for application of ODPs? How can an ODP application target be spotted in biological knowledge? 5.2 Contributions This section describes the expected contributions to the field that this research will yield. Some of them are already implemented, the majority will be implemented in the following two years. 5.2.1 GONG and BONG It has already been shown in the Gene Ontology Next Generation1 (GONG) workflow that by offering simple migration tools to biologists new semantics can be added to pre-existing OBO ontologies and interesting results can be obtained. This is shown by the fact that plenty of new relationships proposed in the last execution of the GONG 1 http://gong.man.ac.uk/ 60 workflow were accepted by the GO curators.2 The Biological Ontology Next Generation3 (BONG) Protégé plugin allows the biologist to define a GONG workflow in a simple ontology, to be executed by the BONG plugin. This means that a biologist can obtain a more sophisticated knowledge representation in OWL DL, with all its advantages by just defining a simple GONG ontology (a simple ontology that describes the GONG workflow and it is read by the BONG plugin in order to execute a GONG workflow). The BONG plugin is useful not just as an incarnation of the GONG workflow: it is a general OBO to OWL converter, or it can even be used to just dissect OBO ontologies and find relations to other ontologies with regular expressions. This is demonstrated by the work carried out by the author in the EBI collaborating with Chris Mungall to create GO relationships including Cell Type ontology terms. During the future research the BONG plugin will be improved, adding, for example, easier regular expression creation, GONG ontology automatic generation and better results retrieval (the plugin should only point to new relationships to be added to GO, filtering the non-informative results from the reasoner). A GONG ontologies repository will be created in the GONG web site so biologists can go to the repository, grab an already defined workflow and execute it, saving time. 5.2.2 Integration of ODPs in BONG The BONG plugin offers the appropriate platform for providing the biologists the possibility of applying ODPs to actual bio-ontologies from OBO. The ODPs can be directly asserted when defining the semantics of the workflow in the GONG ontology. There is also the possibility of implementing ODPs as part of Protégé itself, as it is already the case for some of ODPs which are in wizard form.4 5.2.3 ODPs catalog A catalog of ODPs will be created during the research, with the ODPs already explored and with more ODPs that will come up, available online in the GONG project web page.5 2 See author’s MSc dissertation in http://gong.man.ac.uk/publications/ 3 http://gong.man.ac.uk/downloads/ 4 http://www.co-ode.org/downloads/wizard/index.php 5 http://gong.man.ac.uk/ontologydesignpatterns/ 61 5.2.3.1 Properties of the biological knowledge domain There are certain properties of the biological domain that can be exploited in order to discover new ODPs or that they can represent a challenge when developing ODPs: • Time plays an important role in biology, in processes like development and evolution. • In biology the origin of the biological beings is a complex concept, for example in the case of development where a given structure can be transformed in plenty of different structures and different structures can converge in one structure. • Biology is full of complex dynamics, like physiological or metabolic regulation, population genetics, etc. • Symmetry: in processes like catalytic activity there is always the forward and reverse reaction [Shr03], in metabolism catabolism is always in presence of biosynthesis, etc. There is also structural symmetry, for example radial, pentaradial or bilateral symmetry in anatomy. • In biology, contrary to the medical domain, there is a high diversity of structural organisations. For example the arthropods anatomy is completely different to the vertebrates anatomy, and even more different to the structure of plants or fungi. Each group of organisms presents important differences and idioms in the way they are structurally described by the biologists. • Information order: there are structures were the information in ordered manner is very important, like parts of a gene (at sequence level or other levels like exon/intron), the order of aminoacids in a protein, the order of events in plenty of processes (neuron activation, gene transcription), etc. • Complex interactions in metabolism, at molecular level (for example the macromolecular complex DNA polymerase III) and other levels. There are interactions between elements of different levels. • Taxonomical classifications and nomenclature, which relate in complex ways to evolution via cladistics (paraphyletc and polyphiletic taxa, the difficulty on defining what a species is, etc.). • Biology is an experimental science, were evidence tracking, methodology concepts and quantitative data are very important. 62 • Biological reality is highly fuzzy, non-deterministic and full of uncertainty [BB05]. • In biology different levels of organisation or granularity coexist interacting between each other [AJT05, KSN04]. 5.2.3.2 Ontological constructs for ODPs There are ontological constructs that should be explored in order to build biological ODPs, like QCR (qualified cardinality), GCI (general concept inclusion), and more. In the other hand, rules in the form of the Semantic Web Rule Language6 represent an extension in expressivity for OWL DL, giving the possibility of asserting, for example, relationships between properties [hPSBT05]. Rules have already been used in modelling biomedical knowledge [GBGD05]. However the use of rules can lead to the undecidability of the resulting ontologies. This problem can be avoided by using DL-safe rules [MSS05] and reasoners such as KAON2.7 Rules are yet being explored but they represent a promising area, as they can be used to implement more expressive ODPs. 5.2.4 Documenting ODPs The already described documenting scheme is a contribution in its own right but it has to be seen how potential audience (biologist applying ODPs to biological ontologies) uses it and, nonetheless, there are already important points to be considered: Other classifications of ODPs Another possible classification of ODPs is based on whether the ODP relies on reasoning or not (ODPs where reasoning is not completely neccesary, although recommendable for maintenance and query building): • ODPs based on reasoning: Normalisation, Exception. • ODPs not based on reasoning: N-ary relationships, List, Adapted SEP triples, Value Partition, Upper Level Ontology. Other classifications should come up during the research. 6 http://www.w3.org/Submission/2004/SUBM-SWRL-20040521/ 7 http://kaon2.semanticweb.org/ 63 Compositionality of ODPs SDPs are sometimes built combining different SDPs. In the case of the reviewed ODPs Normalisation can be considered a composed ODP, as it is built using Value Partitions. Other types of compositionality in ODPs will be explored during the research. Ontology metamodels In the current proposal ODPs are described as instances. The ODPs should be expressed using a metamodel: a model that is capable of describing the ODP in an abstract way. In other words, a metamodel capable of expressing an ontology in an abstract level is needed. The model would fill the same functionality as UML in the case of Object Oriented programming, where the different implementations in each programming language are incarnations of the abstract model in UML. This would have the following advantages: • A formal way of assessing the correctness of each ODP implementation. • More clear descriptions of ODPs, amenable of more efficient sharing of the ODPs between the bio-ontologists. If the explanations are not based on instances, all the possible confusions that arise from the particularities of the example are avoided. Nonetheless concrete examples should still be used to explain ODPs, but not as the main expression of the ODP. 5.2.5 Improved bio-ontologies The Arabidopsis thaliana life cycle ontology in OWL DL is going to be built in the Plant Systems Biology division of the Ghent University.8 The author will collaborate in the process as a Marie Curie visitor. The whole ontology has to be built from scratch; plenty of ODPs can be tested on it without the constraints of an already existing ontology, as it happens in the Gene Ontology and other OBO ontologies. The author is regularly involved in preparing materials for OWL DL tutorials given to biologists in Manchester university, where the ODPs and their application can be (and have already been) explained to the attendees, who are mainly biologists. After the tutorials they can try to apply the ODPs in their respective knowledge domains. The tutorials are a good ODPs testing activity because the attendees show the strong points and weaknesses of the explained ODPs, and it can be assessed whether they understand them or not and why. 8 http://www.psb.ugent.be/ 64 5.3 Evaluation The outcome of this research will be the ontologies mentioned in the previous section. Evaluating the quality of an ontology is still a new research area and a matter of controversy. There are different proposed methods for ontology evaluation, differing in the area they focus on: • Methods based in how ontologies perform in task oriented environments [HSG+ 05]. • Methods based in structural validity [HSG+ 05]. • Methods based in sound philosophical principles like Ontoclean [GW02]. None of the mentioned methodologies completely fits the aim of this research, because biological ontologies are not usually task-oriented and because structural validity does not mean suitability of the knowledge representation, so the following criteria will be used to evaluate the ontologies created: • Functionality of the ontology: by the expressiveness of OWL DL and application of reasoning new functionalities can be explored in the developed ontologies. • Expressiveness of the ontology: how does the ontology map to the domain of knowledge. This can be done by creating queries against the ontology that reflect the needs of the domain experts. • Acceptance of the new ontologies by the community of the domain of knowledge. There is a prior example: the results of the GONG workflow were accepted and incorporated to the Gene Ontology in 2004.9 • Logical correctness and reusability [RWRR01]: how modular the ontology is, how does it interact with other ontologies. Ontoclean can be used as part of this criterium. 5.4 Research plan The research plan for the following two years is described in the two charts of the Figure 5.1. 9 See author’s MSc dissertation in http://gong.man.ac.uk/publications/ 65 Figure 5.1: Research plan. The research plan is divided in months for each year. The tasks are depicted in the left column. 66 Bibliography [ABL+ 89] B. Alberts, D. Bray, J. Lewis, M. Raff, K. Roberts, and J.D. Watson. Molecular Biology of the Cell. Garland, New York, 1989. [Ait05] Stuart Aitken. Formalizing concepts of species, sex and developmental stage in anatomical ontologies. Bioinformatics, 21(11):2773–2779, 2005. [AJT05] Rector A.L., Rogers J.E., and Bittner T. Granularity Scale and Collectivity: When Size Does and Doesn’t Matter. Journal of Biomedical informatics (in press), 2005. [ASDUD04] Fátima Al-Shahrour, Ramón Dı́az-Uriarte, and Joaquı́n Dopazo. FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics, 20(4):578–580, 2004. [AvH04] Grigoris Antoniou and Frank van Harmelen. Handbook on ontologies (International Handbooks on Information Systems), chapter 4. Springer, 2004. [AWB04] J.S. Aitken, B.L. Webber, and J.B.L. Bard. Part-of Relations in Anatomy Ontologies: a Proposal for RDFS and OWL Formalisations. In Proc. PSB, pages 166–177, 2004. [BB05] Richard Baldock and Albert Burger. Anatomical ontologies: names and places in biology. Genome Biology, (6):108, 2005. [BGG+ 02] Stefano Borgo, Aldo Gangemi, Nicola Guarino, Claudio Masolo, and Alessandro Oltramari. Ontology RoadMap. Wonder Web deliverable 15, 2002. 67 [Bla00] J.A. Blake. The mouse genome database (MGD): expanding genetic and genomic resources for the laboratory mouse. Nucleic Acid Research, 28:108–111, 2000. [BLHL01] Tim Berners-Lee, James Hendler, and Ora Lassila. The Semantic Web. Scientific American, MAY 2001. [BMM05] O. Bodenreider, J.A. Mitchell, and A.T. Mccray. Biomedical Ontologies. In PSB, 2005. [BMWS03] M. Bada, R. McEntire, C. Wroe, and R. Stevens. GOAT: The Gene Ontology Annotation Tool. In Proceedings of the 2003 UK e-Science All Hands Meeting, pages 514–519, Nottingham, UK, 2003. [BRA05] Jonathan Bard, Seung Y Rhee, and Michael Ashburner. An Ontology for Cell Types. Genome Biology, 6:R:21, 2005. [BS04] Tim BeißBarth and Terence P. Speed. GOstat: find statistically overrepresented gene ontologies within a group of genes. Bioinformatics, 20(9):1464–1465, 2004. [BSG+ 04] Michael Bada, Robert Stevens, Carole Goble, Yolanda Gil, Michael Ashburner, Judith A. Blake, J. Michael Jerry, Midori Harris, and Suzanna Lewis. A short study on the success of the Gene Ontology. Journal of Web Semantics, 1:235–240, 2004. [BVEL04] Sara Brockmans, Raphael Volz, Andreas Eberhart, and Peter Löffler. Visual Modelling of OWL DL Ontologies using UML. In Proc. ISWC, pages 198–213, 2004. [BWG+ 04] Elisabeth L. Boyle, Shuai Weng, Jeremy Gollub, Heng Jin, David Botstein, J. Michael Cherry, and Gavin Sherlock. GO::TermFinder – Open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics, 20(18):3710–3715, 2004. [Car03] Vincent J. Carey. Ontology concepts and tools for statistical genomics. Journal of Multivariate Analysis, 90:213–228, 2003. 68 [CBB+ 03] Evelyn Camon, Daniel Barrell, Catherine Brooksbank, Michele Magrane, and Rolf Apweiler. The gene ontology annotation (GOA) project - application of GO in SWISS-PROT, TrEMBL, and InterPro. Comparative and Functional Genomics, 4:71–74, 2003. [CBM+ 04] Evelyn Camon, Daniel Barrell, Michele Magrane, Rolf Apweiler, Vivian Lee, Emily Dimmer, John Maslen, David Binns, Nicola Harte, and Rodrigo Lopez. The gene ontology annotation (GOA) database: sharing knowledge in uniprot with gene ontology. Nucleic Acid Research, 32:D262–D266, 2004. [CC04] Kuo-Chen Chou and Yu-Dong Cai. Prediction of protein subcellular locations by GO-FunD-PseAA predictor. Biochemical and Biophysical Research Communications, 320:1236–1239, 2004. [CGGG+ 05] Ana Conesa, Stefan Götz, Juan Miguel Garcı́a-Gómez, Javier Terol, Manuel Talón, and Montserrat Robles. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics, 21(18):3674–3676, 2005. [CLT+ 05] F. Chalmel, A. Lardenois, J.D. Thompson, J. Muller, J.A. Sahel, and T. Léveillard. GOAnno: GO annotation based on multiple alignment. Bioinformatics, 21(9):1095–2096, 2005. [Con99] The FlyBase Consortium. The FlyBase database of the drosophila genome projects and community literature. Nucleic Acid Research, 27:85–88, 1999. [Con00] The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genetics, 23(May):25–29, 2000. [Con01] The Gene Ontology Consortium. Creating the Gene Ontology Resource: Design and Implementation. Genome Research, 11:1425–1433, 2001. [Con04] The Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research, 32:D258–D261, 2004. 69 [CSF03] Werner Ceusters, Barry Smith, and Jim Flanagan. Ontology and Medical Terminology: Why Description Logics Are Not Enough. In Towards and Electronic Patient Record, 2003. [CTP04] Peter Clark, John Thompson, and Bruce Porter. Handbook on ontologies (International Handbooks on Information Systems), chapter 32. Springer, 2004. [CY03] Jung-Hsien Chiang and Hsu-Chun Yu. MeKE: discovering the functions of gene products from biomedical literature via sentence alignment. Bioinformatics, 19:1417–1422, 2003. [DBD+ 04] E. Demir, O. Babur, U. Dogrusoz, A. Gursoy, A. Ayaz, G. Gulesir, G. Nisanci, and R. Cetin-Atalay. An Ontology for Collaborative Construction and Analysis of Cellular Pathways. Bioinformatics, 20:349– 356, 2004. [Dev02] Vladan Devedzic. Understanding Ontological Engineering. Communications of the Association for Computing Machinery, 45(4):136–144, 2002. [DMS] Nick Drummond, Georgina Moulton, and Robert Stevens. Personal communication. [DSD+ 03] Scott W Doniger, Nathan Salomonis, Kam D Dahlquist, Karen Vranizan, Steven C Lawlor, and Bruce R Conklin. MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biology, 4, 2003. [DTSC04] Minghua Deng, Zhidong Tu, Fengzhu Sun, and Ting Chen. Mapping Gene Ontology to proteins based on protein-protein interaction data. Bioinformatics, 20:895–902, 2004. [FSP+ 04] Keith Flanagan, Robert Stevens, Matthew Pocock, Pete Lee, and Anil Wipat. Ontology for genome comparison and genomic rearrangements. Comparative and Functional Genomics, 5:537–544, 2004. [FSS03] K. Falkovych, M. Sabou, and H. Stuckenschmidt. Knowledge Transformation for the Semantic Web. IOS Press, 2003. 70 [Gal05] Michael Y. Galperin. The Molecular Biology Database Collection: 2005 update. Nucleic Acids Research, 33(Database issue):D5–D24, 2005. [GBGD05] C. Golbreich, O. Bierlaire, B. Gibaud, and O. Dameron. What Reasoning Support for Ontology and Rules? the Brain Anatomy Case Study. In 8th International Protégé Conference, July 2005. [GCB04] Aldo Gangemi, Carola Catenacci, and Massimo Battaglia. Inflammation Ontology Design Pattern: an exercise in building a core Biomedical Ontology with Descriptions and Situations. Stud. Health Technol. Inform., 102:64–80, 2004. [GHJV95] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Professional Computing Series. Addison-Wesley, 1995. [GHS04] Nicholas Gibbins, Stephen Harris, and Nigel Shadbolt. Agent-based semantic web services. Journal of Web Semantics, 1(1):141–154, 2004. [GLH04] Detlef Groth, Hans Lehrach, and Steffen Hennig. GOblet: a platform for Gene Ontology annotation of anonymous sequence data. Nucleic Acids Research, 32:W313–W317, 2004. [Gru93] T.R. Gruber. A translation approach to portable ontologies. Knowledge Acquisition, 5:199–220, 1993. [GSG04] Pierre Grenon, Barry Smith, and Louis Goldberg. Ontologies in Medicine, chapter Biodynamic Ontology: Apllying BFO in the Biomedical Domain. IOS Press, 2004. [Gua98] N. Guarino. Formal Ontology and Information Systems. In Formal Ontology and Information Systems. IOS Press, 1998. [GW02] Nicola Guarino and Christopher Welty. Evaluating Ontological Decisions with Ontoclean. Communications of the ACM, 45(2):61–65, 2002. [GW04] CA Goble and CJ Wroe. The Montagues and the Capulets. Comparative and Functional Genomics, 2:623–632, 2004. 71 [GZB05] Christine Golbreich, Songmao Zhang, and Olivier Bodenreider. The Foundational Model of Anatomy in OWL: experience and perspectives. In Proc. AMIA symp, 2005. [Har] Midori Harris. Personal communication. [HdCD+ 05] Frank W. Hartel, Sherri de Coronado, Robert Dionne, Gilberto Fragoso, and Jeniffer Golbeck. Modeling a Description Logic Vocabulary for Cancer Research. Journal of Biomedical Informatics, (38):114–129, 2005. [HEC+ 04] L. Hart, P. Emery, B. Colomb, K. Raymond, S. Taraporewalla, D. Chang, Y. Ye, E. Kendall, and M. Dutra. OWL Full and UML 2.0 Compared. http://www.omg.org/docs/ontology/04-03-01.pdf, 2004. [Hen01] James Hendler. Agents and the semantic web. IEEE Intelligent Systems Journal, 16:30–37, 2001. [HGL03] Steffen Hennig, Detlef Groth, and Hans Lehrach. Automated Gene Ontology annotation for anonymous sequence data. Nucleic Acid Research, 31:3712–3715, 2003. [Hor04] Mathew Horridge. A practical guide to building OWL ontologies with the Protégé-OWL plugin. http://www.co-ode.org/ resources/tutorials/ProtegeOWLTutorial.pdf, 2004. [hPSBT05] Ian horrocks, Peter F. Patel-Schneider, Sean Bechoffer, and Dmitry Tsarkov. OWL Rules: a Proposal and Prototype Implementation. Journal of web semantics, (3):23–40, 2005. [HPSvH03] Ian Horrocks, Peter F. Patel-Schneider, and Frank van Harmelen. From SHIQ and RDF to OWL: the making of a web ontology language. Web Semantics: Science, Services and Agents on the World Wide Web, 1:7– 26, 2003. [HSG+ 05] Jens Hartman, Peter Spyns, Alain Gibon, Diana Maynard, Roberta Cuel, Mari Carmen Suárez-Figueroa, and York Sure. Methods for Ontology Evaluation. Knowledge Web deliverable 1.2.3/v1.3, 2005. 72 [IR98] Horrocks IR. The FACT system. In Proceedings of the international conference TABLEAUX, pages 307–312. Springer, 1998. [Ire] Amelia Ireland. Personal communication. [Jac04] Jacob Köhler. Integration of Life Science Databases. Biosilico, 2(2):61– 69, 2004. [JM04] Cliff Joslyn and Susan Mniszewski. Combinatorial Approaches to BioOntology Management with Large Partially Ordered Sets. In SIAM Workshop on Combinatorial Scientific Computing (CSC 04), February 2004. [JMFH04] Cliff A. Joslyn, Susan M. Mniszewski, Andy Fulmer, and Gary Heaton. The Gene Ontology Categorizer. Bioinformatics, 20:i169–i177, 2004. [JSA+ 04] Cheng J, Sun S, Tracy A, Hubbell E, Morris J, Valmeekam V, Kimbrough A, Cline MS, Liu G, Shigeta R, Kulp D, and Siani-Rose MA. NetAffx Gene Ontology Mining Tool: a visual approach for microarray data analysis. Bioinformatics, 20:979–981, 2004. [JSH+ 03] Glynn Dennis Jr, Brad T Sherman, Douglas A Hosack, Jun Yang, Wei Gao, H Clifford Lane, and Richard A Lempicki. DAVID: database for annotations, visualisation, and integrated discovery. Genome Biology, 4, 2003. [KBBD04] Purvesh Khatri, Pratik Bhavsar, Gagandeep Bawa, and Sorin Draghici. Onto-tools: an ensemble of web accessible, ontology-based tools for the functional design and interpretation of high-throughput gene expression experiments. Nuclei Acids Research, 32:W449–W456, 2004. [KD05] Purvesh Khatri and Sorin Draghici. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics, 21(18):3587–3595, 2005. [KFD+ 03] Oliver D. King, Rebecca E. Foulger, Selina S. Dwight, James V. White, and Frederick P. Roth. Predicting gene function from patterns of annotation. Genome Research, 13:896–904, 2003. 73 [KOTT03] J.-D. Kim, T. Ohta, Y. Tateisi, and J. Tsujii. GENIA corpus - a semantically annotated corpus for bio-textmining. Bioinformatics, 19:i180– i182, 2003. [KPL03] Jacob Köhler, Stephan Philippi, and Matthias Lange. SEMEDA: ontology based semantic integration of biological databases. Bioinformatics, 19:2420–2427, 2003. [KSC+ 05] Eilbeck K., Lewis S.E., Mungall C.J., Yandell M., Stein L., Durbin R., and Ashburner M. The Sequence Ontology: A tool for the unification of genome annotations. Genome Biology, (6):R44, 2005. [KSDS03] Salim Khan, Gang Situ, Keith Decker, and Carl J. Schmidt. GeneFigure: Automated Gene Ontology annotation. Bioinformatics, 19:2484–2485, 2003. [KSK02] Satoshi Kamegai, Kenji Satou, and Akihiko Konagaya. To- ward ontology-based knowledge extraction from biomedical literature. Genome Informatics, 13:576–577, 2002. [KSN04] Anand Kumar, Barry Smith, and Daniel D. Novotny. Biomedical Informatics and Granularity. Comparative and Functional Genomics, 5:501– 508, 2004. [KSR+ 04] Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K, Dwight SS, Engel SR, Feierbach B, Fisk DG, Hirschman JE, Hong EL, Issel-Tarver L, Nash R, Sethuraman A, Starr B, Thusfeld CL, Andrada R, Binkley G, Dong Q, Lane C, Schroeder M, Botstein D, and Cherry JM. Saccharomyces Genome Database (GSD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acid Research, 32:D311–D314, 2004. [Kwo03] Oh Byung Kwon. I know what you need to buy: context-aware multimedia-based recommendation system. Expert Systems with Applications, 25:387–400, 2003. [Lew05] Suzanna Lewis. Gene Ontology: looking backwards and forwards. Genome Biology, 6:103, 2005. 74 [LHK04] Sung Geun Lee, Jung Uk Hur, and Yang Seok Kim. A graph-theoretic modeling on GO space for biological interpretation of gene clusters. Bioinformatics, 20:381–388, 2004. [LHMK03] Astrid Lægreid, Torgeir R. Hvidsten, Herman Midelfart, and Jan Komorowski. Predicting Gene Onotology Biological Process from temporal gene expression patterns. Genome Research, 13:965–979, 2003. [LHP03] Patrick Lambrix, Manal Habbouche, and Marta Pérez. Evaluation of Ontology Development tools for bioinformatics. Bioinformatics, 19(12):1564–1571, 2003. [LM04] Jane Lomax and Alexa T. McCray. Mapping the Gene Ontology into the Unified Medical Language System. Comparative and Functional Genomics, 5:354–361, 2004. [LSBG03] P.W. Lord, R.D. Stevens, A. Brass, and C.A. Goble. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics, 19:1275–1283, 2003. [LT04] Shun-Chieh Lin and Shian-Shyong Tseng. Constructing detection knowledge for DDoS intrusion tolerance. Expert Systems with applications, 27:379–390, 2004. [LZ04] Yuefeng Li and Ning Zhong. Web mining model and its applications for information gathering. Knowledge-Based Systems, 2004. [MBH+ 05] David Milward, Marcus Bjäreland, William Hayes, Michelle Maxwell, Lisa Örbeg, Nick Tilford, James Thomas, Roger Hale, Sylvia Knight, and Julie Barnes. Ontology-based interactive information extraction from scientific abstracts. Comparative and Functional Genomics, 6:67– 71, 2005. [MBR+ 04] David Martin, Christine Brun, Elisabeth Remy, Pierre Mouren, Denis Thieffry, and Bernard Jacq. GOToolBox: functional analysis of gene datasets based on Gene Ontology. Bioinformatics, 5:R101, 2004. [McG01] Deborah L. McGuinness. The Semantic Web: Why, What and How, chapter Ontologies come of age. MIT press, 2001. 75 [MSS05] Boris Motik, Ulrike Sattler, and Rudi Studer. Query Answering for OWL-DL with Rules. Journal of web semantics, (3):41–60, 2005. [MTES05] J. P. Massar, Michael Travers, Jeff Elhar, and Jeff Shrager. BioLingua: a programmable knowledge environment for biologists. Bioinformatics, 21(2):199–207, 2005. [Mun05] Chris J. Mungall. OBOL: Integrating Language and Meaning in BioOntologies. 2005. [Mus05] Comparative and Functional Genomics, (5):509–520, Mark Musen. From Cottage Industry to the Industrial Age: New Infrastructure for Ontology Authoring and Dissemination. In Protégé international conference, 2005. [NMW04] Eric K. Neumann, Eric Miller, and John Wilbanks. What the Semantic Web could do for Life Sciences. Biosilico, 2(6):228–236, 2004. [OCAM+ 04] P.V. Ogren, K.B. Cohen, G.K. Acquaah-Mensah, J.Eberlein, and L. Hunter. The Compositional Structure of Gene Ontology terms. In Pac Symp Biocomput., pages 214–25, 2004. [Ode94] James J. Odell. Six Different Kinds of Composition. Journal Of ObjectOriented Programming, 5(8), 1994. [OGA+ 05] Tom Oinn, Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris, Kevin Glover, Carole Goble, Antoon Goderis, Duncan Hull, Darren Marvin, Peter Li, Phillip Lord, Matthew R. Pocock, Martin Senger, Robert Stevens, Anil Wipat, and Chris Wroe. Taverna: Lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience Grid Workflow, 2005. Accepted for Publication. [RCSA02] Soumya Raychaudhuri, Jeffrey T. Chang, Patrick D. Sutphin, and Russ B. Altman. Associating genes with Gene Ontology codes using a maximum entropy analysis of biomedical literature. Genome Research, 12:203–214, 2002. [Rec02] Alan Rector. Analysis of propagation along transitive roles: Formalisation of the GALEN experience with medical ontologies. In DL, 2002. 76 [Rec03] Alan L. Rector. Modularisation of Domain Ontologies Implemented in Description Logics and related formalisms including OWL. In K-CAP, pages 121–128, 2003. [RKM+ 05] Cornellius Rosse, Anand Kumar, Jose LV Mejino, Daniel L Cooks, Landom T Detwiler, and Barry Smith. A strategy for improving and integrating biomedical ontologies. In Annual symposium of American Medical Informatics Association (AMIA), 2005. [RR00] Jeremy Rogers and Alan Rector. GALEN’s Model of Parts and Wholes: Experience and Comparisons. In Proc. AMIA symp, pages 714–718, 2000. [RR04] Alan L Rector and Jeremy Rogers. Patterns, Properties and Minimizing Commitment: Reconstruction of the GALEN Upper Ontology in OWL. In EKAW, 2004. [RWBB04] Peter N. Robinson, Andreas Wollstein, Ulrike Böhme, and Brad Beattie. Ontologizing gene-expression microarray data: characterizing clusters with Gene Ontology. Bioinformatics, 20:979–981, 2004. [RWRR01] Alan L. Rector, Chris Wroe, Jeremy Rogers, and Angus Roberts. Untangling Taxonomies and Relationships: personal and Practical Problems in Loosely Coupled Development of Large Ontologies. In K-CAP, pages 139–146, 2001. [SAW+ 05] Robert Stevens, Mikel Egaña Aranguren, Katty Wolstencroft, Ulrike Sattler, Nick Drummond, and Mathew Horridge. Managing OWL’s Limitations in Modelling Biomedical Knowledge. Submitted to International Journal of Human Computer Studies – special issue on the limits of ontologies, 2005. [SCK+ 05] Barry Smith, Werner Ceusters, Bert Klagges, Jacob Khöler, Anand Kumar, Jane Lomax, Chris Mungall, Fabian Neuhaus, Alan L Rector, and Cornelius Rosse. Relations in biomedical ontologies. Genome Biology, (6):R46, 2005. [SDSH05] Stefan Schulz, Philipp Daumke, Barry Smith, and Udo Hahn. How to distinguish parthood from location in bioontologies. In Annual symposium of American Medical Informatics Association (AMIA), 2005. 77 [SGB00] R. Stevens, C.A. Goble, and S. Bechhofer. Ontology-based Knowledge Representation for Bioinformatics. Briefings in Bioinformatics, 1(4):398–416, 2000. [SGP+ 03] Robert Stevens, Carole Goble, Norman W. Paton, Sean Bechhofer, Gary Ng, Patricia Baker, and Andy Brass. Complex Query Formulation Over Diverse Information Sources in TAMBIS. In Zoe Lacroix and Terence Critchlow, editors, Bioinformatics: Managing Scientific Data. Morgan Kaufmann, May 2003. [SH04] Stefan Schultz and Udo Hahn. Towards a Computational Paradigm for Biomedical Structure. In Proc. KR-MED, pages 63–71, 2004. [SH05] Stefan Shultz and Udo Hahn. Part-whole representation and reasoning in formal biomedical ontologies. Artificial Intelligence in Medicine, 34:179–200, 2005. [Shr03] Jeff Shrager. The fiction of function. Bioinformatics, 19(15):1934– 1936, 2003. [SK02] Steffen Schulze-Kremer. Ontologies for molecular biology and bioinformatics. In Silico Biology, 2(17), 2002. [SK04a] Barry Smith and Anand Kumar. Controlled vocabularies in bioinformatics: a case study in the gene ontology. Biosilico, 2(6):246–252, 2004. [SK04b] Heiner Stuckenschmidt and Michel Klein. Ontologies Refinement - Towards Structure-Based Partitioning of Large Ontologies. Wonder Web deliverable 22, 2004. [SK05] Larisa N. Soldatova and Ross D. King. Are the Current Ontologies in Biology Good Ontologies? Nature Biotechnology, 23(9):1095–1098, 2005. [SKK04] Barry Smith, Jacob Köhler, and Anand Kumar. On the application of Formal Principles to Life Science Data: a Case Study in the Gene Ontology. In DILS, pages 74–94, 2004. [SR04] Barry Smith and Cornelius Rosse. The Role of Foundational Relations in the Alignment fo Biomedical Ontologies. In MEDINFO. IOS press, 2004. 78 [SR05] Julian Seidenberg and Alan Rector. Transitive propagation in OWL. Work not published, 2005. [SRG03] Robert D. Stevens, Alan J. Robinson, and Carole A. Goble. MyGrid: personalised bioinformatics on the information grid. Bioinformatics, 19:i302–i304, 2003. [SRH98] Stefan Schulz, Martin Romacker, and Udo Hahn. Part-Whole Reasoning in Medical Ontologies Revisited - Introducing SEP triplets into Classification-based Description Logics. In Proceedings of the 1998 AMIA Annual Fall Symposium. A Paradigm Shift in Health Care Information Systems: Clinical Infrastructures for the 21st Century, pages 830–834. Hanley and Belfus, 1998. [SWLG04] Robert Stevens, Chris Wroe, Phillip Lord, and Carole Goble. Handbook on ontologies (International Handbooks on Information Systems), chapter 10. Springer, 2004. [SWSK03] Barry Smith, Jennifer Williams, and Steffen Schulze-Kremer. The Ontology of the Gene Ontology. In Annual symposium of American Medical Informatics Association (AMIA), 2003. [Tho03] Jeffrey Thomas. Finding an Oasis in the Desert of Bioinformatics. Biosilico, 1(2):56–58, 2003. [VEF+ 04] Stefano Volinia, Rita Evangelisti, Francesca Francioso, Diego Arcelli, Massimo Carella, and Paolo Gasparini. GOAL: automated Gene Ontology analysis of expression profiles. Nucleic Acids Research, 32:W492– W499, 2004. [VR03] Haarslev V and Möller R. RACER: a core inference engine for the semantic web. In ISWC, pages 27–36, 2003. [WA03] Jennifer Williams and William Andersen. Bringing Ontology to the Gene Ontology. Comparative and Functional Genomics, 4:90–93, 2003. [WGA05] Xiaoshu Wang, Robert Gorlitsky, and Jonas S Almeida. From XML to RDF: how semantic web technologies will change the design of ’omic’ standards. Nature Biotechnology, 23(9):1099–1103, 2005. 79 [WMS+ 05] K. Wolstencroft, R. McEntire, R. Stevens, L. Tabernero, and A. Brass. Constructing Ontology-Driven Protein Family Databases. Bioinformatics, 21(8):1685–92, 2005. [WSGA03] C.J. Wroe, R.D. Stevens, C.A. Goble, and M. Ashburner. A Methodology to Migrate the Gene Ontology to a Description Logic Environment Using DAML+OIL. In 8th Pacific Symposium on biocomputing (PSB), pages 624–636, 2003. [WSM+ 05] Hongwei Wu, Zhengchang Su, Fenglou Mao, Victor Olman, and Ying Xu. Prediction of functional modules based on comparative genome analysis and Gene Ontology application. Nucleic Acids Research, 33(9):2822–2837, 2005. [XWL+ 02] Hanqing Xie, Alon Wasserman, Zurit Levine, Amit Novik, Vladimir Grebinskiy, Avi Shoshan, and Liat Mintz. Large-scale protein annotation through Gene Ontology. Genome Research, 12:785–794, 2002. [YKNA03] Iwei Yeh, Peter D. Karp, Natasha F. Noy, and Russ B. Altman. Knowledge acquisition, consistency checking and concurrency control for Gene Ontology (GO). Bioinformatics, 19(2):241–248, 2003. [YWCS05] A. Young, N. Whitehouse, J. Cho, and C. Shaw. OntologyTraverser: an R package for GO analysis. Bioinformatics, 21(2):275–276, 2005. [Zeh03] Günther Zehetner. OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms. Nucleic Acid Research, 31:3799–3803, 2003. [ZFW+ 03] Barry R Zeeberg, Weimin Feng, Geoffrey Wang, May D Wang, Anthony T Fojo, Margot Sunshine, Sudarshan Narasimhan, David W Kane, William C Reinhold, Samir Lababidi, Kimberly J Bussey, Joseph Riss, J Carl Barret, and John N Weinstein. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biology, 4, 2003. [ZM03] Zuo Zhihong and Zhou Mingtian. Web Ontology Language OWL and its Description Logic Foundation. In Proceedings of the Fourth International Conference on Parallel and Distributed Computing, Applications and Technologies, pages 157–160, 2003. 80 [ZSKS04] Bing Zhang, Denise Schmoyer, Stefan Kirov, and Jay Snoddy. GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies. BMC bioinformatics, 5(16), 2004. 81

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ontology design patterns for the formalisation of biological ontologies