* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bio-ontologies for Annotation and Service Discovery
Survey
Document related concepts
Transcript
Bio-ontologies for Annotation and Service Discovery Chris Wroe ( + material from Carole Goble, Alan Rector, Jeremy Rogers, Ian Horrocks) University of Manchester, UK Overview Example driven tour of the why, what and how of ontologies in life sciences Cover the key features of an ontology Vocabulary, definitions, hierarchies, grammar & reasoning Cover the key targets of ontology use Biological knowledge, service descriptions, (database schema) Ontology – the discipline Semantics – the meaning of meaning. Philosophical discipline, branch of philosophy that deals with the nature and the organisation of reality. Science of Being (Aristotle, Metaphysics, IV,1) What is being? What are the features common to all beings? In science…ontology the thing A resource to aid the precise communication and integration of information Binds a community to communicate information in some domain of interest in a consistent manner. Gene Ontology – a community effort Model organism databases need to be integrated Not possible if they all use a different vocabulary Gene Ontology Consortium got together to form “a dynamic controlled vocabulary that can be applied to all eukaryotes” Gene Ontology – keeping it simple Provide three separate vocabularies to describe: The function a gene product is capable of. The process a gene product takes part in. The location at which the gene product has been found. Annotation GO annotations Gene detail page in MGD for the vitamin D receptor gene, Vdr Annotation Feature 1: GO annotations Ontologies provide a shared controlled vocabulary of concepts. Gene detail page in MGD for the vitamin D receptor gene, Vdr Gene ontology - definitions A diverse community, so explicit definitions important. 60% of GO concepts have a textural definition e.g. apoptotic nuclear changes GO:0030262 Changes affecting the nucleus and its contents during apoptosis; includes condensation and fragmentation of nuclear DNA and of the nucleus itself. Gene ontology - definitions Feature 2: A diverse community so explicit Ontologies provide an agreed definition definitions important. for each concept to ensure each 60% of GO concepts have a textural concept is usede.g. in the same way. definition apoptotic nuclear changes GO:0030262 Changes affecting the nucleus and its contents during apoptosis; includes condensation and fragmentation of nuclear DNA and of the nucleus itself. Gene ontology – organisation An alphabetical list of 11000 terms is not enough Hierarchies allow similar terms to be grouped together. biological process death cell death tissue death necrosis histolysis Gene ontology – hierarchy use GO hierarchy is used for Navigation of concepts by users Indexing of information in databases Aggregating information Taxonomy remark 1 The world is animal not a tree, it’s a lattice vermin rodent wild domestic pet working dog mouse cat cow Taxonomy remark 2 What does the taxonomy mean? Concept A is a parent of concept B iff every instance of B is also an instance of A Superset/subset ICONCLASS Kind of a door Door Closing the Door Monumental Door Metalwork of a Door Door-Knocker Threshold Door-keeper Action associated with a door Something attached to a door The Celestial Emporium of Benevolent Knowledge, Borges Classification trickiness "On those remote pages it is written that animals are divided into: a. those that belong to the Emperor b. embalmed ones c. those that are trained d. suckling pigs e. mermaids f. fabulous ones g. stray dogs h. those that are included in this classification i. those that tremble as if they were mad j. innumerable ones k. those drawn with a very fine camel's hair brush l. others m. those that have just broken a flower vase n. those that resemble flies from a distance" Classification is task and culture specific Dyirbal classification of objects in the universe, Bayi: men, kangaroos, possums, bats, most snakes, most fishes, some birds, most insects, the moon, storms, rainbows, boomerangs, some spears, etc. Balan: women, anything connected with water or fire, bandicoots, dogs, platypus, echidna, some snakes, some fishes, most birds, fireflies, scorpions, crickets, the stars, shields, some spears, some trees, etc. Balam: all edible fruit and the plants that bear them, tubers, ferns, honey, cigarettes, wine, cake. Bala: parts of the body, meat, bees, wind, yamsticks, some spears, most trees, grass, mud, stones, noises, language, etc. Gene ontology – directed acyclic graphs Each concept is explicitly grouped either by is-a or part of relationships Functions are often grouped by type Cellular components are often grouped by part Each concept can have multiple parents A concepts positions is represented by a directed acyclic graph Hierarchies are handcrafted so as to suit the ‘culture’ of biologists Feature 3: Ontologies organise concepts in multiple ways for multiple uses. Principle of grouping should be explicit. Taking it further GO concepts are often phrases insulin control element activator complex, insulin processing, insulin receptor, insulin receptor complex, insulin receptor ligand, insulin receptor signalling pathway, insulin secretion, insulin acticated sodium/amino acid transporter, Components of phrase hidden to computer applications Explicit conceptualisation Semantic similarity searching Automated maintenance of hierarchies. What we need is.. A formal grammar with which to compose phrases Software which can interpret phrases and produce sound and complete hierarchies The exploding bicycle ICD-9 (E826) 8 READ-2 (T30..) 81 READ-3 87 ICD-10 (V10-19) 587 V31.22 Occupant of three-wheeled motor vehicle injured in collision with pedal cycle, person on outside of vehicle, nontraffic accident, while working for income W65.40 Drowning and submersion while in bath-tub, street and highway, while engaged in sports activity X35.44 Victim of volcanic eruption, street and highway, while resting, sleeping, eating or engaging in other vital activities Defusing the exploding bicycle: 500 codes in pieces 10 things to hit… Pedestrian / cycle / motorbike / car / HGV / train / unpowered vehicle / a tree / other 5 roles for the injured… Driving / passenger / cyclist / getting in / other 5 activities when injured… resting / at work / sporting / at leisure / other 2 contexts… In traffic / not in traffic V12.24 Pedal cyclist injured in collision with two- or threewheeled motor vehicle, unspecified pedal cyclist, nontraffic accident, while resting, sleeping, eating or engaging in other vital activities Coordination: Conceptual Lego gene hand protein cell extremity expression body Lung chronic inflammation acute infection bacterial abnormal deletion normal ischaemic polymorphism Conceptual Lego “SNPolymorphism of CFTRGene causing Defect in MembraneTransport of ChlorideIon causing Increase in Viscosity of Mucus in CysticFibrosis…” “Hand which is anatomically normal” DAML+OIL Specifically designed to compose phrases in a compositional manner Becoming a standard ontology interchange language Adopted by W3C and will soon become Ontology Web Language (OWL) Reasoning support Consistency — check if knowledge is meaningful Subsumption — structure knowledge, compute taxonomy Equivalence — check if two classes denote same set of instances Instantiation — check if individual i instance of class C Retrieval — retrieve set of individuals that instantiate C Problems all reducible to consistency (satisfiability) Gene Ontology Next Generation Early aim Proof of concept showing DAML+OIL & description logic can practically help in at least one aspect of GO maintenance. In cooperation with Mike Ashburner and the GO editorial team Further aims Prototype an evolutionary environment in which the benefits can be replicated on a larger scale Preliminary task Providing an exhaustive is-a taxonomy GO is-a poly-hierarchy It becomes increasingly laborious to make sure that all concepts are linked to all possible is-a parents Metabolism terms: e.g. heparin biosynthesis [i] (GO:0006024 [chemical] biosynthesis (GO:0009058) Axis 1: Chemicals [i] carbohydrate biosynthesis (GO:0016051) [i] aminoglycan biosynthesis (GO:0006023) [i] glycosaminoglycan biosynthesis (GO:0006024) [i] heparin biosynthesis (GO:0030210) Axis 2: Process [i] heparin metabolism (GO:0030202) [i] heparin biosynthesis (GO:0030210) Is this important? Complete taxonomy not necessary for browsing by biologist (and may actually get in the way) BUT… improves fidelity of DB record retrieval. Asking for records annotated with ‘glycosaminoglycan biosynthesis’ or more specific will lead to an additional result O94923 SPTr ISS - D-glucuronyl C5-epimerase (Fragment) How can we support the task? Step 0. Translate to DAML+OIL syntax Provided by OilEd Provide DAML+OIL based definitions of GO concepts – initially in the metabolism area DAML+OIL definitions for metabolism concepts heparin biosynthesis class heparin biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass heparin (acts_on is unique) Paraphrase: biosynthesis which acts solely on heparin glycosaminoglycan biosynthesis class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass glycosaminoglycan DAML+OIL definitions for metabolism concepts heparin biosynthesis class heparin biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass heparin (acts_on is unique) Feature 4: Paraphrase: biosynthesis which acts solely on heparin Ontologies provide a formal computer glycosaminoglycan biosynthesis interpretable concept definition. class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass glycosaminoglycan A chemical ontology Initially used MESH to create a DAML+OIL ontology from a subset of the chemical taxonomy (using UMLS tools/ API) Provides the following information carbohydrates [i] polysaccharides [i] glycosaminogylcans [i] heparin Reason over the combination Combine GO definitions with chemical ontology using OilEd API Send to FaCT DL reasoner… Paraphrased reasoning process heparin biosynthesis class heparin biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass heparin glycosaminoglycan biosynthesis Is-a class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass glycosaminoglycan Inferring a new is-a link heparin biosynthesis class heparin biosynthesis defined subClassOf biosynthesis Is-a restriction onProperty acts_on hasClass heparin glycosaminoglycan biosynthesis Is-a class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass glycosaminoglycan Inferring a new is-a link heparin biosynthesis class heparin biosynthesis defined subClassOf biosynthesis Is-a restriction onProperty acts_on hasClass heparin Feature 5: Ontologies can become biosynthesis a dynamic glycosaminoglycan class glycosaminoglycan biosynthesis defined service with reasoning support. subClassOf biosynthesis Is-a restriction onProperty acts_on hasClass glycosaminoglycan Output OilEd API reports additional inferred is-a relationships. E.g. heparin biosynthesis has new is-a parent glycosaminoglycan biosynthesis Sanitised version sent to GO editorial team for comment. They (Jane Lomax) makes changes to GO if appropriate and sends back queries Results Carbohydrate metabolism Amino acid metabolism 22 additional is-a links 17 of which now in GO Further 17 additional is-a links now in GO Currently preparing results for metabolism as a whole Where next with GONG? Moving from proof of concept requires dedicated software tools to support the process. Authoring/ Curation of DAML+OIL definitions Tracking GO as it evolves Tracking suggested changes and response to changes. myGrid & high level ontologies myGrid: Personalised extensible environments for data-intensive in silico experiments in biology Higher level services: workflow, databases, knowledge management, provenance… Bioinformatics services are published as Web services (and soon Grid Services) http://www.ebi.ac.uk/collab/mygrid/service0/axis/index.html Ontologies for Service Discovery Find appropriate type of services Find appropriate instances of that service sequence alignment BLAST (an algorithm for sequence alignment), as delivered by NCBI Assist in forming an appropriate assembly of discovered services. Find, select and execute instances of services while the workflow is being enacted. Knowledge in the head of expert bioinformatician An in silico experiment as a workflow RASMOL Protein name Fetch Fetch View WF Similar Structure sequences modelling Four-tiered service descriptions Domain “semantic” Class of service: 1. • a protein sequence alignment, a protein sequence database. Specific example of an abstract service: 2. • BLAST, SWISS-PROT. Business “operational” Instance service description of a specific service: 3. • BLAST, SWISS-PROT as offered by the EBI. Invoked instance service description: 4. • BLAST as offered by the EBI on a particular date, with particular parameters when a service was actually enacted. Service description phrases Build up a phrase describing classes of service functionality. Building blocks for phrase come from a suite of ontologies Template for the description based on DAML-S specialised for bioinformatics. Use reasoning to maintain a classification of services Suite Upper level ontology Task ontology Informatics ontology Web service ontology Specialises. All concepts are subclassed from those in the more general ontology. Contributes concepts to form definitions. Molecular biology ontology Bioinformatics ontology Publishing ontology Organisation ontology Suite Upper level ontology Task ontology Informatics ontology Web service ontology Specialises. All concepts are subclassed from those in the more general ontology. Contributes concepts to form definitions. Molecular Publishing Organisation parameters: output, ontology biology ontologyinput, ontology precondition, effect performs_task uses-resource Bioinformatics is_function_of ontology class-def defined BLAST-n_service_operation subclass-of atomic_service_operation has_Class performs_task (aligning has_Class has_feature local has_Class has_feature pairwise) has_Class produces_result (report has_Class is_report_of sequence_alignment) has_Class uses_resource (database has_Class contains (data has_Class encodes (sequence has_Class is_sequence_of nucleic_acid_molecule))) has_Class requires_input (data has_Class encodes (sequence has_Class is_sequence_of nucleic_acid_molecule)) has_Class is_function_of (BLAST_application) class-def defined pairwise_sequence_alignment_service subclass-of atomic_service_operation has_Class performs_task (aligning has_Class has_feature local has_Class has_feature pairwise) has_Class produces_result (report has_Class is_report_of sequence_alignment) has_Class uses_resource (database has_Class contains (data has_Class encodes (sequence has_Class is_sequence_of nucleic_acid_molecule))) has_Class requires_input (data has_Class encodes (sequence has_Class is_sequence_of nucleic_acid_molecule)) has_Class is_function_of (BLAST_application) Description driven classification Portal Repository Client Personal Repository Workflow Client Workflow Repository Workflow enactment Bioinformatics services Client framework Ontology Client myGrid. version0 (Meta Data) Ontology Server (Meta Data) Service Type Directory Service instance directory DAML+OIL Reasoner (FaCT) Matcher and Ranker 1. User selects values from a drop down list to create a property based description of their required service. Values are constrained to provide only sensible alternatives. 2. Once the user has entered a partial description they submit it for matching. The results are displayed below. 3. The user adds the operation to the growing workflow. 4. The workflow specification is complete and ready to match against those in the workflow repository. Ontology grounds out Link ontology to WSDL and UDDI types XML Schema UDDI businessEntity messages portType operation businessService binding Template binding service WSDL tModel Other uses of ontology Labelling data items in databases Semantic typing for controlling inputs and outputs Use by distributed query processing Ontology/ registry issues How to best integrate with existing registry technology such as UDDI How do ontological descriptions of data relate to type systems How big should the phrases become within the ontology? Who builds these descriptions? Summary Different ontologies can have a different selection of features tailored to requirements Form a wide spectrum of resources Powerful technology available Harness it for end users And finally.. predates computers Linnaeus 18th Century Nomenclature/ classification of species Language independent (Latin) Promoted sharing and integration of knowledge about related species A community effort – botanists / zoologists Farr 19th Century Nomenclature of disease for consistent cause of death reporting Allowed aggregation/integration of data to discover new knowledge about the aetiology of Cholera. A community effort -- surgeons Links All myGrid tools & ontology available from: http://www.mygrid.org.uk GONG site: http://gong.man.ac.uk Building ontologies site: http://oiled.man.ac.uk/building Acknowledgements Manchester metadata team Carole Goble, Robert Stevens, Sean Bechhofer, Phil Lord, Alan Rector, Jeremy Rogers, Chris Garwood myGrid team GO Consortium Esp. Mike Ashburner, Midori Harris, Jane Lomax Sharing info Sharing meaning Metadata Data describing the content and meaning of resources and services. But everyone must speak the same language… Service provider Service provider Terminologies Shared and common vocabularies For search engines, agents, curators, authors and users But everyone must mean the same thing… Service provider Service provider Service provider Ontologies Shared and common understanding of a domain Essential for search, exchange and discovery • Origin and History Humans require words (or at least symbols) to communicate efficiently. The mapping of words to things is only indirect possible. We do it by creating concepts that refer to things. • The relation between symbols and things has been described in the form of the meaning triangle: Concept “Jaguar“ [Ogden, Richards, 1923] [Deborah McGuinness, Stanford] So what is an ontology? Thesauri Catalog/ ID Terms/ glossary Informal Is-a Gene Ontology Mouse Anatomy Frames (properties) Formal Is-a General Logical constraints Disjointness, Inverse, partof Formal instance Value restrictions Arom TAMBIS EcoCyc PharmGKB • ...Human Human Agent 1 and machine communication [Maedche et al., 2002] Human Agent 2 exchange symbol, e.g. via nat. language Machine Agent 1 Machine Agent 2 exchange symbol, e.g. via protocols Ontology Description Symbol ‘‘JAGUAR“ Formal Semantics Internal models HA1 commit commit Concept MA1 HA2 commit Formal models Ontology MA2 commit a specific domain, e.g. animals Things Meaning Triangle ? Important life science ontologies SWISS-PROT Keywords the SWISS-PROT keyword list now has definitions (in nat. lang.) associated with each keyword. Edinburgh Anatomies Have whole or partial anatomy ontologies for adult and developmental stages for several model organisms. The Ingenuity company has a large knowledge base of experimental findings in biology. Currently, their ontology is not viewable. The MGED ontology working group aim to develop ontologies for describing gene expression experiments and data. Semiotes Regulatory Networks Model PharmGKB: Pharmacogenetics Knowledge Base. the TAMBIS ontology (TaO) an ontology of bioinformatics and molecular biology. RiboWeb an ontology describing ribosomal components, associated data and computations for processing those data. EcoCyc an ontology describing the genes, gene product function, metabolism and regulation within E. coli. Molecular Biology Ontology (MBO)A general, reference ontology for molecular biology. Gene Ontology (GO) an ontology describing the function, the process and cellular location of gene products from eukaryotes. Mouse Genome Informatics GO browser Mouse Anatomical Dictionary ImMunoGeneTics (IMGT) Ontology STAR/mmCIF Macromolecule structure ontology. STAR/mmCIF Signal Transduction Knowledge #Environment (STKE). GENAROM Ontology of gene product interactions. GeneX Ontologies for comparing gene expression across species. EpoDB Controlled Vocabulary function, cell and tissue type, developmental stage and experimental type. CBIL Controlled Vocabulary Terms for human anatomy. Japan Bio-Ontology Committee including Signal Transduction Ontology