Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What developers need to know about ontologies? Barry Smith http://ontology.buffalo.edu/smith 1 HL7 Watch (blog) Microsoft Healthvault: Allergic Episode is_a Health Record Item, Health Record Item =def. A single piece of data in a health record that is accessible through the HealthVault service 2 Problem of ensuring sensible cooperation in a massively interdisciplinary community concept type instance model representation data 3 What do these mean? ‘conceptual data model’ ‘semantic knowledge model’ ‘reference information model’ 4 You’re interested in which genes control heart muscle development 17,536 results 5 time Defense response Immune response Response to stimulus Toll regulated genes JAK-STAT regulated genes Microarray data shows changed expression of thousands of genes. Puparial adhesion Molting cycle hemocyanin Amino acid catabolism Lipid metobolism How will you spot the patterns? Peptidase activity Protein catabloism Immune response Immune response Toll regulated genes attacked control Tree: pearson Coloredby: by: arson lw n3d ... lw n3d ... Colored Copy of Copy C5_RMA Copy ofofCopy of(Defa... C5_RMA (Defa... 6 Lab / pathology data EHR data Clinical trial data Family history data Medical image data Microarray data Model organism data Flow cytometry Mass spec Genotype / SNP data How will you find the data you need? 7 − − − − − − Human Mouse Rat Fish Yeast E. coli How will you find the compare the data? How will you integrate the data 8 The GO Idea GlyProt MouseEcotope sphingolipid transporter activity DiabetInGene GluChem :. annotation using common ontologies yields integration of databases GlyProt MouseEcotope Holliday junction helicase complex DiabetInGene GluChem :. • For this to work, ontologies cannot be allowed to proliferate uncontrollably • Rather, we need as far as possible nonoverlapping ontology modules (OBO Foundry) • How should we build these modules in such a way as to ensure glue-ability of annotations? Glue-ability / integration • rests on the existence of a common benchmark called ‘reality’ • the ontologies we want to glue together are representations of what exists in the world • not of what exists in the heads of different groups of people 12 two kinds of annotations 13 names of types 14 names of instances 15 First basic distinction type vs. instance (science text vs. diary) (human being vs. Tom Cruise) 16 For ontologies it is generalizations that are important = ontologies are about types, kinds, universals 17 Ontology types Instances 18 Ontology = A Representation of types 19 An ontology is a representation of types We learn about types in reality from looking at the results of scientific experiments in the form of scientific theories experiments relate to what is particular science describes what is general 20 Inventory vs. Catalog Two kinds of representational artifact Very roughly: Databases represent instances Ontologies represent types 21 Catalog vs. inventory A B C 515287 521683 521682 DC3300 Dust Collector Fan Gilmer Belt Motor Drive Belt 22 Catalog vs. inventory 23 Catalog of types/Types 24 types object organism animal mammal cat siamese frog instances 25 Ontologies are here 26 or here 27 ontologies represent general structures in reality (leg) 28 Ontologies do not represent concepts in people’s heads 29 They represent types in reality 30 which provide the benchmark for integration 31 Entity =def anything which exists, including things and processes, functions and qualities, beliefs and actions, documents and software (Levels 1, 2 and 3) 32 what are the kinds of entity? 33 First basic distinction type vs. instance (science text vs. diary) (human being vs. Tom Cruise) 34 Ontology Types Instances 35 Ontology = A Representation of types 36 Domain =def a portion of reality that forms the subjectmatter of a single science or technology or mode of study or administrative practice ...; proteomics HIV epidemiology 37 Representation =def an image, idea, map, picture, name or description ... of some entity or entities. 38 Ontologies are representational artifacts comparable to science texts and subject to the same sorts of constraints (including need for update) 39 Representational units =def terms, icons, alphanumeric identifiers ... which refer, or are intended to refer, to entities and which are minimal (atoms) 40 Composite representation =def representation (1) built out of representational units which (2) form a structure that mirrors, or is intended to mirror, the entities in some domain 41 Analogue representations no representational units, no ‘atoms’ 42 The Periodic Table Periodic Table 43 Class =def a maximal collection of particulars determined by a general term (‘cell’. ‘electron’ but also: ‘ ‘restaurant in Palo Alto’, ‘Italian’) the class A = the collection of all particulars x for which ‘x is A’ is true 44 types vs. their extensions types {a,b,c,...} collections of particulars 45 Extension =def The extension of a type A is the class: instance of the type A (it is the class of A’s instances) (the class of all entities to which the term ‘A’ applies) 46 Problem The same general term can be used to refer both to types and to collections of particulars. Consider: HIV is an infectious retrovirus HIV is spreading very rapidly through Asia 47 types vs. classes types {c,d,e,...} classes 48 types vs. classes types ~ defined classes 49 types vs. classes types e.g. populations, ... 50 Defined class =def a class defined by a general term which does not designate a type the class of all diabetic patients in Leipzig on 4 June 1952 51 OWL is a good representation of defined classes • sibling of Finnish spy • member of Abba aged > 50 years • pizza with > 4 different toppings 52 Terminology =def. a representational artifact whose representational units are natural language terms (with IDs, synonyms, comments, etc.) which are intended to designate types together with defined classes, with no particular attention to composite representations 53 types, classes, concepts types defined classes ‘concepts’ ? 54 types < defined classes < ‘concepts’ ‘concepts’ which do not correspond to defined classes: ‘Surgical or other procedure not carried out because of patient's decision’ ‘Congenital absent nipple’ because they do not correspond to anything 55 Gene Ontology: The Very Top cellular component molecular function biological process 56 Gene Ontology: The Very Top continuant cellular component molecular function occurrent biological process 57 BFO: The Very Top continuant independent continuant dependent continuant cellular component molecular function occurrent biological processes 58 Basic Formal Ontology continuant independent continuant occurrent dependent continuant organism 59 Basic Formal Ontology continuant independent continuant occurrent dependent continuant anatomical structure 60 Continuants • continue to exist through time, preserving their identity while undergoing different sorts of changes • independent continuants – objects, things, ... • dependent continuants – qualities, attributes, shapes, potentialities ... 61 Qualities temperature blood pressure mass ... are continuants they exist through time while undergoing changes 62 Qualities temperature / blood pressure / mass ... are dimensions of variation within the structure of the entity; a quality is something which can change while its bearer remains one and the same 63 A Chart representing how John’s temperature changes 65 John’s temperature the temperature he has throughout his entire life, cycles through different determinate temperatures from one time to the next John’s temperature is a physiology variable which, in thus changing, exerts an influence on other physiology variables through time 66 BFO: The Very Top continuant independent continuant occurrent dependent continuant quality temperature 67 Blinding Flash of the Obvious independent continuant dependent continuant quality organism John temperature John’s temperature types instances 68 Blinding Flash of the Obvious independent continuant dependent continuant quality organism John temperature John’s temperature types instances 69 Blinding Flash of the Obvious inheres_in organism John temperature John’s temperature types instances 70 types temperature 37ºC instantiates at t1 37.1ºC instantiates at t2 37.2ºC instantiates at t3 37.3ºC instantiates at t4 37.4ºC instantiates at t5 37.5ºC instantiates at t6 John’s temperature instances 71 types human embryo instantiates at t1 fetus instantiates at t2 neonate instantiates at t3 infant child instantiates at t4 instantiates at t5 adult instantiates at t6 John instances 72 • lower lever of types does not ‘carry identity’ in OntoClean terms • are threshold divisions (hence we do not have sharp boundaries, and we have a certain degree of choice, e.g. in how many subtypes to distinguish, though not in their ordering) 73 independent continuant dependent continuant quality organism John temperature types John’s temperature instances 74 independent continuant organism John dependent continuant occurrent quality process temperature John’s temperature course of temperature changes John’s temperature history 75 independent continuant organism John dependent continuant occurrent quality process temperature John’s temperature life of an organism John’s life 76 BFO/GO: The Very Top continuant independent continuant dependent continuant cellular component molecular function occurrent biological processes 77 BFO: The Very Top continuant independent continuant occurrent dependent continuant quality function role disposition 78 Function - of of of of of liver: to store glycogen birth canal: to enable transport eye: to see mitochondrion: to produce ATP liver: to store glycogen not optional; reflection of physical makeup of bearer; can malfunction 79 :. Role optional: exists because the bearer is in some special natural, social, or institutional set of circumstances in which the bearer does not have to be 80 :. Role - bearers can have more than one role person as student / as staff member - roles often form systems of mutual dependence husband / wife first in queue / last in queue doctor / patient host / pathogen :. 81 Role of some chemical compound: to serve as analyte in an experiment of a dose of penicillin in this human child: to treat a disease of this bacteria in a primary host: to cause infection 82 :. Qualities are categorical features of reality – you just have them Functions, roles and dispositions are potential featires of reality: they are realizable dependent continuants, realized in certain associated processes 83 :. independent continuant portion of chemical compound this portion of aspirin dependent continuant occurrent role process drug role process of drug adminstration role of this portion of aspirin John’s taking this portion of aspirin 84 independent continuant portion of chemical compound dependent continuant occurrent role process drug role process of drug adminstration inheres_in realized_in this portion of aspirin role of this portion of aspirin John’s taking this portion of aspirin 85 RELATION TO TIME CONTINUANT INDEPENDENT OCCURRENT DEPENDENT GRANULARITY ORGAN AND ORGANISM Organism (NCBI Taxonomy) CELL AND CELLULAR COMPONENT Cell (CL) MOLECULE Anatomical Organ Entity Function (FMA, (FMP, CPRO) Phenotypic CARO) Quality (PaTO) Cellular Cellular Component Function (FMA, GO) (GO) Molecule (ChEBI, SO, RnaO, PrO) Molecular Function (GO) Biological Process (GO) Molecular Process (GO) The Open Biomedical Ontologies (OBO) Foundry 86 • The Road to Convergence All ontologies for each given domain (anatomy, chemistry…) should be part of a single suite of interoperable ontologies should use a common top-level core for subdomains with many variants, should follow the strategy of canonical ontologies with extensions should require acceptance of common, tested guidelines on all subscribing ontology developers 87 RELATION TO TIME GRANULARITY INDEPENDENT ORGAN AND ORGANISM Organism (NCBI Taxonomy) CELL AND CELLULAR COMPONENT Cell (CL) MOLECULE CONTINUANT DEPENDENT Anatomical Organ Entity Function (FMA, (FMP, CPRO) Phenotypic CARO) Quality (PaTO) Cellular Cellular Component Function (FMA, GO) (GO) Molecule (ChEBI, SO, RnaO, PrO) OCCURRENT Molecular Function (GO) Organism-Level Process (GO) Cellular Process (GO) Molecular Process (GO) initial OBO Foundry coverage, ontologies automatically semantically coupled 88 Disposition (InternallyGrounded Realizable Entity) disposition =def. a realizable entity which if it ceases to exist, then its bearer is physically changed, and whose realization occurs when this bearer is in some special physical circumstances, in virtue of the bearer’s physical make-up 89 Function • A Disposition (Internally-Grounded Realizable Entity) that is designed or selected for 90 OGMS • Ontology for General Medical Science http://code.google.com/p/ogms 91 Physical Disorder – independent continuant fiat object part 92 :. Big Picture 93 A disease is a disposition rooted in a physical disorder in the organism and realized in pathological processes. produces etiological process bears disorder realized_in disposition pathological process produces diagnosis interpretive process produces signs & symptoms used_in abnormal bodily features recognized_as 94 Elucidation of Primitive Terms • ‘bodily feature’ - an abbreviation for a physical component, a bodily quality, or a bodily process. • disposition - an attribute describing the propensity to initiate certain specific sorts of processes when certain conditions are satisfied. • clinically abnormal - some bodily feature that – (1) is not part of the life plan for an organism of the relevant type (unlike aging or pregnancy), – (2) is causally linked to an elevated risk either of pain or other feelings of illness, or of death or dysfunction, and – (3) is such that the elevated risk exceeds a certain threshold level.* *Compare: baldness 95 Definitions - Foundational Terms • Disorder =def. – A causally linked combination of physical components that is clinically abnormal. • Pathological Process =def. – A bodily process that is a manifestation of a disorder and is clinically abnormal. • Disease =def. – A disposition (i) to undergo pathological processes that (ii) exists in an organism because of one or more disorders in that organism. 96 Dispositions and Predispositions • All diseases are dispositions; not all dispositions are diseases. • A predisposition is a disposition. • Predisposition to Disease of Type X =def. – A disposition in an organism that constitutes an increased risk of the organism’s subsequently developing the disease X. • HNPCC is caused by a – disorder (mutation) in a DNA mismatch repair gene that – disposes to the acquisition of additional mutations from defective DNA repair processes, and thus is a – predisposition to the development of colon cancer. 97 Cirrhosis - environmental exposure • • • • • • • Etiological process - phenobarbitolinduced hepatic cell death – produces Disorder - necrotic liver – bears Disposition (disease) - cirrhosis – realized_in Pathological process - abnormal tissue repair with cell proliferation and fibrosis that exceed a certain threshold; hypoxia-induced cell death – produces Abnormal bodily features – recognized_as Symptoms - fatigue, anorexia Signs - jaundice, splenomegaly Symptoms & Signs used_in Interpretive process produces Hypothesis - rule out cirrhosis suggests Laboratory tests produces Test results - elevated liver enzymes in serum used_in Interpretive process produces Result - diagnosis that patient X has a disorder that bears the disease cirrhosis 98 Influenza - infectious • • • • • • • Etiological process - infection of airway epithelial cells with influenza virus – produces Disorder - viable cells with influenza virus – bears Disposition (disease) - flu – realized_in Pathological process - acute inflammation – produces Abnormal bodily features – recognized_as Symptoms - weakness, dizziness Signs - fever Symptoms & Signs used_in Interpretive process produces Hypothesis - rule out influenza suggests Laboratory tests produces Test results - elevated serum antibody titers used_in Interpretive process produces Result - diagnosis that patient X has a disorder that bears the disease flu But the disorder also induces normal physiological processes (immune response) that can results in the elimination of the 99 disorder (transient disease course). Huntington’s Disease - genetic • • • • • • • Etiological process - inheritance of >39 CAG repeats in the HTT gene – produces Disorder - chromosome 4 with abnormal mHTT – bears Disposition (disease) - Huntington’s disease – realized_in Pathological process - accumulation of mHTT protein fragments, abnormal transcription regulation, neuronal cell death in striatum – produces Abnormal bodily features – recognized_as Symptoms - anxiety, depression Signs - difficulties in speaking and swallowing Symptoms & Signs used_in Interpretive process produces Hypothesis - rule out Huntington’s suggests Laboratory tests produces Test results - molecular detection of the HTT gene with >39CAG repeats used_in Interpretive process produces Result - diagnosis that patient X has a disorder that bears the disease Huntington’s disease 100 HNPCC - genetic pre-disposition • Etiological process - inheritance of a mutant mismatch repair gene – produces • Disorder - chromosome 3 with abnormal hMLH1 – bears • Disposition (disease) - Lynch syndrome – realized_in • Pathological process - abnormal repair of DNA mismatches – produces • Disorder - mutations in proto-oncogenes and tumor suppressor genes with microsatellite repeats (e.g. TGF-beta R2) – bears • Disposition (disease) - non-polyposis colon cancer – realized in • Symptoms (including pain) 101 The OBO Foundry Initiative 102 A good solution to the data integration problem must be: • modular • incremental • bottom-up • evidence-based • revisable • incorporate a strategy for motivating potential developers and users 103 GO is amazingly successful – but covers only three sorts of biological entities: – cellular components – molecular functions – biological processes and does not provide representations of disease-related phenomena 104 RELATION TO TIME CONTINUANT INDEPENDENT OCCURRENT DEPENDENT GRANULARITY ORGAN AND ORGANISM Organism (NCBI Taxonomy) CELL AND CELLULAR COMPONENT Cell (CL) MOLECULE Anatomical Organ Entity Function (FMA, (FMP, CPRO) Phenotypic CARO) Quality (PaTO) Cellular Cellular Component Function (FMA, GO) (GO) Molecule (ChEBI, SO, RnaO, PrO) Molecular Function (GO) Biological Process (GO) Molecular Process (GO) The Open Biomedical Ontologies (OBO) Foundry 105 OBO Foundry provides • tested guidelines enabling new groups to develop the ontologies they need in ways which counteract forking and dispersion of effort • an incremental bottoms-up approach to evidence-based terminology practices in medicine that is rooted in basic biology • automatic web-based linkage between medical terminologies and biological knowledge resources • traffic laws and traffic police 106 the strategy establish common rules governing best practices for creating ontologies in coordinated fashion, with an evidencebased pathway to incremental improvement 107 The methodology of cross-products compound terms in ontologies to be defined as cross-products of simpler terms: E.g elevated blood glucose is a cross-product of PATO: increased concentration with FMA: blood and CheBI: glucose. = factoring out of ontologies into disciplinespecific modules (orthogonality) 108 The methodology of cross-products enforcing use of common relations in linking terms drawn from Foundry ontologies serves • to ensure that the ontologies are maintained and revised in tandem • logically defined relations serve to bind terms in different ontologies together to create a network 109 CRITERIA CRITERIA opennness common formal language. collaborative development evidence-based maintenance identifiers versioning textual and formal definitions 110 Orthogonality = modularity • one ontology for each domain • no need for mappings (which are in any case too expensive, too fragile, too difficult to keep up-to-date as mapped ontologies change) • everyone knows where to look to find out how to annotate each kind of data 111 Ontologies and research groups using BFO and RO – OBO Foundry (60 biomedical ontologies, including GO, OBI, Protein Ontology, Cell Ontology, IDO … – National Cancer Institute (BiomedGT) – NIF (NIH Neuroscience Information Framework) – Cleveland Clinic Semantic Database – Siemens – AstraZeneca – EU (ACGT Cancer Ontology, RAPS, …) 112 Because the ontologies in the Foundry are built as orthogonal modules which form an incrementally evolving network • scientists are motivated to commit to developing ontologies because they will need in their own work ontologies that fit into this network • users are motivated by the assurance that the ontologies they turn to are maintained by experts 113 More benefits of orthogonality • helps those new to ontology to find what they need • to find models of good practice • ensures mutual consistency of ontologies (trivially) • and thereby ensures additivity of annotations 114 More benefits of orthogonality • it rules out the sorts of simplification and partiality which may be acceptable under more pluralistic regimes • thereby brings an obligation on the part of ontology developers to commit to scientific accuracy and domain-completeness 115 More criteria of a successful standard 1. intelligibility to users, consistent use of terms like ‘term’, ‘class’, ‘entity’, ‘object’ …) 2. track record of lessons learned (GO has 10 years of hard user testing) 3. lots of existing users (ontologies are like telephone networks) 116 COMMON ARCHITECTURE The ontology uses relations which are unambiguously defined following the pattern of definitions laid down in the Basic Formal Ontology (BFO) including the Relation Ontology (RO) http://ifomis.org/bfo http://www.obofoundry.org/ro/ 117 top level mid-level Basic Formal Ontology (BFO) Anatomy Ontology (FMA*, CARO) Cell Ontology (CL) domain level Ontology for Biomedical Investigations (OBI) Information Artifact Ontology (IAO) Cellular Component Ontology (FMA*, GO*) Environment Ontology (EnvO) Subcellular Anatomy Ontology (SAO) Sequence Ontology (SO*) Protein Ontology (PRO*) Spatial Ontology (BSPO) Infectious Disease Ontology (IDO*) Phenotypic Quality Ontology (PaTO) Biological Process Ontology (GO*) Molecular Function (GO*) OBO Foundry Modular Organization 118 BFO:continuant continuant independent continuant portion of material object fiat object part object aggregate object boundary dependent continuant site generically dependent continuant information artifact spatial region specifically dependent continuant quality realizable entity 0D-region 1D-region 2D-region function 3D-region role disposition BFO:occurrent occurrent processual entity process spatiotemporal region scattered spatiotemporal region connected spatiotemporal region temporal region scattered temporal region connected temporal region fiat process part spatiotemporal instant temporal instant process aggregate spatiotemporal interval temporal interval process boundary processual context Example: The Cell Ontology top level mid-level Basic Formal Ontology (BFO) Anatomy Ontology (FMA*, CARO) Cell Ontology (CL) domain level Ontology for Biomedical Investigations (OBI) Information Artifact Ontology (IAO) Cellular Component Ontology (FMA*, GO*) Environment Ontology (EnvO) Subcellular Anatomy Ontology (SAO) Sequence Ontology (SO*) Protein Ontology (PRO*) Spatial Ontology (BSPO) Infectious Disease Ontology (IDO*) Phenotypic Quality Ontology (PaTO) Biological Process Ontology (GO*) Molecular Function (GO*) OBO Foundry Modular Organization 122