Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ontologically Modeling Sample Variables in Gene Expression Data James Malone [email protected] EBI, Cambridge, UK Overview • • • • • Application Background Motivation for ontologies – questions we to answer Methodology Ontology and application Future work/things we’d like to do Ontologically Modeling Sample Variables in Gene Expression Data [email protected] Gene Expression: Archive to Atlas ArrayExpress Curation AE/GEO acquire Curation >250,000 Assays Re-annotate & summarize >10,000 experiments Ontologically Modeling Sample Variables in Gene Expression Data [email protected] ATLAS Gene Expression Sample Variable Annotations Annotations Species Atlas 330 9 Samples 238,000 34,650 Annotations on samples 860,700 101830 37,500 6600 Assays (Hybridizations) 246,000 30,000 Annotations on assays 569,700 67,000 25,000 4000 Unique sample annotations Unique assay annotations 4 Archive Use Cases • Query support (e.g, query for 'cancer' and get also ‘leukemia') • Data visualisation – e.g., presenting an ontology tree to the user of what is in the database • Data integration by ontology terms – e.g., we assume that 'kidney' in independent studies roughly means the same, so we can count how many kidney samples we have in the database • Intelligent template generation for different experiment types in submission or data presentation • Summary level data • Nonsense detection – e.g. telling us that something marked as cancer can not be marked as healthy Ontologically Modeling Sample Variables in Gene Expression Data [email protected] Questions we want to answer • Diverse nature of annotations on data • Need to support complex queries which contain semantic information • E.g. which genes are under-expressed in brain cancer samples in human or mouse • If we annotate with adenocarcinoma do we get this data? Ontologically Modeling Sample Variables in Gene Expression Data [email protected] Primary Question: Where to place our semantics? Atlas/AE cancer adenocarcinoma Ontologically Modeling Sample Variables in Gene Expression Data [email protected] Decoupling knowledge from data Atlas/AE Ontologically Modeling Sample Variables in Gene Expression Data [email protected] Methodology: Reference vs Application Ontology • Debate in community about difference, here is our thesis • A reference ontology describes a knowledge space; an explicitly delineated part of a domain. Biomedicine Human Anatomy Cell type GO Process Ontologically Modeling Sample Variables in Gene Expression Data [email protected] Methodology: Reference vs Application Ontology • An application ontology describes an application or data space; an explicitly delineated part of a domain. • Should consume reference ontologies to meet application needs Biomedicine Human Anatomy Cell type GO Process Ontologically Modeling Sample Variables in Gene Expression Data [email protected] Building the Experimental Factor Ontology • • • We consume parts of reference ontologies from domain Construct new classes and relations to answer our use cases Aim is reuse of existing resources, shared frameworks and mapping of equivalencies where they exist Ontology Biomedical Investigations Relation Ontology Disease Ontology EFO 11 5/22/2017 Ontologically Modeling Sample Variables in Gene Expression Data [email protected] Chemical Entities of Biological Interest (ChEBI) Anatomy Reference Ontology Various Species Anatomy Ontologies Identify Upper Level Structure • Taken a BFO-lite approach, hiding labels from users for application purposes and sometimes different definition information content entity (IAO) site (BFO) processual entity (BFO) material entity (BFO) specifically dependent continuant (BFO) Specifically dependent continuant: A continuant [snap:Continuant] that inheres in or is borne by other entities. Every instance of A requires some specific instance of B which must always be the same. Material property: A property or characteristic of some other entity. For example, the mouse has the colour white. Adding New Classes @ www.ebi.ac.uk/efo/tools • We wish to maximise our interoperability • Submitters and other groups use many ontologies • Trade-off: open to their data and preferences vs imposing a more ordered view on semantics • Our goal: Where orthognality exists we aim to import only that classs. Where it does not, we perform ‘mappings’ in our EFO classes via annotation property references (in similar way to xrefs) • E.g. chebi classes, import chebi URI for ‘cancer’, create an EFO class and add multiple mappings Creating Class Mappings • For overlapping ontologies, we aim to create a ‘mapping class’ • Use semi-automated text mining “double-metaphone” algorithm • Perform matching of our values in database to ontology class labels and definitions. • Also perform mappings from EFO to other ontologies, so that EFO: cancer = NCI: cancer, DO: cancer et al. • Sanity checking over mappings before adding to ontology Keeping Up To Date with External Classes • Use of tool to automatically update metadata every release (monthly) • Uses BioPortal web services to access latest Class URI/ID definition, synonyms Ontologically Modeling Sample Variables in Gene Expression Data [email protected] Detecting Change in External Ontologies • Bubastis tool for detecting axiomatic changes between two ontologies (in our case 2 versions of same ontology) • @todo: detect annotation property changes • We also detect missing annotation properties with Watchman tool (not released yet) – mainly used for labels presently Creating Relations and Equivalent Classes species (human) cell line (Hela) cell type (epithelial) organism part (cervix) disease (cervical adenocarcinoma) Ontologically Modeling Sample Variables in Gene Expression Data [email protected] Structure for queries Ontologically Modeling Sample Variables in Gene Expression Data [email protected] Gene Expression Atlas • Linking data to the ontology Assay Table Sample Table Ontology Term Table Database formulated query OWL Model Query Gene Expression Atlas @ www.ebi.ac.uk/gxa Query for Cell adhesion genes in all ‘organism parts’ ‘View on EFO’ Ontologically Modeling Sample Variables in Gene Expression Data [email protected] ArrayExpress Archive @ www.ebi.ac.uk/arrayexpress Future Work: Linked Data Linking data by dereferenceable URI for human and machine http://www.ebi.ac.uk/gxa/Experiment12345 http://www.ebi.ac.uk/gxa/Experiment12345 Developing an Ontology from the Application Up [email protected] Future Work: RDF Triple Store @ www.ebi.ac.uk/efo/semanticweb/atlas • Q: Is an RDF Triple store SPARQL query quicker than a SPARQL translated into SQL? OWL Ontology RDFizer SQL Translation Layer Atlas Data RDF Triple Store S P A R Q L Future Work: Data Integration • Consuming reference ontologies and mapping to multiple ontologies where overlap exists offers us maximum interoperability • The advantage of triple stores is not immediate yet • Impetus required: “should we champion this technology” QUERY Rdf triple Atlas Rdf triple Rdf triple Amino Acid Ontology Swiss Prot Rdf triple Rdf triple Rdf triple Summary • We have created a sustainable approach to consuming multiple reference ontologies • Tooling solutions to expedite process • We consider EFO to be a ‘view’ of such ontologies for our application needs • The primary aim of this work is to enable novel research with the experimental data we have • Specifically, we can answer new questions, integrate across our data resources, visualise and summarise the data • Our belief is describing such data should be the driving force behind ontology development • Future work will look at linked data and rdf triple stores Acknowledgements • • • Ontology creation: • James Malone, Tomasz Adamusiak, Ele Holloway, Helen Parkinson, Jie Zheng (U Penn) Ontology Mapping tools and text mining evaluation: • Tim Rayner, Holly Zheng, Margus Lukk GUI Development • • • • • • • • Misha Kapushesky, Pasha Kurnosov, Anna Zhukova. Nikolay Kolesinkov External Review and anatomy: • Jonathan Bard, Jie Zheng ArrayExpress Production Staff EBI Rebholz Group (Whatizit text mining tool) Many source ontologies for terms and definitions esp. Disease Ontology, Cell Type Ontology, FMA, NCIT, OBI Funders: EC (Gen2Phen,FELICS, MUGEN, EMERALD, ENGAGE, SLING), EMBL, NIH Eric Neumann, Joanne Luciano and Alan Ruttenberg W3C & HCLS Group - Eric Prud'hommeaux and Scott Marshall OBI developers Ontologically Modeling Sample Variables in Gene Expression Data [email protected]