* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Life Science Knowledge Collider
Silencer (genetics) wikipedia , lookup
Protein adsorption wikipedia , lookup
Gene expression profiling wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Paracrine signalling wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Interactome wikipedia , lookup
Gene regulatory network wikipedia , lookup
Life Science Knowledge Collider Vassil Momtchev (Ontotext) Sept, 2008 Presentation Outline • Life Sciences Domain Integration Problems • Pathway and Interaction Knowledge Base • Linked Life Data • LifeSKIM Application to Show Case Platform ESTC Sept, 2008 Andy Law’s First Law “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.” ESTC Sept, 2008 The problem! • The data is supported by different organizations • The information is highly distributed and redundant • There are tons of flat file formats with special semantics • The knowledge is locked in vast data silos • There are many isolated communities which could not reach cross-domain understanding ESTC Sept, 2008 Andy Law’s Second Law “The second step in developing a new genetic analysis algorithm is to decide how to make the output data file format incompatible with all pre-existing analysis data file input formats.” ESTC Sept, 2008 Take Your Best Guess ESTC Sept, 2008 PIKB Overview • Stands for Pathway and Interaction Knowledge Base (PIKB) • Interactions in the cell unveil the molecular mechanisms – Which molecular function or a biological process is affected after the admission of given drug? – What is the involvement of chemical compounds to a specific biological process or disease? • The work is developed in context LARKC and it is refined with AstraZeneca researcher • The use case of “Semantic Integration for Early Clinical and Drug Development” will be assessed with clinical data of AstraZeneca ESTC Sept, 2008 LARKC Project • Giving up 100% correctness: • trading quality for size • often completeness is not needed • sometimes even soundness is not needed precision (soundness) • “Web Scale and Style Reasoning” logic Semantic Web IR recall (completeness) ESTC Sept, 2008 PIKB Objectives • Easily integrate pathway and interaction data from different sources • Allow straightforward updates of the information • Provide scientists with computational support to conceptualize the breath and depth of relationships between data • Scale up to billions of statements ESTC Sept, 2008 PIKB Data Sources Type of data sources Database name Sometimes we need to ask far more • Gene and gene annotations • Entrez-Gene questions efficiently: • • • • • Give all terms more specific than “cell signaling” List all primates sub categories? (e.g., synaptic transmission, transmission of nerve Protein sequences • Uniprot impulse) Give me all proteins which interacts in Protein cross references • iProClass Give me all human genes which are located X nucleus and are annotated with in repressor Give me all human proteins associated with chromosome? Give me all interactions of cell division protein endoplasmic reticulum? Gene and gene product GeneOntology List me all references to • a protein andList have atcross least one participants all protein identifiers encoded by gene IL2?that is kinase? List all articles where protein Interleukin-2 is Interleukin-2? annotations GeneOntology encoded by gene annotated with specific mentioned? term and is located in chromosome X? Filter Organisms • NCBI Taxonomy the results for Mammalia organisms! Molecular interaction and • BioGRID, NCI, Reactome, pathways BioCarta, KEGG, BioCyc ESTC Sept, 2008 Possible Solutions • Classical data-integration with: – data warehouses – federation middleware frameworks – database middleware technology • Not really... – – – – Mapping works efficiently on a small scale Different design paradigm can be a real challenge Direct mapping usually does not work No standard way to integrate textual information ESTC Sept, 2008 Our Approach • Convert all data sources to RDF representation (if not already distributed) • Collide the data to scalable semantic repository • Apply light-weight reasoning to specify formal interpretations of the data (e.g., remove redundancy) • Derive new implicit knowledge ESTC Sept, 2008 Try to Visualise it urn:biogrid:Interaction urn:uniprot:Protein urn:uniprot:FBgn0068575 urn:biogrid:FBgn0068575 rdf:type sameAs rdf:type urn:pubmed:15904 rdf:seeAlso rdf:type urn:uniprot:Q709356 urn:intact:Interaction hasParticipant interactsWith Use Resolve relationships the syntactic to derive differences new implicit in the knowledge identifiers interactsWith rdf:type sameAs urn:biogrid:15904 hasParticipant urn:uniprot:P104172 urn:intact:1007 rdf:seeAlso sameAs urn:biogrid:FBgn00134235 urn:uniprot:FBgn00134235 These are only examples resource names ESTC Sept, 2008 Database Dataset Schema Description Uniprot Curated entries Original by the provider Protein sequences and annotations Entrez-Gene Complete Custom RDF schema Genes and annotation iProClass Complete Custom RDF schema Protein crossreferences Gene Ontology Complete Schema by the provider Gene and gene product annotation thesaurus BioGRID Complete BioPAX 2.0 (custom generated) Protein interactions extracted from the literature NCI - Pathway Interaction Database Complete BioPAX 2.0 (original by the provider) Human pathway interaction database The Cancer Cell Map Complete BioPAX 2.0 (original by the provider) Cancer pathways database Reactome Complete BioPAX 2.0 (original by the provider) Human pathways and interactions BioCarta Complete BioPAX 2.0 (original by the provider) Pathway database KEGG Complete BioPAX 1.0 (original by the provider) Molecular Interaction BioCyc Complete BioPAX 1.0 (original by the provider) Pathway database NCBI Taxonomy Complete Custom RDF schema ESTC Organisms Sept, 2008 Linked Life Data Overview • Platform to automate the process: – Infrastructure to store and inferences – Transform the structured data sources to RDF – Provide web interface to access the data • Currently operates over OWLIM semantic repository • LinkedLifeData - PIKB statistics: – Number of statements: 1,159,857,602 – Number of explicit statements: 403,361,589 – Number of entities: 128,948,564 • Publicly available at: http://www.linkedlifedata.com ESTC Sept, 2008 LifeSKIM Application • A platform offering software infrastructure for: – automatic semantic annotation of text – ontology population • Store the extracted facts and reason on top of them • Semantic indexing and retrieval of content • Query and navigation involving structured knowledge • Based on Information Extraction (i.e. text-mining) technology ESTC Sept, 2008 How LifeSKIM Searchers Better? • LifeSKIM can match a query Documents about interleukin 6 (interferon, beta 2) where is connected to apoptosis of neutrophils . • With a document containing …. the same effect was not observed for IFNB2, IL-8 and TNFalpha…….. …. is induced neutrophil programmed cell death by apoptosis …… ESTC Sept, 2008 How LifeSKIM Searchers Better? The classical IR could not match: • interleukin 6 with a HGF; HSF; BSF2; IL-6; IFNB2 Interleukin 6 is a an entity in Entrez-Gene with GeneID: 3569, and HGF; HSF; BSF2; IL-6; IFNB2 are aliases for the same gene entity. • apoptosis of neutrophils with neutrophil apoptosis; programmed cell death of neutrophils by apoptosis; programmed cell death, neutrophils; neutrophil programmed cell death by apoptosis; GeneOntology thesaurus adds the above list of terms as part of apoptosis of neutrophils term. ESTC Sept, 2008 Semantic Annotation Example ESTC Sept, 2008 Thanks AstraZeneca Ontotext • Bosse Andersson • Deyan Peychev • Elisabet Söderhielm • Georgi Georgiev • Kaushal Desai • OWLIM team • KIM team The development of PIKB and Linked Life Data is partially funded by FP7 215535 LarKC ESTC Sept, 2008