Download Life Science Knowledge Collider

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Silencer (genetics) wikipedia , lookup

Protein adsorption wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Paracrine signalling wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Interactome wikipedia , lookup

Gene regulatory network wikipedia , lookup

Protein–protein interaction wikipedia , lookup

List of types of proteins wikipedia , lookup

Transcript
Life Science Knowledge Collider
Vassil Momtchev
(Ontotext)
Sept, 2008
Presentation Outline
• Life Sciences Domain Integration Problems
• Pathway and Interaction Knowledge Base
• Linked Life Data
• LifeSKIM Application to Show Case Platform
ESTC
Sept, 2008
Andy Law’s First Law
“The first step in developing a new genetic analysis algorithm is
to decide how to make the input data file format different
from all pre-existing analysis data file formats.”
ESTC
Sept, 2008
The problem!
• The data is supported by different organizations
• The information is highly distributed and redundant
• There are tons of flat file formats with special
semantics
• The knowledge is locked in vast data silos
• There are many isolated communities which could not
reach cross-domain understanding
ESTC
Sept, 2008
Andy Law’s Second Law
“The second step in developing a new genetic analysis algorithm
is to decide how to make the output data file format
incompatible with all pre-existing analysis data file input
formats.”
ESTC
Sept, 2008
Take Your Best Guess
ESTC
Sept, 2008
PIKB Overview
• Stands for Pathway and Interaction Knowledge Base (PIKB)
• Interactions in the cell unveil the molecular mechanisms
– Which molecular function or a biological process is affected after the
admission of given drug?
– What is the involvement of chemical compounds to a specific biological
process or disease?
• The work is developed in context LARKC and it is refined with
AstraZeneca researcher
• The use case of “Semantic Integration for Early Clinical and
Drug Development” will be assessed with clinical data of
AstraZeneca
ESTC
Sept, 2008
LARKC Project
• Giving up 100% correctness:
• trading quality for size
• often completeness is
not needed
• sometimes even
soundness is not needed
precision (soundness)
• “Web Scale and Style
Reasoning”
logic
Semantic Web
IR
recall (completeness)
ESTC
Sept, 2008
PIKB Objectives
• Easily integrate pathway and interaction data from different
sources
• Allow straightforward updates of the information
• Provide scientists with computational support to
conceptualize the breath and depth of relationships between
data
• Scale up to billions of statements
ESTC
Sept, 2008
PIKB Data Sources
Type of data sources
Database name
Sometimes we need to ask far more
• Gene
and gene
annotations • Entrez-Gene
questions
efficiently:
•
•
•
•
•
Give all terms more specific than “cell signaling”
List all
primates
sub categories?
(e.g.,
synaptic
transmission,
transmission
of nerve
Protein
sequences
• Uniprot
impulse)
Give me all proteins which interacts in
Protein
cross
references
• iProClass
Give
me
all
human
genes
which
are
located
X
nucleus
and
are
annotated
with in
repressor
Give
me
all
human
proteins
associated
with
chromosome?
Give me all interactions of cell division protein
endoplasmic
reticulum?
Gene
and
gene
product
GeneOntology
List
me
all
references
to •
a protein
andList
have
atcross
least
one
participants
all protein
identifiers
encoded
by
gene IL2?that is
kinase?
List all articles where protein Interleukin-2 is
Interleukin-2?
annotations
GeneOntology
encoded
by gene annotated
with specific
mentioned?
term and is located in chromosome
X? Filter
Organisms
• NCBI Taxonomy
the results for Mammalia organisms!
Molecular interaction and
• BioGRID, NCI, Reactome,
pathways
BioCarta, KEGG, BioCyc
ESTC
Sept, 2008
Possible Solutions
• Classical data-integration with:
– data warehouses
– federation middleware frameworks
– database middleware technology
• Not really...
–
–
–
–
Mapping works efficiently on a small scale
Different design paradigm can be a real challenge
Direct mapping usually does not work
No standard way to integrate textual information
ESTC
Sept, 2008
Our Approach
• Convert all data sources to RDF representation (if not already
distributed)
• Collide the data to scalable semantic repository
• Apply light-weight reasoning to specify formal interpretations
of the data (e.g., remove redundancy)
• Derive new implicit knowledge
ESTC
Sept, 2008
Try to Visualise it
urn:biogrid:Interaction
urn:uniprot:Protein
urn:uniprot:FBgn0068575
urn:biogrid:FBgn0068575
rdf:type
sameAs
rdf:type
urn:pubmed:15904
rdf:seeAlso
rdf:type
urn:uniprot:Q709356
urn:intact:Interaction
hasParticipant
interactsWith
Use
Resolve
relationships
the syntactic
to derive
differences
new implicit
in the knowledge
identifiers
interactsWith
rdf:type
sameAs
urn:biogrid:15904
hasParticipant
urn:uniprot:P104172
urn:intact:1007
rdf:seeAlso
sameAs
urn:biogrid:FBgn00134235
urn:uniprot:FBgn00134235
These are only examples resource names
ESTC
Sept, 2008
Database
Dataset
Schema
Description
Uniprot
Curated
entries
Original by the provider
Protein sequences and
annotations
Entrez-Gene
Complete
Custom RDF schema
Genes and annotation
iProClass
Complete
Custom RDF schema
Protein crossreferences
Gene Ontology
Complete
Schema by the provider
Gene and gene product
annotation thesaurus
BioGRID
Complete
BioPAX 2.0 (custom generated)
Protein interactions
extracted from the
literature
NCI - Pathway
Interaction Database
Complete
BioPAX 2.0 (original by the
provider)
Human pathway
interaction database
The Cancer Cell Map
Complete
BioPAX 2.0 (original by the
provider)
Cancer pathways
database
Reactome
Complete
BioPAX 2.0 (original by the
provider)
Human pathways and
interactions
BioCarta
Complete
BioPAX 2.0 (original by the
provider)
Pathway database
KEGG
Complete
BioPAX 1.0 (original by the
provider)
Molecular Interaction
BioCyc
Complete
BioPAX 1.0 (original by the
provider)
Pathway database
NCBI Taxonomy
Complete
Custom RDF schema
ESTC
Organisms
Sept, 2008
Linked Life Data Overview
• Platform to automate the process:
– Infrastructure to store and inferences
– Transform the structured data sources to RDF
– Provide web interface to access the data
• Currently operates over OWLIM semantic repository
• LinkedLifeData - PIKB statistics:
– Number of statements: 1,159,857,602
– Number of explicit statements: 403,361,589
– Number of entities: 128,948,564
• Publicly available at: http://www.linkedlifedata.com
ESTC
Sept, 2008
LifeSKIM Application
• A platform offering software infrastructure for:
– automatic semantic annotation of text
– ontology population
• Store the extracted facts and reason on top of them
• Semantic indexing and retrieval of content
• Query and navigation involving structured knowledge
• Based on Information Extraction (i.e. text-mining) technology
ESTC
Sept, 2008
How LifeSKIM Searchers Better?
• LifeSKIM can match a query
Documents about interleukin 6 (interferon, beta 2) where is
connected to apoptosis of neutrophils .
• With a document containing
…. the same effect was not observed for IFNB2, IL-8 and TNFalpha…….. …. is induced neutrophil programmed cell death by
apoptosis ……
ESTC
Sept, 2008
How LifeSKIM Searchers Better?
The classical IR could not match:
• interleukin 6 with a HGF; HSF; BSF2; IL-6; IFNB2
Interleukin 6 is a an entity in Entrez-Gene with GeneID: 3569,
and HGF; HSF; BSF2; IL-6; IFNB2 are aliases for the same gene
entity.
• apoptosis of neutrophils with neutrophil apoptosis;
programmed cell death of neutrophils by apoptosis;
programmed cell death, neutrophils; neutrophil programmed
cell death by apoptosis;
GeneOntology thesaurus adds the above list of terms as part of
apoptosis of neutrophils term.
ESTC
Sept, 2008
Semantic Annotation Example
ESTC
Sept, 2008
Thanks
AstraZeneca
Ontotext
• Bosse Andersson
• Deyan Peychev
• Elisabet Söderhielm
• Georgi Georgiev
• Kaushal Desai
• OWLIM team
• KIM team
The development of PIKB and Linked Life Data is
partially funded by FP7 215535 LarKC
ESTC
Sept, 2008