Download Scaling the walls of discovery

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Relational model wikipedia , lookup

Big data wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Lilly Singapore Centre for Drug Discovery
Scaling the walls of discovery:
using semantic metadata for
integrative problem solving
Greg Tucker-Kellogg, Ph.D.
Chief Technology Officer
Senior Director, Systems Biology
Lilly Singapore Centre for Drug Discovery
LSCDD
Outline
The Challenge of Translational Discovery
in Pharmaceutical Research
Integration of Metadata using Semantic
Web Technologies
•Why focus on metadata?
•How it helps
Examples
LSCDD
2
Lilly Singapore Centre for Drug Discovery
Oncology and diabetes research
towards tailored therapy to improve
patient outcome
Drug Discovery
(drug candidates)
Experimental
LSCDD
Wet lab biology
Systems Biology
(biomarkers)
Integrative
Computational
Sciences (tools)
Computational
3
Pharmaceutical R&D spends more to
get less
LSCDD
4
Lost in translation
Translate
The limits of my
mean
Ilanguage
limit the scope
ofthe
thelimits
language
of my I
(Ludwigworld
Wittgenstein)
我的语言限制
的范围是我的
(Ludwig Wittgenstein)
Translate
LSCDD
5
Translational research in cancer:
Connecting the dots of genetic aberrations
Targets
Pathways
Disease
Patients
Tailored Therapeutics
Improve individual patient outcomes and health outcome
predictability through tailoring drug, dose, timing of
treatment, and relevant information
LSCDD
6
The “Web” of heterogeneous data
Cell/Assay
Technologies
LSCDD
7
Integrating Scientific Data Sets
Uncontrollable diversity
Most of the valuable data is
from outside our walls
Much of it is poorly
structured
Ranging from large
(1TB/day) to boutique
LSCDD
8
Scientist’s View of Integrated Information
Target based
chemotype profiling
Pathway-based
chemotype profiling
Functional
chemogenetics
Chemical
biology
RNAi reagents
-Qiagen siRNA
-BROAD shRNA
-cDNA
High-content
bioassays
Biochemical
data
Acumen
assays
Cellomics
assays
Omics
Protein
-IHC,
-Luminex
DNA
-CGH
-SNP,Mutation
RNA
-miRNA
-mRNA
Plate
Reader
Interrogators
Color code
Epigenetics
-Methylation
-ChIP-Chip
Reporters
Mapping and annotation backbone
Strategic
Cross-domain integration
Domain-level integration
Platforms
Foundational
LSCDD
9
Manual Data Integration
A repeated, tedious process:
• Pull data from internal and public data sets
• Normalize terms and values
• Write and run analysis scripts
• Compile into a single Excel file, detached from the data
source (no drill-down)
Often this process can consume days with no guaranteed
resolution
LSCDD
10
Integration Approaches Considered
•Data Warehouse
• Difficult to maintain and integrate new data sets
• Difficult to evolve as data changes
• Schemas tightly coupled to applications
•Federated queries
• Query performance issues
• Where to place the index?
• Problematic to maintain
• Translating user search syntax to all sources requires deep knowledge of
data layer
•Semantic Integration
• Relatively unproven in enterprise systems but adaptive to change
• Relationships between data can be more fully characterized
LSCDD
11
Standard Semantic Integration Model
•All data is mapped to domain
ontology in both directions
•If single system is down,
incomplete results.
•Performance is limited to
slowest system in network
•Massive mapping effort
•Multiple implementations of
this approach, including:
• Biological and Chemical Integrated
Information System (BACIIS)
• Boeing
LSCDD
Query
Generator
Results
Presentation
Query
Planning
Data Set
Integration
Query
Submission
Source
Domain
Ontology &
Mappings
Source
Semantic
Normalization
Source
Source
12
Can we do better for our purposes?
•Avoid a complex architecture and extended development
effort
•Realize benefits in the near-term
•Preprocess metadata to improve efficiency.
•Characterize the type of questions that ontology should
answer
•Identify stable semantic technologies, do not employ
parsers.
•Allow semantic and relational databases to work together
LSCDD
13
What we need
Data Management and Availability
• Capturing and filtering the global and growing avalanche of
internal and external scientific data
Data Fusion
• Systems to link, combine and navigate massive and
heterogeneous data sets
Information Analysis and Mining
• Algorithms and tools to help scientists seek correlations and
find connections between pre-clinical and clinical knowledge
to generate and test translational hypotheses.
LSCDD
14
Data Architecture
Analysis and Mining
Query
Visualization
Algorithms
Workflow
Centralized
Genomics Functional
Ontology
Integration
LayerExperiment
mapping Information
Context
Annotation Services
Experimental
(Genomics mapping
Common Matedata
30Repository
million
Proteome
+ Gene functional
34 info)
platforms
Vocabulary
triples
/GO
Experiment Mapping &
Domain/Platform
Data
Context Specific
Annotation
Expression
(Affy,Agilent,
Illumina)
aCGH
Readout
LSCDD
Derived
Methylation
Results
Screening
SNP
Mutation
Tissue
Microarray
ChIP-Chip,
miRNA
Analysis
Results
15
LSCDD Data integration process in use
Query
Visualization
Experimental
Metadata Repository
(Genomics mapping + Gene Function)
Affy
Agilent
Illumina
Expression Expression Expression
LSCDD
aCGH
Annotation Services
Screening
RNAi
Database
Mutation
SNP
Analysis
Results
TMA
16
LSCDD Semantic Integration Approach
• Use semantic technology on an appropriate problem
• Create Ontology focused on solving LSCDD integration needs
• Scientists and IT Analysts work together to iteratively create
tailored vocabulary
• Define competency questions to validate the ontology
• Encourage ontology to evolve, a different animal than RDBMS
schemas
• Create bridges to public and internal ontologies to realize the full
capabilities of the vocabulary
• Involve users to verify RDBMS-to-ontology mapping to increase
confidence in the solution.
• Sparql is hard. Design an intuitive query model or question templates
for users to navigate the repository.
LSCDD
17
LSCDD Semantic Integration Approach
(Cont)
• Used Agile philosophy throughout: application development, ontology
development and mapping effort
• Drive adoption by engaging users to understand their challenges and
refine the solution.
• Technologies
• Protégé Ontology Editor
• Oracle Semantic Technologies 11g
• D2R Map (Database to RDF Mapping Language)
• C# development in Visual Studio 2205
LSCDD
18
Metadata RDF Repository
• Aggregates experiment metadata from a diverse set of LSCDD relational
databases into an Oracle Semantic Technologies repository for LSCDD
scientific investigation.
• Scientists at LSCDD now have a single source of experiment
information described with a common vocabulary.
• Current data sources include:
•Expression Data : Affymetrix, Illumina, Agilent
•aCGH Data
•RNAi Screening Data
•Reagent Data
Currently ~30
•Gene Ontology (GO)
million triples
•Medical Subject Headings (MeSH)
•Many others
LSCDD
19
LSCDD Metadata Ontology
Project
Study
hasProject
Experiment
hasStudy
hasPlate
hasDiseaseState
hasAssay
Assay
hasChip
DiseaseState
Plate
hasProtocol
Chip
Protocol
Compound
Reagent
Software
subclass
Plate
Well
hasModel
hasTissue
hasCellline
Model
hasChipType
hasGene
subclass
Protein
Reagent
Chip Type
Hardware
Sample
hasCompound
hasReagent
subclass
DNA
Reagent
hasChipType
subclass
subclass
hasPlate
hasSample
Treatment
RNA Reagent
CellLine
hasSource
hasGene
hasReagent
Gene
Tissue
hasTreatment
IsPartOf
hasSourceTissue
Probe
ClinicalData
ViralBatch
GeneList
hasMESHId
hasGOId
GO
LSCDD
MESH
20
Metadata Repository Application
• Both browse and query views are provided for repository access.
• The Query View allows the user to search the repository by setting
constraints on attributes of the entities in the ontology.
• Links to external data sets such as Gene Ontology and MeSH have been
defined, queries may span multiple ontologies.
• Results View displays details about each of the matches found and
allows user to navigate across entities.
• The application is created as a plugin to the Lilly Science Grid and can
leverage Integrated Genomics Portal for Cancer Research (IGPCR)
plugins to provide details about Genes in hit lists.
LSCDD
21
Metadata Repository Application
Find all deacetylases
involved
in Colorectal Neoplasms
- Run
Query…
- Add
- Addfilter
Results
filter
Navigate
totoMeSH
View
Geneacross
shows
Description
Ontology
data
list of
Label
links
Name
Genes
attribute
attribute
LSCDD
22
Experiment Data Annotation
While raw experiment results are not suitable for editing,
metadata such as experiment descriptions and relations
becomes more valuable when users augment and refine.
Experiment
hasId: abc123
hasContact: Bill Smith
hasType: SiRNA Screen
hasDescription: ____
H460 screen: run 789
…
Experiment
hasConflictingResults
hasId: def456
hasContact: Jane Smith
hasType: SiRNA Screen
hasDescription: H460 screen
…
LSCDD
23
IGPCR: Integrated Genomics Portal for
Cancer Research
An Integrated view for analysis results
Helps oncology researchers with:
•Drug target identification and prioritization
•Biomarker discovery
•Combination therapy
LSCDD
24
Backup
LSCDD
25
LSCDD
26
LSCDD
27
LSCDD
28
LSCDD
29
Answering scientific questions
Get me all the interactions for
methylases that are involved in
What
Are
there
isare
theany
status
reagents
of
theavailable
target
of to
my
What
the
right
model
systems
to
colorectal cancer. And for all these
interest
conduct
across
functional
multiple
validation?
tumor
types?
study the
perturbation
of
my gene
of
genes, get the expression and aCGH
interest?
values for all colon cancer samples.
LSCDD
30
Cancer drug discovery
LSCDD
31
Integration of high throughput datasets
Tumor
Samples
Tumor Samples
Cell
lines
Cell lines
Mutations
Mutations
Public / Private
CGH
SKY
CGH // SKY
Expression
Expression
Tissue
Tissue
Microarrays
Microarrays
Chemosensitivity
Chemosensitivity
LSCDD
Patient
Survival
Patient Survival
RNAi
RNAi
32
Going Forward
•
Integration with additional external sources: NCBI, KEGG, Proteome, PubMED
•
•
Integration with National Cancer Institute Metathesaurus
Continued integration with new data types generated internally or from collaborators
•
Definition and support of additional ontologies
Web Resources
Lilly Data
SnoMed
Stanford Tissue
Microarray
Collaborators
PubMed
NCI
Metathesaurus
Labs
Integrated
Augmented
Query
Results
Internal Data
Public Data
Analysis Pipelines Visualizers
LSCDD
33
Acknowledgements
LSCDD, Singapore
IT
• Kevin Gao, Rakhi Bhat, Srinivasulu Kota and Maurice Manning
Systems Biology
• Amit Aggarwal and Mahesh Kumar Guzuva Desikan
ICS
• Pat Hartman
HiSoft Technology – Dalian, China
• Bill Yan, Young Gong, Harold Yin, Steven Cao and Jason Wang
Lilly, Indianapolis USA
• Susie Stephens, Jacob Koehler
LSCDD
34
Backup Slides
LSCDD
35
Putting it all together…
Objects
Measure
Map 1
Map 2
Compounds
Fingerprint
MTS
Literature
Genes
Expression
Binding
Coding
SNPs
Linkage D
Clinical DB
Images
Signature
LSCDD
36
Silos Need to Broken Down
Project
Exploratory
Hit
To
Lead
Target
To
Hit
Target
Program
Hit
Lead
To
PgS
Lead
Launch
Lead
Pre-Clinical
Optimization Development
PgS
CS
Product
Phase I
Phase 2
Phase 3
Registration
FHD
FED
PD/RD
FS
FA
FL
Generate/Test
Hypothesis
Generate/Test
Hypothesis
Generate/Test
Hypothesis
Generate/Test
Hypothesis
Generate/Test
Hypothesis
Generate/Test
Hypothesis
Generate/Test
Hypothesis
Generate/Test
Hypothesis
Generate/Test
Hypothesis
Generate/Test
Hypothesis
Model &
Understand
Model &
Understand
Model &
Understand
Model &
Understand
Model &
Understand
Model &
Understand
Model &
Understand
Model &
Understand
Model &
Understand
Model &
Understand
Analyze
& Mine
Analyze
& Mine
Analyze
& Mine
Analyze
& Mine
Analyze
& Mine
Analyze
& Mine
Analyze
& Mine
Analyze
& Mine
Analyze
& Mine
Analyze
& Mine
Transform
Transform
Transform
Transform
Transform
Transform
Transform
Transform
Transform
Transform
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
LSCDD
37
Global
Launch
GL
BACIIS System Architecture
Web Interface
Input user queries and
present the query results
Mediator
Result Presentation Module
Wrapper
Receive and integrate
the individual result
set from wrappers into
HTML format and
send result pages to
web interface
Fetch HTML/XML
pages from remote
data source, extract
result data
Query Generator Module
Generate semantic
based user queries into
domain recoganized
terms through Ontology
Web
Database
Wrapper
Query Planning and Execution Module
Fetch HTML/XML
pages from remote
data source, extract
result data
Web
Database
Query Planner
BACIIS
Knowledge
Base
Bio-Chemical
Ontology
Decompose the user
query into subqueries,
define the subqueries
dependancy, and find
the query paths
Data Source
Schema
Mapping Engine
Execution Engine
Receive data source
specific subqueries
and envoke
corresponding
wrappers to fetch
the data from
remote data source
Wrapper
Fetch HTML/XML
pages from remote
data source, extract
result data
Web
Database
Map each subquery into
specific data source(s)
LSCDD
38
Hybrid Architecture
User Interface
Knowledge-Space
Navigation
List
Management
Presentation
Services
Analytic
Services
Metadata
Repositories
Navigation Service Layer
Data Set Integration Services
Semantic Layer
Query Preparation Service
Semantic Normalization Service
Adaptive Layer
Query Submission Service
Streams Management Service
Metadata Services Layer
Request Brokers
Analysis Entities
Presentation Entities
Persistence Entities
Personalization Entities
Navigational Entities
Federation Entities
Physical Access Layer
Data Access Service Layer
Source
LSCDD
Source
Source
Source
Source
39
Goals
•Make knowledge emerge from repositories
•Make data more valuable by adding context
•Leverage intellectual assets
•Decision support
•Enhance productivity
•Reduce IT integration efforts
LSCDD
40