* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Scaling the walls of discovery
Survey
Document related concepts
Transcript
Lilly Singapore Centre for Drug Discovery Scaling the walls of discovery: using semantic metadata for integrative problem solving Greg Tucker-Kellogg, Ph.D. Chief Technology Officer Senior Director, Systems Biology Lilly Singapore Centre for Drug Discovery LSCDD Outline The Challenge of Translational Discovery in Pharmaceutical Research Integration of Metadata using Semantic Web Technologies •Why focus on metadata? •How it helps Examples LSCDD 2 Lilly Singapore Centre for Drug Discovery Oncology and diabetes research towards tailored therapy to improve patient outcome Drug Discovery (drug candidates) Experimental LSCDD Wet lab biology Systems Biology (biomarkers) Integrative Computational Sciences (tools) Computational 3 Pharmaceutical R&D spends more to get less LSCDD 4 Lost in translation Translate The limits of my mean Ilanguage limit the scope ofthe thelimits language of my I (Ludwigworld Wittgenstein) 我的语言限制 的范围是我的 (Ludwig Wittgenstein) Translate LSCDD 5 Translational research in cancer: Connecting the dots of genetic aberrations Targets Pathways Disease Patients Tailored Therapeutics Improve individual patient outcomes and health outcome predictability through tailoring drug, dose, timing of treatment, and relevant information LSCDD 6 The “Web” of heterogeneous data Cell/Assay Technologies LSCDD 7 Integrating Scientific Data Sets Uncontrollable diversity Most of the valuable data is from outside our walls Much of it is poorly structured Ranging from large (1TB/day) to boutique LSCDD 8 Scientist’s View of Integrated Information Target based chemotype profiling Pathway-based chemotype profiling Functional chemogenetics Chemical biology RNAi reagents -Qiagen siRNA -BROAD shRNA -cDNA High-content bioassays Biochemical data Acumen assays Cellomics assays Omics Protein -IHC, -Luminex DNA -CGH -SNP,Mutation RNA -miRNA -mRNA Plate Reader Interrogators Color code Epigenetics -Methylation -ChIP-Chip Reporters Mapping and annotation backbone Strategic Cross-domain integration Domain-level integration Platforms Foundational LSCDD 9 Manual Data Integration A repeated, tedious process: • Pull data from internal and public data sets • Normalize terms and values • Write and run analysis scripts • Compile into a single Excel file, detached from the data source (no drill-down) Often this process can consume days with no guaranteed resolution LSCDD 10 Integration Approaches Considered •Data Warehouse • Difficult to maintain and integrate new data sets • Difficult to evolve as data changes • Schemas tightly coupled to applications •Federated queries • Query performance issues • Where to place the index? • Problematic to maintain • Translating user search syntax to all sources requires deep knowledge of data layer •Semantic Integration • Relatively unproven in enterprise systems but adaptive to change • Relationships between data can be more fully characterized LSCDD 11 Standard Semantic Integration Model •All data is mapped to domain ontology in both directions •If single system is down, incomplete results. •Performance is limited to slowest system in network •Massive mapping effort •Multiple implementations of this approach, including: • Biological and Chemical Integrated Information System (BACIIS) • Boeing LSCDD Query Generator Results Presentation Query Planning Data Set Integration Query Submission Source Domain Ontology & Mappings Source Semantic Normalization Source Source 12 Can we do better for our purposes? •Avoid a complex architecture and extended development effort •Realize benefits in the near-term •Preprocess metadata to improve efficiency. •Characterize the type of questions that ontology should answer •Identify stable semantic technologies, do not employ parsers. •Allow semantic and relational databases to work together LSCDD 13 What we need Data Management and Availability • Capturing and filtering the global and growing avalanche of internal and external scientific data Data Fusion • Systems to link, combine and navigate massive and heterogeneous data sets Information Analysis and Mining • Algorithms and tools to help scientists seek correlations and find connections between pre-clinical and clinical knowledge to generate and test translational hypotheses. LSCDD 14 Data Architecture Analysis and Mining Query Visualization Algorithms Workflow Centralized Genomics Functional Ontology Integration LayerExperiment mapping Information Context Annotation Services Experimental (Genomics mapping Common Matedata 30Repository million Proteome + Gene functional 34 info) platforms Vocabulary triples /GO Experiment Mapping & Domain/Platform Data Context Specific Annotation Expression (Affy,Agilent, Illumina) aCGH Readout LSCDD Derived Methylation Results Screening SNP Mutation Tissue Microarray ChIP-Chip, miRNA Analysis Results 15 LSCDD Data integration process in use Query Visualization Experimental Metadata Repository (Genomics mapping + Gene Function) Affy Agilent Illumina Expression Expression Expression LSCDD aCGH Annotation Services Screening RNAi Database Mutation SNP Analysis Results TMA 16 LSCDD Semantic Integration Approach • Use semantic technology on an appropriate problem • Create Ontology focused on solving LSCDD integration needs • Scientists and IT Analysts work together to iteratively create tailored vocabulary • Define competency questions to validate the ontology • Encourage ontology to evolve, a different animal than RDBMS schemas • Create bridges to public and internal ontologies to realize the full capabilities of the vocabulary • Involve users to verify RDBMS-to-ontology mapping to increase confidence in the solution. • Sparql is hard. Design an intuitive query model or question templates for users to navigate the repository. LSCDD 17 LSCDD Semantic Integration Approach (Cont) • Used Agile philosophy throughout: application development, ontology development and mapping effort • Drive adoption by engaging users to understand their challenges and refine the solution. • Technologies • Protégé Ontology Editor • Oracle Semantic Technologies 11g • D2R Map (Database to RDF Mapping Language) • C# development in Visual Studio 2205 LSCDD 18 Metadata RDF Repository • Aggregates experiment metadata from a diverse set of LSCDD relational databases into an Oracle Semantic Technologies repository for LSCDD scientific investigation. • Scientists at LSCDD now have a single source of experiment information described with a common vocabulary. • Current data sources include: •Expression Data : Affymetrix, Illumina, Agilent •aCGH Data •RNAi Screening Data •Reagent Data Currently ~30 •Gene Ontology (GO) million triples •Medical Subject Headings (MeSH) •Many others LSCDD 19 LSCDD Metadata Ontology Project Study hasProject Experiment hasStudy hasPlate hasDiseaseState hasAssay Assay hasChip DiseaseState Plate hasProtocol Chip Protocol Compound Reagent Software subclass Plate Well hasModel hasTissue hasCellline Model hasChipType hasGene subclass Protein Reagent Chip Type Hardware Sample hasCompound hasReagent subclass DNA Reagent hasChipType subclass subclass hasPlate hasSample Treatment RNA Reagent CellLine hasSource hasGene hasReagent Gene Tissue hasTreatment IsPartOf hasSourceTissue Probe ClinicalData ViralBatch GeneList hasMESHId hasGOId GO LSCDD MESH 20 Metadata Repository Application • Both browse and query views are provided for repository access. • The Query View allows the user to search the repository by setting constraints on attributes of the entities in the ontology. • Links to external data sets such as Gene Ontology and MeSH have been defined, queries may span multiple ontologies. • Results View displays details about each of the matches found and allows user to navigate across entities. • The application is created as a plugin to the Lilly Science Grid and can leverage Integrated Genomics Portal for Cancer Research (IGPCR) plugins to provide details about Genes in hit lists. LSCDD 21 Metadata Repository Application Find all deacetylases involved in Colorectal Neoplasms - Run Query… - Add - Addfilter Results filter Navigate totoMeSH View Geneacross shows Description Ontology data list of Label links Name Genes attribute attribute LSCDD 22 Experiment Data Annotation While raw experiment results are not suitable for editing, metadata such as experiment descriptions and relations becomes more valuable when users augment and refine. Experiment hasId: abc123 hasContact: Bill Smith hasType: SiRNA Screen hasDescription: ____ H460 screen: run 789 … Experiment hasConflictingResults hasId: def456 hasContact: Jane Smith hasType: SiRNA Screen hasDescription: H460 screen … LSCDD 23 IGPCR: Integrated Genomics Portal for Cancer Research An Integrated view for analysis results Helps oncology researchers with: •Drug target identification and prioritization •Biomarker discovery •Combination therapy LSCDD 24 Backup LSCDD 25 LSCDD 26 LSCDD 27 LSCDD 28 LSCDD 29 Answering scientific questions Get me all the interactions for methylases that are involved in What Are there isare theany status reagents of theavailable target of to my What the right model systems to colorectal cancer. And for all these interest conduct across functional multiple validation? tumor types? study the perturbation of my gene of genes, get the expression and aCGH interest? values for all colon cancer samples. LSCDD 30 Cancer drug discovery LSCDD 31 Integration of high throughput datasets Tumor Samples Tumor Samples Cell lines Cell lines Mutations Mutations Public / Private CGH SKY CGH // SKY Expression Expression Tissue Tissue Microarrays Microarrays Chemosensitivity Chemosensitivity LSCDD Patient Survival Patient Survival RNAi RNAi 32 Going Forward • Integration with additional external sources: NCBI, KEGG, Proteome, PubMED • • Integration with National Cancer Institute Metathesaurus Continued integration with new data types generated internally or from collaborators • Definition and support of additional ontologies Web Resources Lilly Data SnoMed Stanford Tissue Microarray Collaborators PubMed NCI Metathesaurus Labs Integrated Augmented Query Results Internal Data Public Data Analysis Pipelines Visualizers LSCDD 33 Acknowledgements LSCDD, Singapore IT • Kevin Gao, Rakhi Bhat, Srinivasulu Kota and Maurice Manning Systems Biology • Amit Aggarwal and Mahesh Kumar Guzuva Desikan ICS • Pat Hartman HiSoft Technology – Dalian, China • Bill Yan, Young Gong, Harold Yin, Steven Cao and Jason Wang Lilly, Indianapolis USA • Susie Stephens, Jacob Koehler LSCDD 34 Backup Slides LSCDD 35 Putting it all together… Objects Measure Map 1 Map 2 Compounds Fingerprint MTS Literature Genes Expression Binding Coding SNPs Linkage D Clinical DB Images Signature LSCDD 36 Silos Need to Broken Down Project Exploratory Hit To Lead Target To Hit Target Program Hit Lead To PgS Lead Launch Lead Pre-Clinical Optimization Development PgS CS Product Phase I Phase 2 Phase 3 Registration FHD FED PD/RD FS FA FL Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Model & Understand Model & Understand Model & Understand Model & Understand Model & Understand Model & Understand Model & Understand Model & Understand Model & Understand Model & Understand Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Transform Transform Transform Transform Transform Transform Transform Transform Transform Transform Data Data Data Data Data Data Data Data Data Data LSCDD 37 Global Launch GL BACIIS System Architecture Web Interface Input user queries and present the query results Mediator Result Presentation Module Wrapper Receive and integrate the individual result set from wrappers into HTML format and send result pages to web interface Fetch HTML/XML pages from remote data source, extract result data Query Generator Module Generate semantic based user queries into domain recoganized terms through Ontology Web Database Wrapper Query Planning and Execution Module Fetch HTML/XML pages from remote data source, extract result data Web Database Query Planner BACIIS Knowledge Base Bio-Chemical Ontology Decompose the user query into subqueries, define the subqueries dependancy, and find the query paths Data Source Schema Mapping Engine Execution Engine Receive data source specific subqueries and envoke corresponding wrappers to fetch the data from remote data source Wrapper Fetch HTML/XML pages from remote data source, extract result data Web Database Map each subquery into specific data source(s) LSCDD 38 Hybrid Architecture User Interface Knowledge-Space Navigation List Management Presentation Services Analytic Services Metadata Repositories Navigation Service Layer Data Set Integration Services Semantic Layer Query Preparation Service Semantic Normalization Service Adaptive Layer Query Submission Service Streams Management Service Metadata Services Layer Request Brokers Analysis Entities Presentation Entities Persistence Entities Personalization Entities Navigational Entities Federation Entities Physical Access Layer Data Access Service Layer Source LSCDD Source Source Source Source 39 Goals •Make knowledge emerge from repositories •Make data more valuable by adding context •Leverage intellectual assets •Decision support •Enhance productivity •Reduce IT integration efforts LSCDD 40