Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ariadne Genomics technology: Extraction from the literature and network analysis Dr. Anton Yuryev Ariadne Genomics Inc. ©2006 Ariadne Genomics. All Rights Reserved. Pathway Studio product line • • • Pathway Studio desktop Pathway Studio workgroup Pathway Studio enterprise Main functionality: 1) Data mining and pathway building 2) Analysis of high-throughput data 3) Text-mining, fact extraction and database building ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 2 Ariadne Corporate Offering Software solution for Knowledge management and pathway analysis of the high-throughput data MedScan 1000 abstracts/min Proprietary data Public interaction data Knowledge Databases Pathway Building Pathway collection ResNet Biological Association Networks Analysis of HighThroughput data ©2006 Ariadne Genomics. All Rights Reserved. Text-mining ©2006 Ariadne Genomics. All Rights Reserved. 3 Ariadne Database Construction • Automatic fact extraction by MedScan from organism-specific subset of PubMed and full-text journals • Import of Ariadne proprietary curated data – Curated physical interaction – 712 signaling line pathways • Import of publicly available curated interaction data: Entrez Gene, BIND, HPRD, KEGG, Gene Ontology • Import of publicly available high-throughput interaction data (Y2K, Massspec etc) • Import of user proprietary data: – Proprietary or publicly available experimental data in PSI, BioPax or Tabdelimited formats – Data mined by MedScan tool from literature sources not included with database • User manual curation ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 4 Additional Commercial datasets • > 130 KEGG metabolic pathways • >70 STKE pathways (AAAS) • >10,000 ERGO pathways for 587 organisms (Integrated genomics) • >100,000 protein interactions from Hynet (Prolexys) • >600 disease pathways PathArt (Jubilant) ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 5 Pathway Studio Enterprise distinctions • Web-client for instant pathway publishing – Connection between multiple geographical sites • 3-tier architecture with Java API to connect third party applications and algorithms • MedScan Enterprise license: – open MedScan dictionaries and pattern rules files for customization – distribution of MedScan data across entire company • GSEA, NEA and network clustering algorithms for analysis of high-throughout data ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 6 Pathway Studio Enterprise Architecture Read-only users via web browser Application server Database Data editors via web browser Third party tools, in-house applications, API Bioinformaticians via Pathway Studio SQL interface, bulk data ©2006 Ariadne Genomics. All Rights Reserved. management ©2006 Ariadne Genomics. All Rights Reserved. 7 “Everyone is an Expert” decentralized deployment schema Hundreds or thousands of users some with read only and some with editor or publishers roles accessing one central database via Pathway Studio and/or Web browser to analyze experiments, browse pathway collection, do literature mining, sharing the data and analysis results. ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 8 “Bioinformatics service group” centralized deployment schema Bioinformatics group servicing scientists for entire company by analyzing their experimental data and literature mining. Analysis results are published via Web browser interface for end users End users View only access to pathways and analysis networks annotated with experimental data via web browser and links to PathwayExpert Web Services 1) Experimental data 2) Search requests Bioinformatics group 1) Analysis of experimental data 2) Text-mining and Pathway ©2006 Ariadne Genomics. All Rights Reserved. Building ©2006 Ariadne Genomics. All Rights Reserved. 9 “Disease area” decentralized clusters deployment schema Disease area groups have bioinformatics, biologists and chemists working as a team with focus on one disease Cardiovascular group Digestive disorders group Cancer group CNS group ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 10 Plan of the talk 1) Text-mining, fact extraction and database building - Stay current with the literature Build focused literature networks Build focus databases 2) Data mining and pathway building - Understand molecular mechanisms of disease and processes Maintain pathway collection Build focus databases 3) Analysis of high-throughput data - Functional ontology analysis Network analysis ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 11 Introduction to MedScan technology ©2006 Ariadne Genomics. All Rights Reserved. How MedScan extracts facts from text? • Sentence in PubMed: “Axin binds beta-catenin and inhibits GSK-3beta.” • Identify Proteins in Dictionary (in red): “Axin binds beta-catenin and inhibits GSK-3beta.” • Identify Interaction Type (in black): “Axin binds beta-catenin and inhibits GSK-3beta.” Syntactic Layer Noun Phrase Verb Phrase Noun Phrase Semantic Layer Protein Protein Relations Protein • Extracted Facts: Axin - beta-catenin Axin -> GSK-3beta relation: Binding relation: Regulation, effect:©2006 Negative Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 13 Filtering by Number of references controls the network confidence in Pathway Studio Binding (references: 77) Owner: public, Entities E2F1-RB1 This stabilization of the pRB-E2F-1 complex by AAV expression in adenoviral-infected cells should lead to a decrease in E2F-1- mediated expression of cell cyclespecific genes. ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 14 MedScan Architecture Customizable by user Modules Dictionaries Toxicology Plants C-elegans Drosophila Pattern matcher Relationship extraction Yeast Patterns Semantic processor Entity detection Mammals Rules Entity recognizer RNEF XML Cartridges Future: •New modules: ConceptScan ©2006 Ariadne Genomics. All Rights Reserved. •New cartridges: Immunology, Clinical ©2006 Ariadne Genomics. All Rights Reserved. 15 Describing MedScan • Manually curated: dictionaries and grammar rules • Fast: 14 mln PubMed abstracts in 2 days on modern PC • Comprehensive: facts recovery rate > 90% 90% = 70% sentence recovery rate + 20% literature redundancy • Removes redundancy: 7,647,282 non-distinct relations =>1,000,000 distinct relations • Accurate: false positive rate – 10% • Customizable: dictionaries and patterns ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 16 MedScan Applications Pubmed Indexing the scientific literature Entity-based index Semantic Index Google MedScan Open access Extracting interactions to create databases for systems biology Automatic reader’s digest Document Summary ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 17 Pathway Building in Pathway Studio • Manual • Automatic using Graph navigation tools • Using text-mining with MedScan ©2006 Ariadne Genomics. All Rights Reserved. Viewing and editing pathways in Pathway Studio • • • • • • Viewing entities in the List Pane Entity and relation tables Show all references Pathway Reference summary Export protein list Display styles: By type, By effect, By reference count • UI options: – magnifier – fit text to entities – simple and full graph view – fit to window – rotate – move – zoom by rectangle – advanced graph ©2006 scaling Ariadne Genomics. All Rights Reserved. • resizing nodes in pathway pane ©2006 Ariadne Genomics. All Rights Reserved. 19 Pathway Building by text-mining Non-melanoma skin cancer >1,000,000 cases, (<2,000 deaths), in USA ©2006 Ariadne Genomics. All Rights Reserved. MedScan Reader: PubMed search Keep searching and adding relations At the end Send extracted relations to Pathway Studio ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 21 MedScan Reader: Import top 100 Hits from Google Scholar search: downloads found articles and processes them with MedScan ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 22 MedScan Reader: Import top 30 Hits from Google search: downloads found web-pages and processes them with MedScan ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 23 Full-text article found on Highwire press with “non-melanoma skin cancer” text search ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 24 MedScan customization by focused literature source: “Nonmelanoma skin cancer” literature network – result of targeted text-mining by MedScan Reader Every entity in this network was mentioned in the context of nonmelanoma skin cancer: -Find hubs -Compare with patient data ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 25 MedScan customization by focused literature source: Protein network for non-melanoma skin cancer Compare this pathway with ©2006 Ariadne Genomics. All Rights Reserved. your experimental patient data ©2006 Ariadne Genomics. All Rights Reserved. 26 Automatic Pathway Building using Graph navigation Build pathway tool ©2006 Ariadne Genomics. All Rights Reserved. Mining regulatory relations in database Basic principal Regulatory interactions are mediated by physical interaction network – Regulomes – Biological processes pathways – Disease networks ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 28 Regulome pathways: algorithm input ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 29 Regulome pathways: Connecting IL10 targets with physical interaction relations ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 30 Building pathways by Data mining converting regulatory network to protein physical interaction network for Cell Processes, Diseases, Regulomes ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 31 Disease networks 2300 diseases, 230 cancers in ResNet 5.0 database converting regulatory network to protein physical interaction network for Diseases Endothelial cells cancer ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 32 Endothelial cells cancer network ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 33 Applied information retrieval and multidisciplinary research: new mechanistic hypotheses in Complex Regional Pain Syndrome J Biomed Discov Collab. 2007; 2: 2. Kristina M Hettne, Marissa de Mos, Anke GJ de Bruijn, Marc Weeber, Scott Boyer, Erik M van Mulligen, Montserrat Cases, Jordi Mestres, and Johan van der Lei Resulting network of CRPS concepts ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 34 High-throughput data analysis in Pathway Studio • Identification of responsive genes • Functional ontology analysis • Network analysis ©2006 Ariadne Genomics. All Rights Reserved. Supports analysis of all types of experiment data • • • • • • Gene expression Metabolomics Proteomics SNP and CNV analysis Methylation arrays Phosphorylation arrays Support for all microarray platforms: •Affymetrix •Agilent •Illumina •Nimblegen •Superarray ©2006 Ariadne Genomics. All Rights Reserved. •Custom design chips ©2006 Ariadne Genomics. All Rights Reserved. 36 Analysis of gene expression microarray data: STEP 1: Identification of responsive genes • Expression data import (tab, xls, cel) • Selection of responsive genes – Find differentially expressed genes (significance analysis via t-test) – Gene clustering via correlation networks – Find responsive genes in the third party software for statistical analysis of microarray data and import it as a protein list (Tools->Import protein list) ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 37 Calculation of differentially expressed genes in Pathway Studio (significance analysis using paired and unpaired t-tests) ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 38 Gene clustering in Pathway Studio using Correlation network ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 39 Analysis of gene expression microarray data: STEP 2: Pathway Analysis of responsive genes • Network analysis – Identification of DE expressed protein complexes and physical networks – Identification of major regulators and targets in expression network • Via network querying (Build pathway tool) • Via Network enrichment analysis (in PS Enterprise only) • Functional analysis – Comparison of responsive genes with ontologies and pathway collection • • • • • Via Fisher exact test Via Gene Set Enrichment analysis (GSEA in PS Enterprise only) Gene ontology analysis (via Fisher’s test or GSEA) Comparative gene ontology analysis Via network querying (Build pathway tool) ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 40 Functional analysis: comparative GO groups analysis comparing cell responses in GO group space ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 41 Building protein network from interesting GO groups and identification of its major expression regulator ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 42 Identification drug responsive genes ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 43 Evaluation of drug efficacy and side-effects ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 44 GSEA: Gene Set Enrichment analysis in PS Enterprise ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 45 Visualizing expression data on GSEA pathway ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 46 High-throughput data analysis in Pathway Studio • Functional ontology analysis • Network analysis ©2006 Ariadne Genomics. All Rights Reserved. Data model in ResNet database Formalized representation of biological regulatory and interaction network Expression Interpretation of Gene Expression data PromoterBinding DirectRegulation Interpretation of Proteomics data ProtModification Binding Interpretation of Metabolomics data, Biomarkers prediction and validation MolSynthesis MolTransport ©2006 Ariadne Genomics. All Rights Reserved. Regulation …MORE…. ©2006 Ariadne Genomics. All Rights Reserved. 48 Network analysis: identification of major regulators and targets among DE genes via Build pathway ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 49 Network analysis: Identification of major regulators Network enrichment analysis Finds regulators with most differentially expressed targets Better Worse ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 50 Network Enrichment analysis in PS Enterprise ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 51 Visualizing expression data on NEA pathway ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 52 Network Enrichment Analysis: Example for metabolomics Identification of metabolism regulators Finds regulators with most differential levels of metabolite targets Better Worse ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 53 Network analysis: finding DE protein complexes using Build dense expressed networks in PS Enterprise ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 54 >200 publications using AGI software and ResNet database • • • • • • • • • • Analysis of gene expression microarray data (139) Pathway Analysis (97) Disease mechanism (84) Publication by Ariadne Authors (18) Human genetics (7) Text processing (6) Reviews (7) Databases (3) Drug discovery (21) Toxicogenomics (4) ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 55 Most common workflow for microarray analysis in Pathway Studio for disease • Identify genes differentially expressed in disease (DE genes) • Identify genes known to associate to disease according to the literature using Pathway Studio • Identify DE genes that are linked to known diseases genes using Pathway Studio • Report novel disease genes ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 56 Transcriptional network governing the angiogenic switch in human pancreatic cancer. Abdollahi et al PNAS July 31, 2007 104(31): 12890–12895 ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 57 High-throughput data analysis in Pathway Studio Extras ©2006 Ariadne Genomics. All Rights Reserved. Biomarker prioritization using expression data in Pathway Studio: biomarkers for intestinal bowel disease 1) DE downstream target is better than DE regulator 2) Secreted biomarkers are better than intracellular Dissection of the Inflammatory Bowel Disease Transcriptome Using Genome-Wide cDNA Microarrays, PLoS, August ©2006 Ariadne Genomics. All Rights Reserved. 23, 2005 ©2006 Ariadne Genomics. All Rights Reserved. 59 EXPRESSION VARIATION OF INDIVIDUAL BIOMARKERS IN PATIENTS’ IS UNLIKELY TO AFFECT IN-DEGREE HUBs Sources of patient variations: - genetic - dietary & life-style - stress-related ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 60 Using Chip-On-Chip data to find major regulators in the Expression experiment ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 61 Example of proprietary algorithm that can be integrated into Pathway Studio using API Algorithm finds inconsistencies between expression data and ResNet 100% consistence between expression ResNet data TP53 label assigned by algorithm ACTVATED Explanations for inconsistency: 1) Incorrect expression data – 50% of all cases 2) Incorrect ResNet data – 10% of al cases 3) Posttranscriptional regulation of TP53 Inconsistency between expression ResNet data TP53 label assigned by algorithm ACTVATED ©2006 Ariadne Genomics. All Rights Reserved. ©2006 Ariadne Genomics. All Rights Reserved. 62