Download Ariadne Genomics technology

Document related concepts
no text concepts found
Transcript
Ariadne Genomics technology:
Extraction from the literature and
network analysis
Dr. Anton Yuryev
Ariadne Genomics Inc.
©2006 Ariadne Genomics. All Rights Reserved.
Pathway Studio product line
•
•
•
Pathway Studio desktop
Pathway Studio workgroup
Pathway Studio enterprise
Main functionality:
1) Data mining and pathway building
2) Analysis of high-throughput data
3) Text-mining, fact extraction and database
building
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
2
Ariadne Corporate Offering
Software solution for Knowledge management and pathway
analysis of the high-throughput data
MedScan
1000 abstracts/min
Proprietary
data
Public
interaction
data
Knowledge
Databases
Pathway Building
Pathway collection
ResNet
Biological Association
Networks
Analysis of HighThroughput data
©2006 Ariadne Genomics. All Rights Reserved.
Text-mining
©2006 Ariadne Genomics. All Rights Reserved.
3
Ariadne Database Construction
• Automatic fact extraction by MedScan from organism-specific subset of
PubMed and full-text journals
• Import of Ariadne proprietary curated data
– Curated physical interaction
– 712 signaling line pathways
• Import of publicly available curated interaction data: Entrez Gene, BIND,
HPRD, KEGG, Gene Ontology
• Import of publicly available high-throughput interaction data (Y2K, Massspec etc)
• Import of user proprietary data:
– Proprietary or publicly available experimental data in PSI, BioPax or Tabdelimited formats
– Data mined by MedScan tool from literature sources not included with database
• User manual curation
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
4
Additional Commercial datasets
• > 130 KEGG metabolic pathways
• >70 STKE pathways (AAAS)
• >10,000 ERGO pathways for 587 organisms
(Integrated genomics)
• >100,000 protein interactions from Hynet
(Prolexys)
• >600 disease pathways PathArt (Jubilant)
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
5
Pathway Studio Enterprise distinctions
• Web-client for instant pathway publishing
– Connection between multiple geographical sites
• 3-tier architecture with Java API to connect third party
applications and algorithms
• MedScan Enterprise license:
– open MedScan dictionaries and pattern rules files for
customization
– distribution of MedScan data across entire company
• GSEA, NEA and network clustering algorithms for
analysis of high-throughout data
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
6
Pathway Studio Enterprise Architecture
Read-only users
via web browser
Application
server
Database
Data editors
via web browser
Third party tools,
in-house applications,
API
Bioinformaticians via
Pathway Studio
SQL interface,
bulk
data
©2006 Ariadne Genomics.
All Rights
Reserved.
management
©2006 Ariadne Genomics. All Rights Reserved.
7
“Everyone is an Expert” decentralized deployment schema
Hundreds or thousands of users some with read only and some
with editor or publishers roles accessing one central database
via Pathway Studio and/or Web browser to analyze
experiments, browse pathway collection, do literature mining,
sharing the data and analysis results.
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
8
“Bioinformatics service group” centralized deployment schema
Bioinformatics group servicing scientists for entire company by
analyzing their experimental data and literature mining. Analysis
results are published via Web browser interface for end users
End
users
View only access to pathways and
analysis networks annotated with
experimental data via web browser
and links to PathwayExpert Web
Services
1) Experimental data
2) Search requests
Bioinformatics group
1) Analysis of experimental data
2) Text-mining and Pathway
©2006 Ariadne Genomics. All Rights Reserved.
Building
©2006 Ariadne Genomics. All Rights Reserved.
9
“Disease area” decentralized clusters deployment schema
Disease area groups have bioinformatics, biologists and chemists working as a
team with focus on one disease
Cardiovascular group
Digestive disorders group
Cancer group
CNS group
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
10
Plan of the talk
1) Text-mining, fact extraction and database building
-
Stay current with the literature
Build focused literature networks
Build focus databases
2) Data mining and pathway building
-
Understand molecular mechanisms of disease and processes
Maintain pathway collection
Build focus databases
3) Analysis of high-throughput data
-
Functional ontology analysis
Network analysis
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
11
Introduction to MedScan technology
©2006 Ariadne Genomics. All Rights Reserved.
How MedScan extracts facts from text?
• Sentence in PubMed:
“Axin binds beta-catenin and inhibits GSK-3beta.”
• Identify Proteins in Dictionary (in red):
“Axin binds beta-catenin and inhibits GSK-3beta.”
• Identify Interaction Type (in black):
“Axin binds beta-catenin and inhibits GSK-3beta.”
Syntactic Layer Noun Phrase
Verb Phrase
Noun Phrase
Semantic Layer Protein
Protein
Relations
Protein
• Extracted Facts:
Axin - beta-catenin
Axin -> GSK-3beta
relation: Binding
relation: Regulation, effect:©2006
Negative
Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
13
Filtering by Number of references controls the
network confidence in Pathway Studio
Binding (references: 77)
Owner: public, Entities E2F1-RB1
This stabilization of the pRB-E2F-1 complex by AAV
expression in adenoviral-infected cells should lead to a
decrease in E2F-1- mediated expression of cell cyclespecific genes.
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
14
MedScan Architecture
Customizable by user
Modules
Dictionaries
Toxicology
Plants
C-elegans
Drosophila
Pattern
matcher
Relationship
extraction
Yeast
Patterns
Semantic
processor
Entity
detection
Mammals
Rules
Entity
recognizer
RNEF
XML
Cartridges
Future:
•New modules: ConceptScan
©2006 Ariadne Genomics. All Rights Reserved.
•New cartridges: Immunology, Clinical
©2006 Ariadne Genomics. All Rights Reserved.
15
Describing MedScan
• Manually curated: dictionaries and grammar rules
• Fast: 14 mln PubMed abstracts in 2 days on modern PC
• Comprehensive: facts recovery rate > 90%
90% = 70% sentence recovery rate + 20% literature redundancy
• Removes redundancy: 7,647,282 non-distinct
relations =>1,000,000 distinct relations
• Accurate: false positive rate – 10%
• Customizable: dictionaries and patterns
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
16
MedScan Applications
Pubmed
Indexing the scientific
literature
Entity-based index
Semantic Index
Google
MedScan
Open access
Extracting interactions to create
databases for systems biology
Automatic reader’s digest
Document Summary
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
17
Pathway Building in Pathway Studio
• Manual
• Automatic using Graph navigation tools
• Using text-mining with MedScan
©2006 Ariadne Genomics. All Rights Reserved.
Viewing and editing pathways in Pathway Studio
•
•
•
•
•
•
Viewing entities in the List Pane
Entity and relation tables
Show all references
Pathway Reference summary
Export protein list
Display styles: By type, By effect, By
reference count
• UI options:
– magnifier
– fit text to entities
– simple and full graph view
– fit to window
– rotate
– move
– zoom by rectangle
– advanced graph ©2006
scaling
Ariadne Genomics. All Rights Reserved.
• resizing nodes in pathway pane
©2006 Ariadne Genomics. All Rights Reserved.
19
Pathway Building by text-mining
Non-melanoma skin cancer
>1,000,000 cases, (<2,000 deaths), in USA
©2006 Ariadne Genomics. All Rights Reserved.
MedScan Reader: PubMed search
Keep searching and
adding relations
At the end Send
extracted relations to
Pathway Studio
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
21
MedScan Reader: Import top 100 Hits from Google Scholar search:
downloads found articles and processes them with MedScan
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
22
MedScan Reader: Import top 30 Hits from Google search:
downloads found web-pages and processes them with MedScan
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
23
Full-text article found on Highwire press with “non-melanoma skin
cancer” text search
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
24
MedScan customization by focused literature source:
“Nonmelanoma skin cancer” literature network – result of
targeted text-mining by MedScan Reader
Every entity in this network was
mentioned in the context of nonmelanoma skin cancer:
-Find hubs
-Compare with patient data
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
25
MedScan customization by focused literature source:
Protein network for non-melanoma skin cancer
Compare this pathway with
©2006 Ariadne Genomics. All Rights Reserved.
your
experimental patient data
©2006 Ariadne Genomics. All Rights Reserved.
26
Automatic Pathway Building
using Graph navigation
Build pathway tool
©2006 Ariadne Genomics. All Rights Reserved.
Mining regulatory relations in database
Basic principal
Regulatory interactions are mediated by physical interaction network
– Regulomes
– Biological processes pathways
– Disease networks
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
28
Regulome pathways: algorithm input
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
29
Regulome pathways:
Connecting IL10 targets
with physical interaction
relations
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
30
Building pathways by Data mining
converting regulatory network to protein physical interaction network for Cell Processes, Diseases, Regulomes
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
31
Disease networks
2300 diseases, 230 cancers in ResNet 5.0 database
converting regulatory network to protein physical interaction network for Diseases
Endothelial cells cancer
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
32
Endothelial cells cancer network
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
33
Applied information retrieval and multidisciplinary research: new mechanistic
hypotheses in Complex Regional Pain Syndrome
J Biomed Discov Collab. 2007; 2: 2.
Kristina M Hettne, Marissa de Mos, Anke GJ de Bruijn, Marc Weeber, Scott Boyer, Erik M van Mulligen, Montserrat Cases,
Jordi Mestres, and Johan van der Lei
Resulting network of CRPS concepts
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
34
High-throughput data analysis in Pathway
Studio
• Identification of responsive genes
• Functional ontology analysis
• Network analysis
©2006 Ariadne Genomics. All Rights Reserved.
Supports analysis of all types
of experiment data
•
•
•
•
•
•
Gene expression
Metabolomics
Proteomics
SNP and CNV analysis
Methylation arrays
Phosphorylation arrays
Support for all
microarray platforms:
•Affymetrix
•Agilent
•Illumina
•Nimblegen
•Superarray
©2006 Ariadne Genomics. All Rights Reserved.
•Custom design chips
©2006 Ariadne Genomics. All Rights Reserved.
36
Analysis of gene expression microarray data:
STEP 1: Identification of responsive genes
• Expression data import (tab, xls, cel)
• Selection of responsive genes
– Find differentially expressed genes (significance
analysis via t-test)
– Gene clustering via correlation networks
– Find responsive genes in the third party software
for statistical analysis of microarray data and import
it as a protein list (Tools->Import protein list)
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
37
Calculation of differentially expressed genes in Pathway Studio
(significance analysis using paired and unpaired t-tests)
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
38
Gene clustering in Pathway Studio using Correlation network
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
39
Analysis of gene expression microarray data:
STEP 2: Pathway Analysis of responsive genes
• Network analysis
– Identification of DE expressed protein complexes and physical networks
– Identification of major regulators and targets in expression network
• Via network querying (Build pathway tool)
• Via Network enrichment analysis (in PS Enterprise only)
• Functional analysis
– Comparison of responsive genes with ontologies and pathway collection
•
•
•
•
•
Via Fisher exact test
Via Gene Set Enrichment analysis (GSEA in PS Enterprise only)
Gene ontology analysis (via Fisher’s test or GSEA)
Comparative gene ontology analysis
Via network querying (Build pathway tool)
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
40
Functional analysis: comparative GO groups analysis
comparing cell responses in GO group space
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
41
Building protein network from interesting GO groups and
identification of its major expression regulator
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
42
Identification drug responsive genes
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
43
Evaluation of drug efficacy and side-effects
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
44
GSEA: Gene Set Enrichment analysis in PS Enterprise
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
45
Visualizing expression data on GSEA pathway
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
46
High-throughput data analysis in
Pathway Studio
• Functional ontology analysis
• Network analysis
©2006 Ariadne Genomics. All Rights Reserved.
Data model in ResNet database
Formalized representation of biological regulatory and interaction network
Expression
Interpretation of Gene
Expression data
PromoterBinding
DirectRegulation
Interpretation of
Proteomics data
ProtModification
Binding
Interpretation of
Metabolomics data,
Biomarkers prediction
and validation
MolSynthesis
MolTransport
©2006 Ariadne Genomics. All Rights Reserved.
Regulation
…MORE….
©2006 Ariadne Genomics. All Rights Reserved.
48
Network analysis: identification of major regulators
and targets among DE genes via Build pathway
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
49
Network analysis: Identification of major regulators
Network enrichment analysis
Finds regulators with most differentially expressed targets
Better
Worse
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
50
Network Enrichment analysis in PS Enterprise
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
51
Visualizing expression data on NEA pathway
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
52
Network Enrichment Analysis: Example for metabolomics
Identification of metabolism regulators
Finds regulators with most differential levels of metabolite targets
Better
Worse
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
53
Network analysis: finding DE protein complexes using Build
dense expressed networks in PS Enterprise
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
54
>200 publications using AGI software and ResNet database
•
•
•
•
•
•
•
•
•
•
Analysis of gene expression microarray data (139)
Pathway Analysis (97)
Disease mechanism (84)
Publication by Ariadne Authors (18)
Human genetics (7)
Text processing (6)
Reviews (7)
Databases (3)
Drug discovery (21)
Toxicogenomics (4)
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
55
Most common workflow for microarray analysis in Pathway
Studio for disease
• Identify genes differentially expressed in
disease (DE genes)
• Identify genes known to associate to disease
according to the literature using Pathway Studio
• Identify DE genes that are linked to known
diseases genes using Pathway Studio
• Report novel disease genes
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
56
Transcriptional network governing the angiogenic switch in human
pancreatic cancer. Abdollahi et al
PNAS July 31, 2007 104(31): 12890–12895
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
57
High-throughput data analysis in
Pathway Studio
Extras
©2006 Ariadne Genomics. All Rights Reserved.
Biomarker prioritization using expression data in Pathway Studio:
biomarkers for intestinal bowel disease
1) DE downstream
target is better
than DE regulator
2) Secreted
biomarkers are
better than
intracellular
Dissection of the
Inflammatory Bowel
Disease Transcriptome
Using Genome-Wide cDNA
Microarrays, PLoS, August
©2006 Ariadne Genomics. All Rights Reserved.
23, 2005
©2006 Ariadne Genomics. All Rights Reserved.
59
EXPRESSION VARIATION OF INDIVIDUAL BIOMARKERS
IN PATIENTS’ IS UNLIKELY TO AFFECT IN-DEGREE
HUBs
Sources of patient variations:
- genetic
- dietary & life-style
- stress-related
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
60
Using Chip-On-Chip data to find major regulators in the
Expression experiment
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
61
Example of proprietary algorithm that can be integrated into Pathway Studio
using API
Algorithm finds inconsistencies between expression data and ResNet
100% consistence between expression
ResNet data
TP53 label
assigned by
algorithm
ACTVATED
Explanations for inconsistency:
1) Incorrect expression data – 50% of all cases
2) Incorrect ResNet data – 10% of al cases
3) Posttranscriptional regulation of TP53
Inconsistency between expression
ResNet data
TP53 label
assigned by
algorithm
ACTVATED
©2006 Ariadne Genomics. All Rights Reserved.
©2006 Ariadne Genomics. All Rights Reserved.
62