Download PARADIGM - Bioinformatics.ca

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Canadian Bioinformatics Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 4
Pathway and Network Analysis
Lincoln Stein
Classes of Gene Set Analysis
DAVID
GSEA
Reactome FI network
PARADIGM
Khatri et al. PLOS Comp Bio. 8:1 2012
Module 4
bioinformatics.ca
Limitations of Gene Set Enrichment
Analysis
• Many possible gene sets – diseases, molecular
function, biological process, cellular compartment,
pathways...
• Gene sets are heavily overlapping; need to sort
through lists of enriched gene sets!
• “Bags of genes” obscure regulatory relationships
among them.
Module 4
bioinformatics.ca
Pathway Databases
• Advantages:
–
–
–
–
Usually curated.
Biochemical view of biological processes.
Cause and effect captured.
Human-interpretable visualizations.
• Disadvantages:
– Sparse coverage of genome.
– Different databases disagree on boundaries of pathways.
Module 4
bioinformatics.ca
KEGG
Module 4
bioinformatics.ca
Reactome
Module 4
bioinformatics.ca
Reactome
• Hand-curated pathways in human.
• Rigorous curation standards – every reaction traceable to
primary literature.
• Automatically-projected pathways to non-human species.
• 22 species; 1112 human pathways; 5078 proteins.
• Features:
–
–
–
–
Google-map style reaction diagrams with overlays;
Find pathways containing your gene list;
Calculate gene overrepresentation in pathways;
Find corresponding pathways in other species.
• Open access.
Module 4
bioinformatics.ca
Pathway Commons
Module 4
bioinformatics.ca
Pathway Colorization
• Main feature offered by all pathway databases.
• Upload a gene list
• Database calculates an enrichment score on each
pathway and displays ranked list.
• Browse into pathways of interest; download
colorized pictures.
Module 4
bioinformatics.ca
Example from Reactome
Module 4
bioinformatics.ca
Example from Reactome
Module 4
bioinformatics.ca
Module 4
bioinformatics.ca
Networks
• Pathways capture only the “well understood” portion
of biology.
• Networks cover less well understood relationships:
–
–
–
–
–
Module 4
Genetic interactions
Physical interaction
Coexpression
GO term sharing
Adjacency in pathways
bioinformatics.ca
Module 4
bioinformatics.ca
Module 4
bioinformatics.ca
Module 4
bioinformatics.ca
Module 4
bioinformatics.ca
Module 4
bioinformatics.ca
Biological Networks are Scale Free
Properties:
1. The degree (#
connections) of nodes
follows a power law. A
node of degree k+1 is
exponentially less likely to
occur than a node of
degree k.
2. The local clustering
coefficient (tendency of
nodes to interconnect) is
independent of the degree
of the node.
Nature Reviews Genetics 5, 101-113 (February 2004) | doi:10.1038/nrg1272
Module 4
bioinformatics.ca
Biological Networks are Scale Free
Implications:
1. A small number of genes
have a large number of
connections (chokepoints).
2. A large number of genes
have a small number of
connections (leaves).
3. Genes cluster (functional
groups).
4. The cluster sizes are also
scale-free (many small
clusters, few large
clusters).
Nature Reviews Genetics 5, 101-113 (February 2004) | doi:10.1038/nrg1272
Module 4
bioinformatics.ca
Network Databases
• Can be built automatically or via curation.
• Popular sources of curated networks:
– BioGRID – Curated interactions from literature; 529,000
genes, 167,000 interactions.
– InTact – Curated interactions from literature; 60,000 genes,
203,000 interactions.
– MINT – Curated interactions from literature; 31,000 genes,
83,000 interactions.
Module 4
bioinformatics.ca
Uncurated Interaction Sources
• Text mining approaches
– Computationally extract gene relationships from text, such
as PubMed abstracts.
– Much faster than hand curation.
– Not perfect:
• Problems recognizing gene names. Is hedgehog a gene or a
species?
• Natural language processing is difficult.
– Popular resources:
• iHOP
• PubGene
Module 4
bioinformatics.ca
Uncurated Interaction Sources
• Experimental techniques
– Yeast 2 hybrid protein interactions.
– Protein complex pulldowns/mass spec.
– Genetic screens, such as synthetic lethals,
enhancer/suppressor screens.
– NOT perfect
• Y2H interactions have taken proteins out of natural context;
physical interaction != biological interaction.
• Protein complex pulldowns plagued by “sticky” proteins such as
actin.
• Genetic screens highly sensitive to genetic background (“network
effects”).
Module 4
bioinformatics.ca
Integrative Approaches
• Combine multiple sources of evidence to increase
accuracy.
• Simple example:
– “Party hubs” are Y2H interactions that have been filtered
for those partners that share the same temporal-spatial
location.
• Complex example:
– Combine multiple sources of curated and uncurated
evidence.
Module 4
bioinformatics.ca
Example: Reactome FI Network
Curated Human Data – Version 35.
5078 proteins
4166 reactions
3870 complexes
1112 pathways
Only ~25% of genome!
Goal: add a “corona” of
uncurated interaction data
around scaffold of curated
pathway data.
Expanding Reactome’s Coverage
Curated Pathways
Uncurated Information
human PPI
PPI inferred from fly,
worm & yeast
PPI from text mining
GeneWays
Gene co-expression
CellMap
TRED
GO annotation on
biological processes
Protein domaindomain interactions
Naïve Bayes
Classifier
Annotated Functional
Interactions
Predicted Functional
Interactions
Wu et al. (2010) Genome Biology
Integrated Functional Interaction (FI) Network
•10,956 proteins
(9,542 genes).
•209,988 FIs.
•~50% coverage of
genome.
•False (+) rate < 1%
•False (-) rate ~80%
5% of network
shown here
Active Network Extraction
Machine Learning
+
Curated
Pathway Dbs
Uncurated
Interaction
Evidence
Reactome Functional Interaction Network
(~11,000 proteins; 200,000 interactions)
Extract and Cluster Altered Genes
Disease “modules”
(10-30)
Module 4
bioinformatics.ca
Clustering of TCGA Breast Cancer Mutations
Cadherin signaling
Signaling by Tyrosine
Kinase receptors
NOTCH and wnt signaling
Focal adhesion
ECM-Receptor interaction
Neuroactive ligand-receptor
interaction
Mucin cluster
Cell adhesion
molecules
Ubiquitin-mediated
proteolysis
Metabolism of proteins
Signaling by Rho GTPases
DNA repair
Cell cycle
Axon guidance
Calcium signaling
Module 4
M phase
G2/M Transition
bioinformatics.ca
Pancreatic Mutation Modules
Module 0: MAPK,
Hedgehog, TGFβ
signaling
Module 4: ECM, focal
adhesion, integrin
signaling
Module 5: Wnt &
Cadherin singaling
Module 3:
Translation
Module 9:
Axon
guidance
Module 2: B-cell
receptor, ERBB, FGFR,
EGFR signaling
Module 10:
muscle
contraction
Module 7: Axon
guidance
Module 1:
Heterotrimeric Gprotein signaling
Module 6: Ca2+
signaling
Module 8: MHC class II
antigen presentation
Module 4
bioinformatics.ca
256 Pancreatic Cancer Mutations
Patient Samples
Genes
Module 4
bioinformatics.ca
Modules After Hierarchical Clustering
Patient Samples
Modules
Module 4
bioinformatics.ca
Network-Based Clustering Algorithms
• Reactome FI network (Wu & Stein, Genome Biol. 2012
13(12):R112)
– Expression or SNV analysis
– Online analysis via Cytoscape Plugin (lab)
• HotNet (Vandin et al. J Comput Biol. 2011 Mar;18(3):507-22).
– Expression or SNV analysis
– Local installation with Python & MatLab
– Cytoscape visualization
• WGCNA (Langfelder et al. 2008 BMC Bioinformatics 9: 559.)
– Expression analysis
– Local installation as R package.
Module 4
bioinformatics.ca
Classification of Tumors via Molecular
Phenotype
Test
Classify
Proteomics
Transcriptomics
Genomics
Module 4
bioinformatics.ca
Risk Stratification
Don’t Treat
Low risk – reduce
treatment
TEST
Treat
10-20%
progress
No Relapse
Module 4
High risk – treat
aggresively
Relapse
bioinformatics.ca
Challenges in Biomarker Discovery
• Overtraining
– 22,000 genes; any given cancer may show alterations in 1000s of
them; patients cohorts are in 100s.
– Can find a set of gene alterations that nicely predicts survival in a
single cohort by chance.
– Field is littered with biomarkers that didn’t replicate in independent
cohorts.
• Disease Heterogeneity
– If there are many subtypes of disease then need even larger cohorts.
• Tumor Heterogeneity
– A single primary tumor may carry high-risk and low-risk subclones
simultaneously.
Module 4
bioinformatics.ca
Using Network Architecture to Accelerate
Biomarker Selection
Disease Module Map
Guanming Wu
Expression Analysis of
tumours from multiple
patients
Principal component analysis
on modules
Correlate principal
components with clinical
parameters
Genome Biol. 2012 Dec 10;13(12):R112
Module 4
bioinformatics.ca
Breast Cancer Expression Biomarker:
Samples Used
• Built the network using Nejm: van de Vijver et al
2002
– 295 Samples, ~12,000 genes
– Event: death
• Validated with GSE4922: Ivshina et al. Cancer Res.
2006
– 249 Samples, ~13,000 genes
– Event: recurrence or death
Module 4
bioinformatics.ca
PC Analysis Identifies Module 2 as Explaining Much of
Variation in Survival
Module 4
bioinformatics.ca
Same Signature Predicts Survival in
Independent Data Set
Module 4
bioinformatics.ca
And Three More Data Sets as Well…
Module 4
bioinformatics.ca
Module 2: Kinetochore + Aurora B Signaling
Module 4
bioinformatics.ca
Integration of Multiple Data Sets
• Experimental samples can be interrogated many ways:
–
–
–
–
RNA expression
Genome/exome sequencing
Copy number changes/loss of heterozygosity
shRNA knockdown screens
• Integrate multiple functional data types using
network/pathway relationships?
Module 4
bioinformatics.ca
PARADIGM
Vaske, Benz et al. Bioinformatics 26:i237 2010
Module 4
bioinformatics.ca
Vaske, Benz et al. Bioinformatics 26:i237 2010
Module 4
Factor graph: directed graph
connecting genes; each gene is
activated, inactivated, or
unchanged in a single patient.
bioinformatics.ca
Vaske, Benz et al. Bioinformatics 26:i237 2010
Module 4
bioinformatics.ca
PARADIGM: The Bad News
• Distributed in source code form only
– Requires several third-party math/graph libraries (all open
source).
– Tedious to compile!
• Scant documentation.
• No repositories of formatted pathway data.
• No examples of converting experimental data into input
files.
• Good news: we are working on a web service
implementation for a Reactome-based implementation.
Module 4
bioinformatics.ca
Take Home Messages
• Pathway/network analysis can provide context to altered
gene lists.
• Pathway/network analysis differs greatly in complexity ,
power, and usability:
– SIMPLE: Pathway diagram colorization
– MODERATE: Reactome FI network extraction
– COMPLEX: PARADIGM
• This type of analysis is work-in-progress, but promises
ability to integrate data across many dimensions.
Module 4
bioinformatics.ca
URLs
KEGG – www.genome.jp/kegg
Biocarta – www.biocarta.com
WikiPathways – www.wikipathways.org
Reactome – www.reactome.org
NCI/PID – pid.nci.nih.gov
Ingenuity – www.ingenuity.com
Pathway Commons – www.pathwaycommons.org/pc/
PARADIGM -- http://sbenz.github.com/Paradigm/
Module 4
bioinformatics.ca
URLs
BioGrid – www.thebiogrid.org
InTact – www.ebi.ac.uk/intact
MINT – mint.bio.uniroma2.it
iHOP – www.ihop-net.org/UniPub/iHOP
PubGene – www.pubgene.org
Module 4
bioinformatics.ca
We are on a Coffee Break &
Networking Session
Module 4
bioinformatics.ca