Download A Factor Graph Model for Minimal Gene Set

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A Factor Graph Model for Minimal
Gene Set Enrichment Analysis
Diana Uskat
Computational Biology - Gene Center Munich
Motivation
Problem Outline:
•
•
•
Cutout of
Gene Ontology
Single gene analysis of microarray
experiments entails a large multiple testing
problem
Even after appropriate multiple testing
correction, the result is usually a long list of
differentially expressed genes
Interpretation is difficult by hand
Possible improvement: Gene set
enrichment analysis
1. Group genes into different biologically
meaningful categories (Gene Ontology,
KEGG Pathways, Transcription factor
targets)
2. Use a statistical method for finding those
categories which are enriched for
differentially expressed genes
24.03.2010
Ontologizer fromGraph
S. Bauer,
from J.
Ontologizer
Gagneur, P. N. Robinson
by S. Bauer, J. Gagneur, P. N. Robinson (NAR 2010)
Diana Uskat - Gene Center Munich
2
Motivation
Cutout of
Gene Ontology
Established Methods:
•
GSEA (Subramanian, Tamayo)
•
TopGO (Alexa)
•
Globaltest (Goemann, Mansmann)
•
GOStats (Falcon, Gentleman)
Drawbacks:
•
There are often 1000’s of overlapping
categories, genes can belong to multiple
categories  difficult new multiple testing
problem
•
Group testing returns often a large number
of significant categories  identification of
biologically relevant categories difficult
Graph from Ontologizer
by S. Bauer, J. Gagneur, P. N. Robinson (NAR 2010)
24.03.2010
Diana Uskat - Gene Center Munich
3
Minimal Gene Set Enrichment
Idea (Bauer, Gagneur et al., Nucleic Acids Research 2010)
• Search for a sparse explanation, i.e. a minimal number of categories
that explain the data (sufficiently well)
• Use a simplistic probabilistic graphical model relating categories and
genes, and do Bayesian inference on the marginal posterior for each
category
Correct explanation
T1
T2
Correct minimal explanation
T3
Categories
T1
T2
T3
E1
E2
E3
“gene E3 is element
of category T3”
E1
E2
E3
Genes
(coloured means „on“)
24.03.2010
Diana Uskat - Gene Center Munich
4
Minimal Gene Set Enrichment
The model
T1
T2
T3
Categories
E1
E2
E3
Genes
D1
D2
D3
Observations
(data)
A Bayesian Network factorization of the full posterior:
Posterior
24.03.2010
Likelihood
Prior
Diana Uskat - Gene Center Munich
Main trick: Use a prior
favoring sparse solutions
5
Factor Graphs
Our method: Factor Graphs
• Graphical model (Kschischang
IEEE, 2001)
• Bipartite graph with factor nodes
and variable nodes
• Each factor node encodes a
function for its neighbouring
variables
• Efficient computation of
marginal distribution with the
sum-product algorithm (if factor
graph is a tree...)
24.03.2010
T1
T2
T3
E1
E2
E3
D1
D2
D3
Diana Uskat - Gene Center Munich
6
Factor Graphs
• Graphical model (Kschischang
IEEE, 2001)
• Bipartite graph with factor nodes
and variable nodes
• Each factor node encodes a
function its neighbouring
variables
• Efficient computation of
Pr(D|E)
marginal distribution with the
given by
dataset (if factor
sum-product
algorithm
graph is a tree...)
Pr(T , E | D) 
f
jJ
j
( E j )   g j ( E j , Tnext ( g j ) )  fT (T )
24.03.2010
jJ
T1
T2
T3
E1
E2
E3
f1
f2
f3
D1
D2
D3
Diana Uskat - Gene Center Munich
7
Factor Graphs
• Graphical model (Kschischang
IEEE, 2001)
• Bipartite graph with factor nodes
and variable nodes
• Each factor node encodes a
function its neighbouring
variables
• Efficient computation of
marginal distribution
with the
E only active
if at least
sum-product algorithm
(if factor
one parent
active
graph is a tree...)
Pr(T , E | D) 
f
jJ
j
( E j )   g j ( E j , Tnext ( g j ) )  fT (T )
24.03.2010
jJ
T1
g1
g2
T2
T3
g3
g6
g4
g5
E1
E2
E3
f1
f2
f3
D1
D2
D3
Diana Uskat - Gene Center Munich
7
Factor Graphs
• Graphical model (Kschischang
IEEE, 2001)
• Bipartite graph with factor nodes
and variable nodes
• Each factor node encodes a
function its neighbouring
variables
N
1T
• Efficient computation of T j
fT T  
p (1  p) j
marginal distribution with
the
j 1
sum-product algorithm (if factor
with
graph is a tree...) 0  p  0.5
fT
T1
g1
g2
T2
T3
g3
g6

Pr(T , E | D) 
f
jJ
j
( E j )   g j ( E j , Tnext ( g j ) )  fT (T )
24.03.2010
jJ
g4
g5
E1
E2
E3
f1
f2
f3
D1
D2
D3
Diana Uskat - Gene Center Munich
7
Estimation Methods for Factor Graphs
Computation of posterior for T,E:
•
•
•
Message-Passing Algorithm: SumProduct-Algorithm
Stops at correct result after one round if
graph has a tree structure
No guarantees if graph has cycles
(e.g., oscillation may occur), however
works well in practice
fT
T1
g1
g2
T2
T3
g3
g6
Principle:
•
•
•
g4
Start in leaf nodes
Message propagation:
– variable to factor node („Sum“)
– factor to variable node („Product“)
Termination: Compute the marginal
distribution of the variable nodes
24.03.2010
g5
E1
E2
E3
f1
f2
f3
D1
D2
D3
Diana Uskat - Gene Center Munich
8
Application: Yeast Salt Stress
• Categories:
Transcritption factors (with their targets) instead of GO categories
• Given:
– List of transcription factors with their corresponding genes
– List of genes (their p-values) from a yeast salt stress experiment
• Question: Which transcription factors are active during salt stress?
• Task: Find a set of transcription factors that are most likely to be active
g1
g2
TF1
“g2 is target of TF2”
g3
g4
TF2
g5
24.03.2010
Diana Uskat - Gene Center Munich
9
Results
~2.000
genes
118 transcription
factors
Graph obtained from re-analysis
of Harbison TF binding data
(Nat, 2004) by MacIsaac et al.
24.03.2010
Diana Uskat - Gene Center Munich
(BMC Bioinformatics, 2006)
10
Results
~2.000
genes
118 transcription
factors
YML081W
DAL81
STB4
HSF1
UME6
Previously known
transcription factors
involved in salt stress
(Capaldi et al., Nat.Gen 2008,
Wu and Chen, Bioinform Biol
Insights. 2009)
SNT2
RGT1
MET28
MSN2
GAL4
Differentially
phosphorylated
transcription factors
(Soufi et al., Mol.Biosyst 2009)
SKO1
Graph obtained from re-analysis
of Harbison TF binding data
(Nat, 2004) by MacIsaac et al.
24.03.2010
Diana Uskat - Gene Center Munich
(BMC Bioinformatics, 2006)
10
Summary and Outlook
• Todo: scalability and speed
• Lists of (meaningful) gene sets are better than
lists of genes
• Search for biologically meaningful explanations
requires a new minmal model (MGSE) for gene
set enrichment analysis
• We use factor graphs for parameter estimation
• Wide application to GO analysis, TF-target
analysis, Pathway enrichment
24.03.2010
Diana Uskat - Gene Center Munich
11
Acknowledgments
Gene Center Munich:
Achim Tresch, Theresa Niederberger, Björn
Schwalb, Sebastian Dümcke
Collaborating Partners:
Gene Center Munich:
Patrick Cramer, Christian Miller, Daniel
Schulz, Dietmar Martin, Andreas Mayer
EMBL Heidelberg:
Julien Gagneur(talk nov. 2009, working group
conference of the GMDS „AG Statistische
Methoden in der Bioinformatik, Munich“)
24.03.2010
Diana Uskat - Gene Center Munich
12
Related documents