Download igenomics2

Document related concepts
no text concepts found
Transcript
Integrative Genomics
BME 230
Probabilistic Networks
Incorporate uncertainty explicitly
Capture sparseness of wiring
Incorporate multiple kinds of data
Most models incomplete
• Molecular systems are complex & rife with
uncertainty
• Data/Models often incomplete
– Some gene structures misidentified in some
organisms by gene structure model
– Some true binding sites for a transcription factor
can’t be found with a motif model
– Not all genes in the same pathway predicted to be
coregulated by a clustering model
• Even if have perfect learning and inference
methods, appreciable amount of data left
unexplained
Why infer system properties from data?
Knowledge acquisition bottleneck
• Knowledge acquisition is an expensive process
• Often we don’t have an expert
Data is cheap
• Amount of available information growing rapidly
• Learning allows us to construct models from raw data
Discovery
• Want to identify new relationships in a data-driven
way
Graphical models for joint-learning
• Combine probability theory & graph theory
• Explicitly link our assumptions about how a
system works with its observed behavior
• Incorporate notion of modularity: complex
systems often built from simpler pieces
• Ensures we find consistent models in the
probabilistic sense
• Flexible and intuitive for modeling
• Efficient algorithms exist for drawing
inferences and learning
• Many classical formulations are special cases
Michael Jordan, 1998
Motif Model Example (Barash ’03)
• Sites can have
arbitrary
dependence on
each other.
Barash ’03 Results
• Many TFs have binding sites that exhibit
dependency
Barash ’03 Results
Bayesian Networks for joint-learning
• Provide an intuitive formulation for combining
models
• Encode notion of causality which can guide model
formulation
• Formally expresses decomposition of system state
into modular, independent sub-pieces
• Makes learning in complex domains tractable
Unifying models for molecular
biology
• We need knowledge representation systems
that can maintain our current understanding of
gene networks
• E.g. DNA damage response and promotion
into S-phase
– highly linked sub-system
– experts united around sub-system
– but probably need combined model to understand
either sub-system
• Graphical models offer one solution to this
“Explaining Away”
• Causes “compete” to explain observed data
• So, if observe data and one of the causes,
this provides information about the other
cause.
• Intuition into “V-structures”:
Observing grass is wet and then
sprinkler
wet grass
rain
finding out sprinklers were on
decreases our belief that it rained.
So sprinkler and rain dependent
given their child is observed.
Conditional independence: “Bayes
Ball” analogy
diverging or parallel: ball
passes through when
unobserved; does not pass
through when observed
converging: ball does not
pass through when
unobserved; passes through
when observed.
unobserved
observed
Ross Schacter, 1995
Inference
• Given some set of evidence, what is the
most likely cause?
• BN allows us to ask any question that can
be posed about the probabilities of any
combination of variables
Inference
Since BN provides joint distribution, can ask questions
by computing any probability among the set of
variables. For example, what’s the probability p53 is
activated given ATM is off and the cell is arrested
before S-phase?
Need to marginalize (sum out) variables not interested
in:
Variable Elimination
• Inference amounts to distributing sums over
products
• Message passing in the BN
• Generalization of forward-backward algorithm
Pearl, 1988
Variable Elimination Procedure
• The initial potentials are the CPTs in BN.
• Repeat until only query variable(s) remain:
– Choose another variable to eliminate.
– Multiply all potentials that contain the variable.
– If no evidence for the variable then sum the variable out
and replace original potential by the new result.
– Else, remove variable based on evidence.
• Normalize remaining potentials to get the final
distribution over the query variable.
Motif Model Example (Barash ’03)
• Sites can have
arbitrary
dependence on
each other.
Barash ’03 Results
• Many TFs have binding sites that exhibit
dependency
Barash ’03 Results
Conditional independence: “Bayes
Ball” analogy
diverging or parallel: ball
passes through when
unobserved; does not pass
through when observed
converging: ball does not
pass through when
unobserved; passes through
when observed.
unobserved
observed
Ross Schacter, 1995
Inference
• Given some set of evidence, what is the
most likely cause?
• BN allows us to ask any question that can
be posed about the probabilities of any
combination of variables
Inference
Since BN provides joint distribution, can ask questions
by computing any probability among the set of
variables. For example, what’s the probability p53 is
activated given ATM is off and the cell is arrested
before S-phase?
Need to marginalize (sum out) variables not interested
in:
Variable Elimination
• Inference amounts to distributing sums over
products
• Message passing in the BN
• Generalization of forward-backward algorithm
Pearl, 1988
Variable Elimination Procedure
• The initial potentials are the CPTs in BN.
• Repeat until only query variable(s) remain:
– Choose another variable to eliminate.
– Multiply all potentials that contain the variable.
– If no evidence for the variable then sum the variable out
and replace original potential by the new result.
– Else, remove variable based on evidence.
• Normalize remaining potentials to get the final
distribution over the query variable.
Learning Networks from Data
• Have a dataset X
• What is the best model that explains X?
• We define a score, score(G,X) that scores
networks by computing how likely X is
given the network
• Then search for the network that gives us
the best score
Bayesian Score
log M
log P(G | X)  (G : D) 
dim(G)  O(1)
log2M
G
 M  I(Xi ; Pai )  H(Xi )  
dim(G)  O(1)
2
i
Fit dependencies in
empirical distribution
Complexity
penalty
• As M (amount of data) grows,
– Increasing pressure to fit dependencies in distribution
– Complexity term avoids fitting noise
• Asymptotic equivalence to MDL score
• Bayesian score is consistent
– Observed data eventually overrides prior
Friedman & Koller ‘03
Learning Parameters: Summary
• Estimation relies on sufficient statistics
– For multinomials: counts N(xi,pai)
– Parameter estimation
(xi , pai )  N (xi , pai )
~
N ( x , pa )
ˆ

x |pa 
N ( pa )
i
i
( pai )  N ( pai )
i
i
xi | pai
i
MLE
Bayesian (Dirichlet)
• Both are asymptotically equivalent and consistent
• Both can be implemented in an on-line manner by
accumulating sufficient statistics
Incomplete Data
Data is often incomplete
• Some variables of interest are not assigned values
This phenomenon happens when we have
• Missing values:
– Some variables unobserved in some instances
• Hidden variables:
– Some variables are never observed
– We might not even know they exist
Friedman & Koller ‘03
Hidden (Latent) Variables
Why should we care about unobserved
variables?
X1
X2
X3
X1
X2
X3
Y3
Y1
Y2
Y3
H
Y1
Y2
17 parameters
59 parameters
Friedman & Koller ‘03
Expectation Maximization (EM)
• A general purpose method for learning from incomplete
data
Intuition:
• If we had true counts, we could estimate parameters
• But with missing values, counts are unknown
• We “complete” counts using probabilistic inference based
on current parameter assignment
• We use completed counts as if real to re-estimate
parameters
Friedman & Koller ‘03
Expectation Maximization (EM)
P(Y=H|X=H,Z=T,) = 0.3
Current
model
Data
X Y
Z
H
T
H
H
T
T
?
?
T
H
?
?
H
T
T
Expected Counts
N (X,Y )
X Y #
H
T
H
T
H
H
T
T
1.3
0.4
1.7
1.6
P(Y=H|X=T,) = 0.4
Friedman & Koller ‘03
Expectation Maximization (EM)
Reiterate
Initial network (G,0)
X1
X2
H
Y1
Y2

Expected Counts
X3
Computation
Y3
(E-Step)
N(X1)
N(X2)
N(X3)
N(H, X1, X1, X3)
N(Y1, H)
N(Y2, H)
N(Y3, H)
Updated network (G,1)
X1
Reparameterize
(M-Step)
X2
X3
H
Y1
Y2
Y3
Training
Data
Friedman & Koller ‘03
Expectation Maximization (EM)
Computational bottleneck:
• Computation of expected counts in E-Step
– Need to compute posterior for each unobserved
variable in each instance of training set
– All posteriors for an instance can be derived
from one pass of standard BN inference
Friedman & Koller ‘03
Why Struggle for Accurate Structure?
Earthquake
Alarm Set
Burglary
Sound
Missing an arc
Earthquake
Alarm Set
Burglary
Sound
• Cannot be compensated for
by fitting parameters
• Wrong assumptions about
domain structure
Adding an arc
Earthquake
Alarm Set
Burglary
Sound
• Increases the number of
parameters to be estimated
• Wrong assumptions about
domain structure
Learning BN structure
• Treat like a search problem
• Bayesian Score allows us to measure how
well a BN fits the data
• Search for a BN that fits the data the best
• Start with initial network B0=<G0,00>
• Define successor operations on current
network
Structural EM
Recall, in complete data we had
– Decomposition  efficient search
Idea:
• Instead of optimizing the real score…
• Find decomposable alternative score
• Such that maximizing new score
 improvement in real score
Structural EM
Idea:
• Use current model to help evaluate new structures
Outline:
• Perform search in (Structure, Parameters) space
• At each iteration, use current model for finding
either:
– Better scoring parameters: “parametric” EM step
or
– Better scoring structure: “structural” EM step
Reiterate
Computation
X1
X2
X3
H
Y1
Y2

Y3
Training
Data
Expected Counts
N(X1)
N(X2)
N(X3)
N(H, X1, X1, X3)
N(Y1, H)
N(Y2, H)
N(Y3, H)
N(X2,X1)
N(H, X1, X3)
N(Y1, X2)
N(Y2, Y1, H)
Score
&
Parameterize
X1
X2
X3
H
Y1
X1
Y2
X2
Y3
X3
H
Y1
Y2
Y3
Structure Search Bottom Line
• Discrete optimization problem
• In some cases, optimization problem is easy
– Example: learning trees
• In general, NP-Hard
– Need to resort to heuristic search
– In practice, search is relatively fast (~100 vars
in ~2-5 min):
• Decomposability
• Sufficient statistics
– Adding randomness to search is critical
Heuristic Search
• Define a search space:
– search states are possible structures
– operators make small changes to structure
• Traverse space looking for high-scoring structures
• Search techniques:
–
–
–
–
Greedy hill-climbing
Best first search
Simulated Annealing
...
Friedman & Koller ‘03
Local Search
• Start with a given network
– empty network
– best tree
– a random network
• At each iteration
– Evaluate all possible changes
– Apply change based on score
• Stop when no modification improves score
Friedman & Koller ‘03
Heuristic Search
• Typical operations:
S
S
C
S
C
E
C
To update score after local
change, E
only re-score Dfamilies that
changed
D
score =
S({C,E} D)
- S({E} D)
S
C
E
E
D
D
Friedman & Koller ‘03
Naive Approach to Structural EM
• Perform EM for each candidate graph
G3
G2
G1
Parameter space
Parametric
optimization
(EM)
Local Maximum


Computationally
expensive:
Gn
G4
 Parameter optimization via EM — non-trivial
 Need to perform EM for all candidate structures
 Spend time even on poor candidates
In practice, considers only a few candidates
Friedman & Koller ‘03
Structural EM
Recall, in complete data we had
– Decomposition  efficient search
Idea:
• Instead of optimizing the real score…
• Find decomposable alternative score
• Such that maximizing new score
 improvement in real score
Friedman & Koller ‘03
Structural EM
Idea:
• Use current model to help evaluate new structures
Outline:
• Perform search in (Structure, Parameters) space
• At each iteration, use current model for finding either:
– Better scoring parameters: “parametric” EM step
or
– Better scoring structure: “structural” EM step
Friedman & Koller ‘03
Bayesian Approach
• Posterior distribution over structures
• Estimate probability of features
– Edge XY
– Path X…  Y
– …P (f | D ) 
Bayesian score
for G
f (G )P (G | D )
G
Feature of G,
e.g., XY
Indicator function
for feature f
P(G|D)
Discovering Structure
E
R
B
A
C
• Current practice: model selection
– Pick a single high-scoring model
– Use that model to infer domain structure
P(G|D)
Discovering Structure
E
R
B
A
C
R
E
B
E
A
C
R
B
A
C
E
R
B
A
C
E
R
B
A
C
Problem
– Small sample size  many high scoring models
– Answer based on one model often useless
– Want features common to many models
Application: Gene expression
Input:
• Measurement of gene expression under
different conditions
– Thousands of genes
– Hundreds of experiments
Output:
• Models of gene interaction
– Uncover pathways
Friedman & Koller ‘03
“Mating response” Substructure
KAR4
SST2
TEC1
NDJ1
KSS1
YLR343W
YLR334C
MFA1
STE6
FUS1
PRM1
AGA1
AGA2 TOM6
FIG1
FUS3
YEL059W
• Automatically constructed sub-network of high-confidence edges
• Almost exact reconstruction of yeast mating pathway
N. Friedman et al 2000
N. Friedman et al 2000
Bayesian Network Limitation:
Model pdf over instances
• Bayesian nets use propositional representation
• Real world has objects, related to each other
• Can we take advantage of general properties of objects
of the same class?
Intelligence
Grade
Difficulty
Bayesian Networks: Problem
• Bayesian nets use propositional representation
• Real world has objects, related to each other
These “instances” are not independent!
Intell_J.Doe
Diffic_CS101
Grade_JDoe_CS101
A
Intell_FGump
Diffic_Geo101
Grade_FGump_Geo101
Intell_FGump
Diffic_CS101
Grade_FGump_CS101
C
Relational Schema
• Specifies types of objects in domain, attributes of each
type of object, & types of links between objects
Classes
Professor
Student
Teaching-Ability
Intelligence
Take
Teach
Links
Attributes
Course
Difficulty
In
Registration
Grade
Satisfaction
Modeling real-world relationships
• Bayesian nets use propositional representation
• Biological system has objects, related to each
other
from E. Segal’s PhD Thesis 2004
Probabilistic Relational Models
(PRMs)
• A skeleton σ defining classes & their
relations
• Random variables are the attributes of the
objects
Probabilistic Relational Models
• Dependencies exist at the class level
from E. Segal’s PhD Thesis 2004
Converting a PRM into a BN
• We can “unroll” (or instantiate) a PRM into its
underlying BN.
• E.g.: 2 genes & 3 experiments:
from E. Segal’s PhD
Thesis 2004
A PRM has important differences from a
BN
• Dependencies are defined at the class level
– Thus can be reused for any objects of the class
• Uses the relational structure to allow
attributes of an object to depend on
attributes of related objects
• Parameters are shared across instances of
the same class! (Thus, more data for each
parameter)
Example of joint learning
Segal et al. 2003
Joint Learning: Motifs &
Modules
Segal et al. 2003
Joint Model of Segal et al. 2003
Segal et al. 2003
PRM for the joint model
Segal et al. 2003
Joint Model
• Motif model:
Regulation model:
• Expression Model:
Segal et al. 2003
Segal et al. 2003
“Ground BN”
Segal et al. 2003
TFs associated to modules
Segal et al. 2003
PRM for genome-wide location data
Segal et al. 2003