Download Learning Bayesian Networks from Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Overview
Introduction
Parameter Estimation
Model Selection
Structure Discovery
Incomplete Data
Learning from Structured Data
Learning Bayesian
Networks from Data
Nir Friedman
Hebrew U.
Daphne Koller
Stanford
.
2
Bayesian Networks
Example: “ICU Alarm” network
Compact representation of probability
distributions via conditional independence
Family of
Alarm
Qualitative part:
Earthquake
Directed acyclic graph (DAG)
Nodes - random variables
Edges - direct influence Radio
Burglary
Alarm
Domain: Monitoring Intensive-Care Patients
37 variables
509 parameters
…instead of 254
MINVOLSET
E
e
e
e
e
B P(A | E,B)
b 0.9 0.1
b 0.2 0.8
b 0.9 0.1
b 0.01 0.99
PULMEMBOLUS
PAP
HYPOVOLEMIA
Quantitative part:
Set of conditional
probability distributions
LVEDVOLUME
CVP
SAO2
PCWP
LVFAILURE
STROEVOLUME
FIO2
VENTALV
PVSAT
ARTCO2
EXPCO2
INSUFFANESTH
CATECHOL
HISTORY
ERRBLOWOUTPUT
CO
HR
HREKG
ERRCAUTER
HRSAT
HRBP
BP
3
4
Inference
Why learning?
Knowledge acquisition bottleneck
Knowledge acquisition is an expensive process
Often we don’t have an expert
Probability of any event given any evidence
Most likely explanation
Scenario that explains evidence
Earthquake
Data is cheap
Amount of available information growing rapidly
Learning allows us to construct models from raw
Burglary
Value of Information
Effect of intervention
DISCONNECT
VENITUBE
PRESS
Posterior probabilities
Maximize expected utility
VENTMACH
VENTLUNG
MINOVL
TPR
P (B , E , A, C , R ) = P (B )P (E )P (A | B , E )P (R | E )P (C | A)
Rational decision making
SHUNT
ANAPHYLAXIS
Call
Together:
Define a unique distribution
in a factored form
KINKEDTUBE
INTUBATION
Radio
Alarm
data
Call
5
6
1
Why Learn Bayesian Networks?
Learning Bayesian networks
Conditional independencies & graphical language
capture structure of many real-world distributions
E
Graph structure provides much insight into domain
Data
+
Prior Information
Allows “knowledge discovery”
Learned model can be used for many tasks
Learner
B
R
A
C
E
e
e
e
e
Supports all the features of probabilistic learning
Model selection criteria
Dealing with missing data & hidden variables
B P(A | E,B)
b .9 .1
b .7 .3
b .8 .2
b .99 .01
7
8
Known Structure, Complete Data
E, B, A
<Y,N,N>
<Y,N,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
E
e
e
e
e
B P(A | E,B)
b ? ?
b ? ?
b ? ?
b ? ?
E
B
E, B, A
<Y,N,N>
<Y,N,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
B
E
e
e
e
e
A
Learner
E
Unknown Structure, Complete Data
A
B P(A | E,B)
b .9 .1
b .7 .3
b .8 .2
b .99 .01
E
e
e
e
e
Network structure is specified
B P(A | E,B)
b ? ?
b ? ?
b ? ?
b ? ?
E
E
e
e
e
e
A
Learner
E
B
B
A
B P(A | E,B)
b .9 .1
b .7 .3
b .8 .2
b .99 .01
Network structure is not specified
Inducer needs to estimate parameters
Inducer needs to select arcs & estimate parameters
Data does not contain missing values
Data does not contain missing values
9
10
Known Structure, Incomplete Data
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
.
.
<?,Y,Y>
E
e
e
e
e
B P(A | E,B)
b ? ?
b ? ?
b ? ?
b ? ?
E
Learner
E
B
A
Unknown Structure, Incomplete Data
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
.
.
<?,Y,Y>
B
A
E
e
e
e
e
B P(A | E,B)
b .9 .1
b .7 .3
b .8 .2
b .99 .01
Network structure is specified
Data contains missing values
E
e
e
e
e
B P(A | E,B)
b ? ?
b ? ?
b ? ?
b ? ?
E
Learner
E
B
A
B
A
E
e
e
e
e
B P(A | E,B)
b .9 .1
b .7 .3
b .8 .2
b .99 .01
Network structure is not specified
Data contains missing values
Need to consider assignments to missing values
Need to consider assignments to missing values
11
12
2
Overview
Learning Parameters
Introduction
Parameter Estimation
Training data has the form:
E
B
A
Likelihood function
E
Bayesian estimation
D=
Model Selection
Structure Discovery
Incomplete Data
Learning from Structured Data
 [1]

⋅

 ⋅

 [ ]
B[1] A[1] C [1] 
⋅
⋅
⋅
⋅
⋅
⋅
E M B[M] A[M] C [M
C




]
13
14
Likelihood Function
Likelihood Function
Assume i.i.d. samples
Likelihood function is
E
By definition of network, we get
B
E
B
A
L(Θ : D) = ∏P (E [m],B[m], A[m],C [m] : Θ)
A
L(Θ : D) = ∏P (E [m],B[m],A[m],C [m] : Θ)
C
C
m
m
P (E [m] : Θ)


P (B[m] : Θ)

=∏
P (A[m] |B[m],E [m] : Θ) 

P (C [m] | A[m] : Θ)





m 








E [1] B [1] A[1] C [1] 
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
E [M] B [M] A[M] C [M
15
Likelihood Function
E
B
Generalizing for any Bayesian network:
L(Θ : D ) = ∏ P (x 1[m], K, xn [m] : Θ)
A
L(Θ : D) = ∏P (E [m],B[m],A[m],C [m] : Θ)
C
m
= ∏∏ P (xi [m] | Pai [m] : Θi )
m
P (E [m] : Θ)
∏
m
i
P (C [m] |A[m] : Θ)
∏
m
m
= ∏ Li (Θi
P (B[m] : Θ)
∏
m
P (A[m] | B[m],E [m] : Θ)
∏
m
i






E [1] B [1] A[1] C [1] 
⋅
⋅
16
General Bayesian Networks
Rewriting terms, we get
=




]
⋅
⋅
⋅
⋅
⋅
⋅
E [M] B [M] A[M] C [M




]
17
:
D)
Decomposition
Independent estimation problems
⇒
18
3
Likelihood Function: Multinomials
L( θ : D ) = P (D | θ) = ∏ P (x [m ] | θ)
Bayesian Inference
Represent uncertainty about parameters using a
m
probability distribution over parameters, data
The likelihood for the sequence H,T, T, H, H is
)
:
D
Learning using Bayes rule
θ
L(θ : D ) = θ ⋅ (1 − θ) ⋅ (1 − θ) ⋅ θ ⋅ θ
L
(
0
0.2
θ
0.4
0.6
0.8
Count of kth
outcome in D
1
K
L( Θ : D ) = ∏ θ
General case:
k =1
k
Likelihood
Prior
x [M ] | θ)P (θ)
P (θ | x [1], K x [M ]) = P (xP[1(],xK
[1], K x [M ])
Posterior
Probability of data
Nk
Probability of
kth outcome
19
20
Bayesian Inference
Example: Binomial Data
Represent Bayesian distribution as Bayes net
Prior: uniform for θ in [0,1]
⇒ Pθ
θ
(
D) ∝ the likelihood L(θ :D)
|
P (θ | x[1],Kx [M]) ∝ P (x[1],Kx[M] |θ) ⋅ P (θ)
X[1]
X[2]
X[m]
H
(N ,N
T)
= (4,1)
Observed data
MLE for P(X = H ) is 4/5 = 0.8
The values of X are independent given θ
P(x[m] | θ ) = θ
Bayesian prediction is inference in this network
Bayesian prediction is
0
0.2
0.4
0.6
0.8
P (x [M + 1] = H | D ) = ∫ θ ⋅P (θ | D )dθ = 5 = 0.7142K
7
21
22
Dirichlet Priors
Dirichlet Priors - Example
Recall that the likelihood function is
k =1
k
4.5
Nk
P (Θ) ∝ ∏ θk
3.5
α ,…,α
1
K
αk −1
⇒ the posterior has the same form, with
k =1
hyperparameters α1 +N
Dirichlet(αheads, αtails)
4
Dirichlet prior with hyperparameters
K
5
P(θ heads)
K
L( Θ : D ) = ∏ θ
1
α
,…,
K
k
−1
Dirichlet(5,5)
Dirichlet(0.5,0.5)
2
Dirichlet(2,2)
1.5
Dirichlet(1,1)
K
K
P (Θ | D ) ∝ P (Θ)P (D | Θ) ∝ ∏θ k α ∏θ k N
k
k
=1
3
2.5
1
+N
K
1
k
=1
K
0.5
= ∏θ k αk +Nk −1
0
0
k =1
0.2
0.4
0.6
0.8
1
θ heads
23
24
4
Dirichlet Priors (cont.)
Bayesian Nets & Bayesian Prediction
If P(Θ) is Dirichlet with hyperparameters α1,…,αK
θY|X
α
P (X [1] = k ) = θk ⋅P (Θ)dΘ = k
αl
∫
∑
m
l
Since the posterior is also Dirichlet, we get
P (X [M + 1] = k | D ) = ∫ θk ⋅P ( Θ | D )dΘ =
X[1]
X[2]
X[M]
X[M+1]
X[m]
Y[1]
Y[2]
Y[M]
Y[M+1]
Y[m]
θY|X
Plate notation
α k + Nk
(α l + Nl )
Query
Observed data
∑
l
θX
θX
Priors for each parameter group are independent
Data instances are independent given the
unknown parameters
25
Bayesian Nets & Bayesian Prediction
θY|X
26
Learning Parameters: Summary
Estimation relies on sufficient statistics
θX
For multinomials: counts N(x ,pa )
i
X[1]
X[2]
X[M]
Y[1]
Y[2]
Y[M]
ˆθx |pa = N (xi , pai )
N (pai )
i
Observed data
i
~
θx |pa =
i
⇒
Complete data
posteriors on parameters are independent
Can compute posterior over parameters separately!
Bayesian (Dirichlet)
Both are asymptotically equivalent and consistent
Both can be implemented in an on-line manner by
accumulating sufficient statistics
27
Learning Parameters: Case Study
KL Divergence
to true distribution
Introduction
Parameter Learning
Model Selection
M’ — strength of prior
1
28
Overview
Instances sampled from
ICU Alarm network
1.2
xi , pai ) + N (xi , pai )
α( pai ) + N ( pai )
α(
i
MLE
We can also “read” from the network:
1.4
i
Parameter estimation
Scoring function
0.8
MLE
0.6
Structure search
Structure Discovery
Incomplete Data
Learning from Structured Data
Bayes; M'=20
0.4
Bayes; M'=50
0.2
Bayes; M'=5
0
0
500
1000 1500 2000 2500 3000 3500 4000 4500 5000
# instances
29
30
5
Why Struggle for Accurate Structure?
Earthquake Alarm Set
Score-based Learning
Burglary
Define scoring function that evaluates how well a
structure matches the data
Sound
Missing an arc
Earthquake Alarm Set
Adding an arc
Burglary
Earthquake Alarm Set
Sound
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
Burglary
Sound
Cannot be compensated
for by fitting parameters
Wrong assumptions about
domain structure
Increases the number of
E
E
B
parameters to be estimated
Wrong assumptions about
domain structure
E
A
A
A
B
B
Search for a structure that maximizes the score
31
32
Bayesian Score
Likelihood Score for Structure
l(G : D) = log L(G : D) = M
∑(
I(X i ; Pa i
i
G
)
−
H(X i )
)
i
Max likelihood params
on Pa
i
⇒ higher score
P (D | G ) = ∫ P (D | G ,θ )P (θ | G )dθ
Adding arcs always helps
I(X; Y)
≤ I(X;
L(G : D) = P(D | G, θˆG )
Bayesian approach:
Deal with uncertainty by assigning probability to
all possibilities
Mutual information between
Xi and its parents
Larger dependence of X
Likelihood score:
Marginal Likelihood
Max score attained by fully connected network
Overfitting: A bad idea…
P(G | D)
=
P(D | G)P(G)
P(D)
33
Network structure
Fortunately, in many cases integral has closed form
determines form of
marginal likelihood
Θ) is Dirichlet with hyperparameters α1,…,αK
D is a dataset with sufficient statistics N1 ,…,NK
P(
Then
P (D ) =

Γ


Γ

∑
l

αl 

∑ α +N
(
l
l

l )

∏
l
34
Marginal Likelihood: Bayesian Networks
Marginal Likelihood: Multinomials
Prior over parameters
Likelihood
{Y,Z})
1
2
3
4
5
6
7
X
H
T
T
H
T
H
H
Y
H
T
H
H
T
T
H
Network 1:
Two Dirichlet marginal likelihoods
Γ(α l + Nl )
Γ(α l )
35
X
P(
)
Integral over θ X
P(
)
Integral over θ Y
Y
36
6
Marginal Likelihood: Bayesian Networks
Network structure
determines form of
marginal likelihood
1
2
3
4
5
6
7
X
H
T
T
H
T
H
H
Y
H
T
H
H
T
T
H
Marginal Likelihood for Networks
The marginal likelihood has the form:
P (D | G ) = ∏∏
Network 2:
Three Dirichlet marginal likelihoods
X
Y
Dirichlet marginal likelihood
for multinomial P(Xi | pai)
paiG
i
Γ (α ( pa G ) )
Γ(α ( x i , pa G ) + N ( x i , pa G ))
Γ (α ( x i , pa G ))
( pa G ) + N ( pa G ) )∏
x
i
Γ α(
P(
)
Integral over θ X
P(
)
Integral over θ Y|X=H
P(
)
Integral over θ Y|X=T
i
i
i
i
i
i
N(..) are counts from the data
α(..) are hyperparameters for each family given G
37
Bayesian Score: Asymptotic Behavior
log
P(D | G)
=
M
∑
i
= l (G
: D)
−
log
M dim(G)
2
(I(X i ; Pa i ) − H(X i ) ) −
Fit dependencies in
empirical distribution
+ O(1)
M dim(G)
log
G
2
38
Structure Search as Optimization
Input:
Training data
Scoring function
Set of possible structures
+ O(1)
Complexity
penalty
As M (amount of data) grows,
Output:
A network that maximizes the score
Increasing pressure to fit dependencies in distribution
Complexity term avoids fitting noise
Asymptotic equivalence to MDL score
Bayesian score is consistent
Key Computational Property: Decomposability:
score(G) = ∑ score ( family of X in G )
Observed data eventually overrides prior
39
40
Tree-Structured Networks
Learning Trees
MINVOLSET
Let p(i) denote parent of X
We can write the Bayesian score as
i
PULMEMBOL US
Trees:
At most one parent per variable
PAP
KINKEDTUBE
INTUBATION
SHUNT
VENTMACH
VENTLU NG
DISCONNE CT
VENITUBE
PRESS
MINOVL
VENTALV
PVSAT
Score(G : D ) = ∑ Score(X : Pa )
ARTCO2
i
TPR
Why trees?
Elegant math
⇒ we can solve the
optimization problem
Sparse parameterization
⇒ avoid overfitting
HYPOVOLEMIA
LVEDVOL UME
CVP
PCWP
SAO2
LVFAILURE
STROEVOLUME
=
CATECHOL
HISTORY
ERRBLOWOUTPUT
CO
i
i
EXPCO2
INSUFFANESTH
HR
HREKG
∑ (Score X
(
i
ERRC AUTER
i
:
X
p (i
)
)−
Score(X )) + ∑ Score(X )
i
i
i
HRSAT
HRBP
Improvement over
“empty” network
BP
Score of “empty”
network
Score = sum of edge scores + constant
41
42
7
Learning Trees
Beyond Trees
When we consider more complex network, the
problem is not as easy
Suppose we allow at most two parents per node
A greedy algorithm is no longer guaranteed to find
the optimal network
Set w(j→i) =Score( X → X ) - Score(X )
Find tree (or forest) with maximal weight
j
i
In fact, no efficient algorithm exists
i
Standard max spanning tree algorithm —
O(n2
log n)
Theorem: Finding maximal scoring structure with at
most k parents per node is NP-hard for k > 1
Theorem: This procedure finds tree with max score
43
44
Heuristic Search
Local Search
Define a search space:
Start with a given network
search states are possible structures
empty network
operators make small changes to structure
best tree
Traverse space looking for high-scoring structures
Search techniques:
a random network
At each iteration
Greedy hill-climbing
Best first search
Evaluate all possible changes
Simulated Annealing
Apply change based on score
...
Stop when no modification improves score
45
Heuristic Search
46
Learning in Practice: Alarm domain
Typical operations:
S
2
C
E
C
Ad d
local
1.5
→D
To update score after
change,D
E
only re-score families
that changed
S
C
e
le t
De
C
→
E
D
Re
v
ers
∆score =
S({C,E} →D)
- S({E} →D)
eC
→E
S
C
E
E
D
D
true distribution
C
KL Divergence to
S
Structure known, fit params
1
Learn both structure & params
0.5
0
0
47
500
1000
1500
2000
2500
#samples
3000
3500
4000
4500
5000
48
8
Local Search: Possible Pitfalls
Improved Search: Weight Annealing
Standard annealing process:
Local search can get stuck in:
Take bad steps with probability ∝ exp(∆score/t)
Probability increases with temperature
Local Maxima:
All one-edge changes reduce the score
Weight annealing:
Plateaux:
Take uphill steps relative to perturbed score
Some one-edge changes leave the score unchanged
Perturbation increases with temperature
Standard heuristics can escape both
)
D
|
G
(
e
r
o
c
S
Random restarts
TABU search
Simulated annealing
G
49
50
Weight Annealing: ICU Alarm network
Perturbing the Score
Cumulative performance of 100 runs of
annealed structure search
Perturb the score by reweighting instances
Each weight sampled from distribution:
True structure
Learned params
Mean = 1
Variance ∝ temperature
Instances sampled from “original” distribution
… but perturbation changes emphasis
Greedy
hill-climbing
Annealed
search
Benefit:
Allows global moves in the search space
51
52
Structure Search: Summary
Overview
Discrete optimization problem
In some cases, optimization problem is easy
Introduction
Parameter Estimation
Model Selection
Structure Discovery
Incomplete Data
Learning from Structured Data
Example: learning trees
In general, NP-Hard
Need to resort to heuristic search
In practice, search is relatively fast (~100 vars in
~2-5 min):
Decomposability
Sufficient statistics
Adding randomness to search is critical
53
54
9
Structure Discovery
Discovering Structure
P(G|D)
Task: Discover structural properties
Is there a direct connection between X & Y
Does X separate between two “subsystems”
Does X causally effect Y
E
B
R
A
C
Example: scientific data mining
Disease properties and symptoms
Interactions between the expression of genes
Current practice: model selection
Pick a single high-scoring model
Use that model to infer domain structure
55
56
Discovering Structure
P(G|D)
Bayesian Approach
Posterior distribution over structures
Estimate probability of features
Edge X→Y
E
R
B
A
C
E
R
B
A
E
R
C
Problem
Small sample size
E
B
A
B
R
E
A
C
R
C
Path X→… →
B
A
C
⇒ many high scoring models
…
Y
P (f | D ) = ∑f (G )P (G | D )
G
G
Feature of ,
e.g., X→Y
Answer based on one model often useless
Bayesian score
for G
Indicator function
for feature f
Want features common to many models
57
MCMC over Networks
ICU Alarm BN: No Mixing
500 instances:
Cannot enumerate structures, so sample structures
MCMC Sampling
fG
n∑
1
n
(
i
=1
i
)
Score of
cuurent sample
P (f (G ) | D ) ≈
58
Define Markov chain over BNs
Run chain to get samples from posterior P(G | D)
Possible pitfalls:
Huge (superexponential) number of networks
Time for chain to converge to posterior is unknown
Islands of high posterior, connected by low bridges
empty
greedy
0
100000
200000
300000
400000
500000
600000
MCMC Iteration
59
The runs clearly do not mix
60
10
Effects of Non-Mixing
Fixed Ordering
Two MCMC runs over same 500 instances
Probability estimates for edges for two runs
Suppose that
We know the ordering of variables
say, X1 > X2 > X3 > X4 > … > Xn
Limit number of parents per nodes to k
True BN
Random start
parents for Xi must be in X1,…,Xi-1
2k•n•log n
networks
Intuition: Order decouples choice of parents
Choice of Pa(X7) does not restrict choice of Pa(X12)
True BN
True BN
Probability estimates highly variable, nonrobust
61
Upshot: Can compute efficiently in closed form
Likelihood P(D | p)
Feature probability P(f | D, p)
62
Mixing with MCMC-Orderings
Our Approach: Sample Orderings
4 runs on ICU-Alarm with 500 instances
fewer iterations than MCMC-Nets
approximately same amount of computation
We can write
P (f | D) = ∑ P (f |p, D)P (p| D )
400
p
405
Score of
cuurent sample
410
Sample orderings and approximate
415
P (f | D ) ≈ ∑ P (f | p ,D )
420
n
i
=1
425
i
430
435
MCMC Sampling
440
445
Define Markov chain over orderings
0
20000
30000
40000
50000
60000
Process appears to be mixing!
64
Application: Gene expression
Mixing of MCMC runs
Two MCMC runs over same instances
Probability estimates for edges
500 instances
10000
MCMC Iteration
63
Input:
Measurement of gene expression under
different conditions
Thousands of genes
Hundreds of experiments
Output:
Models of gene interaction
Uncover pathways
1000 instances
Probability estimates very robust
random
greedy
450
Run chain to get samples from posterior P(p | D)
65
66
11
Map of Feature Confidence
“Mating response” Substructure
KAR4
SST2
TEC1
KSS1
NDJ1
YLR343W
Yeast data
YLR334C
[Hughes et al 2000]
MFA1
STE6
FUS1
PRM1
AGA1
AGA2 TOM6
FIG1
FUS3
YEL059W
Automatically constructed sub-network of high-
600 genes
300 experiments
confidence edges
67
Almost exact reconstruction of yeast mating pathway
Overview
68
Incomplete Data
Introduction
Parameter Estimation
Model Selection
Structure Discovery
Incomplete Data
Data is often incomplete
Some variables of interest are not assigned values
This phenomenon happens when we have
Missing values:
Parameter estimation
Some variables unobserved in some instances
Hidden variables:
Structure search
Learning from Structured Data
Some variables are never observed
We might not even know they exist
69
Hidden (Latent) Variables
70
Example
X
Y
P(X) assumed to be known
Likelihood function of: θY|X=T, θY|X=H
Contour plots of log likelihood for different number
Why should we care about unobserved variables?
of missing values of X
X1
X2
X3
X1
X2
(M = 8):
X3
H
=
X
|Y
θ
H
Y1
Y2
Y3
17 parameters
Y1
Y2
Y3
59 parameters
θY|X=T
θY|X=T
no missing values
2 missing value
θY|X=T
3 missing values
In general: likelihood function has multiple modes
71
72
12
Incomplete Data
EM: MLE from Incomplete Data
In the presence of incomplete data, the likelihood
can have multiple maxima
H
Y
)
D
|
(
L
Θ
Example:
We can rename the values of hidden variable H
If H has two values, likelihood has two maxima
In practice, many local maxima
Θ
Use current point to construct “nice” alternative function
Max of new function scores ≥ than current point
73
Expectation Maximization (EM)
74
Expectation Maximization (EM)
A general purpose method for learning from
incomplete data
Data
P(Y=H|X=H,Z=T,Θ ) = 0.3
Intuition:
If we had true counts, we could estimate parameters
But with missing values, counts are unknown
We “complete” counts using probabilistic inference
Current
model
X Y Z
H
?
T
T
?
?
H
H
?
H
T
T
T
T
H
Expected Counts
N
( X,Y )
X Y
H H
T H
H T
T T
#
1.3
0.4
1.7
1.6
P(Y=H|X=T,Θ ) = 0.4
based on current parameter assignment
We use completed counts as if real to re-estimate
parameters
75
Expectation Maximization (EM)
76
Expectation Maximization (EM)
Reiterate
Formal Guarantees:
L(Θ :D) ≥ L(Θ :D)
1
Θ0)
Initial network (G,
X
1
X
2
X
1
Y
+
2
Computation
N(X2)
X
11
Reparameterize
3
(E-Step)
N(H, X1, X1, X3)
N(Y1, H)
N(Y2, H)
(M-Step)
X
2
2
X
1
3
3
Y
11
Y
2
2
=
Θ , then Θ is a stationary point of L(Θ:D)
0
0
Usually, this means a local maximum
H
N(X3)
Y
If Θ
Θ1)
Updated network (G,
N(X1)
H
Y
Expected Counts
3
0
Each iteration improves the likelihood
Y
3
3
N(Y3, H)
Training
Data
77
78
13
Summary: Parameter Learning
with Incomplete Data
Expectation Maximization (EM)
Incomplete data makes parameter estimation hard
Likelihood function
Computational bottleneck:
Computation of expected counts in E-Step
Need to compute posterior for each unobserved
variable in each instance of training set
All posteriors for an instance can be derived
from one pass of standard BN inference
Does not have closed form
Is multimodal
Finding max likelihood parameters:
EM
Gradient ascent
Both exploit inference procedures for Bayesian
networks to compute expected sufficient statistics
79
80
Incomplete Data: Structure Scores
Naive Approach
Perform EM for each candidate graph
Recall, Bayesian score:
G3
P (G | D ) ∝ P (G )P (D | G )
= P (G ) ∫ P (D | G , Θ)P (Θ | G )dθ
G2
Parameter space
G1
Parametric
optimization
(EM)
Local Maximum
With incomplete data:
Cannot evaluate marginal likelihood in closed form
We have to resort to approximations:
Computationally
expensive:
G4
Gn
Parameter optimization via EM — non-trivial
Need to perform EM for all candidate structures
Evaluate score around MAP parameters
Spend time even on poor candidates
⇒In practice, considers only a few candidates
Need to find MAP parameters (e.g., EM)
81
82
Structural EM
Structural EM
Recall, in complete data we had
Decomposition efficient search
Idea:
Use current model to help evaluate new structures
Idea:
Instead of optimizing the real score…
Find decomposable alternative score
Such that maximizing new score
Perform search in (Structure, Parameters) space
At each iteration, use current model for finding either:
⇒
Outline:
Better scoring parameters: “parametric” EM step
⇒ improvement in real score
or
Better scoring structure: “structural” EM step
83
84
14
Reiterate
Example: Phylogenetic Reconstruction
Computation
X
X
1
2
X
Expected Counts
3
Y
Y
1
2
+
X
N(X1)
H
X
1
N(X2)
X
2
3
H
N(X3)
Y
Input: Biological sequences
Human
CGTTGC…
Chimp
CCTAGG…
Orang
CGAACG…
Score
&
Parameterize
Y
N(Y1, H)
N(Y2, H)
Y
1
Assumption: positions
are independent
….
N(H, X1, X1, X3)
3
An “instance” of
evolutionary process
Y
2
Output: a phylogeny
3
N(Y3, H)
10 billion years
N(X2,X1)
N(H, X1, X3)
X
N(Y1, X2)
Training
Data
11
N(Y2, Y1, H)
X
2
X
3
3
leaf
H
Y
1
Y
2
Y
3
85
Phylogenetic Model
8
leaf
2
3
4
Variables: Letter at each position for each species
12 internal
node
11
10
1
Phylogenetic Tree as a Bayes Net
9
branch (8,9)
5
6
Current day species – observed
Ancestral species - hidden
7
BN Structure: Tree topology
BN Parameters: Branch lengths (time spans)
Topology: bifurcating
Observed species – 1…N
Main problem: Learn topology
Ancestral species – N+1…2N-2
Lengths t = {ti,j} for each branch (i,j)
Evolutionary model:
86
If ancestral were observed
easy learning problem (learning trees)
⇒
P(A changes to T| 10 billion yrs )
87
Algorithm Outline
88
Algorithm Outline
Compute expected pairwise stats
Compute expected pairwise stats
Weights: Branch scores
Weights: Branch scores
Find: T '= argmaxT
Original Tree (T0,t0)
∑ wi j
ij T
( , )∈
,
Pairwise weights
O(N2) pairwise statistics suffice
to evaluate all trees
89
90
15
Algorithm Outline
Algorithm Outline
Compute expected pairwise stats
Compute expected pairwise stats
Weights: Branch scores
Weights: Branch scores
Find: T '= argmaxT
∑ wi j
ij T
( , )∈
Find: T '= argmaxT
,
Construct bifurcation T1
∑ wi j
ij T
( , )∈
,
Construct bifurcation T1
Max. Spanning Tree
New Tree
Theorem: L(T1,t1) ≥ L(T0,t0)
Repeat until convergence…
91
92
Real Life Data
Lysozyme c
Overview
# sequences
43
34
# pos
122
3,578
-2,916.2
-74,227.9
-2,892.1
-70,533.5
0.19
1.03
Traditional
approach
LogStructural EM
likelihood
Approach
Difference
per position
Introduction
Parameter Estimation
Model Selection
Structure Discovery
Incomplete Data
Learning from Structured Data
Mitochondrial
genomes
Each position twice as likely
93
Bayesian Networks: Problem
94
Bayesian Networks: Problem
Bayesian nets use propositional representation
Real world has objects, related to each other
Bayesian nets use propositional representation
Real world has objects, related to each other
These “instances” are not independent!
Intelligence
Difficulty
Intell_J.Doe
Diffic_CS101
Grade_JDoe_CS101
A
Grade
Intell_FGump
Intell_FGump
Diffic_CS101
Grade_FGump_CS101
C
Diffic_Geo101
Grade_FGump_Geo101
95
96
16
St. Nordaf University
Teaching-ability
of object, & types of links between objects
Classes
Professor
Teaches
Teaches
Teaching-ability
Relational Schema
Specifies types of objects in domain, attributes of each type
In-courseGrade
Registered
Welcome to
Grade
Course
Intelligence
Grade
Satisfac
In-course
Links
Attributes
Registered
In-course
Satisfac
CS101
Difficulty
Take
Teach
Forrest Gump
Geo101
Difficulty
Intelligence
Intelligence
Satisfac
Welcome to
Student
Teaching-Ability
Registered
In
Difficulty
Registration
Grade
Satisfaction
Jane Doe
97
98
Representing the Distribution
Possible Worlds
Many possible worlds for a given university
Prof.
Jones
High
Low
Teaching-ability
Prof.
Smith
High
Teaching-ability
World: assignment to all attributes
of all objects in domain
All possible assignments of all attributes of all
objects
Grade
C
B
Infinitely many potential universities
Intelligence
Weak
Satisfac
Hate
Like
Welcome to
Each associated with a very different set of
Forrest Gump
Geo101
worlds
Grade
B
C
Easy
Difficulty
Welcome to
Satisfac
Hate
CS101
Need to represent
infinite set of complex distributions
Intelligence
Smart
A
Grade
Hard
Difficulty
Easy
Jane Doe
Satisfac
Hate
Like
99
100
Probabilistic Relational Models
PRM Semantics
Key ideas:
Universals: Probabilistic patterns hold for all objects in class
Locality: Represent direct probabilistic dependencies
Prof. Jones
Teaching-ability
Prof. Smith
Teaching-ability
Links give us potential interactions!
Professor
Teaching-Ability
Grade
Student
Welcome to
Intelligence
Satisfac
Forrest Gump
Geo101
Intelligence
Course
Instantiated PRM BN
variables: attributes of all objects
dependencies:
determined by
θ Grade|Intell,Diffic
links & PRM
Grade
Difficulty
A
B
Difficulty
C
Welcome to
easy,weak
Reg
Grade
Satisfaction
CS101
easy,smart
hard,weak
Grade
hard,smart
0%
Satisfac
20%
40%
60%
80%
100%
101
Difficulty
Satisfac
Intelligence
Jane Doe
102
17
The Web of Influence
PRM Learning: Complete Data
Welcome to
Objects are all correlated
CS101
Need to perform inference over entire model
For large databases, use approximate inference:
Loopy beliefCpropagation
Prof.
Jones
Low
Teaching-ability
Prof.
Smith
High
Teaching-ability
θ Grade|Intell,Diffic
Grade
C
Intelligence
Weak
Satisfac
Like
Welcome to
0%
0%
50%
50%
Introduce
over parameters
Geo101
Entireprior
database is single “instance”
B
Update
prior with sufficient Grade
statistics:
Parameters
Difficulty
Easyused many times in instance
100%
100%
Satisfac
Hate
Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi)
Welcome to
A
Geo101
weak
Welcome to
smart
CS101
Intelligence
Smart
A
Grade
Difficulty
Easy
easy / hard
weak / smart
103
104
PRM Learning: Incomplete Data
???
Satisfac
Like
A Web of Data
Tom Mitchell
Professor
???
Project-of
C
WebKB
Project
???
Hi
Welcome
to
Use expected
sufficient
statistics
Geo101
But, everything is correlated:
Sean Slattery
Student
A
⇒ E-step uses???
(approx) inference over entire model
Low
Welcome to
CS101
Contains
???
B
Hi
???
Advisee-of
Works-on
CMU CS Faculty
105
Standard Approach
[Craven et al.]
106
What’s in a Link
Page
Word1
To- Page
From-Page
Category
Category
Category
... WordN
Word1
... WordN
Word1
Professor
department
extract
information
computer
science
machine
learning
…
... WordN
Exists
Link
0.52
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
107
0.52
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
108
18
Discovering Hidden Concepts
Discovering Hidden Concepts
Actor
Director
Type
Type
Gender
Movie
Appeared
Genre
Type
Credit-Order
Year MPAA Rating
Internet Movie Database
http://www.imdb.com109
Actors
Directors
Wizard of Oz
Cinderella
Sound of Music
The Love Bug
Pollyanna
The Parent Trap
Mary Poppins
Swiss Family Robinson
Sylvester Stallone
Bruce Willis
Harrison Ford
Steven Seagal
Kurt Russell
Kevin Costner
Jean-Claude Van Damme
Arnold Schwarzenegger
Alfred Hitchcock
Stanley Kubrick
David Lean
Milos Forman
Terry Gilliam
Francis Coppola
Terminator 2
Batman
Batman Forever
Mission: Impossible
GoldenEye
Starship Troopers
Hunt for Red October
Anthony Hopkins
Robert De Niro
Tommy Lee Jones
Harvey Keitel
Morgan Freeman
Gary Oldman
Steven Spielberg
Tim Burton
Tony Scott
James Cameron
John McTiernan
Joel Schumacher
…
#Votes
Internet Movie Database
http://www.imdb.com110
Web of Influence, Yet Again
Movies
Rating
Conclusion
Many distributions have combinatorial dependency
structure
Utilizing this structure is good
Discovering this structure has implications:
To density estimation
To knowledge discovery
Many applications
Medicine
Biology
…
Web
…
111
112
The END
Thanks to
Gal Elidan
Lise Getoor
Moises Goldszmidt
Matan Ninio
Dana Pe’er
Eran Segal
Ben Taskar
Slides will be available from:
http://www.cs.huji.ac.il/~nir/
http://robotics.stanford.edu/~koller/
113
19