Download PPT

Document related concepts
no text concepts found
Transcript
Based on J. A. Bilmes,“A Gentle Tutorial of the EM Algorithm and its Application
to Parameter Estimation for Gaussian Mixture and Hidden Markov Models“, TR-97021, U.C. Berkeley, April 1998; G. J. McLachlan, T. Krishnan, „The EM Algorithm and
Extensions“, John Wiley & Sons, Inc., 1997; D. Koller, course CS-228 handouts,
Stanford University, 2001., N. Friedman & D. Koller‘s NIPS‘99.
Advanced
I
WS 06/07
MINVOLSET
KINKEDTUBE
PULMEMBOLUS INTUBATION
PAP
SHUNT
VENTMACH
VENTLUNG
DISCONNECT
VENITUBE
PRESS
MINOVL
ANAPHYLAXIS
SAO2
TPR
HYPOVOLEMIA
LVEDVOLUME
CVP
LVFAILURE
FIO2
VENTALV
PVSAT
ARTCO2
INSUFFANESTH
CATECHOL
STROEVOLUME HISTORY ERRBLOWOUTPUT
PCWP
Graphical Models
- Learning -
EXPCO2
CO
HR
HREKG
ERRCAUTER
Parameter Estimation
HRSAT
HRBP
BP
Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel
Albert-Ludwigs University Freiburg, Germany
Outline
• Introduction
• Reminder: Probability theory
• Basics of Bayesian Networks
• Modeling Bayesian networks
• Inference (VE, Junction tree)
• [Excourse: Markov Networks]
• Learning Bayesian networks
• Relational Models
Bayesian Networks
Advanced
I
WS 06/07
Advanced
I
WS 06/07
What is Learning?
The problem of understanding intelligence is
said to be the greatest problem in science
today and “the” problem for this century –
as deciphering the genetic code was for
the second half of the last one…the
problem of learning represents a gateway
to understanding intelligence in man and
machines.
-- Tomasso Poggio and Steven Smale, 2003.
Bayesian Networks - Learning
• Agents are said to learn if they improve their
performance over time based on experience.
Why bothering with learning?
• Bottleneck of knowledge aquisition
– Expensive, difficult
– Normally, no expert is around
• Data is cheap !
– Huge amount of data avaible, e.g.
•Clinical tests
•Web mining, e.g. log files
•....
Bayesian Networks - Learning
Advanced
I
WS 06/07
Advanced
I
WS 06/07
Why Learning Bayesian Networks?
• Conditional independencies & graphical language
capture structure of many real-world distributions
• Graph structure provides much insight into domain
– Allows “knowledge discovery”
• Supports all the features of probabilistic learning
– Model selection criteria
– Dealing with missing data & hidden variables
Bayesian Networks
• Learned model can be used for many tasks
Advanced
I
WS 06/07
Learning With Bayesian Networks
B
E
Data
+
Priori Info
E
B
P(A | E,B)
e
b
.9
.1
e
b
.7
.3
e
b
.8
.2
e
b
.99
.01
Bayesian Networks - Learning
A
Advanced
I
WS 06/07
Learning With Bayesian Networks
B
E
Data
+
Priori Info
Learning
algorithm
E
B
P(A | E,B)
e
b
.9
.1
e
b
.7
.3
e
b
.8
.2
e
b
.99
.01
Bayesian Networks - Learning
A
What does the data look like?
attributes/variables
A1
A2
A3
A4
A5
A6
true
true
false
true
false
false
X1
false
true
true
true
false
false
X2
...
...
...
...
...
...
...
complete data set
true
false
false
false
true
true
XM
data cases
Bayesian Networks - Learning
Advanced
I
WS 06/07
What does the data look like?
Advanced
I
WS 06/07
A1
A2
A3
A4
A5
A6
true
true
?
true
false
false
?
true
?
?
false
false
...
...
...
...
...
...
true
false
?
false
true
?
Real-world data: states of
some random variables are
missing
– E.g. medical diagnose:
not all patient are
subjects to all test
– Parameter reduction,
e.g. clustering, ...
Bayesian Networks - Learning
incomplete data set
What does the data look like?
Advanced
I
WS 06/07
A1
A2
A3
A4
A5
A6
true
true
?
true
false
false
?
true
?
?
false
false
...
...
...
...
...
...
true
false
?
false
true
?
Real-world data: states of
some random variables are
missing
– E.g. medical diagnose:
not all patient are
subjects to all test
– Parameter reduction,
e.g. clustering, ...
missing value
Bayesian Networks - Learning
incomplete data set
What does the data look like?
Advanced
I
WS 06/07
A1
A2
A3
A4
A5
A6
true
true
?
true
false
false
?
true
?
?
false
false
...
...
...
...
...
...
true
false
?
false
true
?
Real-world data: states of
some random variables are
missing
– E.g. medical diagnose:
not all patient are
subjects to all test
– Parameter reduction,
e.g. clustering, ...
missing value
Bayesian Networks - Learning
incomplete data set
Hidden variable – Examples
Advanced
I
WS 06/07
1. Parameter reduction:
2
X1
X2
2
X3
118
2
2
2
X1
X2
X3
hidden
16
Y1
Y2
32
Y3
64
H
H
34
16
Y1
Y2
Y3
4
4
4
Bayesian Networks - Learning
2
Hidden variable – Examples
Advanced
I
WS 06/07
2. Clustering:
hidden
X1
assignment of cluster
Cluster
X2
...
Xn
attributes
• Autoclass model:
– Hidden variables assigns
class labels
– Observed attributes are
independent given the class
Bayesian Networks - Learning
• Hidden variables also appear in clustering
[Slides due to Eamonn Keogh]
Bayesian Networks - Learning
Advanced
I
WS 06/07
[Slides due to Eamonn Keogh]
Advanced
I
WS 06/07
The cluster
means are
randomly
assigned
Bayesian Networks - Learning
Iteration 1
[Slides due to Eamonn Keogh]
Advanced
I
WS 06/07
Bayesian Networks - Learning
Iteration 2
[Slides due to Eamonn Keogh]
Advanced
I
WS 06/07
Bayesian Networks - Learning
Iteration 5
[Slides due to Eamonn Keogh]
Advanced
I
WS 06/07
Bayesian Networks - Learning
Iteration 25
[Slides due to Eamonn Keogh]
What is a natural grouping among these objects?
[Slides due to Eamonn Keogh]
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
School Employees
Females
Males
Learning With Bayesian Networks
Fixed structure
A
B
Fixed variables
A
?
B
fully
Partially
observed
Easiest problem
counting
Selection of arcs
New domain with no
domain expert
Data mining
Numerical, nonlinear
optimization,
Multiple calls to BNs,
Difficult for large
networks
Encompasses to
difficult subproblem,
„Only“ Structural EM
is known
Hidden variables
A
?
B
?
H
Scientific discouvery
Bayesian Networks - Learning
Advanced
I
WS 06/07
Parameter Estimation
Advanced
I
WS 06/07
• Let
be set of data
over m RVs
X i X
•
is called a data case
– All data cases are independently sampled
from identical distributions
Find:

Parameters
of CPDs which match the
data best
Bayesian Networks - Learning
• iid - assumption:
X
Advanced
I
WS 06/07
Maximum Likelihood - Parameter Estimation
Bayesian Networks - Learning
• What does „best matching“ mean ?
Advanced
I
WS 06/07
Maximum Likelihood - Parameter Estimation
Find paramteres
which have most
likely produced the data
Bayesian Networks - Learning
• What does „best matching“ mean ?
Advanced
I
WS 06/07
Maximum Likelihood - Parameter Estimation
• What does „best matching“ mean ?
Bayesian Networks - Learning
1. MAP parameters
Advanced
I
WS 06/07
Maximum Likelihood - Parameter Estimation
• What does „best matching“ mean ?
1. MAP parameters
Bayesian Networks - Learning
2. Data is equally likely for all parameters.
Advanced
I
WS 06/07
Maximum Likelihood - Parameter Estimation
• What does „best matching“ mean ?
2. Data is equally likely for all parameters
3. All parameters are apriori equally likely
Bayesian Networks - Learning
1. MAP parameters
Advanced
I
WS 06/07
Maximum Likelihood - Parameter Estimation
• What does „best matching“ mean ?
Find:
Bayesian Networks - Learning
ML parameters
Advanced
I
WS 06/07
Maximum Likelihood - Parameter Estimation
• What does „best matching“ mean ?
Find:
Likelihood
the data
of the paramteres given
Bayesian Networks - Learning
ML parameters
Advanced
I
WS 06/07
Maximum Likelihood - Parameter Estimation
• What does „best matching“ mean ?
Find:
Likelihood
the data
of the paramteres given
Log-Likelihood
of the
paramteres given the data
Bayesian Networks - Learning
ML parameters
Advanced
I
WS 06/07
Maximum Likelihood
• Consistent: estimate converges to best
possible value as the number of examples grow
• Asymptotic efficiency: estimate is as close to
the true value as possible given a particular
training set
Bayesian Networks - Learning
• This is one of the most commonly used
estimators in statistics
• Intuitively appealing
Learning With Bayesian Networks
Fixed structure
A
B
fully
Easiest problem
counting
Partially
observed
?
Numerical, nonlinear
optimization,
Multiple calls to BNs,
Difficult for large
networks
Fixed variables
A
?
B
Hidden variables
A
?
B
?
H
Selection of arcs
New domain with no
domain expert
Data mining
Encompasses to
difficult subproblem,
„Only“ Structural EM
is known
Scientific discouvery
Bayesian Networks - Learning
Advanced
I
WS 06/07
Known Structure, Complete Data
E, B, A
<Y,N,N>
<Y,N,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
E
B
B
E
B
E
A
A
P(A | E,B)
e
b
?
?
e
b
?
?
e
b
?
?
e
b
?
?
Learning
algorithm
• Network structure is
specified
– Learner needs to estimate
parameters
• Data does not contain
missing values
E
B
P(A | E,B)
e
b
.9
.1
e
b
.7
.3
e
b
.8
.2
e
b
.99
.01
Bayesian Networks - Learning
Advanced
I
WS 06/07
ML Parameter Estimation
A1
A2
A3
A4
A5
A6
true
true
false
true
false
false
false
true
true
true
false
false
...
...
...
...
...
...
true
false
false
false
true
true
Bayesian Networks - Learning
Advanced
I
WS 06/07
(iid)
ML Parameter Estimation
A1
A2
A3
A4
A5
A6
true
true
false
true
false
false
false
true
true
true
false
false
...
...
...
...
...
...
true
false
false
false
true
true
Bayesian Networks - Learning
Advanced
I
WS 06/07
(iid)
ML Parameter Estimation
A1
A2
A3
A4
A5
A6
true
true
false
true
false
false
false
true
true
true
false
false
...
...
...
...
...
...
true
false
false
false
true
true
Bayesian Networks - Learning
Advanced
I
WS 06/07
(iid)
ML Parameter Estimation
A1
A2
A3
A4
A5
A6
true
true
false
true
false
false
false
true
true
true
false
false
...
...
...
...
...
...
true
false
false
false
true
true
(BN semantics)
Bayesian Networks - Learning
Advanced
I
WS 06/07
(iid)
ML Parameter Estimation
A1
A2
A3
A4
A5
A6
true
true
false
true
false
false
false
true
true
true
false
false
...
...
...
...
...
...
true
false
false
false
true
true
(BN semantics)
Bayesian Networks - Learning
Advanced
I
WS 06/07
(iid)
ML Parameter Estimation
A1
A2
A3
A4
A5
A6
true
true
false
true
false
false
false
true
true
true
false
false
...
...
...
...
...
...
true
false
false
false
true
true
(BN semantics)
Bayesian Networks - Learning
Advanced
I
WS 06/07
(iid)
ML Parameter Estimation
A1
A2
A3
A4
A5
A6
true
true
false
true
false
false
false
true
true
true
false
false
...
...
...
...
...
...
true
false
false
false
true
true
(BN semantics)
Only local
parameters
of family of
Aj involved
Bayesian Networks - Learning
Advanced
I
WS 06/07
(iid)
ML Parameter Estimation
A1
A2
A3
A4
A5
A6
true
true
false
true
false
false
false
true
true
true
false
false
...
...
...
...
...
...
true
false
false
false
true
true
(BN semantics)
Only local
parameters
of family of
Aj involved
Each factor individually !!
Bayesian Networks - Learning
Advanced
I
WS 06/07
(iid)
ML Parameter Estimation
A1
A2
A3
A4
A5
A6
true
true
false
true
false
false
false
true
true
true
false
false
...
...
...
...
...
...
true
false
false
false
true
true
(BN semantics)
Decomposability of the likelihood
Only local
parameters
of family of
Aj involved
Each factor individually !!
Bayesian Networks - Learning
Advanced
I
WS 06/07
Advanced
I
WS 06/07
Decomposability of Likelihood
If the data set if complete (no question marks)
• we can maximize each local likelihood function
independently, and
• then combine the solutions to get an MLE
solution.
independent, local sub-problems. This allows
efficient solutions to the MLE problem.
Bayesian Networks
decomposition of the global problem to
Advanced
I
WS 06/07
Likelihood for Multinominals
• Random variable V with 1,...,K values
•
where Nk is the counts of state k in data
X
Bayesian Networks - Learning
This constraint implies that the choice on
I influences the choice on j (i<>j)
Advanced
I
WS 06/07
Likelihood for Binominals (2 states only)
• Compute partial derivative
1 + 2 =1
=> MLE is
Bayesian Networks - Learning
• Set partial derivative zero
X
Advanced
I
WS 06/07
Likelihood for Binominals (2 states only)
• Compute partial derivative
1 + 2 =1
In general, for multinomials (>2 states),
the MLE is
=> MLE is
Bayesian Networks - Learning
• Set partial derivative zero
X
Advanced
I
WS 06/07
Likelihood for Conditional Multinominals
•
multinomial for each joint state pa of
the parents of V:
• MLE
Bayesian Networks - Learning
•
Learning With Bayesian Networks
Fixed structure
A
Fixed variables
A
B
?
B
fully
Partially
observed
Easiest problem
counting
Selection of arcs
New domain with no
domain expert
Data mining
Numerical, nonlinear
optimization,
Multiple calls to BNs,
Difficult for large
networks
Encompasses to
difficult subproblem,
„Only“ Structural EM
is known
?
Hidden variables
A
?
B
?
H
Scientific discouvery
Bayesian Networks - Learning
Advanced
I
WS 06/07
Known Structure, Incomplete Data
E, B, A
<Y,?,N>
<Y,N,?>
<N,N,Y>
<N,Y,Y>
.
.
<?,Y,Y>
E
B
B
E
B
E
A
A
P(A | E,B)
e
b
?
?
e
b
?
?
e
b
?
?
e
b
?
?
Learning
algorithm
• Network structure is specified
• Data contains missing values
– Need to consider assignments
to missing values
E
B
P(A | E,B)
e
b
.9
.1
e
b
.7
.3
e
b
.8
.2
e
b
.99
.01
Bayesian Networks - Learning
Advanced
I
WS 06/07
EM Idea
Advanced
I
WS 06/07
• In the case of complete data, ML
parameter estimation is easy:
Incomplete data ?
1. Complete data (Imputation)
• most probable?, average?, ... value
2. Count
3. Iterate
Bayesian Networks - Learning
– simply counting (1 iteration)
EM Idea: complete the data
A
B
incomplete data
A
B
true
true
true
?
false
true
true
false
false
?
Bayesian Networks - Learning
Advanced
I
WS 06/07
EM Idea: complete the data
A
B
incomplete data
complete
complete data
A
B
N
true
true
1.5
true
false
1.5
false
true
1.5
false
false
0.5
expected counts
A
B
true
true
true
?
false
true
true
false
false
?
Bayesian Networks - Learning
Advanced
I
WS 06/07
EM Idea: complete the data
A
B
incomplete data
A
B
true
true
true
?
false
true
true
false
false
?
complete
complete data
expected counts
A
B
N
true
true
1.0
A
B
N
true
?=true
0.5
true
true
1.5
true
?=false
0.5
true
false
1.5
false
true
1.0
false
true
1.5
true
false
1.0
false
false
0.5
false
?=true
0.5
false
?=false
0.5
Bayesian Networks - Learning
Advanced
I
WS 06/07
EM Idea: complete the data
A
B
incomplete data
complete
complete data
A
B
N
true
true
1.5
true
false
1.5
false
true
1.5
false
false
0.5
expected counts
maximize
A
B
true
true
true
?
false
true
true
false
false
?
Bayesian Networks - Learning
Advanced
I
WS 06/07
EM Idea: complete the data
A
B
incomplete data
complete
complete data
A
B
N
true
true
1.5
true
false
1.5
false
true
1.5
false
false
0.5
expected counts
maximize
A
B
true
true
true
?
false
true
true
false
false
?
iterate
Bayesian Networks - Learning
Advanced
I
WS 06/07
EM Idea: complete the data
A
B
incomplete data
A
B
true
true
true
?
false
true
true
false
false
?
Bayesian Networks - Learning
Advanced
I
WS 06/07
EM Idea: complete the data
A
B
incomplete data
complete
complete data
A
B
true
true
true
false
1.5
false
true
1.75
false
false
0.25
expected counts
N
1.5
maximize
A
B
true
true
true
?
false
true
true
false
false
?
iterate
Bayesian Networks - Learning
Advanced
I
WS 06/07
EM Idea: complete the data
A
B
incomplete data
A
B
true
true
true
?
false
true
true
false
false
?
Bayesian Networks - Learning
Advanced
I
WS 06/07
EM Idea: complete the data
A
B
incomplete data
complete
complete data
A
B
true
true
true
false
1.5
false
true
1.875
false
false
0.125
expected counts
N
1.5
maximize
A
B
true
true
true
?
false
true
true
false
false
?
iterate
Bayesian Networks - Learning
Advanced
I
WS 06/07
incomplete-data likelihood
Assume complete data
complete-data likelihood
Complete-data likelihood
A1
A2
A3
A4
A5
A6
true
true
?
true
false
false
?
true
?
?
false
false
...
...
...
...
...
...
true
false
?
false
true
?
exists with
Bayesian Networks - Learning
Advanced
I
WS 06/07
EM Algorithm - Abstract
Expectation Step
Maximization Step
Bayesian Networks - Learning
Advanced
I
WS 06/07
EM Algorithm - Principle
Expectation Maximization (EM):

Construct an new function based on the “current point” (which “behaves well”)
Property: The maximum of the new function has a better scoring then the current point.
Bayesian Networks - Learning
P(Y|)
Advanced
I
WS 06/07
EM Algorithm - Principle
Current point k
Expectation Maximization (EM):

Construct an new function based on the “current point” (which “behaves well”)
Property: The maximum of the new function has a better scoring then the current point.
Bayesian Networks - Learning
P(Y|)
Advanced
I
WS 06/07
EM Algorithm - Principle
P(Y|)
Advanced
I
WS 06/07
Current point k
Expectation Maximization (EM):

Construct an new function based on the “current point” (which “behaves well”)
Property: The maximum of the new function has a better scoring then the current point.
Bayesian Networks - Learning
new function Q(,k)
EM Algorithm - Principle
P(Y|)
Advanced
I
WS 06/07
Current point k
Maximum k+1
Expectation Maximization (EM):

Construct an new function based on the “current point” (which “behaves well”)
Property: The maximum of the new function has a better scoring then the current point.
Bayesian Networks - Learning
new function Q(,k)
Advanced
I
WS 06/07
EM for Multi-Nominals
• Random variable V with 1,...,K values
where ENk are the expected counts of
X
state k in the data, i.e.
• „MLE“:
Bayesian Networks - Learning
•
Advanced
I
WS 06/07
•
EM for Conditional Multinominals
multinomial for each joint state pa of
the parents of V:
• „MLE“
Bayesian Networks - Learning
•
Advanced
I
WS 06/07
Learning Parameters: incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Data
Current model
S X D C
<? 0 1 0
<1 1 ? 0
<0 0 0 ?
<? ? 0 ?
………
B
1>
1>
?>
1>
Bayesian Networks - Learning
Initial parameters
Advanced
I
WS 06/07
Learning Parameters: incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Data
Current model
Expectation
Inference:
P(S|X=0,D=1,C=0,B=1)
S X D C
<? 0 1 0
<1 1 ? 0
<0 0 0 ?
<? ? 0 ?
………
B
1>
1>
?>
1>
Expected
counts
S
1
1
0
1
X D C
0 1 0
1 1 0
0 0 0
0 0 0
………..
B
1
1
0
1
Bayesian Networks - Learning
Initial parameters
Advanced
I
WS 06/07
Learning Parameters: incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Data
Current model
Expectation
Inference:
P(S|X=0,D=1,C=0,B=1)
S X D C
<? 0 1 0
<1 1 ? 0
<0 0 0 ?
<? ? 0 ?
………
B
1>
1>
?>
1>
Expected
counts
Maximization
Update parameters
(ML, MAP)
S
1
1
0
1
X D C
0 1 0
1 1 0
0 0 0
0 0 0
………..
B
1
1
0
1
Bayesian Networks - Learning
Initial parameters
Advanced
I
WS 06/07
Learning Parameters: incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Data
Current model
Expectation
Inference:
P(S|X=0,D=1,C=0,B=1)
S X D C
<? 0 1 0
<1 1 ? 0
<0 0 0 ?
<? ? 0 ?
………
B
1>
1>
?>
1>
Expected
counts
Maximization
Update parameters
(ML, MAP)
EM-algorithm:
iterate until convergence
S
1
1
0
1
X D C
0 1 0
1 1 0
0 0 0
0 0 0
………..
B
1
1
0
1
Bayesian Networks - Learning
Initial parameters
Advanced
I
WS 06/07
Learning Parameters: incomplete data
1. Initialize parameters
2. Compute pseudo counts for each variable
3. Set parameters to the (completed) ML
estimates
4. If not converged, iterate to 2
Bayesian Networks - Learning
junction tree
algorithm
Advanced
I
WS 06/07
Monotonicity
• (discrete) Bayesian networks: for any initial, nonuniform value the EM algorithm converges to a
(local or global) maximum.
Bayesian Networks - Learning
• (Dempster, Laird, Rubin ´77): the incompletedata likelihood fuction is not decreased after an
EM iteration
LL on training set (Alarm)
Experiment by Bauer, Koller and Singer [UAI97]
Bayesian Networks - Learning
Advanced
I
WS 06/07
Parameter value (Alarm)
Experiment by Baur, Koller and Singer [UAI97]
Bayesian Networks - Learning
Advanced
I
WS 06/07
EM in Practice
Advanced
I
WS 06/07
Initial parameters:
• Random parameters setting
• “Best” guess from other source
Stopping criteria:
Avoiding bad local maxima:
• Multiple restarts
• Early “pruning” of unpromising ones
Speed up:
• various methods to speed
convergence
Bayesian Networks - Learning
• Small change in likelihood of data
• Small change in parameter values
Advanced
I
WS 06/07
Gradient Ascent
• Main result
• Requires same BN inference computations as EM
– Flexible
– Closely related to methods in neural network training
• Cons:
– Need to project gradient onto space of legal parameters
– To get reasonable convergence we need to combine with
“smart” optimization techniques
Bayesian Networks - Learning
• Pros:
Advanced
I
WS 06/07
Parameter Estimation: Summary
• Parameter estimation is a basic task for
learning with Bayesian networks
• Due to missing values non-linear
optimization
• EM for multi-nominal random variables
– Fully observed data: counting
– Partially observed data: pseudo counts
• Junction tree to do multiple inference
Bayesian Networks - Learning
– EM, Gradient, ...
Related documents