Download Lecture 14: Generative Models

Document related concepts

Vector generalized linear model wikipedia , lookup

Pattern recognition wikipedia , lookup

Numerical weather prediction wikipedia , lookup

Predictive analytics wikipedia , lookup

General circulation model wikipedia , lookup

Generalized linear model wikipedia , lookup

Transcript
Introduction to Machine Learning (67577)
Lecture 14
Shai Shalev-Shwartz
School of CS and Engineering,
The Hebrew University of Jerusalem
Generative Models
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
1 / 26
Generative Models
The Generative Approach: try to learn the distribution D over the
data
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
2 / 26
Generative Models
The Generative Approach: try to learn the distribution D over the
data
If we know D we can predict using the Bayes optimal classifier
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
2 / 26
Generative Models
The Generative Approach: try to learn the distribution D over the
data
If we know D we can predict using the Bayes optimal classifier
Usually, it is much harder to learn D than to simply learn a predictor
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
2 / 26
Generative Models
The Generative Approach: try to learn the distribution D over the
data
If we know D we can predict using the Bayes optimal classifier
Usually, it is much harder to learn D than to simply learn a predictor
However, in some situations, it is reasonable to adopt the generative
learning approach:
.
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
2 / 26
Generative Models
The Generative Approach: try to learn the distribution D over the
data
If we know D we can predict using the Bayes optimal classifier
Usually, it is much harder to learn D than to simply learn a predictor
However, in some situations, it is reasonable to adopt the generative
learning approach:
Computational reasons
.
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
2 / 26
Generative Models
The Generative Approach: try to learn the distribution D over the
data
If we know D we can predict using the Bayes optimal classifier
Usually, it is much harder to learn D than to simply learn a predictor
However, in some situations, it is reasonable to adopt the generative
learning approach:
Computational reasons
We sometimes don’t have a specific task at hand
.
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
2 / 26
Generative Models
The Generative Approach: try to learn the distribution D over the
data
If we know D we can predict using the Bayes optimal classifier
Usually, it is much harder to learn D than to simply learn a predictor
However, in some situations, it is reasonable to adopt the generative
learning approach:
Computational reasons
We sometimes don’t have a specific task at hand
Interpretability of the data
.
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
2 / 26
Outline
1
Maximum Likelihood
2
Naive Bayes
3
Linear Discriminant Analysis
4
Latent Variables and EM
5
Bayesian Reasoning
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
3 / 26
Maximum Likelihood Estimator
We assume as prior knowledge that the underlying distribution is
parameterized by some θ
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
4 / 26
Maximum Likelihood Estimator
We assume as prior knowledge that the underlying distribution is
parameterized by some θ
Learning the distribution corresponds to finding θ
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
4 / 26
Maximum Likelihood Estimator
We assume as prior knowledge that the underlying distribution is
parameterized by some θ
Learning the distribution corresponds to finding θ
Example: let X = {0, 1} then the set of distributions over X are
parameterized by a single number θ ∈ [0, 1] corresponding to
Px∼Dθ [x = 1] = Dθ ({1}) = θ
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
4 / 26
Maximum Likelihood Estimator
We assume as prior knowledge that the underlying distribution is
parameterized by some θ
Learning the distribution corresponds to finding θ
Example: let X = {0, 1} then the set of distributions over X are
parameterized by a single number θ ∈ [0, 1] corresponding to
Px∼Dθ [x = 1] = Dθ ({1}) = θ
The goal is to learn θ from a sequence of i.i.d. examples
S = (x1 , . . . , xm ) ∼ Dθm
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
4 / 26
Maximum Likelihood Estimator
Likelihood: The likelihood of S, assuming the distribution is Dθ , is
defined to be
Dθm ({S})
=
m
Y
Dθ ({xi }) =
i=1
Shai Shalev-Shwartz (Hebrew U)
m
Y
i=1
IML Lecture 14
P [X = xi ]
X∼Dθ
Generative Models
5 / 26
Maximum Likelihood Estimator
Likelihood: The likelihood of S, assuming the distribution is Dθ , is
defined to be
Dθm ({S})
=
m
Y
Dθ ({xi }) =
i=1
m
Y
i=1
P [X = xi ]
X∼Dθ
Log-Likelihood: it is convenient to denote
L(S; θ) =
log (Dθm ({S}))
=
m
X
i=1
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
log
P [X = xi ]
X∼Dθ
Generative Models
5 / 26
Maximum Likelihood Estimator
Likelihood: The likelihood of S, assuming the distribution is Dθ , is
defined to be
Dθm ({S})
=
m
Y
Dθ ({xi }) =
i=1
m
Y
i=1
P [X = xi ]
X∼Dθ
Log-Likelihood: it is convenient to denote
L(S; θ) =
log (Dθm ({S}))
=
m
X
i=1
log
P [X = xi ]
X∼Dθ
Maximum Likelihood Estimator (MLE): estimate θ based on S
according to
θ̂(S) = argmax L(S; θ) .
θ
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
5 / 26
Maximum Likelihood Estimator for Bernoulli variables
Suppose X = {0, 1}, Dθ is the distribution s.t. Px∼Dθ [x = 1] = θ
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
6 / 26
Maximum Likelihood Estimator for Bernoulli variables
Suppose X = {0, 1}, Dθ is the distribution s.t. Px∼Dθ [x = 1] = θ
The log-likelihood function:
L(S; θ) = log(θ) · |{i : xi = 1}| + log(1 − θ) · |{i : xi = 0}|
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
6 / 26
Maximum Likelihood Estimator for Bernoulli variables
Suppose X = {0, 1}, Dθ is the distribution s.t. Px∼Dθ [x = 1] = θ
The log-likelihood function:
L(S; θ) = log(θ) · |{i : xi = 1}| + log(1 − θ) · |{i : xi = 0}|
Maximizing w.r.t. θ gives the ML estimator. Taking derivative w.r.t.
θ and comparing to zero gives:
|{i : xi = 1}|
θ̂
Shai Shalev-Shwartz (Hebrew U)
−
|{i : xi = 0}|
1 − θ̂
= 0 ⇒ θ̂ =
IML Lecture 14
|{i : xi = 1}|
m
Generative Models
6 / 26
Maximum Likelihood Estimator for Bernoulli variables
Suppose X = {0, 1}, Dθ is the distribution s.t. Px∼Dθ [x = 1] = θ
The log-likelihood function:
L(S; θ) = log(θ) · |{i : xi = 1}| + log(1 − θ) · |{i : xi = 0}|
Maximizing w.r.t. θ gives the ML estimator. Taking derivative w.r.t.
θ and comparing to zero gives:
|{i : xi = 1}|
θ̂
−
|{i : xi = 0}|
1 − θ̂
= 0 ⇒ θ̂ =
|{i : xi = 1}|
m
That is, θ̂ is the average number of ones in S
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
6 / 26
Maximum Likelihood for Continuous Variables
Example: X = [0, 1] and Dθ is the uniform distribution. Then,
Dθ ({x}) = 0 for all x so L(S; θ) = −∞ ...
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
7 / 26
Maximum Likelihood for Continuous Variables
Example: X = [0, 1] and Dθ is the uniform distribution. Then,
Dθ ({x}) = 0 for all x so L(S; θ) = −∞ ...
To overcome the problem, we define L using the density distribution:
L(S; θ) =
m
X
log (Px∼Dθ [x = xi ])
i=1
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
7 / 26
Maximum Likelihood for Continuous Variables
Example: X = [0, 1] and Dθ is the uniform distribution. Then,
Dθ ({x}) = 0 for all x so L(S; θ) = −∞ ...
To overcome the problem, we define L using the density distribution:
L(S; θ) =
m
X
log (Px∼Dθ [x = xi ])
i=1
E.g., for Gaussian distribution, with θ = (µ, σ),
1
(xi − µ)2
√
Px∼Dθ (xi ) =
exp −
2σ 2
σ 2π
and
L(S; θ) = −
m
√
1 X
(xi − µ)2 − m log(σ 2 π) .
2
2σ
i=1
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
7 / 26
Maximum Likelihood for Continuous Variables
Example: X = [0, 1] and Dθ is the uniform distribution. Then,
Dθ ({x}) = 0 for all x so L(S; θ) = −∞ ...
To overcome the problem, we define L using the density distribution:
L(S; θ) =
m
X
log (Px∼Dθ [x = xi ])
i=1
E.g., for Gaussian distribution, with θ = (µ, σ),
1
(xi − µ)2
√
Px∼Dθ (xi ) =
exp −
2σ 2
σ 2π
and
m
√
1 X
(xi − µ)2 − m log(σ 2 π) .
2
2σ
i=1
q P
m
1 Pm
1
2
MLE becomes: µ̂ = m i=1 xi and σ̂ =
i=1 (xi − µ̂)
m
L(S; θ) = −
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
7 / 26
Outline
1
Maximum Likelihood
2
Naive Bayes
3
Linear Discriminant Analysis
4
Latent Variables and EM
5
Bayesian Reasoning
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
8 / 26
Naive Bayes
A classical demonstration of how generative assumptions and
parameter estimations simplify the learning process
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
9 / 26
Naive Bayes
A classical demonstration of how generative assumptions and
parameter estimations simplify the learning process
Goal: learn function h : {0, 1}d → {0, 1}
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
9 / 26
Naive Bayes
A classical demonstration of how generative assumptions and
parameter estimations simplify the learning process
Goal: learn function h : {0, 1}d → {0, 1}
Bayes optimal classifier:
hBayes (x) = argmax P[Y = y|X = x] .
y∈{0,1}
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
9 / 26
Naive Bayes
A classical demonstration of how generative assumptions and
parameter estimations simplify the learning process
Goal: learn function h : {0, 1}d → {0, 1}
Bayes optimal classifier:
hBayes (x) = argmax P[Y = y|X = x] .
y∈{0,1}
Need 2d parameters for describing P[Y = y|X = x] for every
x ∈ {0, 1}d
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
9 / 26
Naive Bayes
A classical demonstration of how generative assumptions and
parameter estimations simplify the learning process
Goal: learn function h : {0, 1}d → {0, 1}
Bayes optimal classifier:
hBayes (x) = argmax P[Y = y|X = x] .
y∈{0,1}
Need 2d parameters for describing P[Y = y|X = x] for every
x ∈ {0, 1}d
Naive generative assumption: features are independent given the
label:
d
Y
P[X = x|Y = y] =
P[Xi = xi |Y = y]
i=1
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
9 / 26
Naive Bayes
With this (rather naive) assumption and using Bayes rule, the Bayes
optimal classifier can be further simplified:
hBayes (x) = argmax P[Y = y|X = x]
y∈{0,1}
= argmax P[Y = y]P[X = x|Y = y]
y∈{0,1}
= argmax P[Y = y]
y∈{0,1}
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
d
Y
P[Xi = xi |Y = y] .
i=1
Generative Models
10 / 26
Naive Bayes
With this (rather naive) assumption and using Bayes rule, the Bayes
optimal classifier can be further simplified:
hBayes (x) = argmax P[Y = y|X = x]
y∈{0,1}
= argmax P[Y = y]P[X = x|Y = y]
y∈{0,1}
= argmax P[Y = y]
y∈{0,1}
d
Y
P[Xi = xi |Y = y] .
i=1
Now, number of parameters to estimate is 2d + 1
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
10 / 26
Naive Bayes
With this (rather naive) assumption and using Bayes rule, the Bayes
optimal classifier can be further simplified:
hBayes (x) = argmax P[Y = y|X = x]
y∈{0,1}
= argmax P[Y = y]P[X = x|Y = y]
y∈{0,1}
= argmax P[Y = y]
y∈{0,1}
d
Y
P[Xi = xi |Y = y] .
i=1
Now, number of parameters to estimate is 2d + 1
Reduces both runtime and sample complexity
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
10 / 26
Outline
1
Maximum Likelihood
2
Naive Bayes
3
Linear Discriminant Analysis
4
Latent Variables and EM
5
Bayesian Reasoning
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
11 / 26
Linear Discriminant Analysis
Another demonstration of how generative assumptions simplify the
learning process
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
12 / 26
Linear Discriminant Analysis
Another demonstration of how generative assumptions simplify the
learning process
Goal: learn h : Rd → {0, 1}
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
12 / 26
Linear Discriminant Analysis
Another demonstration of how generative assumptions simplify the
learning process
Goal: learn h : Rd → {0, 1}
The generative assumption: y is generated based on
P[Y = 1] = P[Y = 0] = 1/2 and given y, x ∼ N(µy , Σ):
1
1
T −1
P[X = x|Y = y] =
exp − (x − µy ) Σ (x − µy )
2
(2π)d/2 |Σ|1/2
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
12 / 26
Linear Discriminant Analysis
Another demonstration of how generative assumptions simplify the
learning process
Goal: learn h : Rd → {0, 1}
The generative assumption: y is generated based on
P[Y = 1] = P[Y = 0] = 1/2 and given y, x ∼ N(µy , Σ):
1
1
T −1
P[X = x|Y = y] =
exp − (x − µy ) Σ (x − µy )
2
(2π)d/2 |Σ|1/2
Bayes rule:
hBayes (x) = argmax P[Y = y]P[X = x|Y = y]
y∈{0,1}
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
12 / 26
Linear Discriminant Analysis
Another demonstration of how generative assumptions simplify the
learning process
Goal: learn h : Rd → {0, 1}
The generative assumption: y is generated based on
P[Y = 1] = P[Y = 0] = 1/2 and given y, x ∼ N(µy , Σ):
1
1
T −1
P[X = x|Y = y] =
exp − (x − µy ) Σ (x − µy )
2
(2π)d/2 |Σ|1/2
Bayes rule:
hBayes (x) = argmax P[Y = y]P[X = x|Y = y]
y∈{0,1}
This means we will predict hBayes (x) = 1 iff
1
1
(x − µ0 )T Σ−1 (x − µ0 ) − (x − µ1 )T Σ−1 (x − µ1 ) > 0
2
2
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
12 / 26
Linear Discriminant Analysis
Equivalent to hw, xi + b > 0 where
w = (µ1 − µ0 )T Σ−1
Shai Shalev-Shwartz (Hebrew U)
and b =
IML Lecture 14
1 T −1
µ0 Σ µ0 − µT1 Σ−1 µ1
2
Generative Models
13 / 26
Linear Discriminant Analysis
Equivalent to hw, xi + b > 0 where
w = (µ1 − µ0 )T Σ−1
and b =
1 T −1
µ0 Σ µ0 − µT1 Σ−1 µ1
2
That is, Bayes optimal is a halfspace in this case
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
13 / 26
Linear Discriminant Analysis
Equivalent to hw, xi + b > 0 where
w = (µ1 − µ0 )T Σ−1
and b =
1 T −1
µ0 Σ µ0 − µT1 Σ−1 µ1
2
That is, Bayes optimal is a halfspace in this case
But, instead of learning the halfspace directly, we’ll learn µ0 , µ1 , Σ
using maximum likelihood.
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
13 / 26
Outline
1
Maximum Likelihood
2
Naive Bayes
3
Linear Discriminant Analysis
4
Latent Variables and EM
5
Bayesian Reasoning
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
14 / 26
Latent Variables
Sometimes, it is convenient to express the distribution over X using
latent random variables
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
15 / 26
Latent Variables
Sometimes, it is convenient to express the distribution over X using
latent random variables
Mixture of Gaussians: Each x ∈ Rd is generated by first selecting a
random y from [k], then choose x according to N (µy , Σy )
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
15 / 26
Mixture of Gaussians
Each x ∈ Rd is generated by first selecting a random y from [k], then
choose x according to N (µy , Σy )
The density can be written as:
P[X = x] =
k
X
P[Y = y]P[X = x|Y = y]
y=1
k
X
1
1
T −1
cy
=
exp − (x − µy ) Σy (x − µy )
2
(2π)d/2 |Σy |1/2
y=1
Note: Y is a hidden variable that we do not observe in the data. It is
just used to simplify the parametric description of the distribution
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
16 / 26
Latent Variables
More generally,

log (Pθ [X = x]) = log 
k
X

Pθ [X = x, Y = y] .
y=1
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
17 / 26
Latent Variables
More generally,

log (Pθ [X = x]) = log 
k
X

Pθ [X = x, Y = y] .
y=1
Maximum Likelihood:
argmax
θ
Shai Shalev-Shwartz (Hebrew U)
m
X
i=1

log 
k
X

Pθ [X = xi , Y = y] .
y=1
IML Lecture 14
Generative Models
17 / 26
Latent Variables
More generally,

log (Pθ [X = x]) = log 
k
X

Pθ [X = x, Y = y] .
y=1
Maximum Likelihood:
argmax
θ
m
X
i=1

log 
k
X

Pθ [X = xi , Y = y] .
y=1
In many cases, the summation inside the log makes the above
optimization problem computationally hard
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
17 / 26
Latent Variables
More generally,

log (Pθ [X = x]) = log 
k
X

Pθ [X = x, Y = y] .
y=1
Maximum Likelihood:
argmax
θ
m
X
i=1

log 
k
X

Pθ [X = xi , Y = y] .
y=1
In many cases, the summation inside the log makes the above
optimization problem computationally hard
A popular heuristic: Expectation-Maximization (EM), due to
Dempster, Laird and Rubin
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
17 / 26
Expectation-Maximization (EM)
Designed for cases in which, had we known the values of the latent
variables Y , then the maximum likelihood optimization problem
would have been tractable.
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
18 / 26
Expectation-Maximization (EM)
Designed for cases in which, had we known the values of the latent
variables Y , then the maximum likelihood optimization problem
would have been tractable.
Precisely, define the following function over m × k matrices and the
set of parameters θ:
F (Q, θ) =
m X
k
X
Qi,y log (Pθ [X = xi , Y = y])
i=1 y=1
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
18 / 26
Expectation-Maximization (EM)
Designed for cases in which, had we known the values of the latent
variables Y , then the maximum likelihood optimization problem
would have been tractable.
Precisely, define the following function over m × k matrices and the
set of parameters θ:
F (Q, θ) =
m X
k
X
Qi,y log (Pθ [X = xi , Y = y])
i=1 y=1
Interpret F (Q, θ) as the expected log-likelihood of
(x1 , y1 ), . . . , (xm , ym )
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
18 / 26
Expectation-Maximization (EM)
Designed for cases in which, had we known the values of the latent
variables Y , then the maximum likelihood optimization problem
would have been tractable.
Precisely, define the following function over m × k matrices and the
set of parameters θ:
F (Q, θ) =
m X
k
X
Qi,y log (Pθ [X = xi , Y = y])
i=1 y=1
Interpret F (Q, θ) as the expected log-likelihood of
(x1 , y1 ), . . . , (xm , ym )
Assumption: For any matrix Q ∈ [0, 1]m,k , such that each row of Q
sums to 1, the optimization problem argmaxθ F (Q, θ) is tractable.
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
18 / 26
Expectation-Maximization (EM)
“chicken and egg” problem: Had we known Q, easy to find θ. Had
we known θ, we can set Qi,y = Pθ [Y = y|X = xi ]
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
19 / 26
Expectation-Maximization (EM)
“chicken and egg” problem: Had we known Q, easy to find θ. Had
we known θ, we can set Qi,y = Pθ [Y = y|X = xi ]
Expectation step: set
(t+1)
Qi,y
Shai Shalev-Shwartz (Hebrew U)
= Pθ(t) [Y = y|X = xi ] .
IML Lecture 14
Generative Models
19 / 26
Expectation-Maximization (EM)
“chicken and egg” problem: Had we known Q, easy to find θ. Had
we known θ, we can set Qi,y = Pθ [Y = y|X = xi ]
Expectation step: set
(t+1)
Qi,y
= Pθ(t) [Y = y|X = xi ] .
Maximization step: set
θ (t+1) = argmax F (Q(t+1) , θ) .
θ
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
19 / 26
EM as an alternate maximization algorithm
EM can be viewed as alternate maximization on the objective
G(Q, θ) = F (Q, θ) −
m X
k
X
Qi,y log(Qi,y ) .
i=1 y=1
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
20 / 26
EM as an alternate maximization algorithm
EM can be viewed as alternate maximization on the objective
G(Q, θ) = F (Q, θ) −
m X
k
X
Qi,y log(Qi,y ) .
i=1 y=1
Lemma: The EM procedure can be rewritten as:
Q(t+1) =
argmax
P
Q∈[0,1]m,k :∀i,
y
G(Q, θ (t) )
Qi,j =1
θ (t+1) = argmax G(Q(t+1) , θ) .
θ
Furthermore, G(Q(t+1) , θ (t) ) = L(S; θ (t) ).
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
20 / 26
EM as an alternate maximization algorithm
EM can be viewed as alternate maximization on the objective
G(Q, θ) = F (Q, θ) −
m X
k
X
Qi,y log(Qi,y ) .
i=1 y=1
Lemma: The EM procedure can be rewritten as:
Q(t+1) =
argmax
P
Q∈[0,1]m,k :∀i,
y
G(Q, θ (t) )
Qi,j =1
θ (t+1) = argmax G(Q(t+1) , θ) .
θ
Furthermore, G(Q(t+1) , θ (t) ) = L(S; θ (t) ).
Corollary: L(S; θ t+1 ) ≥ L(S; θ (t) )
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
20 / 26
EM for Mixture of Gaussians (soft k-means)
Mixture of k Gaussians in which θ = (c, {µ1 , . . . , µk })
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
21 / 26
EM for Mixture of Gaussians (soft k-means)
Mixture of k Gaussians in which θ = (c, {µ1 , . . . , µk })
Expectation step:
1
(t)
(t) 2
kx
−
µ
k
Qi,y ∝ c(t)
exp
−
i
y
y
2
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
21 / 26
EM for Mixture of Gaussians (soft k-means)
Mixture of k Gaussians in which θ = (c, {µ1 , . . . , µk })
Expectation step:
1
(t)
(t) 2
kx
−
µ
k
Qi,y ∝ c(t)
exp
−
i
y
y
2
Maximization step:
µ(t+1)
∝
y
m
X
(t)
Qi,y xi
and
i=1
Shai Shalev-Shwartz (Hebrew U)
c(t+1)
∝
y
m
X
(t)
Qi,y
i=1
IML Lecture 14
Generative Models
21 / 26
Outline
1
Maximum Likelihood
2
Naive Bayes
3
Linear Discriminant Analysis
4
Latent Variables and EM
5
Bayesian Reasoning
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
22 / 26
Bayesian Reasoning
So far, we treated θ as an unknown parameter
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
23 / 26
Bayesian Reasoning
So far, we treated θ as an unknown parameter
Bayesians treat uncertainty as randomness
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
23 / 26
Bayesian Reasoning
So far, we treated θ as an unknown parameter
Bayesians treat uncertainty as randomness
Formally, think on θ as a random variable with prior probability P [θ]
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
23 / 26
Bayesian Reasoning
So far, we treated θ as an unknown parameter
Bayesians treat uncertainty as randomness
Formally, think on θ as a random variable with prior probability P [θ]
The probability of X given S is
X
X
P[X = x|S] =
P[X = x|θ, S] P[θ|S] =
P[X = x|θ] P[θ|S]
θ
Shai Shalev-Shwartz (Hebrew U)
θ
IML Lecture 14
Generative Models
23 / 26
Bayesian Reasoning
So far, we treated θ as an unknown parameter
Bayesians treat uncertainty as randomness
Formally, think on θ as a random variable with prior probability P [θ]
The probability of X given S is
X
X
P[X = x|S] =
P[X = x|θ, S] P[θ|S] =
P[X = x|θ] P[θ|S]
θ
θ
Using Bayes rule we obtain a posterior distribution
P[θ|S] =
Shai Shalev-Shwartz (Hebrew U)
P[S|θ] P[θ]
P[S]
IML Lecture 14
Generative Models
23 / 26
Bayesian Reasoning
So far, we treated θ as an unknown parameter
Bayesians treat uncertainty as randomness
Formally, think on θ as a random variable with prior probability P [θ]
The probability of X given S is
X
X
P[X = x|S] =
P[X = x|θ, S] P[θ|S] =
P[X = x|θ] P[θ|S]
θ
θ
Using Bayes rule we obtain a posterior distribution
P[θ|S] =
P[S|θ] P[θ]
P[S]
Therefore,
P[X = x|S] =
m
Y
1 X
P[X = x|θ]
P[X = xi |θ] P[θ] .
P[S]
θ
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
i=1
Generative Models
23 / 26
Bayesian Reasoning
Example: suppose X = {0, 1} and the prior on θ is uniform over [0, 1]
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
24 / 26
Bayesian Reasoning
Example: suppose X = {0, 1} and the prior on θ is uniform over [0, 1]
Then:
Z
P[X = x|S] ∝
Shai Shalev-Shwartz (Hebrew U)
θx+
P
i
xi
(1 − θ)1−x+
IML Lecture 14
P
i (1−xi )
dθ .
Generative Models
24 / 26
Bayesian Reasoning
Example: suppose X = {0, 1} and the prior on θ is uniform over [0, 1]
Then:
Z
P[X = x|S] ∝
θx+
P
i
xi
(1 − θ)1−x+
P
i (1−xi )
dθ .
Solving (using integration by parts) we obtain
P
( i xi ) + 1
.
P[X = 1|S] =
m+2
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
24 / 26
Bayesian Reasoning
Example: suppose X = {0, 1} and the prior on θ is uniform over [0, 1]
Then:
Z
P[X = x|S] ∝
θx+
P
i
xi
(1 − θ)1−x+
P
i (1−xi )
dθ .
Solving (using integration by parts) we obtain
P
( i xi ) + 1
.
P[X = 1|S] =
m+2
Recall that Maximum Likelihood in this case is P[X = 1|θ̂] =
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
P
Generative Models
i xi
m
24 / 26
Bayesian Reasoning
Example: suppose X = {0, 1} and the prior on θ is uniform over [0, 1]
Then:
Z
P[X = x|S] ∝
θx+
P
i
xi
(1 − θ)1−x+
P
i (1−xi )
dθ .
Solving (using integration by parts) we obtain
P
( i xi ) + 1
.
P[X = 1|S] =
m+2
Recall that Maximum Likelihood in this case is P[X = 1|θ̂] =
P
i xi
m
Therefore, uniform prior is similar to maximum likelihood, except it
adds “pseudoexamples” to the training set
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
24 / 26
Maximum A-Posteriori
In many situations, it is difficult to find a closed form solution to the
integral in the definition of P[X = x|S]
A popular approximation is to find a single θ which maximizes P[θ|S]
This value is called the Maximum A-Posteriori estimator
Once this value is found, we can calculate the probability that X = x
given the maximum a-posteriori estimator and independently on S.
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
25 / 26
Summary
Generative approach: model the distribution over the data
Parametric density estimation: estimate the parameters characterizing
the distribution
Rules: Maximum Likelihood, Bayesian estimation, maximum a
posteriori.
Algorithms: Naive Bayes, LDA, EM
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 14
Generative Models
26 / 26