Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Machine Learning (67577) Lecture 14 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Generative Models Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 1 / 26 Generative Models The Generative Approach: try to learn the distribution D over the data Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 2 / 26 Generative Models The Generative Approach: try to learn the distribution D over the data If we know D we can predict using the Bayes optimal classifier Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 2 / 26 Generative Models The Generative Approach: try to learn the distribution D over the data If we know D we can predict using the Bayes optimal classifier Usually, it is much harder to learn D than to simply learn a predictor Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 2 / 26 Generative Models The Generative Approach: try to learn the distribution D over the data If we know D we can predict using the Bayes optimal classifier Usually, it is much harder to learn D than to simply learn a predictor However, in some situations, it is reasonable to adopt the generative learning approach: . Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 2 / 26 Generative Models The Generative Approach: try to learn the distribution D over the data If we know D we can predict using the Bayes optimal classifier Usually, it is much harder to learn D than to simply learn a predictor However, in some situations, it is reasonable to adopt the generative learning approach: Computational reasons . Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 2 / 26 Generative Models The Generative Approach: try to learn the distribution D over the data If we know D we can predict using the Bayes optimal classifier Usually, it is much harder to learn D than to simply learn a predictor However, in some situations, it is reasonable to adopt the generative learning approach: Computational reasons We sometimes don’t have a specific task at hand . Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 2 / 26 Generative Models The Generative Approach: try to learn the distribution D over the data If we know D we can predict using the Bayes optimal classifier Usually, it is much harder to learn D than to simply learn a predictor However, in some situations, it is reasonable to adopt the generative learning approach: Computational reasons We sometimes don’t have a specific task at hand Interpretability of the data . Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 2 / 26 Outline 1 Maximum Likelihood 2 Naive Bayes 3 Linear Discriminant Analysis 4 Latent Variables and EM 5 Bayesian Reasoning Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 3 / 26 Maximum Likelihood Estimator We assume as prior knowledge that the underlying distribution is parameterized by some θ Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 4 / 26 Maximum Likelihood Estimator We assume as prior knowledge that the underlying distribution is parameterized by some θ Learning the distribution corresponds to finding θ Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 4 / 26 Maximum Likelihood Estimator We assume as prior knowledge that the underlying distribution is parameterized by some θ Learning the distribution corresponds to finding θ Example: let X = {0, 1} then the set of distributions over X are parameterized by a single number θ ∈ [0, 1] corresponding to Px∼Dθ [x = 1] = Dθ ({1}) = θ Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 4 / 26 Maximum Likelihood Estimator We assume as prior knowledge that the underlying distribution is parameterized by some θ Learning the distribution corresponds to finding θ Example: let X = {0, 1} then the set of distributions over X are parameterized by a single number θ ∈ [0, 1] corresponding to Px∼Dθ [x = 1] = Dθ ({1}) = θ The goal is to learn θ from a sequence of i.i.d. examples S = (x1 , . . . , xm ) ∼ Dθm Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 4 / 26 Maximum Likelihood Estimator Likelihood: The likelihood of S, assuming the distribution is Dθ , is defined to be Dθm ({S}) = m Y Dθ ({xi }) = i=1 Shai Shalev-Shwartz (Hebrew U) m Y i=1 IML Lecture 14 P [X = xi ] X∼Dθ Generative Models 5 / 26 Maximum Likelihood Estimator Likelihood: The likelihood of S, assuming the distribution is Dθ , is defined to be Dθm ({S}) = m Y Dθ ({xi }) = i=1 m Y i=1 P [X = xi ] X∼Dθ Log-Likelihood: it is convenient to denote L(S; θ) = log (Dθm ({S})) = m X i=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 log P [X = xi ] X∼Dθ Generative Models 5 / 26 Maximum Likelihood Estimator Likelihood: The likelihood of S, assuming the distribution is Dθ , is defined to be Dθm ({S}) = m Y Dθ ({xi }) = i=1 m Y i=1 P [X = xi ] X∼Dθ Log-Likelihood: it is convenient to denote L(S; θ) = log (Dθm ({S})) = m X i=1 log P [X = xi ] X∼Dθ Maximum Likelihood Estimator (MLE): estimate θ based on S according to θ̂(S) = argmax L(S; θ) . θ Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 5 / 26 Maximum Likelihood Estimator for Bernoulli variables Suppose X = {0, 1}, Dθ is the distribution s.t. Px∼Dθ [x = 1] = θ Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 6 / 26 Maximum Likelihood Estimator for Bernoulli variables Suppose X = {0, 1}, Dθ is the distribution s.t. Px∼Dθ [x = 1] = θ The log-likelihood function: L(S; θ) = log(θ) · |{i : xi = 1}| + log(1 − θ) · |{i : xi = 0}| Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 6 / 26 Maximum Likelihood Estimator for Bernoulli variables Suppose X = {0, 1}, Dθ is the distribution s.t. Px∼Dθ [x = 1] = θ The log-likelihood function: L(S; θ) = log(θ) · |{i : xi = 1}| + log(1 − θ) · |{i : xi = 0}| Maximizing w.r.t. θ gives the ML estimator. Taking derivative w.r.t. θ and comparing to zero gives: |{i : xi = 1}| θ̂ Shai Shalev-Shwartz (Hebrew U) − |{i : xi = 0}| 1 − θ̂ = 0 ⇒ θ̂ = IML Lecture 14 |{i : xi = 1}| m Generative Models 6 / 26 Maximum Likelihood Estimator for Bernoulli variables Suppose X = {0, 1}, Dθ is the distribution s.t. Px∼Dθ [x = 1] = θ The log-likelihood function: L(S; θ) = log(θ) · |{i : xi = 1}| + log(1 − θ) · |{i : xi = 0}| Maximizing w.r.t. θ gives the ML estimator. Taking derivative w.r.t. θ and comparing to zero gives: |{i : xi = 1}| θ̂ − |{i : xi = 0}| 1 − θ̂ = 0 ⇒ θ̂ = |{i : xi = 1}| m That is, θ̂ is the average number of ones in S Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 6 / 26 Maximum Likelihood for Continuous Variables Example: X = [0, 1] and Dθ is the uniform distribution. Then, Dθ ({x}) = 0 for all x so L(S; θ) = −∞ ... Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 7 / 26 Maximum Likelihood for Continuous Variables Example: X = [0, 1] and Dθ is the uniform distribution. Then, Dθ ({x}) = 0 for all x so L(S; θ) = −∞ ... To overcome the problem, we define L using the density distribution: L(S; θ) = m X log (Px∼Dθ [x = xi ]) i=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 7 / 26 Maximum Likelihood for Continuous Variables Example: X = [0, 1] and Dθ is the uniform distribution. Then, Dθ ({x}) = 0 for all x so L(S; θ) = −∞ ... To overcome the problem, we define L using the density distribution: L(S; θ) = m X log (Px∼Dθ [x = xi ]) i=1 E.g., for Gaussian distribution, with θ = (µ, σ), 1 (xi − µ)2 √ Px∼Dθ (xi ) = exp − 2σ 2 σ 2π and L(S; θ) = − m √ 1 X (xi − µ)2 − m log(σ 2 π) . 2 2σ i=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 7 / 26 Maximum Likelihood for Continuous Variables Example: X = [0, 1] and Dθ is the uniform distribution. Then, Dθ ({x}) = 0 for all x so L(S; θ) = −∞ ... To overcome the problem, we define L using the density distribution: L(S; θ) = m X log (Px∼Dθ [x = xi ]) i=1 E.g., for Gaussian distribution, with θ = (µ, σ), 1 (xi − µ)2 √ Px∼Dθ (xi ) = exp − 2σ 2 σ 2π and m √ 1 X (xi − µ)2 − m log(σ 2 π) . 2 2σ i=1 q P m 1 Pm 1 2 MLE becomes: µ̂ = m i=1 xi and σ̂ = i=1 (xi − µ̂) m L(S; θ) = − Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 7 / 26 Outline 1 Maximum Likelihood 2 Naive Bayes 3 Linear Discriminant Analysis 4 Latent Variables and EM 5 Bayesian Reasoning Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 8 / 26 Naive Bayes A classical demonstration of how generative assumptions and parameter estimations simplify the learning process Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 9 / 26 Naive Bayes A classical demonstration of how generative assumptions and parameter estimations simplify the learning process Goal: learn function h : {0, 1}d → {0, 1} Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 9 / 26 Naive Bayes A classical demonstration of how generative assumptions and parameter estimations simplify the learning process Goal: learn function h : {0, 1}d → {0, 1} Bayes optimal classifier: hBayes (x) = argmax P[Y = y|X = x] . y∈{0,1} Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 9 / 26 Naive Bayes A classical demonstration of how generative assumptions and parameter estimations simplify the learning process Goal: learn function h : {0, 1}d → {0, 1} Bayes optimal classifier: hBayes (x) = argmax P[Y = y|X = x] . y∈{0,1} Need 2d parameters for describing P[Y = y|X = x] for every x ∈ {0, 1}d Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 9 / 26 Naive Bayes A classical demonstration of how generative assumptions and parameter estimations simplify the learning process Goal: learn function h : {0, 1}d → {0, 1} Bayes optimal classifier: hBayes (x) = argmax P[Y = y|X = x] . y∈{0,1} Need 2d parameters for describing P[Y = y|X = x] for every x ∈ {0, 1}d Naive generative assumption: features are independent given the label: d Y P[X = x|Y = y] = P[Xi = xi |Y = y] i=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 9 / 26 Naive Bayes With this (rather naive) assumption and using Bayes rule, the Bayes optimal classifier can be further simplified: hBayes (x) = argmax P[Y = y|X = x] y∈{0,1} = argmax P[Y = y]P[X = x|Y = y] y∈{0,1} = argmax P[Y = y] y∈{0,1} Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 d Y P[Xi = xi |Y = y] . i=1 Generative Models 10 / 26 Naive Bayes With this (rather naive) assumption and using Bayes rule, the Bayes optimal classifier can be further simplified: hBayes (x) = argmax P[Y = y|X = x] y∈{0,1} = argmax P[Y = y]P[X = x|Y = y] y∈{0,1} = argmax P[Y = y] y∈{0,1} d Y P[Xi = xi |Y = y] . i=1 Now, number of parameters to estimate is 2d + 1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 10 / 26 Naive Bayes With this (rather naive) assumption and using Bayes rule, the Bayes optimal classifier can be further simplified: hBayes (x) = argmax P[Y = y|X = x] y∈{0,1} = argmax P[Y = y]P[X = x|Y = y] y∈{0,1} = argmax P[Y = y] y∈{0,1} d Y P[Xi = xi |Y = y] . i=1 Now, number of parameters to estimate is 2d + 1 Reduces both runtime and sample complexity Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 10 / 26 Outline 1 Maximum Likelihood 2 Naive Bayes 3 Linear Discriminant Analysis 4 Latent Variables and EM 5 Bayesian Reasoning Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 11 / 26 Linear Discriminant Analysis Another demonstration of how generative assumptions simplify the learning process Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 12 / 26 Linear Discriminant Analysis Another demonstration of how generative assumptions simplify the learning process Goal: learn h : Rd → {0, 1} Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 12 / 26 Linear Discriminant Analysis Another demonstration of how generative assumptions simplify the learning process Goal: learn h : Rd → {0, 1} The generative assumption: y is generated based on P[Y = 1] = P[Y = 0] = 1/2 and given y, x ∼ N(µy , Σ): 1 1 T −1 P[X = x|Y = y] = exp − (x − µy ) Σ (x − µy ) 2 (2π)d/2 |Σ|1/2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 12 / 26 Linear Discriminant Analysis Another demonstration of how generative assumptions simplify the learning process Goal: learn h : Rd → {0, 1} The generative assumption: y is generated based on P[Y = 1] = P[Y = 0] = 1/2 and given y, x ∼ N(µy , Σ): 1 1 T −1 P[X = x|Y = y] = exp − (x − µy ) Σ (x − µy ) 2 (2π)d/2 |Σ|1/2 Bayes rule: hBayes (x) = argmax P[Y = y]P[X = x|Y = y] y∈{0,1} Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 12 / 26 Linear Discriminant Analysis Another demonstration of how generative assumptions simplify the learning process Goal: learn h : Rd → {0, 1} The generative assumption: y is generated based on P[Y = 1] = P[Y = 0] = 1/2 and given y, x ∼ N(µy , Σ): 1 1 T −1 P[X = x|Y = y] = exp − (x − µy ) Σ (x − µy ) 2 (2π)d/2 |Σ|1/2 Bayes rule: hBayes (x) = argmax P[Y = y]P[X = x|Y = y] y∈{0,1} This means we will predict hBayes (x) = 1 iff 1 1 (x − µ0 )T Σ−1 (x − µ0 ) − (x − µ1 )T Σ−1 (x − µ1 ) > 0 2 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 12 / 26 Linear Discriminant Analysis Equivalent to hw, xi + b > 0 where w = (µ1 − µ0 )T Σ−1 Shai Shalev-Shwartz (Hebrew U) and b = IML Lecture 14 1 T −1 µ0 Σ µ0 − µT1 Σ−1 µ1 2 Generative Models 13 / 26 Linear Discriminant Analysis Equivalent to hw, xi + b > 0 where w = (µ1 − µ0 )T Σ−1 and b = 1 T −1 µ0 Σ µ0 − µT1 Σ−1 µ1 2 That is, Bayes optimal is a halfspace in this case Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 13 / 26 Linear Discriminant Analysis Equivalent to hw, xi + b > 0 where w = (µ1 − µ0 )T Σ−1 and b = 1 T −1 µ0 Σ µ0 − µT1 Σ−1 µ1 2 That is, Bayes optimal is a halfspace in this case But, instead of learning the halfspace directly, we’ll learn µ0 , µ1 , Σ using maximum likelihood. Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 13 / 26 Outline 1 Maximum Likelihood 2 Naive Bayes 3 Linear Discriminant Analysis 4 Latent Variables and EM 5 Bayesian Reasoning Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 14 / 26 Latent Variables Sometimes, it is convenient to express the distribution over X using latent random variables Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 15 / 26 Latent Variables Sometimes, it is convenient to express the distribution over X using latent random variables Mixture of Gaussians: Each x ∈ Rd is generated by first selecting a random y from [k], then choose x according to N (µy , Σy ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 15 / 26 Mixture of Gaussians Each x ∈ Rd is generated by first selecting a random y from [k], then choose x according to N (µy , Σy ) The density can be written as: P[X = x] = k X P[Y = y]P[X = x|Y = y] y=1 k X 1 1 T −1 cy = exp − (x − µy ) Σy (x − µy ) 2 (2π)d/2 |Σy |1/2 y=1 Note: Y is a hidden variable that we do not observe in the data. It is just used to simplify the parametric description of the distribution Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 16 / 26 Latent Variables More generally, log (Pθ [X = x]) = log k X Pθ [X = x, Y = y] . y=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 17 / 26 Latent Variables More generally, log (Pθ [X = x]) = log k X Pθ [X = x, Y = y] . y=1 Maximum Likelihood: argmax θ Shai Shalev-Shwartz (Hebrew U) m X i=1 log k X Pθ [X = xi , Y = y] . y=1 IML Lecture 14 Generative Models 17 / 26 Latent Variables More generally, log (Pθ [X = x]) = log k X Pθ [X = x, Y = y] . y=1 Maximum Likelihood: argmax θ m X i=1 log k X Pθ [X = xi , Y = y] . y=1 In many cases, the summation inside the log makes the above optimization problem computationally hard Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 17 / 26 Latent Variables More generally, log (Pθ [X = x]) = log k X Pθ [X = x, Y = y] . y=1 Maximum Likelihood: argmax θ m X i=1 log k X Pθ [X = xi , Y = y] . y=1 In many cases, the summation inside the log makes the above optimization problem computationally hard A popular heuristic: Expectation-Maximization (EM), due to Dempster, Laird and Rubin Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 17 / 26 Expectation-Maximization (EM) Designed for cases in which, had we known the values of the latent variables Y , then the maximum likelihood optimization problem would have been tractable. Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 18 / 26 Expectation-Maximization (EM) Designed for cases in which, had we known the values of the latent variables Y , then the maximum likelihood optimization problem would have been tractable. Precisely, define the following function over m × k matrices and the set of parameters θ: F (Q, θ) = m X k X Qi,y log (Pθ [X = xi , Y = y]) i=1 y=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 18 / 26 Expectation-Maximization (EM) Designed for cases in which, had we known the values of the latent variables Y , then the maximum likelihood optimization problem would have been tractable. Precisely, define the following function over m × k matrices and the set of parameters θ: F (Q, θ) = m X k X Qi,y log (Pθ [X = xi , Y = y]) i=1 y=1 Interpret F (Q, θ) as the expected log-likelihood of (x1 , y1 ), . . . , (xm , ym ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 18 / 26 Expectation-Maximization (EM) Designed for cases in which, had we known the values of the latent variables Y , then the maximum likelihood optimization problem would have been tractable. Precisely, define the following function over m × k matrices and the set of parameters θ: F (Q, θ) = m X k X Qi,y log (Pθ [X = xi , Y = y]) i=1 y=1 Interpret F (Q, θ) as the expected log-likelihood of (x1 , y1 ), . . . , (xm , ym ) Assumption: For any matrix Q ∈ [0, 1]m,k , such that each row of Q sums to 1, the optimization problem argmaxθ F (Q, θ) is tractable. Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 18 / 26 Expectation-Maximization (EM) “chicken and egg” problem: Had we known Q, easy to find θ. Had we known θ, we can set Qi,y = Pθ [Y = y|X = xi ] Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 19 / 26 Expectation-Maximization (EM) “chicken and egg” problem: Had we known Q, easy to find θ. Had we known θ, we can set Qi,y = Pθ [Y = y|X = xi ] Expectation step: set (t+1) Qi,y Shai Shalev-Shwartz (Hebrew U) = Pθ(t) [Y = y|X = xi ] . IML Lecture 14 Generative Models 19 / 26 Expectation-Maximization (EM) “chicken and egg” problem: Had we known Q, easy to find θ. Had we known θ, we can set Qi,y = Pθ [Y = y|X = xi ] Expectation step: set (t+1) Qi,y = Pθ(t) [Y = y|X = xi ] . Maximization step: set θ (t+1) = argmax F (Q(t+1) , θ) . θ Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 19 / 26 EM as an alternate maximization algorithm EM can be viewed as alternate maximization on the objective G(Q, θ) = F (Q, θ) − m X k X Qi,y log(Qi,y ) . i=1 y=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 20 / 26 EM as an alternate maximization algorithm EM can be viewed as alternate maximization on the objective G(Q, θ) = F (Q, θ) − m X k X Qi,y log(Qi,y ) . i=1 y=1 Lemma: The EM procedure can be rewritten as: Q(t+1) = argmax P Q∈[0,1]m,k :∀i, y G(Q, θ (t) ) Qi,j =1 θ (t+1) = argmax G(Q(t+1) , θ) . θ Furthermore, G(Q(t+1) , θ (t) ) = L(S; θ (t) ). Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 20 / 26 EM as an alternate maximization algorithm EM can be viewed as alternate maximization on the objective G(Q, θ) = F (Q, θ) − m X k X Qi,y log(Qi,y ) . i=1 y=1 Lemma: The EM procedure can be rewritten as: Q(t+1) = argmax P Q∈[0,1]m,k :∀i, y G(Q, θ (t) ) Qi,j =1 θ (t+1) = argmax G(Q(t+1) , θ) . θ Furthermore, G(Q(t+1) , θ (t) ) = L(S; θ (t) ). Corollary: L(S; θ t+1 ) ≥ L(S; θ (t) ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 20 / 26 EM for Mixture of Gaussians (soft k-means) Mixture of k Gaussians in which θ = (c, {µ1 , . . . , µk }) Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 21 / 26 EM for Mixture of Gaussians (soft k-means) Mixture of k Gaussians in which θ = (c, {µ1 , . . . , µk }) Expectation step: 1 (t) (t) 2 kx − µ k Qi,y ∝ c(t) exp − i y y 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 21 / 26 EM for Mixture of Gaussians (soft k-means) Mixture of k Gaussians in which θ = (c, {µ1 , . . . , µk }) Expectation step: 1 (t) (t) 2 kx − µ k Qi,y ∝ c(t) exp − i y y 2 Maximization step: µ(t+1) ∝ y m X (t) Qi,y xi and i=1 Shai Shalev-Shwartz (Hebrew U) c(t+1) ∝ y m X (t) Qi,y i=1 IML Lecture 14 Generative Models 21 / 26 Outline 1 Maximum Likelihood 2 Naive Bayes 3 Linear Discriminant Analysis 4 Latent Variables and EM 5 Bayesian Reasoning Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 22 / 26 Bayesian Reasoning So far, we treated θ as an unknown parameter Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 23 / 26 Bayesian Reasoning So far, we treated θ as an unknown parameter Bayesians treat uncertainty as randomness Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 23 / 26 Bayesian Reasoning So far, we treated θ as an unknown parameter Bayesians treat uncertainty as randomness Formally, think on θ as a random variable with prior probability P [θ] Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 23 / 26 Bayesian Reasoning So far, we treated θ as an unknown parameter Bayesians treat uncertainty as randomness Formally, think on θ as a random variable with prior probability P [θ] The probability of X given S is X X P[X = x|S] = P[X = x|θ, S] P[θ|S] = P[X = x|θ] P[θ|S] θ Shai Shalev-Shwartz (Hebrew U) θ IML Lecture 14 Generative Models 23 / 26 Bayesian Reasoning So far, we treated θ as an unknown parameter Bayesians treat uncertainty as randomness Formally, think on θ as a random variable with prior probability P [θ] The probability of X given S is X X P[X = x|S] = P[X = x|θ, S] P[θ|S] = P[X = x|θ] P[θ|S] θ θ Using Bayes rule we obtain a posterior distribution P[θ|S] = Shai Shalev-Shwartz (Hebrew U) P[S|θ] P[θ] P[S] IML Lecture 14 Generative Models 23 / 26 Bayesian Reasoning So far, we treated θ as an unknown parameter Bayesians treat uncertainty as randomness Formally, think on θ as a random variable with prior probability P [θ] The probability of X given S is X X P[X = x|S] = P[X = x|θ, S] P[θ|S] = P[X = x|θ] P[θ|S] θ θ Using Bayes rule we obtain a posterior distribution P[θ|S] = P[S|θ] P[θ] P[S] Therefore, P[X = x|S] = m Y 1 X P[X = x|θ] P[X = xi |θ] P[θ] . P[S] θ Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 i=1 Generative Models 23 / 26 Bayesian Reasoning Example: suppose X = {0, 1} and the prior on θ is uniform over [0, 1] Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 24 / 26 Bayesian Reasoning Example: suppose X = {0, 1} and the prior on θ is uniform over [0, 1] Then: Z P[X = x|S] ∝ Shai Shalev-Shwartz (Hebrew U) θx+ P i xi (1 − θ)1−x+ IML Lecture 14 P i (1−xi ) dθ . Generative Models 24 / 26 Bayesian Reasoning Example: suppose X = {0, 1} and the prior on θ is uniform over [0, 1] Then: Z P[X = x|S] ∝ θx+ P i xi (1 − θ)1−x+ P i (1−xi ) dθ . Solving (using integration by parts) we obtain P ( i xi ) + 1 . P[X = 1|S] = m+2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 24 / 26 Bayesian Reasoning Example: suppose X = {0, 1} and the prior on θ is uniform over [0, 1] Then: Z P[X = x|S] ∝ θx+ P i xi (1 − θ)1−x+ P i (1−xi ) dθ . Solving (using integration by parts) we obtain P ( i xi ) + 1 . P[X = 1|S] = m+2 Recall that Maximum Likelihood in this case is P[X = 1|θ̂] = Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 P Generative Models i xi m 24 / 26 Bayesian Reasoning Example: suppose X = {0, 1} and the prior on θ is uniform over [0, 1] Then: Z P[X = x|S] ∝ θx+ P i xi (1 − θ)1−x+ P i (1−xi ) dθ . Solving (using integration by parts) we obtain P ( i xi ) + 1 . P[X = 1|S] = m+2 Recall that Maximum Likelihood in this case is P[X = 1|θ̂] = P i xi m Therefore, uniform prior is similar to maximum likelihood, except it adds “pseudoexamples” to the training set Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 24 / 26 Maximum A-Posteriori In many situations, it is difficult to find a closed form solution to the integral in the definition of P[X = x|S] A popular approximation is to find a single θ which maximizes P[θ|S] This value is called the Maximum A-Posteriori estimator Once this value is found, we can calculate the probability that X = x given the maximum a-posteriori estimator and independently on S. Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 25 / 26 Summary Generative approach: model the distribution over the data Parametric density estimation: estimate the parameters characterizing the distribution Rules: Maximum Likelihood, Bayesian estimation, maximum a posteriori. Algorithms: Naive Bayes, LDA, EM Shai Shalev-Shwartz (Hebrew U) IML Lecture 14 Generative Models 26 / 26