Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Maximum-likelihood and Bayesian parameter estimation Andrea Passerini [email protected] Machine Learning Maximum-likelihood and Bayesian parameter estimation Parameter estimation Setting Data are sampled from a probability distribution p(x, y) The form of the probability distribution p is known but its parameters are unknown There is a training set D = {(x1 , y1 ), . . . , (xm , ym )} of examples sampled i.i.d. according to p(x, y) Task Estimate the unknown parameters of p from training data D. Note: i.i.d. sampling independent: each example is sampled independently from the others identically distributed: all examples are sampled from the same distribution Maximum-likelihood and Bayesian parameter estimation Parameter estimation Multiclass classification setting The training set can be divided into D1 , . . . , Dc subsets, one for each class (Di = {x1 , . . . , xn } contains i.i.d examples for target class yi ) For any new example x (not in training set), we compute the posterior probability of the class given the example and the full training set D: P(yi |x, D) = p(x|yi , D)p(yi |D) p(x|D) Note same as Bayesian decision theory (compute posterior probability of class given example) except that parameters of distributions are unknown a training set D is provided instead Maximum-likelihood and Bayesian parameter estimation Parameter estimation Multiclass classification setting: simplifications P(yi |x, D) = p(x|yi , Di )p(yi |D) p(x|D) we assume x is independent of Dj (j 6= i) given yi and Di without additional knowledge, p(yi |D) can be computed as the fraction of examples with that class in the dataset the normalizing factor p(x|D) can be computed marginalizing p(x|yi , Di )p(yi |D) over possible classes Note We must estimate class-dependent parameters θ i for: p(x|yi , Di ) Maximum-likelihood and Bayesian parameter estimation Maximum Likelihood vs Bayesian estimation Maxiumum likelihood/Maximum a-posteriori estimation Assumes parameters θ i have fixed but unknown values Values are computed as those maximizing the probability of the observed examples Di (the training set for the class) Obtained values are used to compute probability for new examples: p(x|yi , Di ) ≈ p(x|θ i ) Maximum-likelihood and Bayesian parameter estimation Maximum Likelihood vs Bayesian estimation Bayesian estimation Assumes parameters θ i are random variables with some known prior distribution Observing examples turns prior distribution over parameters into a posterior distribution Predictions for new examples are obtained integrating over all possible values for the parameters: Z p(x|yi , Di ) = p(x, θ i |yi , Di )dθ i θi Maximum-likelihood and Bayesian parameter estimation Maxiumum likelihood/Maximum a-posteriori estimation Maximum a-posteriori estimation θ ∗i = argmaxθ i p(θ i |Di , yi ) = argmaxθ i p(Di , yi |θ i )p(θ i ) Assumes a prior distribution for the parameters p(θ i ) is available Maximum likelihood estimation (most common) θ ∗i = argmaxθ i p(Di , yi |θ i ) maximizes the likelihood of the parameters with respect to the training samples no assumption about prior distributions for parameters Note Each class yi is treated independently: replace yi , Di → D for simplicity Maximum-likelihood and Bayesian parameter estimation Maximum-likelihood (ML) estimation Setting (again) A training data D = {x1 , . . . , xn } of i.i.d. examples for the target class y is available We assume the parameter vector θ has a fixed but unknown value We estimate such value maximizing its likelihood with respect to the training data: θ ∗ = argmaxθ p(D|θ) = argmaxθ n Y p(xj |θ) j=1 The joint probability over D decomposes into a product as examples are i.i.d (thus independent of each other given the distribution) Maximum-likelihood and Bayesian parameter estimation Maximum-likelihood estimation Maximizing log-likelihood It is usually simpler to maximize the logarithm of the likelihood (monotonic): ∗ θ = argmaxθ ln p(D|θ) = argmaxθ n X ln p(xj |θ) j=1 Necessary conditions for the maximum can be obtained zeroing the gradient wrt to θ: ∇θ n X ln p(xj |θ) = 0 j=1 Points zeroing the gradient can be local or global maxima depending on the form of the distribution Maximum-likelihood and Bayesian parameter estimation Maximum-likelihood estimation Gaussian case: unknown µ known Σ the log-likelihood is: n X ln p(xj |θ) = j=1 n X j=1 1 1 − (xj − µ)t Σ−1 (xj − µ) − ln (2π)d |Σ| 2 2 The gradient wrt to the mean is: ∇µ n X ln p(xj |θ) = j=1 n X Σ−1 (xj − µ) j=1 Setting the gradient to zero gives: n X j=1 n Σ −1 ∗ (xj − µ ) = 0 ⇒ 1X µ = xj n ∗ j=1 Maximum-likelihood and Bayesian parameter estimation Maximum-likelihood estimation Univariate Gaussian case: unknown µ and σ 2 the log-likelihood is: L= n X − j=1 1 1 (xj − µ)2 − ln 2πσ 2 2 2 2σ The gradient is: " ∇µ,σ2 L = Pn 1 j=1 σ 2 (xj − µ) Pn (xj −µ)2 1 j=1 − 2σ 2 + 2σ 4 # Maximum-likelihood and Bayesian parameter estimation Maximum-likelihood estimation Univariate Gaussian case: unknown µ and σ 2 Setting the gradient to zero gives mean: n 1X µ∗ = xj n j=1 and variance: n n X X (xj − µ∗ )2 1 = 2σ 2 2σ 4 j=1 j=1 n X σ2 = j=1 n X (xj − µ∗ )2 j=1 n σ2 = 1X (xj − µ∗ )2 n j=1 Maximum-likelihood and Bayesian parameter estimation Maximum-likelihood estimation Multivariate Gaussian case: unknown µ and Σ the log-likelihood is: n X j=1 1 1 − (xj − µ)t Σ−1 (xj − µ) − ln (2π)d |Σ| 2 2 The maximum-likelihood estimates are: n µ∗ = 1X xj n j=1 and: n Σ= 1X (xj − µ∗ )(xj − µ∗ )t n j=1 Maximum-likelihood and Bayesian parameter estimation Maximum-likelihood estimation general Gaussian case: Maximum likelihood estimates for Gaussian parameters are simply their empirical estimates over the samples: Gaussian mean is the sample mean Gaussian covariance matrix is the mean of the sample covariances Maximum-likelihood and Bayesian parameter estimation Bayesian estimation setting (again) Assumes parameters θ i are random variables with some known prior distribution Predictions for new examples are obtained integrating over all possible values for the parameters: Z p(x|yi , Di ) = p(x, θ i |yi , Di )dθ i θi probability of x given each class yi is independent of the other classes yj , for simplicity we can again write: Z p(x|yi , Di ) → p(x|D) = p(x, θ|D)dθ θ where D is a dataset for a certain class y and θ the parameters of the distribution Maximum-likelihood and Bayesian parameter estimation Bayesian estimation setting Z p(x|D) = θ Z p(x, θ|D)dθ = p(x|θ)p(θ|D)dθ p(x|θ) can be easily computed (we have both form and parameters of distribution, e.g. Gaussian) need to estimate the parameter posterior density given the training set: p(θ|D) = p(D|θ)p(θ) p(D) Maximum-likelihood and Bayesian parameter estimation Bayesian estimation denominator p(θ|D) = p(D|θ)p(θ) p(D) p(D) is a constant independent of θ (i.e. it will no influence final Bayesian decision) if final probability (not only decision) is needed we can compute: Z p(D) = p(D|θ)p(θ)dθ θ Maximum-likelihood and Bayesian parameter estimation Bayesian estimation Univariate normal case: unknown µ, known σ 2 Examples are drawn from: p(x|µ) ∼ N(µ, σ 2 ) The Gaussian mean prior distribution is itself normal: p(µ) ∼ N(µ0 , σ02 ) The Gaussian mean posterior given the dataset is computed as: n p(µ|D) = Y p(D|µ)p(µ) =α p(xj |µ)p(µ) p(D) j=1 where α = 1/p(D) is independent of µ Maximum-likelihood and Bayesian parameter estimation Univariate normal case: unknown µ, known σ 2 a posteriori parameter density p(xj |µ) p(µ) z {z { " }| " }| 2 # 2 # n Y 1 xj − µ 1 µ − µ0 1 1 √ √ exp − exp − p(µ|D) = α 2 σ 2 σ0 2πσ 2πσ0 j=1 2 2 n 1 X µ − xj µ − µ0 0 = α exp − + 2 σ σ0 j=1 n X 1 1 n 1 µ 0 = α00 exp − 2 + 2 µ2 − 2 2 xj + 2 µ 2 σ σ σ0 σ 0 j=1 Normal distribution 1 " 1 p(µ|D) = √ exp − 2 2πσn µ − µn σn 2 # Maximum-likelihood and Bayesian parameter estimation Univariate normal case: unknown µ, known σ 2 recovering mean and variance 1 n + 2 σ2 σ0 2 n X 1 µ0 µ − µn µ2 − 2 2 xj + 2 µ + α000 = σ σn σ0 j=1 = Solving for µn and σn2 we obtain: ! nσ02 σ2 µn = µ̂ + µ0 n nσ02 + σ 2 nσ02 + σ 2 1 2 µn µ2n µ − 2 µ + σn2 σn2 σn2 σn2 = σ02 σ 2 nσ02 + σ 2 where µ̂n is the sample mean: n µ̂n = 1X xj n j=1 Maximum-likelihood and Bayesian parameter estimation Univariate normal case: unknown µ, known σ 2 Interpreting the posterior µn = nσ02 nσ02 + σ 2 ! µ̂n + σ2 µ0 nσ02 + σ 2 σn2 = σ02 σ 2 nσ02 + σ 2 The mean is a linear combination of the prior (µ0 ) and sample means (µ̂n ) The more training examples (n) are seen, the more sample mean (unless σ02 = 0) dominates over prior mean. The more training examples (n) are seen, the more variance decreases making the distribution sharply peaked over its mean: σ02 σ 2 σ2 = lim =0 n→∞ n n→∞ nσ 2 + σ 2 0 lim Maximum-likelihood and Bayesian parameter estimation Univariate normal case: unknown µ, known σ 2 Mean posterior distribution varying sample size p(µ|x1,x2,...,xn) p(µ |x1,x2,...,xn) 30 3 20 50 2 24 1 1 10 12 0 -2 5 1 -1 -4 5 -2 0 0 1 0 1 -1 -2 1 2 4 µ -3 2 -4 Maximum-likelihood and Bayesian parameter estimation Conjugate priors Definition Given a likelihood function p(x|θ) Given a prior distribution p(θ) p(θ) is a conjugate prior for p(x|θ) if the posterior distribution p(θ|x) is in the same family as the prior p(θ) Examples Likelihood Binomial Multinomial Normal Multivariate normal Parameters p (probability) p (probability vector) µ (mean) µi (mean vector) Conjugate prior Beta Dirichlet Normal Normal Maximum-likelihood and Bayesian parameter estimation Univariate normal case: unknown µ, known σ 2 Computing the class conditional density Z p(x|D) = p(x|µ)p(µ|D)dµ " " 2 # 2 # Z 1 1 1 x −µ 1 µ − µn √ √ = dµ exp − exp − 2 σ 2 σn 2πσ 2πσn ∼ N(µn , σ 2 + σn2 ) Note (proof omitted) the probability of x given the dataset for the class is a Gaussian with: mean equal to the posterior mean variance equal to the sum of the known variance (σ 2 ) and an additional variance (σn2 ) due to the uncertainty on the mean Maximum-likelihood and Bayesian parameter estimation Multivariate normal case: unknown µ, known Σ Generalization of univariate case p(x|µ) ∼ N(µ, Σ) p(µ) ∼ N(µ0 , Σ0 ) ⇓ p(µ|D) ∼ N(µn , Σn ) ⇓ p(x|D) ∼ N(µn , Σ + Σn ) Maximum-likelihood and Bayesian parameter estimation Sufficient statistics Definition Any function on a set of samples D is a statistic A statistic s = φ(D) is sufficient for some parameters θ if: P(D|s, θ) = P(D|s) If θ is a random variable, a sufficient statistic contains all relevant information D has for estimating it: p(θ|D, s) = p(D|θ, s)p(θ|s) = p(θ|s) p(D|s) Use A sufficient statistic allows to compress a sample D into (possibly few) values Sample mean and covariance are sufficient statistics for true mean and covariance of the Gaussian distribution Maximum-likelihood and Bayesian parameter estimation Bernoulli distribution Setting Boolean event: x = 1 for success, x = 0 for failure (e.g. tossing a coin) Parameters: θ = probability of success (e.g. head) Probability mass function P(x|θ) = θx (1 − θ)1−x Beta conjugate prior: P(θ|ψ) = P(θ|αh , αt ) = Γ(α) θαh −1 (1 − θ)αt −1 Γ(αh )Γ(αt ) Maximum-likelihood and Bayesian parameter estimation Bernoulli distribution Maximum likelihood estimation: example Dataset D = {H, H, T , T , T , H, H} of N realizations (e.g. head/tail coin toss results) Likelihood function: p(D|θ) = θ · θ · (1 − θ) · (1 − θ) · (1 − θ) · θ · θ = θh (1 − θ)t Maximum likelihood parameter: ∂ ∂ lnp(D|θ) = 0 ⇒ h ln θ + t ln (1 − θ) = 0 ∂θ ∂θ 1 1 h −t =0 θ 1−θ h(1 − θ) = tθ θ= h h+t h, t are the sufficient statistics Maximum-likelihood and Bayesian parameter estimation Bernoulli distribution Bayesian estimation: example Parameter posterior is proportional to: P(θ|D, ψ) ∝ P(D|θ)P(θ|ψ) ∝ θh (1 − θ)t θαh −1 (1 − θ)αt −1 i.e. the posterior has a beta distribution with parameters h + αh , t + αt : P(θ|D, ψ) ∝ θh+αh −1 (1 − θ)t+αt −1 The prediction for a new event is: Z Z P(x|D) = P(x|θ)P(θ|D, ψ)dθ = θP(θ|D, ψ)dθ = EP(θ|D,ψ) [θ] = h + αh h + t + αh + αt Maximum-likelihood and Bayesian parameter estimation Bernoulli distribution Interpreting priors Our prior knowledge is encoded a number α = αh + αt of imaginary experiments we assume αh times we observed heads α is called equivalent sample size α → 0 reduces estimation to the classical ML approach (frequentist) Maximum-likelihood and Bayesian parameter estimation Multinomial distribution Setting Categorical event with r states x ∈ {x 1 , . . . , x r } (e.g. tossing a six-faced dice) One-hot encoding z(x) = [z1 (x), . . . , zr (x)] with zk (x) = 1 if x = x k , 0 otherwise. Parameters: θ = [θ1 , . . . , θr ] probability of each state Probability mass function P(x|θ) = r Y z (x) θk k k=1 Dirichlet conjugate prior: r Y Γ(α) P(θ|ψ) = P(θ|α1 , . . . , αr ) = Qr θkαk −1 Γ(α ) k k=1 k=1 Maximum-likelihood and Bayesian parameter estimation Multinomial distribution Maximum likelihood estimation: example Dataset D of N realizations (e.g. results of tossing a dice) Likelihood function: p(D|θ) = N Y r Y z (x ) θk k j = j=1 k=1 r Y θkNk k=1 Maximum likelihood parameter: θk = Nk N N1 , . . . , Nr are the sufficient statistics Maximum-likelihood and Bayesian parameter estimation Multinomial distribution Bayesian estimation: example Parameter posterior is proportional to: P(θ|D, ψ) ∝ P(D|θ)P(θ|ψ) ∝ r Y θkNk θkαk −1 k=1 i.e. the posterior has a Dirichlet distribution with parameters Nk + αk , k = 1, . . . , r : P(θ|D, ψ) ∝ r Y θkNk +αk −1 k=1 The prediction for a new event is: Z N + αk P(xk |D) = θk P(θ|D, ψ)dθ = EP(θ |D,ψ) [θk ] = k N +α Maximum-likelihood and Bayesian parameter estimation