Download Maximum likelihood estimation

II.2 Statistical Inference: Sampling and Estimation A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if it can be completely described by a finite number of parameters, e.g., the family of Normal distributions for a finite number of parameters μ, σ:   f X ( x;  ,  )   IR&DM, WS'11/12 1 2 2 e ( x )2  2 2 October 25, 2011  for   R,   0  II.1 Statistical Inference Given a parametric model M and a sample X1,...,Xn, how do we infer (learn) the parameters of M? For multivariate models with observed variable X and „outcome (response)“ variable Y, this is called prediction or regression, for a discrete outcome variable this is also called classification. r(x) = E[Y | X=x] is called the regression function. IR&DM, WS'11/12 October 25, 2011 II.2 Idea of Sampling Distribution X Statistical Inference (e.g., a population, objects of interest) What can we say about X based on X1,…,Xn? Distrib. Param. μX  X2 N Sample Param. mean variance size Samples X1,…,Xn drawn from X (e.g., people, objects) X S X2 n Example: Suppose we want to estimate the average salary of employees in German companies.  Sample 1: Suppose we look at n=200 top-paid CEOs of major banks.  Sample 2: Suppose we look at n=100 employees across all kinds of companies. IR&DM, WS'11/12 October 25, 2011 II.3 Basic Types of Statistical Inference Given a set of iid. samples X1,...,Xn ~ X of an unknown distribution X. e.g.: n single-coin-toss experiments X1,...,Xn ~ X: Bernoulli(p) • Parameter Estimation e.g.: - what is the parameter p of X: Bernoulli(p) ? - what is E[X], the cdf FX of X, the pdf fX of X, etc.? • Confidence Intervals e.g.: give me all values C=(a,b) such that P(pC) ≥ 0.95 where a and b are derived from samples X1,...,Xn • Hypothesis Testing e.g.: H0 : p = 1/2 IR&DM, WS'11/12 vs. H1 : p ≠ 1/2 October 25, 2011 II.4 Statistical Estimators A point estimator for a parameter  of a prob. distribution X is a random variable ̂n derived from an iid. sample X1,...,Xn. 1 n Examples: X :  X i Sample mean: n i 1 n 1 2 ( X  X ) Sample variance: S X2 :  i n  1 i 1 An estimator ̂n for parameter  is unbiased if E[̂n ] = ; otherwise the estimator has bias E[̂n ] – . An estimator on a sample of size n is consistent if lim n P[| ˆn   |  ]  1 for any   0 Sample mean and sample variance are unbiased and consistent estimators of μX and  X2 . IR&DM, WS'11/12 October 25, 2011 II.5 Estimator Error Let ̂n be an estimator for parameter  over iid. samples X1, ...,Xn. The distribution of ̂n is called the sampling distribution. The standard error for ̂n is: se(ˆn )  Var[ˆn ] The mean squared error (MSE) for ̂n is: MSE (ˆ )  E[(ˆ   )2 ] n n  bias 2 ( ˆn )  Var[ ˆn ] Theorem: If bias  0 and se  0 then the estimator is consistent. The estimator ̂n is asymptotically Normal if ( ˆn   ) / se converges in distribution to standard Normal N(0,1). IR&DM, WS'11/12 October 25, 2011 II.6 Types of Estimation • Nonparametric Estimation No assumptions about model M nor the parameters θ of the underlying distribution X  “Plug-in estimators” (e.g. histograms) to approximate X • Parametric Estimation (Inference) Requires assumptions about model M and the parameters θ of the underlying distribution X Analytical or numerical methods for estimating θ  Method-of-Moments estimator  Maximum Likelihood estimator and Expectation Maximization (EM) IR&DM, WS'11/12 October 25, 2011 II.7 Nonparametric Estimation The empirical distribution function F̂n is the cdf that puts probability mass 1/n at each data point Xi: 1 n F̂n ( x )  i 1 I( X i  x ) n 1 if X i  x with I ( X i  x)   0 if X i  x A statistical functional (“statistics”) T(F) is any function over F, e.g., mean, variance, skewness, median, quantiles, correlation. The plug-in estimator of  = T(F) is: ˆn  T( Fˆn )  Simply use F̂n instead of F to calculate the statistics T of interest. IR&DM, WS'11/12 October 25, 2011 II.8 Histograms as Density Estimators Instead of the full empirical distribution, often compact data synopses may be used, such as histograms where X1, ...,Xn are grouped into m cells (buckets) c1, ..., cm with bucket boundaries lb(ci) and ub(ci) s.t. lb(c1) = , ub(cm ) = , ub(ci ) = lb(ci+1 ) for 1 i<m, and 1 n ˆ freqf (ci ) = f n ( x)  v 1 I (lb (ci )  X v  ub(ci )) n 1 n freqF (ci ) = Fˆn ( x)   I ( X v  ub(ci )) n v 1 Example: X1 = 1 X2 = 1 X3 = 2 X4 = 2 X5 = 2 X6 = 3 … X20=7 2 3 5  2   3 20 20 20 4 3 2 1  4  5  6  7  20 20 20 20  3.65 ˆ n  1  fX(x) 5/20 4/20 3/20 3/20 2/20 2/20 1 2 3 4 5 6 1/20 7 x Histograms provide a (discontinuous) density estimator. IR&DM, WS'11/12 October 25, 2011 II.9 Parametric Inference (1): Method of Moments Suppose parameter θ = (θ1,…,θk ) has k components.  j j Compute j-th moment:  j   j ( )  E [ X ]   x f X ( x) dx  1 n j j-th sample moment: ̂ j  i 1 X i for 1 ≤ j ≤ k n Estimate parameter  by method-of-moments estimator ̂n s.t. 1 (ˆn )  ˆ1 and  (ˆ )  ˆ … and 2 n 2 …  k (ˆn )  ˆ k (for the first k moments)  Solve equation system with k equations and k unknowns. Method-of-moments estimators are usually consistent and asymptotically Normal, but may be biased. IR&DM, WS'11/12 October 25, 2011 II.10 Parametric Inference (2): Maximum Likelihood Estimators (MLE) Let X1,...,Xn be iid. with pdf f(x;θ). Estimate parameter  of a postulated distribution f(x;) such that the likelihood that the sample values x1,...,xn are generated by this distribution is maximized.  Maximum likelihood estimation: Maximize L(x1,...,xn; ) ≈ P[x1, ...,xn originate from f(x; )] Usually formulated as Ln() = ∏i f(Xi; ) Or (alternatively)  Maximize ln() = log Ln() The value ̂n that maximizes Ln() is the MLE of . If analytically untractable  use numerical iteration methods IR&DM, WS'11/12 October 25, 2011 II.11 Simple Example for Maximum Likelihood Estimator Given: • Coin toss experiment (Bernoulli distribution) with unknown parameter p for seeing heads, 1-p for tails • Sample (data): h times head with n coin tosses Want: Maximum likelihood estimation of p n n i 1 i 1 Let L(h, n, p)   f ( X i ; p)   p X i (1  p)1 X i  p h (1  p) n  h with h = ∑i Xi Maximize log-likelihood function: log L (h, n, p)  h log( p)  (n  h) log( 1  p) h  ln L h n  h   0  p  n p p 1 p IR&DM, WS'11/12 October 25, 2011 II.12 MLE for Parameters of Normal Distributions  1  2 L( x1 ,..., xn ,  ,  )     2   n n e  ( xi   ) 2 2 2 i 1  ln( L ) 1 n  2( x   )  0  i  2 2 i 1  ln( L )  2  n 2 2 1 n  ̂   xi n i 1 IR&DM, WS'11/12  1 n ( xi   )2  0  2 4 i 1 n 1 ˆ 2   ( xi  ˆ ) 2 n i 1 October 25, 2011 II.13 MLE Properties Maximum Likelihood estimators are consistent, asymptotically Normal, and asymptotically optimal (i.e., efficient) in the following sense: Consider two estimators U and T which are asymptotically Normal. Let u2 and t2 denote the variances of the two Normal distributions to which U and T converge in probability. The asymptotic relative efficiency of U to T is ARE(U,T) := t2/u2 . Theorem: For an MLE ̂n and any other estimatorn the following inequality holds: ARE(  ,ˆ )  1 n n That is, among all estimators MLE has the smallest variance. IR&DM, WS'11/12 October 25, 2011 II.14 Bayesian Viewpoint of Parameter Estimation • Assume prior distribution g() of parameter  • Choose statistical model (generative model) f (x | ) that reflects our beliefs about RV X • Given RVs X1,...,Xn for the observed data, the posterior distribution is h ( | x1,...,xn ) For X1= x1, ... ,Xn= xn the likelihood is L( x1...xn ,  )  i 1 f ( xi |  )  i 1 n n h( | xi )  ' f ( xi |  ' ) g ( ' ) g ( ) which implies h( | x1...xn ) ~ L( x1...xn , )  g ( ) (posterior is proportional to likelihood times prior) MAP estimator (maximum a posteriori): Compute  that maximizes h( | x1, …, xn ) given a prior for . IR&DM, WS'11/12 October 25, 2011 II.15 Analytically Non-tractable MLE for parameters of Multivariate Normal Mixture Consider samples from a k-mixture of m-dimensional Normal distributions with the density (e.g. height and weight of males and females):    f ( x,  1 ,...,  k , 1 ,..., k , 1 ,...,  k ) k      j n( x,  j ,  j )    j k j 1 j 1 1 (2 ) m  j e   1    ( x   j )T  j 1 ( x   j ) 2  with expectation values j and invertible, positive definite, symmetric mm covariance matrices  j  Maximize log-likelihood function: n n  k        log L( x1 ,..., xn , ) : log  P[ xi |  ]    log   j n( xi ,  j ,  j )  i 1  j 1 i 1  IR&DM, WS'11/12 October 25, 2011 II.16 Expectation-Maximization Method (EM) Key idea: When L(X1,...,Xn, θ) (where the Xi and  are possibly multivariate) is analytically intractable then • introduce latent (i.e., hidden, invisible, missing) random variable(s) Z such that • the joint distribution J(X1,...,Xn, Z, ) of the “complete” data is tractable (often with Z actually being multivariate: Z1,...,Zm) • iteratively derive the expected complete-data likelihood by integrating J and find best : ˆ  arg max   z J [ X 1... X n , | Z  z ]P[ Z  z ] EZ|X,[J(X1,…,Xn, Z, )] IR&DM, WS'11/12 October 25, 2011 II.17 EM Procedure Initialization: choose start estimate for (0) (e.g., using Method-of-Moments estimator) Iterate (t=0, 1, …) until convergence: E step (expectation): estimate posterior probability of Z: P[Z | X1,…,Xn, (t)] assuming  were known and equal to previous estimate (t), and compute EZ|X,θ(t) [log J(X1,…,Xn, Z, (t))] by integrating over values for Z M step (maximization, MLE step): Estimate (t+1) by maximizing (t+1) = arg maxθ EZ|X,θ[log J(X1,…,Xn, Z, )]  Convergence is guaranteed (because the E step computes a lower bound of the true L function, and the M step yields monotonically non-decreasing likelihood), but may result in local maximum of (log-)likelihood function IR&DM, WS'11/12 October 25, 2011 II.18 EM Example for Multivariate Normal Mixture Expectation step (E step): P[ xi | n j (  ( t ) ) ] hij : P[ Zij  1| xi , ( t ) ]  k (t) P[ x | n (  )]  i l l 1 Maximization step (M step): Zij = 1 if ith data point Xi was generated by jth component, 0 otherwise  EZ | X , [log J ( X 1 ,..., X n , Z , )]  i log  j n j ( xi , | Z ij  1) P[ Z ij  1] n  hij xi  j : i n1  hij i 1  hij ( xi   j )( xi   j )T  j : i 1 n  hij i 1  ( t 1 ) IR&DM, WS'11/12 n n October 25, 2011  j :  hij i 1 k n   hij n  hij  i 1 n j 1i 1 See L. Wasserman, p.121 ff. for k=2, m=1 II.19 Confidence Intervals Estimator T for an interval for parameter  such that P[ T  a    T  a ]  1   [T-a, T+a] is the confidence interval and 1– is the confidence level. For the distribution of random variable X, a value x (0<  <1) with P[ X  x ]    P[ X  x ]  1   is called a -quantile; the 0.5-quantile is called the median. For the Normal distribution N(0,1) the -quantile is denoted  .  For a given a or α, find a value z of N(0,1) that denotes the [T-a, T+a] conf. interval or a corresponding -quantile for 1– . IR&DM, WS'11/12 October 25, 2011 II.20 Confidence Intervals for Expectations (1) Let x1, ..., xn be a sample from a distribution with unknown expectation  and known variance 2. For sufficiently large n, the sample mean X is N(,2/n) distributed and ( X   ) n is N(0,1) distributed:  ( X  ) n P[ z   z ]  ( z )  ( z )  ( z )  (1  ( z ))  2( z )  1   P[ X  z z X  ] n n 1 / 2  1 / 2   P[ X  X  ]  1 n n For confidence interval [ X  a, X  a] or confidence level 1- set  a n z :  ( 1  ) quantile of N ( 0,1 ) z : 2  z then a : then look up (z) to find 1–α n IR&DM, WS'11/12 October 25, 2011 II.21 Confidence Intervals for Expectations (2) Let X1, ..., Xn be an iid. sample from a distribution X with unknown expectation  and unknown variance 2 and known sample variance S2. For sufficiently large n, the random variable (X  ) n T : S  n 1   1 2  has a t distribution (Student  fT ,n (t )  n 1 distribution) with n-1 degrees n 2 2    t   of freedom: n 1   2 n   ( x)   e t t x 1 dt for x  0 with the Gamma function: 0 ( with the properties ( 1 )  1 and ( x  1 )  x ( x ) )  P[ X  t n1,1 / 2 S X t n1,1 / 2 S n IR&DM, WS'11/12 ]  1 n October 25, 2011 II.22 Summary of Section II.2 • Quality measures for statistical estimators • Nonparametric vs. parametric estimation • Histograms as generic (nonparametric) plug-in estimators • Method-of-Moments estimator good initial guess but may be biased • Maximum-Likelihood estimator & Expectation Maximization • Confidence intervals for parameters IR&DM, WS'11/12 October 25, 2011 II.23 Normal Distribution Table IR&DM, WS'11/12 October 25, 2011 II.24 Student‘s t Distribution Table IR&DM, WS'11/12 October 25, 2011 II.25

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Maximum likelihood estimation