Download X - BCMI

Model Inference and Averaging Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University Contents • The Bootstrap and Maximum Likelihood Methods • Bayesian Methods • Relationship Between the Bootstrap and Bayesian Inference • The EM Algorithm • MCMC for Sampling from the Posterior • Bagging • Model Averaging and Stacking • Stochastic Search: Bumping 2017/5/25 Model Inference and Averaging 2 Bootstrap by Basis Expansions • Consider a linearN expansion  ( x )    j h j ( x) j 1 • The least square error solution T 1 T ˆ   (H H ) H y • The Covariance of \beta T 1 2 ˆ cor (  )  ( H H ) ˆ ; 2017/5/25 ˆ 2   ( yi  ˆ ( xi )) 2 / N Model Inference and Averaging 3 2017/5/25 Model Inference and Averaging 4 Parametric Model • Assume a parameterized probability density (parametric model) for observations 2017/5/25 Model Inference and Averaging 5 Maximum Likelihood Inference • Suppose we are trying to measure the true value of some quantity (xT). THISrepeated PROCEDURE MAKE SENSE? –DOES We make measurements of this quantity {x1, x2, … xn}. – The standard way to estimate xT from our The maximum likelihood method (MLM) measurements is to calculate the mean value: answers this question Nand provides a 1 general method for estimating  x   xi parameters N i1 data. of interest from and set xT .  x 2017/5/25  Model Inference and Averaging 6 The Maximum Likelihood Method • Statement of the Maximum Likelihood Method – Assume we have made N measurements of x {x1, x2, …, xn}. – Assume we know the probability distribution function that describes x: f(x, a). – Assume we want to determine the parameter a. • MLM: pick a to maximize the probability of getting the measurements (the xi's) we did! 2017/5/25 Model Inference and Averaging 7 The MLM Implementation • • • • The probability of measuring x1 is f ( x1,  )dx The probability of measuring x2 is f ( x2 ,  )dx The probability of measuring xn is f ( xn ,  )dx If the measurements are independent, the probability of getting the measurements we did is: L  f ( x1 ,  )dx  f ( x2 ,  )dx    f ( xn ,  )dx  f ( x1 ,  )  f ( x2 ,  )    f ( xn ,  )[dx n ] • We can drop the dxn term as it is only a proportionality constant. N L is called the Likelihood Function : L   f ( xi , ) 2017/5/25 Model Inference and Averaging i 1 8 Log Maximum Likelihood Method • We want to pick the a that maximizes L: L 0   * – Often easier to maximize lnL. – L and lnL are both maximum at the same  location. • we maximize ln L rather than L itself because ln L converts the product into a summation. N ln L   ln f (xi , ) i1 2017/5/25 Model Inference and Averaging 9 Log Maximum Likelihood Method • The new maximization condition is: N  ln L   ln f ( xi ,  ) 0    * i 1    * •  could be an array of parameters (e.g. slope and intercept) or just a single variable. • equations to determine a range from simple linear equations to coupled non-linear equations. 2017/5/25 Model Inference and Averaging 10 An Example: Gaussian • Let f(x, ) be given by a Gaussian distribution function. • Let  =μ be the mean of the Gaussian. We want to use our data+MLM to find the mean. • We want the best estimate of a from our set of n measurements {x1, x2, …, xn}. • Let’s assume that s is the same for each measurement. 2017/5/25 Model Inference and Averaging 11 An Example: Gaussian • Gaussian PDF 1 f (xi , )   2 ( x i  ) 2  2 2  e • The likelihood function for this problem is: n n 1  L   f ( xi , )   e i 1 i 1  2 n  1    e   2  2017/5/25 ( x1  ) 2 2 2  e ( x2  ) 2 2 ( xi  ) 2 2 e 2 2  ( xn  ) 2 2 2 n  1   i 1  e   2  Model Inference and Averaging n  ( xi  ) 2 2 2 12 An Example: Gaussian n ( xi  ) 2  1 2 2 n i 1 ln L  ln  f ( xi ,  )  ln([ ] e )  2 i 1 n ( x   )2 1  n ln( ) i 2  2 i 1 2 • We want to find the a that maximizes the log likelihood function: n 2 n   ( x   )  ln L   1  i   n ln  0 2      2  i 1 2  n n n  1 2 ( xi   )  0; 2( xi   )( 1)  0    xi    i 1 n i 1 i 1 2017/5/25 Model Inference and Averaging 13 An Example: Gaussian • If  are different for each data point then  is just the weighted average: n xi  2    i n1 i Weighted Average 1  2 i 1  i 2017/5/25 Model Inference and Averaging 14 An Example: Poisson • Let f(x,) be given by a Poisson distribution. • Let  = μ be the mean of the Poisson. • We want the best estimate of a from our set of n measurements {x1, x2, … xn}. • Poisson PDF:  e  f ( x, )  x! 2017/5/25 Model Inference and Averaging x 15 An Example: Poisson • The likelihood function for this problem is: n L i 1 e    x1! 2017/5/25  e  f ( xi ,  )   xi ! i 1 n x1 e   e   ... x2 ! xn ! x2 xn xi n  xi e  n  i1  x1! x2 !..xn ! Model Inference and Averaging 16 An Example: Poisson • Find a that maximizes the log likelihood function: n d ln L d      n  ln    xi  ln( x1! x2 !..xn !)  d d  i 1  1 n  n   xi  0  i 1 1 n    xi n i 1 2017/5/25 Average Model Inference and Averaging 17 General properties of MLM • For large data samples (large n) the likelihood function, L, approaches a Gaussian distribution. • Maximum likelihood estimates are usually consistent. – For large n the estimates converge to the true value of the parameters we wish to determine. • Maximum likelihood estimates are usually unbiased. – For all sample sizes the parameter of interest is calculated correctly. 2017/5/25 Model Inference and Averaging 18 General properties of MLM • Maximum likelihood estimate is efficient: the estimate has the smallest variance. • Maximum likelihood estimate is sufficient: it uses all the information in the observations (the xi’s). • The solution from MLM is unique. • Bad news: we must know the correct probability distribution for the problem at hand! 2017/5/25 Model Inference and Averaging 19 Maximum Likelihood • We maximize the likelihood function • Log-likelihood function 2017/5/25 Model Inference and Averaging 20 Score Function • Assess the precision of  using the likelihood function • Assume that L takes its maximum in the interior parameter space. Then 2017/5/25 Model Inference and Averaging 21 Likelihood Function • We maximize the likelihood function • We omit normalization since only adds a constant factor • Think of L as a function of ˆ with Z fixed • Log-likelihood function 2017/5/25 Model Inference and Averaging 22 Fisher Information • Negative sum of second derivatives is the information matrix • is called the observed information, should be greater 0. • Fisher information ( expected information ) is 2017/5/25 Model Inference and Averaging 23 Sampling Theory • Basic result of sampling theory • The sampling distribution of the maxlikelihood estimator approaches the following normal distribution, as when we sample independently from • This suggests to approximate the distribution with 2017/5/25 Model Inference and Averaging 24 Error Bound • The corresponding error estimates are obtained from • The confidence points have the form 2017/5/25 Model Inference and Averaging 25 Simplified form of the Fisher information Suppose, in addition, that the operations of integration and differentiation can be swapped for the second derivative of f(x;θ) as well, i.e., In this case, it can be shown that the Fisher information equals The Cramér–Rao bound can then be written as 2017/5/25 Model Inference and Averaging 26 Single-parameter proof If the expectation of T is denoted by ψ(θ), then, for all θ, Let X be a random variable with probability density function f(x;θ). Here T = t(X) is a statistic, which is used as an estimator for ψ(θ). If V is the score, i.e. then the expectation of V, written E(V), is zero. If we consider the covariance cov(V,T) of V and T, we have cov(V,T) = E(VT), because E(V) = 0. Expanding this expression we have 2017/5/25 Model Inference and Averaging 27 Using the chain rule and the definition of expectation gives, after cancelling f(x;θ), because the integration and differentiation operations commute (second condition). The Cauchy-Schwarz inequality shows that Therefore 2017/5/25 Model Inference and Averaging 28 An Example • Consider a linearN expansion  ( x )    j h j ( x) j 1 • The least square error solution T 1 T ˆ   (H H ) H y • The Covariance of \beta T 1 2 ˆ cor (  )  ( H H ) ˆ ; 2017/5/25 ˆ 2   ( yi  ˆ ( xi )) 2 / N Model Inference and Averaging 29 An Example N Consider prediction model ˆ ( x)   ˆ j h j ( x), j 1 The standard deviation se[ ˆ ( x)]  [ h( x)T ( H T H ) 1 h( x)]1/2 ˆ • The confidence region ˆ ( x)  1.96se[ˆ ( x)] 2017/5/25 Model Inference and Averaging 30 Bayesian Methods • Given a sampling model Pr(Z |  ) and a prior Pr( ) for the parameters, estimate the posterior probability • By drawing samples or estimating its mean or mode • Differences to mere counting ( frequentist approach ) – Prior: allow for uncertainties present before seeing the data – Posterior: allow for uncertainties present after seeing the data 2017/5/25 Model Inference and Averaging 31 Bayesian Methods • The posterior distribution affords also a predictive distribution of seeing future values Z new • In contrast, the max-likelihood approach would predict future data on the basis of not accounting for the uncertainty in the parameters 2017/5/25 Model Inference and Averaging 32 An Example • Consider a linearN expansion  ( x )    j h j ( x) j 1 • The least square error solution px   2017/5/25 2  1 p/2  p/2 e   12 x   Model Inference and Averaging T 1 x    33 • The posterior distribution for \beta is also Gaussian, with mean and covariance • The corresponding posterior values for \mu(x), 2017/5/25 Model Inference and Averaging 34 2017/5/25 Model Inference and Averaging 35 Bootstrap vs Bayesian • The bootstrap mean is an approximate posterior average • Simple example: – Single observation z drawn from a normal distribution – Assume a normal prior for  : – Resulting posterior distribution 2017/5/25 Model Inference and Averaging 36 Bootstrap vs Bayesian • Three ingredients make this work – The choice of a noninformative prior for  – The dependence of on Z only through the max-likelihood estimate ˆ Thus – The symmetry of 2017/5/25 Model Inference and Averaging 37 Bootstrap vs Bayesian • The bootstrap distribution represents an (approximate) nonparametric, noninformative posterior distribution for our parameter. • But this bootstrap distribution is obtained painlessly without having to formally specify a prior and without having to sample from the posterior distribution. • Hence we might think of the bootstrap distribution as a \poor man's" Bayes posterior. By perturbing the data, the bootstrap approximates the Bayesian effect of perturbing the parameters, and is typically much simpler to carry out. 2017/5/25 Model Inference and Averaging 38 The EM Algorithm • The EM algorithm for two-component Gaussian mixtures – Take initial guesses ˆ , ˆ1 ,ˆ1 , ˆ 2 ,ˆ2 for the parameters – Expectation Step: Compute the responsibilities 2017/5/25 Model Inference and Averaging 39 The EM Algorithm – Maximization Step: Compute the weighted means and variances – Iterate 2 and 3 until convergence 2017/5/25 Model Inference and Averaging 40 The EM Algorithm in General • Baum-Welch algorithm • Applicable to problems for which maximizing the log-likelihood is difficult but simplified by enlarging the sample set with unobserved ( latent ) data ( data augmentation ). 2017/5/25 Model Inference and Averaging 41 The EM Algorithm in General 2017/5/25 Model Inference and Averaging 42 2017/5/25 Model Inference and Averaging 43 The EM Algorithm in General • Start with initial params ˆ • Expectation Step: at the j-th step compute as a function of the dummy argument • Maximization Step: Determine the new params ˆ( j1) by maximizing • Iterate 2 and 3 until convergence 2017/5/25 Model Inference and Averaging 44 Model Averaging and Stacking • Given predictions • Under square-error loss, seek weights • Such that • Here the input x is fixed and the N observations in Z are distributed according to P 2017/5/25 Model Inference and Averaging 45 Model Averaging and Stacking • The solution is the population linear regression of Y on namely • Now the full regression has smaller error, namely • Population linear regression is not available, thus replace it by the linear regression over the training set 2017/5/25 Model Inference and Averaging 46

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download X - BCMI