Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Model Inference and Averaging Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University Contents • The Bootstrap and Maximum Likelihood Methods • Bayesian Methods • Relationship Between the Bootstrap and Bayesian Inference • The EM Algorithm • MCMC for Sampling from the Posterior • Bagging • Model Averaging and Stacking • Stochastic Search: Bumping 2017/5/25 Model Inference and Averaging 2 Bootstrap by Basis Expansions • Consider a linearN expansion ( x ) j h j ( x) j 1 • The least square error solution T 1 T ˆ (H H ) H y • The Covariance of \beta T 1 2 ˆ cor ( ) ( H H ) ˆ ; 2017/5/25 ˆ 2 ( yi ˆ ( xi )) 2 / N Model Inference and Averaging 3 2017/5/25 Model Inference and Averaging 4 Parametric Model • Assume a parameterized probability density (parametric model) for observations 2017/5/25 Model Inference and Averaging 5 Maximum Likelihood Inference • Suppose we are trying to measure the true value of some quantity (xT). THISrepeated PROCEDURE MAKE SENSE? –DOES We make measurements of this quantity {x1, x2, … xn}. – The standard way to estimate xT from our The maximum likelihood method (MLM) measurements is to calculate the mean value: answers this question Nand provides a 1 general method for estimating x xi parameters N i1 data. of interest from and set xT . x 2017/5/25 Model Inference and Averaging 6 The Maximum Likelihood Method • Statement of the Maximum Likelihood Method – Assume we have made N measurements of x {x1, x2, …, xn}. – Assume we know the probability distribution function that describes x: f(x, a). – Assume we want to determine the parameter a. • MLM: pick a to maximize the probability of getting the measurements (the xi's) we did! 2017/5/25 Model Inference and Averaging 7 The MLM Implementation • • • • The probability of measuring x1 is f ( x1, )dx The probability of measuring x2 is f ( x2 , )dx The probability of measuring xn is f ( xn , )dx If the measurements are independent, the probability of getting the measurements we did is: L f ( x1 , )dx f ( x2 , )dx f ( xn , )dx f ( x1 , ) f ( x2 , ) f ( xn , )[dx n ] • We can drop the dxn term as it is only a proportionality constant. N L is called the Likelihood Function : L f ( xi , ) 2017/5/25 Model Inference and Averaging i 1 8 Log Maximum Likelihood Method • We want to pick the a that maximizes L: L 0 * – Often easier to maximize lnL. – L and lnL are both maximum at the same location. • we maximize ln L rather than L itself because ln L converts the product into a summation. N ln L ln f (xi , ) i1 2017/5/25 Model Inference and Averaging 9 Log Maximum Likelihood Method • The new maximization condition is: N ln L ln f ( xi , ) 0 * i 1 * • could be an array of parameters (e.g. slope and intercept) or just a single variable. • equations to determine a range from simple linear equations to coupled non-linear equations. 2017/5/25 Model Inference and Averaging 10 An Example: Gaussian • Let f(x, ) be given by a Gaussian distribution function. • Let =μ be the mean of the Gaussian. We want to use our data+MLM to find the mean. • We want the best estimate of a from our set of n measurements {x1, x2, …, xn}. • Let’s assume that s is the same for each measurement. 2017/5/25 Model Inference and Averaging 11 An Example: Gaussian • Gaussian PDF 1 f (xi , ) 2 ( x i ) 2 2 2 e • The likelihood function for this problem is: n n 1 L f ( xi , ) e i 1 i 1 2 n 1 e 2 2017/5/25 ( x1 ) 2 2 2 e ( x2 ) 2 2 ( xi ) 2 2 e 2 2 ( xn ) 2 2 2 n 1 i 1 e 2 Model Inference and Averaging n ( xi ) 2 2 2 12 An Example: Gaussian n ( xi ) 2 1 2 2 n i 1 ln L ln f ( xi , ) ln([ ] e ) 2 i 1 n ( x )2 1 n ln( ) i 2 2 i 1 2 • We want to find the a that maximizes the log likelihood function: n 2 n ( x ) ln L 1 i n ln 0 2 2 i 1 2 n n n 1 2 ( xi ) 0; 2( xi )( 1) 0 xi i 1 n i 1 i 1 2017/5/25 Model Inference and Averaging 13 An Example: Gaussian • If are different for each data point then is just the weighted average: n xi 2 i n1 i Weighted Average 1 2 i 1 i 2017/5/25 Model Inference and Averaging 14 An Example: Poisson • Let f(x,) be given by a Poisson distribution. • Let = μ be the mean of the Poisson. • We want the best estimate of a from our set of n measurements {x1, x2, … xn}. • Poisson PDF: e f ( x, ) x! 2017/5/25 Model Inference and Averaging x 15 An Example: Poisson • The likelihood function for this problem is: n L i 1 e x1! 2017/5/25 e f ( xi , ) xi ! i 1 n x1 e e ... x2 ! xn ! x2 xn xi n xi e n i1 x1! x2 !..xn ! Model Inference and Averaging 16 An Example: Poisson • Find a that maximizes the log likelihood function: n d ln L d n ln xi ln( x1! x2 !..xn !) d d i 1 1 n n xi 0 i 1 1 n xi n i 1 2017/5/25 Average Model Inference and Averaging 17 General properties of MLM • For large data samples (large n) the likelihood function, L, approaches a Gaussian distribution. • Maximum likelihood estimates are usually consistent. – For large n the estimates converge to the true value of the parameters we wish to determine. • Maximum likelihood estimates are usually unbiased. – For all sample sizes the parameter of interest is calculated correctly. 2017/5/25 Model Inference and Averaging 18 General properties of MLM • Maximum likelihood estimate is efficient: the estimate has the smallest variance. • Maximum likelihood estimate is sufficient: it uses all the information in the observations (the xi’s). • The solution from MLM is unique. • Bad news: we must know the correct probability distribution for the problem at hand! 2017/5/25 Model Inference and Averaging 19 Maximum Likelihood • We maximize the likelihood function • Log-likelihood function 2017/5/25 Model Inference and Averaging 20 Score Function • Assess the precision of using the likelihood function • Assume that L takes its maximum in the interior parameter space. Then 2017/5/25 Model Inference and Averaging 21 Likelihood Function • We maximize the likelihood function • We omit normalization since only adds a constant factor • Think of L as a function of ˆ with Z fixed • Log-likelihood function 2017/5/25 Model Inference and Averaging 22 Fisher Information • Negative sum of second derivatives is the information matrix • is called the observed information, should be greater 0. • Fisher information ( expected information ) is 2017/5/25 Model Inference and Averaging 23 Sampling Theory • Basic result of sampling theory • The sampling distribution of the maxlikelihood estimator approaches the following normal distribution, as when we sample independently from • This suggests to approximate the distribution with 2017/5/25 Model Inference and Averaging 24 Error Bound • The corresponding error estimates are obtained from • The confidence points have the form 2017/5/25 Model Inference and Averaging 25 Simplified form of the Fisher information Suppose, in addition, that the operations of integration and differentiation can be swapped for the second derivative of f(x;θ) as well, i.e., In this case, it can be shown that the Fisher information equals The Cramér–Rao bound can then be written as 2017/5/25 Model Inference and Averaging 26 Single-parameter proof If the expectation of T is denoted by ψ(θ), then, for all θ, Let X be a random variable with probability density function f(x;θ). Here T = t(X) is a statistic, which is used as an estimator for ψ(θ). If V is the score, i.e. then the expectation of V, written E(V), is zero. If we consider the covariance cov(V,T) of V and T, we have cov(V,T) = E(VT), because E(V) = 0. Expanding this expression we have 2017/5/25 Model Inference and Averaging 27 Using the chain rule and the definition of expectation gives, after cancelling f(x;θ), because the integration and differentiation operations commute (second condition). The Cauchy-Schwarz inequality shows that Therefore 2017/5/25 Model Inference and Averaging 28 An Example • Consider a linearN expansion ( x ) j h j ( x) j 1 • The least square error solution T 1 T ˆ (H H ) H y • The Covariance of \beta T 1 2 ˆ cor ( ) ( H H ) ˆ ; 2017/5/25 ˆ 2 ( yi ˆ ( xi )) 2 / N Model Inference and Averaging 29 An Example N Consider prediction model ˆ ( x) ˆ j h j ( x), j 1 The standard deviation se[ ˆ ( x)] [ h( x)T ( H T H ) 1 h( x)]1/2 ˆ • The confidence region ˆ ( x) 1.96se[ˆ ( x)] 2017/5/25 Model Inference and Averaging 30 Bayesian Methods • Given a sampling model Pr(Z | ) and a prior Pr( ) for the parameters, estimate the posterior probability • By drawing samples or estimating its mean or mode • Differences to mere counting ( frequentist approach ) – Prior: allow for uncertainties present before seeing the data – Posterior: allow for uncertainties present after seeing the data 2017/5/25 Model Inference and Averaging 31 Bayesian Methods • The posterior distribution affords also a predictive distribution of seeing future values Z new • In contrast, the max-likelihood approach would predict future data on the basis of not accounting for the uncertainty in the parameters 2017/5/25 Model Inference and Averaging 32 An Example • Consider a linearN expansion ( x ) j h j ( x) j 1 • The least square error solution px 2017/5/25 2 1 p/2 p/2 e 12 x Model Inference and Averaging T 1 x 33 • The posterior distribution for \beta is also Gaussian, with mean and covariance • The corresponding posterior values for \mu(x), 2017/5/25 Model Inference and Averaging 34 2017/5/25 Model Inference and Averaging 35 Bootstrap vs Bayesian • The bootstrap mean is an approximate posterior average • Simple example: – Single observation z drawn from a normal distribution – Assume a normal prior for : – Resulting posterior distribution 2017/5/25 Model Inference and Averaging 36 Bootstrap vs Bayesian • Three ingredients make this work – The choice of a noninformative prior for – The dependence of on Z only through the max-likelihood estimate ˆ Thus – The symmetry of 2017/5/25 Model Inference and Averaging 37 Bootstrap vs Bayesian • The bootstrap distribution represents an (approximate) nonparametric, noninformative posterior distribution for our parameter. • But this bootstrap distribution is obtained painlessly without having to formally specify a prior and without having to sample from the posterior distribution. • Hence we might think of the bootstrap distribution as a \poor man's" Bayes posterior. By perturbing the data, the bootstrap approximates the Bayesian effect of perturbing the parameters, and is typically much simpler to carry out. 2017/5/25 Model Inference and Averaging 38 The EM Algorithm • The EM algorithm for two-component Gaussian mixtures – Take initial guesses ˆ , ˆ1 ,ˆ1 , ˆ 2 ,ˆ2 for the parameters – Expectation Step: Compute the responsibilities 2017/5/25 Model Inference and Averaging 39 The EM Algorithm – Maximization Step: Compute the weighted means and variances – Iterate 2 and 3 until convergence 2017/5/25 Model Inference and Averaging 40 The EM Algorithm in General • Baum-Welch algorithm • Applicable to problems for which maximizing the log-likelihood is difficult but simplified by enlarging the sample set with unobserved ( latent ) data ( data augmentation ). 2017/5/25 Model Inference and Averaging 41 The EM Algorithm in General 2017/5/25 Model Inference and Averaging 42 2017/5/25 Model Inference and Averaging 43 The EM Algorithm in General • Start with initial params ˆ • Expectation Step: at the j-th step compute as a function of the dummy argument • Maximization Step: Determine the new params ˆ( j1) by maximizing • Iterate 2 and 3 until convergence 2017/5/25 Model Inference and Averaging 44 Model Averaging and Stacking • Given predictions • Under square-error loss, seek weights • Such that • Here the input x is fixed and the N observations in Z are distributed according to P 2017/5/25 Model Inference and Averaging 45 Model Averaging and Stacking • The solution is the population linear regression of Y on namely • Now the full regression has smaller error, namely • Population linear regression is not available, thus replace it by the linear regression over the training set 2017/5/25 Model Inference and Averaging 46