Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computational statistics 2009 The basic idea Assume a particular model with unknown parameters. Determine how the likelihood of a given event varies with model model parameters Choose the parameter values that maximize the likelihood of the observed event Computational statistics 2009 A general mathematical formulation Consider a sample (X1, ..., Xn) which is drawn from a probability distribution P(X|) where are parameters. If the Xs are independent with probability density function P(Xi|) then the joint probability of the whole set is n P( X 1 ,.., X n | ) = P( X i | ) i=1 Find the parameters that maximize this function Computational statistics 2009 The likelihood function for the general non-linear model Assume that Y = f(X, ) + e e ~ N(0; ) Then the likelihood function is L( , ) = 1 2 exp [-0.5(Y - f(X, ) -1 (Y - f(X, )] Note that the ML-estimator of is identical to the mean square estimator if = 2I, where I is the identity matrix. Computational statistics 2009 Large sample properties of ML estimators Consistency: As the sample size increases, the ML estimator converges to the true parameter value Invaríance: If f() is a function of the unknown parameters of the distribution, then the ML estimator of f() is f(̂) Asymptotic normality: As the sampe size increases, the sampling distribution of an ML estimator converges to a normal distribution Variance: For large sample sizes, the variance of an ML estimator (assuming a single unknown parameter) is approximately the negative of the reciprocal of the second derivative of the log-likelihood function evaluated at the ML estimate. 1 2 L( | x) Var (ˆ) | 2 ˆ Computational statistics 2009 The information matrix (Hessian) The matrix 2 log (L( ˆ )) = I( ) E - is a measure of how `pointy' the likelihood function is. The variance of the ML estimator is given by the inverse Hessian Var( ̂ ML ) = [I( ) ] -1 Computational statistics 2009 The Cramer-Rao lower bound The Cramer-Rao lower bound is the smallest theoretical variance which can be achieved. ML gives this, so any other estimation technique can at best only equal it. If * is another estimator of Var( * ) I( )-1 Do we need estimators other than ML estimators? Computational statistics 2009 ML estimators for dynamic models A general decomposition technique for the log likelihood function allows us to extend standard ML procedures to dynamic models (time series models). From the basic definition of conditional probability Pr( , ) = Pr( | )Pr( ) This may be applied directly to the likelihood function Computational statistics 2009 Prediction error decomposition Consider the decomposition log(L( Y 1 ,Y 2 ,...Y T -1 ,Y T )) = log(L( Y T | Y 1 ,Y 2 ,...,Y T -1 )) + log(L( Y 1 ,Y 2 ,...,Y T -1 )) The first term is the conditional probability of Y given all past values. We can then condition the second term and so on to give T -2 = log(L( Y T -i | Y 1 ,...,Y T -i-1 )) + log(L( Y 1 )) i=0 that is, a series of one step ahead prediction errors conditional on actual lagged Y. Computational statistics 2009 Numerical optimization In simple cases (e.g. OLS) we can calculate the maximum likelihood estimates analytically. But in many cases we cannot, then we resort to numerical optimisation of the likelihood function. This amounts to hill climbing in parameter space. 1. 2. 3. 4. set an arbitrary initial set of parameters. determine a direction of movement determine a step length to move examine some termination criteria and either stop or go back to 2 Computational statistics 2009 L Lu 1 2 * Computational statistics 2009 Gradient methods for determining the maximum of a function These methods base the direction of movement on the first derivatives of the likelihood function with respect to the parameters. Often the step length is also determined by (an approximation to) the second derivatives. So -1 L L i+1 = i + 2 2 The class of gradient methods include: Newton, Quasi Newton, Steepest descent etc. Computational statistics 2009 Qualitative response models Assume that we have a quantitative model Y t = X t + ut but we only observe certain limited information, e.g. z = 1 if Y > 0 z = 0 if Y < 0 Then we can group the data into two groups and form a likelihood function with the following form L = F(- X t ) F(1 - X t ) z=0 z=1 where F is the cumulative distribution function of the error terms ut Computational statistics 2009