Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stat 550 Notes 8 Notes: 1. We have no class this coming Tuesday because it’s fall break. 2. The midterm is due at Wednesday by 5. I’ll be around on Monday and Tuesday if you have any questions about it. I’ll hold my usual office hours on Tuesday from 4:45-5:45 and can also meet with you by appointment. I. Maximum Likelihood The method of maximum likelihood is a general approach to point estimation. Motivating Example: A purchaser of electrical components buys them in lots of size 12. Each electrical component is either acceptable or defective. Let denote the number of acceptable components in the box. It is expensive to test whether all the electrical components are acceptable; we would like to try to estimate by randomly choosing five components without replacement and testing whether these five components are acceptable. Let X denote the number of acceptable components in the sample. Suppose X =3 of the components in the sample are acceptable. How should we estimate ? Probability model: Imagine that the components are numbered 1-12. A sample of five components thus consists of five distinct 1 12 numbers. All 5 792 samples are equally likely. The distribution of X is hypergeometric: 12 x 5 x P ( X x) 12 5 The following table shows the probability distribution for X given for each possible value of . X =Number of acceptable components in the sample 0 1 2 3 4 5 0 Number 0 1 0 0 0 0 of acceptable components in the box ( ) 1 .5833 .4167 0 0 0 0 2 .3182 .5303 .1515 0 0 0 3 .1591 .4773 .3182 .0454 0 0 4 .0707 .3535 .4243 .1414 .0101 0 5 .0265 .2210 .4419 .2652 .0442 .0012 6 .0076 .1136 .3788 .3788 .1136 .0076 7 .0012 .0442 .2652 .4419 .2210 .0265 8 0 .0101 .1414 .4243 .3535 .0707 2 9 0 0 .0454 .3182 .4773 .1591 10 0 0 0 .1515 .5303 .3182 11 0 0 0 0 .4167 .5833 12 0 0 0 0 0 1 Once we obtain the sample X =3, what should we estimate to be? It’s not clear how to apply the method of moments. We have ˆ 5 3 0 gives ˆ 7.2 , which is E ( X ) 5 but solving 12 12 not in the parameter space. Maximum likelihood approach: We know that it is impossible that =0, 1, 2, 11 or 12. The set of possible values for once we observe X =3 are =3, 4, 5, 6, 7, 8, 9, 10. Although both =3 and =7 are possible, the occurrence of X =3 would be more “likely” if =7 [ P 7 ( X 3) .4419 ] than if =3 [ P 3 ( X 3) .0454 ]. Among =3, 4, 5, 6, 7, 8, 9, 10, the that makes the observed data X =3 most “likely” is =7. General definitions for maximum likelihood estimator The likelihood function is defined by LX ( ) p( X | ) . The likelihood function is just the joint probability mass or probability density of the data, except that we treat it as a 3 function of the parameter . Thus, LX : [0, ) . The likelihood function is not a probability mass function or a probability density function: in general, it is not true that LX ( ) integrates to 1 with respect to . In the motivating example, for X 3 , LX 3 ( ) 2.167 . The maximum likelihood estimator (the MLE), denoted by ˆ , is the value of that maximizes the likelihood: MLE ˆMLE arg max Lx ( ) . For the motivating example, ˆMLE =7. Intuitively, the MLE is a reasonable choice for an estimator. The MLE is the parameter point for which the observed sample is most likely. Equivalently, we can work with the log likelihood function l x ( ) log p( x | ) , ˆ arg max l ( ) . x MLE Example 2: Poisson distribution. Suppose X 1 , , X n are iid Poisson( ), (0, ) . e X n n n l x ( ) i 1 log n ( i 1 X i ) log i 1 X i ! i Xi ! To maximize the log likelihood, we set the first derivative of the log likelihood equal to zero, 1 n l '( ) n i 1 X i 0. 4 X is the unique solution to this equation. To confirm that X in fact maximizes l ( ) , we can use the second derivative test, 1 n l ''( ) 2 i 1 X i n l ''( X ) 0 as long as i 1 X i 0 so that X in fact maximizes l ( ) . When i 1 X i 0 , it can be seen by inspection that l x ( ) is a strictly decreasing function of and therefore there is no maximum of l x ( ) for the parameter space (0, ) ; the MLE n does not exist when n i 1 Xi 0 . Example 3: Suppose X 1 , , X n are iid Uniform( 0, ]. if max X i 0 Lx ( ) 1 if max X i n Thus, ˆ max X . MLE i Recall that the method of moments estimator is 2 X . In notes 4, we showed that max X i dominates 2 X for the squared error loss n 1 function (although max X i is dominated by n max X i ). Key valuable asymptotic features of maximum likelihood estimators: 5 For X1 , , X n iid p( x | ), , under “regularity conditions” on p ( x | ) [these are essentially smoothness conditions on p ( x | ) ]: 1. The MLE is consistent. 2. The MLE is asymptotically normal: ˆMLE For a one dimensional parameter , SE (ˆ ) converges in MLE distribution to a standard normal distribution. 3. The MLE is asymptotically optimal: roughly, this means that among all well-behaved estimators, the MLE has the smallest variance for large samples. Consistency of maximum likelihood estimates: A basic desirable property of estimators is that they are consistent, i.e., converge to the true parameter when there is a “large” amount of data. The maximum likelihood estimator is generally, although not always consistent. We prove a special case of consistency here. Theorem: Consider the model X1 , , X n are iid with pmf or pdf { p( X i | ), } Suppose (a) the parameter space is finite; (b) is identifiable and (c) the p( X i | ) have common support for all . Then the maximum likelihood estimator ˆ is consistent as n . MLE 6 Proof: Let 0 denote the true parameter. First, we show that for any 0 P0 (l x ( 0 ) l x ( )) 1 as n (0.1) The inequality is equivalent to p( X i | ) 1 n log p( X | ) 0 . n i 1 i 0 By the law of large numbers, the left side tends in probability toward p( X i | ) E0 log p ( X | ) i 0 Since –log is strictly convex, Jensen’s inequality shows that p( X i | ) p( X i | ) E0 log log E0 0 p ( X | ) p ( X | ) i 0 i 0 and (0.1) follows. For a finite parameter space, ˆMLE is consistent if and only if P (ˆ ) 1 . 0 MLE 0 Denote the points other than 0 in the finite parameter space by 1 , , K . Let A jn be the event that for n observations, l x ( 0 ) l x ( j ) . The event ˆ for n observations is MLE 0 contained in the event A1n AKn . By (0.1), P ( A jn ) 1 as n for j 1, , K . Consequently, 7 AKn ) 1 as n and since ˆMLE 0 for n observations is contained in the event A1n AKn , P0 (ˆMLE 0 ) 1 as n . P( A1n For infinite parameter spaces, the MLE can be shown to be consistent under conditions (b)-(c) of the theorem plus the following two assumptions: (1) The parameter space contains an open set of which the true parameter is an interior point (i.e., true parameter is not on boundary of parameter space); (2) p ( x | ) is differentiable in . The consistency theorem assumes that the parameter space does not depend on the sample size. The MLE can be inconsistent when the number of parameters increases with the sample size, e.g., X 1 , , X n independent normals with mean i and variance 2 . 2 MLE of is inconsistent. 8