Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
B90.3302 C22.0015 NOTES for Wednesday 2011.MAR.23 Some loose topics. We’ve got to deal with the Cramer-Rao inequality. This is covered on a separate handout. The essential summary is this. Suppose that is an estimate of . n n ˆ lim then is efficient. n I (1) If lim Var (2) If is maximum likelihood, then is efficient. (3) If is unbiased, meaning E = , then Var (4) If is unbiased and Var n 1 . ej Iaf 1 , then is MVUE (minimum ej Iaf variance unbiased estimate). One final detail. We had mentioned that maximum likelihood estimates are asymptotically normal. Why does this happen? Let’s show a partial proof for the case in which we have a sample X 1 , X 2 ,..., X n from a probability law f(x) with one parameter . We’ll have to use a Taylor series. This says that for any function h(y) h(y0) + h´(y0) (y - y0). Our likelihood is then n L = f bx g i i 1 and the log-likelihood is n log L = log f bx g i i 1 1 Now obtain the derivative with respect to . log L n log f bx g i i 1 Letting be the maximum likelihood estimate, let’s write this as a Taylor series about that . log L log f xi ˆ i 1 n 2 log f xi 2 i 1 n ˆ ˆ Now divide left and right sides by n : 1 log L n 1 n log f xi ˆ n i 1 1 n 2 2 log f xi n i 1 ˆ ˆ Now let’s examine this expression. The first summand is 1 n log f x n i 1 i ˆ which is zero…. because this is the equation (aside from the get ! n ) which we solve to Thus, we’ve reduced the relationship to this: 1 log L n 1 n 2 2 log f xi n i 1 ˆ ˆ We can write out the left side, too: 1 log L = n 1 n n log f xi i 1 1 n 2 2 log f xi n i 1 ˆ ˆ 1 n log f xi , we can assert the Central Limit theorem! After all, it’s n i 1 the sum of n independent, identically distributed things. As each summand has mean zero, this limiting distribution is N(0, Var [ log f(xi) ] ), or N(0, I() ). Thus, we decide that the limiting distribution of Based on 2 1 n 2 2 log f xi n i 1 ˆ ˆ must also be N(0, I() ). Let’s rewrite this as 1 n 2 2 log f xi n i 1 ˆ n ˆ Watch the n’s and the minus signs. The expression in the brackets certainly converges to I(); remember the calculating forms for I() and also the law of large numbers. Thus our result comes down to I n ˆ ~ N 0, I Certainly we can express this as F IJ G H afK 1 n ~ N 0, I e j This is of course the statement for the asymptotic normality of the maximum likelihood estimate. It should be noted that many approximations were made. Also, we left several mathematical nuances untouched. Nonetheless, this demonstration shows the essential features of the proof that maximum likelihood estimates are asymptotically normal with 1 variance . I af **This was not covered in class.** Let’s note (but not get too excited about) the exponential family. densities can be factored as This says that some af af T af x ABx f(x) = ecaf Indeed, this reflects the use of three factors that we had done in the factorization theorem to identify sufficient statistics, except that here the link between and x occurs as products in the exponent. The remaining factors are somewhat arbitrary. In fact, Rice puts them in the exponent as T af x d af S af x f(x) = ecaf 3 In any event, this clearly identifies T(x) as the sufficient statistic. Rice then goes on to n show that in a sample of n, we have X as the sufficient statistic. T bg i i 1 We note the following: (1) Many common probability laws are of the exponential family form. Rice shows several examples. (2) This idea applies as well to X which are not iid samples. (3) There is a huge body of statistical theory exploiting the notions of the exponential family. **end of commentary Finally, we have the Rao-Blackwell theorem. We have a handout on this. Stress that the major use of the Rao-Blackwell theorem is not that of finding estimates. The actual use is the theoretical assurance that the best procedures are based on sufficient statistics. Now let’s do the NP lemma. Neyman and Pearson did their fundamental lemma in 1931. It solved a very neat problem in a very clever mathematical way. This approach has formed the basis for statistical work in all of the empirical sciences. It created an enormous tidal wave of jargon: Null hypothesis Alternative hypothesis Type I error Type II error Statistically significant Not statistically significant Level of significance Power of test Power function Operating characteristic (OC) curve P-value A neat advantage of the Neyman-Pearson approach is that it is a well-defined, textbookcitable method. 4 The hypothesis-testing game has some peculiar aspects. Here are some aspects which ought to be emphasized: H 0 and HA (or H1 ) are not exchangeable. H 0 must contain the = part of the problem. HA is the interesting thing you’d like to show. Rejecting H 0 in favor of HA is really exciting. Accepting H 0 just means that you can’t reject H 0 . This is the nonsignificance case. Rejecting H 0 in favor of HA is the significant case. You have significant evidence. Your results are statistically significant. Hypothesis testing can obtain significant evidence that 0. It cannot obtain significant evidence that really is equal to 0. With large quantities of data, you will always be able to reject a null hypothesis of exact equality. (That is, in a problem like H 0 : = 0 versus HA: 0, a large sample size will always reject H 0 .) If your objective is to accept H 0 , you can accomplish this with a small sample size. Essentially, you’ve run an experiment with power too small to decide in favor of HA. It is certainly interesting that the methodology has grown in odd ways over the last 65+ years. In most sciences, including social sciences, quantitative results must be subjected to statistical hypothesis tests. There is an extreme prejudice against reporting nonsignificant results. 5 OK....now let’s look at the Neyman-Pearson problem. It says that you have data X and model f(x) and two statements H 0 : = 0 and HA: = 1. It is assumed that the model is fixed aside from choice of . Thus.... Covered by this problem: N(, 25) sample, Bin(50, ), N(100, 2). Also covered would be N(, 2) with H 0 : = 10, = 2 versus HA: = 12, = 4. NOT covered by this problem: N(, 2) with H 0 and HA discussing only. Neyman and Pearson set up an embarrassment level, . This represents the maximum probability of rejecting the null, type I error. They claimed to have a “best test.” This means that any competitor test cannot be better. To be a competitor (and have a chance at being better), a test had to have a Type I error probability which was at most α. NP would then show that the competitor must do worse on Type II error. Details are on handout. Handout uses “rejection set” whereas many use “critical region.” NP procedure is based on the likelihood ratio b g. fb x 1g f0 x , which we might also write as f1 x f x0 The technology generalizes. We can prove optimality of this sort of stuff for other types of hypotheses and for more complicated likelihoods. The Neyman-Pearson paradigm is useful for simple versus simple situations. However, a great many of our test are of the form H 0 : = 0 versus HA : 0. We need another kind of tool. We could of course ask at this point where our common tests come from. Suppose that we have X 1 , X 2 ,..., X n from N(, 2) and we want to test (say) H 0 : = 80 versus H1 : 80. Our method of doing this consists of the following logical process. (1) (2) Seek a statistic T(X) such that we can easily see what kinds of values of T suggest H 0 and what kind of values of T suggest H1. Obtain the distribution of T(X) when H 0 is true. This is the hard part. 6 (3) Based on the distribution obtained in (2), calibrate a cutoff rule so that P[ T(X) rejection set H 0 true ] = . For the one-sample problem above, the statistic is T(X) = the reject set is described by T(X) t/2; n - 1 . n X 80 and the nature of s The logic of hypothesis testing (as generally covered in textbooks) goes through these steps: (1) (2) (3) (4) Prove Neyman-Pearson lemma for situation of testing simple H 0 versus simple H1. These are nearly always done for one-parameter problem, such as for mean of normal population with known variance. Consider one-parameter problems of form H 0 : = 0 versus H1 : > 0. For a large class of problems, we can work out the Neyman-Pearson test of H 0 : = 0 versus H1 : = 1 at some particular 1 > 0. If it turns out that the form of the test does not depend on the particular 1, then the test is described as UMP (uniformly most powerful). As the point made in (2) does not generalize to multi-parameter problems, we develop other concepts such as “similar” tests or “invariant” tests. Eventually we come to the likelihood ratio test principle. The likelihood ratio test idea is motivated by the problem H 0 : 0 versus HA : A. As Neyman and Pearson showed us that likelihood ratios are a good idea, we base the test on = bg fb x g max f x 0 max 0 A Then we reject H 0 if k, choosing k to adjust the level of significance. The procedure comes down to these difficult steps: (1) (2) (3) Do the maximization in the numerator. Do the maximization in the denominator. Obtain the distribution of under H 0 , as this is essential to the problem of setting k to fix the level of significance. We can certainly work through examples of this technique to derive most of the common statistical tests. However, the real use comes in the non-standard cases, where we can work with an asymptotic result on . 7 What’s a non-standard case? Suppose that X1, X2, …, Xn is a sample from a p-variable normal distribution with mean vector and variance matrix . Consider the test of H0: = diagonal versus HA : = arbitrary positive definite matrix. This is pretty much hopeless by any other criterion. The limiting result is that, under H 0 , -2 log ~ 2 with degrees of freedom equal to the number of parameters being debated between H 0 and H1. OK…. Let’s see this in action. Let’s see some examples of the likelihood ratio test in action. (This is on a handout.) Suppose that X1, X2, …, Xm is a sample from a Poisson distribution, while Y1, Y2, …, Yn is a sample from a Poisson distribution. We wish to test H 0 : = versus HA : . L e The likelihood is L = M N m i 1 xi xi ! L O e P M QM N n j 1 y O P P Q j . yj ! In the numerator of the likelihood ratio test statistic, there is only one sample of size m + n with a common parameter. The maximum likelihood estimate for the common m n i 1 j 1 Xi Yj parameter is is L M e M N m = 0 . The maximized likelihood for H 0 , for the numerator, mn 0 i 1 x0i xi ! O L P M e P M QN n 0 j 1 y 0j yj ! O P P Q In the denominator, the parameters and are allowed to be different. Thus, the maximum likelihood estimates are A X and A Y . The maximized likelihood for the denominator is L M e M N m i 1 A xAi xi ! O L P M e M P QN n j 1 A y Aj yj ! O P P Q This allows us to write the likelihood ratio as 8 L M e M N L M e M N m 0 i 1 = m A i 1 m = n O L P e P M QM N O L P M e x !P M QN x0i xi ! xi j 1 i j 1 0 A O P P Q= O P y !P Q y 0j yj ! yj A m n x y e a f 0 m n 0 i 1 m e m A n A i i 1 j n xj yj Ai 1 Ai 1 j (think why the exponentials cancel!) n xj yj Ai 1 Ai 1 m n A xi y j 0i 1 i 1 m n n F I F I = GJ GJ H K H K xi 0 A i 1 yj 0 i 1 A The analysis of the distribution of this is hopeless. However, we can use the -2 log rule. Here it’s chi-squared with 1 degrees of freedom. The alternative space is described by two parameters ( and ), while the null space is described by one (the single common value of and ). The difference is 1. 9