Download Session 5 The Score Statistic

Session 5 The Score Statistic page General Properties of ML Estimators 5-2 The quadratic approximation 5-2 The Central Limit Theorem 5-3 Properties of the Likelihood Function as a Statistic 5-3 The Score Statistic 5-4 Asymptotic Distribution of an MLE 5-4 Normal Distribution 5-5 Binomial Distribution 5-5 The Wald Test 5-6 The Chi-Squared Distribution 5-6 Binomial Example 5-7 Parameter Transformations 5-7 Practical 5 5-10 Appendix B: Consistency of ML Estimators 5-12 5-1 5. The Score Statistic General Properties of ML Estimators Although we have a general method for deriving estimators, each problem seems to generate its own unique solution. Can we say anything in general about the estimators? It turns out that we can say a lot but only as n " ! (asymptotically) or in the limit. In particular, we can show that ML estimators are consistent (see appendix). However, the results provide useful approximations that are good for moderate sample sizes. The classic example of statistical inference about µ in the Normal distribution is the ideal in that we can solve all problems exactly. In this case. the log-likelihood is quadratic. Whether the asymptotic results provide a good approximation hinges on whether the log-likelihood is approximately quadratic in the region of its maximum. In theoretical treatments of ML it is often stated that “under regularity conditions” a particular result applies. These conditions often have an abstract mathematical definition. We will take a regular problem to mean that a quadratic approximation of the log-likelihood in the region of the maximum applies. The better this approximation is, the better the limiting results will be as approximations to the exact. The reason that we need these approximations is that the mathematics very quickly becomes impossible once we consider more complex probability models, so that we cannot derive exact sampling distributions. The Quadratic Approximation From now on we will consider that the Likelihood has had irrelevant constants removed and is expressed as a relative likelihood. Then the quadratic approximation we are considering can be written as log RL( y | # ) & log L( y | # ) % a $ b# $ c# 2 L( y | #ˆ) However, we know that this quadratic must satisfy some conditions. Firstly, it is a maximum at # & #ˆ , secondly it must be zero at the maximum since we 5-2 are considering relative Likelihood and finally the value of c must be negative for a maximum. Thus we will rewrite the quadratic as: ' 12 I (# ' #ˆ) 2 The reason for the particular choice of form will become clear. Noting that the denominator in RL is not a function of # the derivative is ( log L( y | # ) % ' I (# ' #ˆ) (# which is approximately linear in the MLE. This simple result will be crucial in the development of the asymptotic properties. Note also that the second derivative is ( 2 log L & 'I (# 2 The first derivative when thought of as a function of the data is called the score statistic. To derive its statistical properties we need the Central Limit Theorem. The Central Limit Theorem (CLT) Consider an set of independent and identically distributed (i.i.d.) variables Y1, Y2, … , Yn . The Central Limit Theorem states that the distribution of n (Y ' * ) ) tends to the N(0,1) distribution as n " ! . Less formally we say that the distribution of + Yi tends to the Normal distribution. In fact, the result applies under less strict conditions, we can relax the condition that the variables have the same distribution so long as we can guarantee that one particular variable will not dominate the sum. In practice the distribution can become close to Normality for surprisingly small values of n, Properties of the Likelihood Function as a Statistic From the fundamental property that the Likelihood is a probability distribution we have 5-3 , L( y | # )dy & 1 Differentiating this once and taking expectations we obtain 2 ( log L / E0 -&0 1 (# . Differentiating again and taking expectations we obtain 82 ( log L / 2 5 2 ( 2 log L / -E 60 - 3 & ' E 00 2 671 (# . 34 1 (# . See Appendix A for details. The Score Statistic Define the score statistic S as S& ( log L( y | # ) (# Note we are considering this as a function of y and hence a statistic. From the previous results we see that E (S ) & 0 82 ( log L / 2 5 2 ( 2 log L / -- & I var(S ) & E ( S ) & E 60 - 3 & ' E 00 2 # ( ( # . 1 1 . 76 43 2 This latter value, I, is called the expected information. Expressing S as a sum S& ( ( ( log f ( y1 | # ) $ log f ( y 2 | # ) $ ! $ log f ( y n | # ) (# (# (# then we see that we can use the CLT to show that the asymptotic distribution of S is S ~ N (0, I ) Asymptotic Distribution of an MLE Since for regular problems we can approximate the score function (here now thought of as a function of !) by a linear function S (# ) % ' I (# ' #ˆ) Hence 5-4 #ˆ % # $ S I Thus the approximate limiting distribution of the MLE is #ˆ ~ N (# , I '1 ) In fact, we can show that the quadratic approximation gets better as n " ! so that this result is the limiting distribution of the MLE. We see that I -½ is the approximate standard error of the MLE. Normal Distribution The inference for the Normal mean is exact but is a useful example to clarify the ideas. Consider only estimating the parameter µ then the score statistic is S& ( log L y ' * & 2 (* ) /n we also need the expected information I ( 2 log L n &' 2 2 ) (* 2 ( 2 log L / n -- & 2 I & ' E 00 2 * ( . ) 1 so the asymptotic distribution would be S ~ N (0, n )2 ) In this case this is exact for all n. Binomial Distribution Following the same procedure for the estimation of p log L & k $ y log p $ ( N ' y ) log(1 ' p ) pˆ ' p ( log L y N ' y Npˆ N (1 ' pˆ ) S& & & ' & ' (1 ' p ) p (1 ' p ) p p (1 ' p ) (p Note that this looks far from linear in p. However, proceeding 2 ( 2 log L / 2 / N - & E 0 y $ N ' y - & Np $ N (1 ' p ) & I & ' E 00 2 0 p 2 (1 ' p ) 2 - p 2 (1 ' p ) 2 p (1 ' p ) 1 . 1 (* . 5-5 Note that here we need the value of p in both S and I. The standard procedure is to substitute the MLE for p. This can be shown to converge in probability to the required values. So we use I ( pˆ ) & N pˆ (1 ' pˆ ) The Wald Test We can use this result directly to provide an approximate hypothesis test in moderately sized samples by using the standard procedure for the Normal distribution. The Wald statistic is defined from S2 so needs the distribution of the square of a Normal variable (the 9 2 or chi-squared distribution). The rationale for this is that it can be generalised to multi-parameter situations. The Chi-Squared Distribution Consider a sequence Y1,…,Yd of d standard Normal random variables (i.e. from N(0,1)) then a random variable T defined by T & Y12 $ Y22 $ ! $ Yd2 has the chi-squared distribution with d degrees of freedom. Thus the simple Wald test with one parameter uses the chi-squared distribution with one degree of freedom. However we need to take account that S does not have variance 1. To make it conform we define the Wald statistic, W, as W = S2 / I which asymptotically has the chi-squared distribution on one degree of freedom. Thus a 5% significance test would be: W > 3.84 One disadvantage of the Wald test is that it corresponds to a two-sided alternative always. An alternative procedure (sometimes also called a Wald test) is to consider S directly as a Normal variate and using z& S ~ N (0,1) I This allows a one-sided test to be performed. 5-6 Binomial Example Taking the above results together we can perform a test of the Binomial parameter p by using the asymptotic result: pˆ ~ N ( p, pˆ (1 ' pˆ ) ) N and hence a test statistic: z& pˆ ' p0 pˆ (1 ' pˆ ) / N The sex ratio of babies is expected to be 106 boys to 100 girls, so a value of p=106/206=0.515. We wish to test this null hypothesis against a one-sided alternative that it is less than this based on a sample with 20 sons and 40 daughters. This gives a p̂ of 20/60=0.333 and hence z& 0.333 ' 0.515 & '2.978 0.0609 The one-sided 1% significance value for the Normal distribution is -2.326, so this is significant at the 1% level. Parameter Transformations The effectiveness of the above test procedure depends critically on the accuracy of the quadratic approximation to the log-likelihood. When this approximation is poor the test can be seriously in error. However, by the invariance principle we can transform the parameter in any way we like without affecting the true inference. It will though change the shape of the likelihood function. We can attempt to take advantage of this by finding a transformation that improves the quadratic approximation and hence make the approximate tests closer to the true inference. A popular choice of transformation for the Binomial parameter p is the logit function defined by: 2 p / -# & log00 11 ' p . e# with inverse p & 1 $ e# (the logistic function) We can examine the effects of this transformation on the RL worksheet in the Excel spreadsheet (or in more detail in the BinoML spreadsheet). Here are graphs of logRL against p and against logit(p). 5-7 log RL 0 -0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -1 -1.5 -2 -2.5 -3 -3.5 -4 p log RL vs logit 0 -2 -1 -0.5 0 1 2 -1 log RL -1.5 -2 -2.5 -3 -3.5 -4 logit These graphs are for the case N=25 and y=10. Superimposed is the quadratic approximation. We see that the logit transformed parameter results in a better quadratic approximation to the log-likelihood. By the invariance principle we do not have to start from scratch, but we do have to adjust the expected information using: 5-8 2 ( 2 ln L / 2 dp / 2 2 ( 2 ln L / -& E 00 - E 00 2 - 0 2 # d ( ( # # . 1 1 1 . . dp e# & & p (1 ' p ) where d# (1 $ e# ) 2 Hence I & Np (1 ' p ) This gives approximate distribution: #ˆ ~ N (# ,1 / Npˆ (1 ' pˆ )) Hence the test for the previous example would be: #ˆ & '0.693 # 0 & 0.058 giving a test statistic z & ('0.693 ' 0.058) : 3.651 & '2.744 This is still significant at the 1% level. 5-9 Practical 5 1. Using the RVLS web site Sampling demonstration, “paint” a discrete distribution and examine the effect on the sampling distribution of the mean. 2. Use the CLT sheet to examine the behaviour of sums of random variables. Add more columns to make sums of 4 random variables in the cases of uniform and exponential distributions. How do these results differ for the two cases? 3. Use the RL sheet to examine how good the quadratic approximation is for the logRL function various values of N and y. Compare this with the graph using the logit transform, to see whether this transform always gives a better approximation. 4. Using the approximate distribution of p̂ test the Hypothesis H0: p=0.2 in the ESP experiment if we observe the number of correct responses, Y=9. 5. Repeat the Binomial examples for data based on a sample of 58 licensed male divers in south-east Australia who fathered 85 girls and 45 boys. 5-10 Appendix A: Expected Information We start by noting that Now , L( y)dy & 1 . ( log L 1 (L & (# L (# and L Differentiating ( L( y )dy & 0 (# , Interchanging differentiation and integration: (L , (# dy & , L ( log L 2 ( log L / dy & E 0 -&0 (# 1 (# . Differentiating (Equ 1) again ( 2 log L (L ( log L ( 2 L L $ & 2 (# (# (# 2 (# Substituting (Equ 1) and integrating 2 ( 2 log L 2 ( log L / , L (# 2 dy $ , L01 (# -. dy & 0 Hence 2 @ ( log L = 2 / 2 ( 2 log L / -- & ' E 0 ? E 00 < -2 0 # ( # ; . . 1 ( 1> 5-11 ( log L (L & (# (# (Equ 1) Appendix B: Consistency of ML Estimators Proof: By definition log L( y | #ˆ) A log L( y | # ) Let the true value be # 0 and define E 0 to be expectation at the true value. Consider # * B # 0 . Since geometric mean < arithmetic mean @ L( y | # * = @ L( y | # * = E0 ?log < < C log E 0 ? L( y | # 0 ; > L( y | # 0 ; > now @ L( y | # * = L( y | # * ) E0 ? L( y | # 0 )dy & 1 <&, L ( y | # L ( y | # ) 0; 0 > so @ L( y | # * = E0 ?log <C0 L( y | # 0 ; > D E hence E 0 log L( y | # ) C E 0 Dlog L( y | # 0 )E * by the WLLN as n " ! with probability " 1 log L( y | # * ) C log L( y | # 0 ) but log L( y | #ˆ) A log L( y | # 0 ) Thus as n " ! with probability " 1 , L( y | #ˆ) cannot take any other value than L( y | # 0 ) . QED 5-12

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Session 5 The Score Statistic