Download Session 5 The Score Statistic

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Session 5
The Score Statistic
page
General Properties of ML Estimators
5-2
The quadratic approximation
5-2
The Central Limit Theorem
5-3
Properties of the Likelihood Function as a Statistic
5-3
The Score Statistic
5-4
Asymptotic Distribution of an MLE
5-4
Normal Distribution
5-5
Binomial Distribution
5-5
The Wald Test
5-6
The Chi-Squared Distribution
5-6
Binomial Example
5-7
Parameter Transformations
5-7
Practical 5
5-10
Appendix B: Consistency of ML Estimators
5-12
5-1
5. The Score Statistic
General Properties of ML Estimators
Although we have a general method for deriving estimators, each problem
seems to generate its own unique solution. Can we say anything in general
about the estimators? It turns out that we can say a lot but only as n " !
(asymptotically) or in the limit. In particular, we can show that ML estimators
are consistent (see appendix). However, the results provide useful
approximations that are good for moderate sample sizes. The classic example
of statistical inference about µ in the Normal distribution is the ideal in that we
can solve all problems exactly. In this case. the log-likelihood is quadratic.
Whether the asymptotic results provide a good approximation hinges on
whether the log-likelihood is approximately quadratic in the region of its
maximum. In theoretical treatments of ML it is often stated that “under
regularity conditions” a particular result applies. These conditions often have
an abstract mathematical definition. We will take a regular problem to mean
that a quadratic approximation of the log-likelihood in the region of the
maximum applies. The better this approximation is, the better the limiting
results will be as approximations to the exact. The reason that we need these
approximations is that the mathematics very quickly becomes impossible
once we consider more complex probability models, so that we cannot derive
exact sampling distributions.
The Quadratic Approximation
From now on we will consider that the Likelihood has had irrelevant constants
removed and is expressed as a relative likelihood. Then the quadratic
approximation we are considering can be written as
log RL( y | # ) & log
L( y | # )
% a $ b# $ c# 2
L( y | #ˆ)
However, we know that this quadratic must satisfy some conditions. Firstly, it
is a maximum at
# & #ˆ , secondly it must be zero at the maximum since we
5-2
are considering relative Likelihood and finally the value of c must be negative
for a maximum. Thus we will rewrite the quadratic as:
' 12 I (# ' #ˆ) 2
The reason for the particular choice of form will become clear.
Noting that the denominator in RL is not a function of
# the derivative is
( log L( y | # )
% ' I (# ' #ˆ)
(#
which is approximately linear in the MLE. This simple result will be crucial in
the development of the asymptotic properties.
Note also that the second derivative is
( 2 log L
& 'I
(# 2
The first derivative when thought of as a function of the data is called the
score statistic. To derive its statistical properties we need the Central Limit
Theorem.
The Central Limit Theorem (CLT)
Consider an set of independent and identically distributed (i.i.d.) variables Y1,
Y2, … , Yn . The Central Limit Theorem states that the distribution of
n (Y ' * )
)
tends to the N(0,1) distribution as n " ! .
Less formally we say that the distribution of
+ Yi tends to the Normal
distribution.
In fact, the result applies under less strict conditions, we can relax the
condition that the variables have the same distribution so long as we can
guarantee that one particular variable will not dominate the sum.
In practice the distribution can become close to Normality for surprisingly
small values of n,
Properties of the Likelihood Function as a Statistic
From the fundamental property that the Likelihood is a probability distribution
we have
5-3
, L( y | # )dy & 1
Differentiating this once and taking expectations we obtain
2 ( log L /
E0
-&0
1 (# .
Differentiating again and taking expectations we obtain
82 ( log L / 2 5
2 ( 2 log L /
-E 60
- 3 & ' E 00
2
671 (# . 34
1 (#
.
See Appendix A for details.
The Score Statistic
Define the score statistic S as
S&
( log L( y | # )
(#
Note we are considering this as a function of y and hence a statistic.
From the previous results we see that
E (S ) & 0
82 ( log L / 2 5
2 ( 2 log L /
-- & I
var(S ) & E ( S ) & E 60
- 3 & ' E 00
2
#
(
(
#
.
1
1
.
76
43
2
This latter value, I, is called the expected information.
Expressing S as a sum
S&
(
(
(
log f ( y1 | # ) $
log f ( y 2 | # ) $ ! $
log f ( y n | # )
(#
(#
(#
then we see that we can use the CLT to show that the asymptotic distribution
of S is
S ~ N (0, I )
Asymptotic Distribution of an MLE
Since for regular problems we can approximate the score function (here now
thought of as a function of !) by a linear function
S (# ) % ' I (# ' #ˆ)
Hence
5-4
#ˆ % # $
S
I
Thus the approximate limiting distribution of the MLE is
#ˆ ~ N (# , I '1 )
In fact, we can show that the quadratic approximation gets better as n " !
so that this result is the limiting distribution of the MLE.
We see that I -½ is the approximate standard error of the MLE.
Normal Distribution
The inference for the Normal mean is exact but is a useful example to clarify
the ideas. Consider only estimating the parameter µ then the score statistic is
S&
( log L y ' *
& 2
(*
) /n
we also need the expected information I
( 2 log L
n
&' 2
2
)
(*
2 ( 2 log L / n
-- & 2
I & ' E 00
2
*
(
. )
1
so the asymptotic distribution would be
S ~ N (0,
n
)2
)
In this case this is exact for all n.
Binomial Distribution
Following the same procedure for the estimation of p
log L & k $ y log p $ ( N ' y ) log(1 ' p )
pˆ ' p
( log L y N ' y Npˆ N (1 ' pˆ )
S&
&
&
'
& '
(1 ' p )
p (1 ' p )
p
p (1 ' p )
(p
Note that this looks far from linear in p. However, proceeding
2 ( 2 log L /
2
/
N
- & E 0 y $ N ' y - & Np $ N (1 ' p ) &
I & ' E 00
2 0 p 2 (1 ' p ) 2 - p 2 (1 ' p ) 2
p (1 ' p )
1
.
1 (* .
5-5
Note that here we need the value of p in both S and I. The standard
procedure is to substitute the MLE for p. This can be shown to converge in
probability to the required values. So we use
I ( pˆ ) &
N
pˆ (1 ' pˆ )
The Wald Test
We can use this result directly to provide an approximate hypothesis test in
moderately sized samples by using the standard procedure for the Normal
distribution. The Wald statistic is defined from S2 so needs the distribution of
the square of a Normal variable (the
9 2 or chi-squared distribution). The
rationale for this is that it can be generalised to multi-parameter situations.
The Chi-Squared Distribution
Consider a sequence Y1,…,Yd of d standard Normal random variables (i.e.
from N(0,1)) then a random variable T defined by
T & Y12 $ Y22 $ ! $ Yd2
has the chi-squared distribution with d degrees of freedom.
Thus the simple Wald test with one parameter uses the chi-squared
distribution with one degree of freedom. However we need to take account
that S does not have variance 1. To make it conform we define the Wald
statistic, W, as
W = S2 / I
which asymptotically has the chi-squared distribution on one degree of
freedom.
Thus a 5% significance test would be:
W > 3.84
One disadvantage of the Wald test is that it corresponds to a two-sided
alternative always. An alternative procedure (sometimes also called a Wald
test) is to consider S directly as a Normal variate and using
z&
S
~ N (0,1)
I
This allows a one-sided test to be performed.
5-6
Binomial Example
Taking the above results together we can perform a test of the Binomial
parameter p by using the asymptotic result:
pˆ ~ N ( p,
pˆ (1 ' pˆ )
)
N
and hence a test statistic:
z&
pˆ ' p0
pˆ (1 ' pˆ ) / N
The sex ratio of babies is expected to be 106 boys to 100 girls, so a value of
p=106/206=0.515. We wish to test this null hypothesis against a one-sided
alternative that it is less than this based on a sample with 20 sons and 40
daughters. This gives a p̂ of 20/60=0.333 and hence
z&
0.333 ' 0.515
& '2.978
0.0609
The one-sided 1% significance value for the Normal distribution is -2.326, so
this is significant at the 1% level.
Parameter Transformations
The effectiveness of the above test procedure depends critically on the
accuracy of the quadratic approximation to the log-likelihood. When this
approximation is poor the test can be seriously in error. However, by the
invariance principle we can transform the parameter in any way we like
without affecting the true inference. It will though change the shape of the
likelihood function. We can attempt to take advantage of this by finding a
transformation that improves the quadratic approximation and hence make
the approximate tests closer to the true inference.
A popular choice of transformation for the Binomial parameter p is the logit
function defined by:
2 p /
-# & log00
11 ' p .
e#
with inverse p &
1 $ e#
(the logistic function)
We can examine the effects of this transformation on the RL worksheet in the
Excel spreadsheet (or in more detail in the BinoML spreadsheet). Here are
graphs of logRL against p and against logit(p).
5-7
log RL
0
-0.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-1
-1.5
-2
-2.5
-3
-3.5
-4
p
log RL vs logit
0
-2
-1
-0.5
0
1
2
-1
log RL
-1.5
-2
-2.5
-3
-3.5
-4
logit
These graphs are for the case N=25 and y=10. Superimposed is the quadratic
approximation. We see that the logit transformed parameter results in a better
quadratic approximation to the log-likelihood.
By the invariance principle we do not have to start from scratch, but we do
have to adjust the expected information using:
5-8
2 ( 2 ln L / 2 dp / 2 2 ( 2 ln L /
-&
E 00
- E 00
2 - 0
2 #
d
(
(
#
#
.
1
1
1
.
.
dp
e#
&
& p (1 ' p )
where
d# (1 $ e# ) 2
Hence
I & Np (1 ' p )
This gives approximate distribution:
#ˆ ~ N (# ,1 / Npˆ (1 ' pˆ ))
Hence the test for the previous example would be:
#ˆ & '0.693 # 0 & 0.058
giving a test statistic
z & ('0.693 ' 0.058) : 3.651 & '2.744
This is still significant at the 1% level.
5-9
Practical 5
1. Using the RVLS web site Sampling demonstration, “paint” a discrete
distribution and examine the effect on the sampling distribution of the
mean.
2. Use the CLT sheet to examine the behaviour of sums of random variables.
Add more columns to make sums of 4 random variables in the cases of
uniform and exponential distributions. How do these results differ for the
two cases?
3. Use the RL sheet to examine how good the quadratic approximation is for
the logRL function various values of N and y. Compare this with the graph
using the logit transform, to see whether this transform always gives a
better approximation.
4. Using the approximate distribution of p̂ test the Hypothesis H0: p=0.2 in
the ESP experiment if we observe the number of correct responses, Y=9.
5. Repeat the Binomial examples for data based on a sample of 58 licensed
male divers in south-east Australia who fathered 85 girls and 45 boys.
5-10
Appendix A: Expected Information
We start by noting that
Now
, L( y)dy & 1 .
( log L 1 (L
&
(#
L (#
and L
Differentiating
(
L( y )dy & 0
(# ,
Interchanging differentiation and integration:
(L
, (# dy & , L
( log L
2 ( log L /
dy & E 0
-&0
(#
1 (# .
Differentiating (Equ 1) again
( 2 log L (L ( log L ( 2 L
L
$
& 2
(# (#
(# 2
(#
Substituting (Equ 1) and integrating
2
( 2 log L
2 ( log L /
, L (# 2 dy $ , L01 (# -. dy & 0
Hence
2 @ ( log L = 2 /
2 ( 2 log L /
-- & ' E 0 ?
E 00
< -2
0
#
(
#
; .
.
1 (
1>
5-11
( log L (L
&
(#
(#
(Equ 1)
Appendix B: Consistency of ML Estimators
Proof:
By definition log L( y | #ˆ) A log L( y | # )
Let the true value be # 0 and define E 0 to be expectation at the true value.
Consider # * B # 0 . Since geometric mean < arithmetic mean
@ L( y | # * =
@
L( y | # * =
E0 ?log
<
< C log E 0 ?
L( y | # 0 ;
> L( y | # 0 ;
>
now
@ L( y | # * =
L( y | # * )
E0 ?
L( y | # 0 )dy & 1
<&,
L
(
y
|
#
L
(
y
|
#
)
0;
0
>
so
@
L( y | # * =
E0 ?log
<C0
L( y | # 0 ;
>
D
E
hence E 0 log L( y | # ) C E 0 Dlog L( y | # 0 )E
*
by the WLLN as n " ! with probability " 1
log L( y | # * ) C log L( y | # 0 )
but
log L( y | #ˆ) A log L( y | # 0 )
Thus as n " ! with probability " 1 , L( y | #ˆ) cannot take any other value
than L( y | # 0 ) . QED
5-12