Download B90.3302 C22.0015 NOTES for Wednesday 2011.MAR.23 Some

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Generalized linear model wikipedia , lookup

Taylor's law wikipedia , lookup

Psychometrics wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
B90.3302 C22.0015
NOTES for Wednesday 2011.MAR.23
Some loose topics. We’ve got to deal with the Cramer-Rao inequality. This is covered
on a separate handout. The essential summary is this. Suppose that  is an estimate of
.
 n 
n ˆ  lim 
 then  is efficient.
n   I    




(1)
If lim Var
(2)
If  is maximum likelihood, then  is efficient.
(3)
If  is unbiased, meaning E  = , then Var  
(4)
If  is unbiased and Var  
n 
1
.
ej Iaf

1
, then  is MVUE (minimum
ej Iaf

variance unbiased estimate).
One final detail. We had mentioned that maximum likelihood estimates are
asymptotically normal. Why does this happen? Let’s show a partial proof for the case in
which we have a sample X 1 , X 2 ,..., X n from a probability law f(x) with one
parameter .
We’ll have to use a Taylor series. This says that for any function
h(y)  h(y0) + h´(y0) (y - y0).
Our likelihood is then
n
L =
 f bx g
i
i 1
and the log-likelihood is
n
log L =
 log f bx g
i
i 1
1
Now obtain the derivative with respect to .

log L 

n

  log f bx g
i
i 1
Letting  be the maximum likelihood estimate, let’s write this as a Taylor series about
that  .

log L 



log f xi ˆ

i 1 
n


2
log f  xi 

2
i 1 
n

ˆ
  ˆ 
Now divide left and right sides by n :

1 
log L 
n 
1 n 
 log f xi ˆ
n i 1 

1 n 2

 2 log f  xi 
n i 1 

ˆ
  ˆ 
Now let’s examine this expression. The first summand is
1
n
  log f  x
n

i 1
i
ˆ

which is zero…. because this is the equation (aside from the
get  !
n ) which we solve to
Thus, we’ve reduced the relationship to this:
1 
log L 
n 
1 n 2
 2 log f  xi 
n i 1 

ˆ
  ˆ 
We can write out the left side, too:
1 
log L =
n 
1
n
n
 log f  xi 
i 1


1 n 2
 2 log f  xi 
n i 1 

ˆ
  ˆ 
1 n
 log f  xi   , we can assert the Central Limit theorem! After all, it’s
n i 1
the sum of n independent, identically distributed things. As each summand has mean
zero, this limiting distribution is N(0, Var [ log f(xi) ] ), or N(0, I() ). Thus, we
decide that the limiting distribution of
Based on
2
1 n 2
 2 log f  xi 
n i 1 

ˆ
  ˆ 
must also be N(0, I() ). Let’s rewrite this as
 1 n 2
   2 log f  xi 
 n i 1 




ˆ 

n ˆ  

Watch the n’s and the minus signs. The expression in the brackets certainly converges to
I(); remember the calculating forms for I() and also the law of large numbers. Thus
our result comes down to
I  


n ˆ   ~ N  0, I   
Certainly we can express this as
F
IJ
G
H afK
1
n    ~ N 0,
I
e j
This is of course the statement for the asymptotic normality of the maximum likelihood
estimate.
It should be noted that many approximations were made. Also, we left several
mathematical nuances untouched. Nonetheless, this demonstration shows the essential
features of the proof that maximum likelihood estimates are asymptotically normal with
1
variance
.
I
af
**This was not covered in class.**
Let’s note (but not get too excited about) the exponential family.
densities can be factored as
This says that some
af af
 T af
x
ABx
f(x) = ecaf
Indeed, this reflects the use of three factors that we had done in the factorization theorem
to identify sufficient statistics, except that here the link between  and x occurs as
products in the exponent. The remaining factors are somewhat arbitrary. In fact, Rice
puts them in the exponent as
 T af
x  d af
  S af
x
f(x) = ecaf
3
In any event, this clearly identifies T(x) as the sufficient statistic. Rice then goes on to
n
show that in a sample of n, we have
X as the sufficient statistic.
 T bg
i
i 1
We note the following:
(1)
Many common probability laws are of the exponential family form. Rice
shows several examples.
(2)
This idea applies as well to X which are not iid samples.
(3)
There is a huge body of statistical theory exploiting the notions of the
exponential family.
**end of commentary
Finally, we have the Rao-Blackwell theorem. We have a handout on this.
Stress that the major use of the Rao-Blackwell theorem is not that of finding estimates.
The actual use is the theoretical assurance that the best procedures are based on sufficient
statistics.
Now let’s do the NP lemma.
Neyman and Pearson did their fundamental lemma in 1931. It solved a very neat
problem in a very clever mathematical way. This approach has formed the basis for
statistical work in all of the empirical sciences. It created an enormous tidal wave of
jargon:
Null hypothesis
Alternative hypothesis
Type I error
Type II error
Statistically significant
Not statistically significant
Level of significance
Power of test
Power function
Operating characteristic (OC) curve
P-value
A neat advantage of the Neyman-Pearson approach is that it is a well-defined, textbookcitable method.
4
The hypothesis-testing game has some peculiar aspects. Here are some aspects which
ought to be emphasized:
H 0 and HA (or H1 ) are not exchangeable.
H 0 must contain the = part of the problem.
HA is the interesting thing you’d like to show.
Rejecting H 0 in favor of HA is really exciting.
Accepting H 0 just means that you can’t reject H 0 . This is the nonsignificance
case.
Rejecting H 0 in favor of HA is the significant case. You have significant
evidence. Your results are statistically significant.
Hypothesis testing can obtain significant evidence that   0. It cannot obtain
significant evidence that  really is equal to 0.
With large quantities of data, you will always be able to reject a null hypothesis of
exact equality. (That is, in a problem like H 0 : = 0 versus HA:   0, a large
sample size will always reject H 0 .)
If your objective is to accept H 0 , you can accomplish this with a small sample
size. Essentially, you’ve run an experiment with power too small to decide in
favor of HA.
It is certainly interesting that the methodology has grown in odd ways over the last 65+
years. In most sciences, including social sciences, quantitative results must be subjected
to statistical hypothesis tests. There is an extreme prejudice against reporting
nonsignificant results.
5
OK....now let’s look at the Neyman-Pearson problem. It says that you have data X and
model f(x) and two statements H 0 :  = 0 and HA:  = 1. It is assumed that the model
is fixed aside from choice of . Thus....
Covered by this problem: N(, 25) sample, Bin(50, ), N(100, 2). Also
covered would be N(, 2) with H 0 :  = 10,  = 2 versus HA:  = 12,  = 4.
NOT covered by this problem: N(, 2) with H 0 and HA discussing  only.
Neyman and Pearson set up an embarrassment level, . This represents the maximum
probability of rejecting the null, type I error. They claimed to have a “best test.” This
means that any competitor test cannot be better. To be a competitor (and have a chance
at being better), a test had to have a Type I error probability which was at most α. NP
would then show that the competitor must do worse on Type II error. Details are on
handout. Handout uses “rejection set” whereas many use “critical region.”
NP procedure is based on the likelihood ratio
b g.
fb
x 1g
f0  x 
, which we might also write as
f1  x 
f x0
The technology generalizes. We can prove optimality of this sort of stuff for other types
of hypotheses and for more complicated likelihoods.
The Neyman-Pearson paradigm is useful for simple versus simple situations. However, a
great many of our test are of the form H 0 :  = 0 versus HA :   0. We need another
kind of tool.
We could of course ask at this point where our common tests come from. Suppose that
we have X 1 , X 2 ,..., X n from N(, 2) and we want to test (say) H 0 :  = 80 versus
H1 :   80. Our method of doing this consists of the following logical process.
(1)
(2)
Seek a statistic T(X) such that we can easily see what kinds of values of T
suggest H 0 and what kind of values of T suggest H1.
Obtain the distribution of T(X) when H 0 is true. This is the hard part.
6
(3)
Based on the distribution obtained in (2), calibrate a cutoff rule so that
P[ T(X)  rejection set  H 0 true ] = .
For the one-sample problem above, the statistic is T(X) =
the reject set is described by T(X)  t/2; n - 1 .
n
X  80
and the nature of
s
The logic of hypothesis testing (as generally covered in textbooks) goes through these
steps:
(1)
(2)
(3)
(4)
Prove Neyman-Pearson lemma for situation of testing simple H 0 versus
simple H1. These are nearly always done for one-parameter problem, such
as for mean of normal population with known variance.
Consider one-parameter problems of form H 0 :  = 0 versus H1 :  > 0.
For a large class of problems, we can work out the Neyman-Pearson test
of H 0 :  = 0 versus H1 :  = 1 at some particular 1 > 0. If it turns out
that the form of the test does not depend on the particular 1, then the test
is described as UMP (uniformly most powerful).
As the point made in (2) does not generalize to multi-parameter problems,
we develop other concepts such as “similar” tests or “invariant” tests.
Eventually we come to the likelihood ratio test principle.
The likelihood ratio test idea is motivated by the problem H 0 :   0 versus HA :   A.
As Neyman and Pearson showed us that likelihood ratios are a good idea, we base the test
on
 =
bg
fb
x g
max f x 
  0
max
   0  A
Then we reject H 0 if   k, choosing k to adjust the level of significance.
The procedure comes down to these difficult steps:
(1)
(2)
(3)
Do the maximization in the numerator.
Do the maximization in the denominator.
Obtain the distribution of  under H 0 , as this is essential to the problem
of setting k to fix the level of significance.
We can certainly work through examples of this technique to derive most of the common
statistical tests. However, the real use comes in the non-standard cases, where we can
work with an asymptotic result on .
7
What’s a non-standard case? Suppose that X1, X2, …, Xn is a sample from a p-variable
normal distribution with mean vector  and variance matrix . Consider the test of
H0:  = diagonal versus HA :  = arbitrary positive definite matrix. This is pretty much
hopeless by any other criterion.
The limiting result is that, under H 0 , -2 log  ~ 2 with degrees of freedom equal to the
number of parameters being debated between H 0 and H1.
OK…. Let’s see this in action.
Let’s see some examples of the likelihood ratio test in action. (This is on a handout.)
Suppose that X1, X2, …, Xm is a sample from a Poisson  distribution, while Y1, Y2, …, Yn
is a sample from a Poisson  distribution. We wish to test H 0 :  =  versus HA :   .
L e
The likelihood is L = M
N
m

i 1
xi
xi !
L
O
e
P
M
QM
N
n
j 1
y

O
P
P
Q
j
.
yj !
In the numerator of the likelihood ratio test statistic, there is only one sample of size
m + n with a common parameter. The maximum likelihood estimate for the common
m
n
i 1
j 1
 Xi   Yj
parameter is
is
L
M
e
M
N
m
=  0 . The maximized likelihood for H 0 , for the numerator,
mn
  0
i 1
 x0i
xi !
O
L
P
M
e
P
M
QN
n
  0
j 1
y
 0j
yj !
O
P
P
Q
In the denominator, the parameters  and  are allowed to be different. Thus, the
maximum likelihood estimates are  A  X and  A  Y . The maximized likelihood for
the denominator is
L
M
e
M
N
m
i 1
  A
 xAi
xi !
O
L
P
M
e
M
P
QN
n
j 1
  A
y
 Aj
yj !
O
P
P
Q
This allows us to write the likelihood ratio as
8
L
M
e
M
N
L
M
e
M
N
m
  0
i 1
 =
m
  A
i 1
m
=
n
O
L
P
e
P
M
QM
N
O
L
P
M
e
x !P
M
QN
 x0i
xi !
 xi
j 1
i
j 1
  0
  A
O
P
P
Q=
 O
P
y !P
Q
y
 0j
yj !
yj
A
m
n
x y
e a f  0
 m  n  0
i 1
m
e
 m A  n A
i
i 1
j
n
 xj
 yj
 Ai 1  Ai 1
j
(think why the exponentials cancel!)
n
 xj
 yj
 Ai 1  Ai 1
m
n
A
 xi   y j
 0i 1 i 1
m
n
n
F I F I
= GJ GJ
H K H K
 xi
0
A
i 1
 yj
0
i 1
A
The analysis of the distribution of this is hopeless. However, we can use the -2 log 
rule. Here it’s chi-squared with 1 degrees of freedom. The alternative space is
described by two parameters ( and ), while the null space is described by one (the
single common value of  and ). The difference is 1.
9