Download Notes 8

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Law of large numbers wikipedia , lookup

Foundations of statistics wikipedia , lookup

Exponential family wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
Stat 550 Notes 8
Notes:
1. We have no class this coming Tuesday because it’s fall
break.
2. The midterm is due at Wednesday by 5. I’ll be around on
Monday and Tuesday if you have any questions about it. I’ll
hold my usual office hours on Tuesday from 4:45-5:45 and can
also meet with you by appointment.
I. Maximum Likelihood
The method of maximum likelihood is a general approach to
point estimation.
Motivating Example: A purchaser of electrical components buys
them in lots of size 12. Each electrical component is either
acceptable or defective. Let  denote the number of acceptable
components in the box. It is expensive to test whether all the
electrical components are acceptable; we would like to try to
estimate  by randomly choosing five components without
replacement and testing whether these five components are
acceptable. Let X denote the number of acceptable components
in the sample. Suppose X =3 of the components in the sample
are acceptable. How should we estimate  ?
Probability model: Imagine that the components are numbered
1-12. A sample of five components thus consists of five distinct
1
12 
numbers. All  5   792 samples are equally likely. The
 
distribution of X is hypergeometric:
  12   
 

x  5  x 

P ( X  x) 
12 
 
5 
The following table shows the probability distribution for X
given  for each possible value of  .
X =Number of acceptable components
in the sample
0
1
2
3
4
5
0
Number
0
1
0
0
0
0
of acceptable
components in
the box (  )
1
.5833 .4167 0
0
0
0
2
.3182 .5303 .1515 0
0
0
3
.1591 .4773 .3182 .0454 0
0
4
.0707 .3535 .4243 .1414 .0101 0
5
.0265 .2210 .4419 .2652 .0442 .0012
6
.0076 .1136 .3788 .3788 .1136 .0076
7
.0012 .0442 .2652 .4419 .2210 .0265
8
0
.0101 .1414 .4243 .3535 .0707
2
9
0
0
.0454 .3182 .4773 .1591
10 0
0
0
.1515 .5303 .3182
11 0
0
0
0
.4167 .5833
12 0
0
0
0
0
1
Once we obtain the sample X =3, what should we estimate  to
be?
It’s not clear how to apply the method of moments. We have
ˆ

5  3  0 gives ˆ  7.2 , which is
E ( X )  5
but
solving
12
12
not in the parameter space.
Maximum likelihood approach: We know that it is impossible
that  =0, 1, 2, 11 or 12. The set of possible values for  once
we observe X =3 are
=3, 4, 5, 6, 7, 8, 9, 10. Although both  =3 and  =7 are
possible, the occurrence of X =3 would be more “likely” if  =7
[ P 7 ( X  3)  .4419 ] than if  =3 [ P 3 ( X  3)  .0454 ].
Among  =3, 4, 5, 6, 7, 8, 9, 10, the  that makes the observed
data X =3 most “likely” is  =7.
General definitions for maximum likelihood estimator
The likelihood function is defined by LX ( )  p( X |  ) .
The likelihood function is just the joint probability mass or
probability density of the data, except that we treat it as a
3
function of the parameter  . Thus, LX :   [0, ) . The
likelihood function is not a probability mass function or a
probability density function: in general, it is not true that
LX ( ) integrates to 1 with respect to  . In the motivating
example, for X  3 ,  LX 3 ( )  2.167 .
 
The maximum likelihood estimator (the MLE), denoted by
ˆ , is the value of  that maximizes the likelihood:
MLE
ˆMLE  arg max  Lx ( ) . For the motivating example, ˆMLE =7.
Intuitively, the MLE is a reasonable choice for an estimator.
The MLE is the parameter point for which the observed sample
is most likely.
Equivalently, we can work with the log likelihood function
l x ( )  log p( x |  ) ,
ˆ  arg max l ( ) .
  x
MLE
Example 2: Poisson distribution. Suppose X 1 , , X n are iid
Poisson(  ),   (0, ) .
e   X
n
n
n
l x ( )   i 1 log
 n  ( i 1 X i ) log    i 1 X i !
i
Xi !
To maximize the log likelihood, we set the first derivative of the
log likelihood equal to zero,
1 n
l '( )  n   i 1 X i  0.

4
X is the unique solution to this equation. To confirm that X in
fact maximizes l ( ) , we can use the second derivative test,
1 n
l ''( )  2  i 1 X i

n
l ''( X )  0 as long as i 1 X i  0 so that X in fact maximizes
l ( ) .
When i 1 X i  0 , it can be seen by inspection that l x ( ) is a
strictly decreasing function of  and therefore there is no
maximum of l x ( ) for the parameter space   (0, ) ; the MLE
n

does not exist when
n
i 1
Xi  0 .
Example 3: Suppose X 1 , , X n are iid Uniform( 0,  ].
if max X i  
0

Lx ( )   1
if max X i  
 n
Thus, ˆ  max X .
MLE
i
Recall that the method of moments estimator is 2 X . In notes 4,
we showed that max X i dominates 2 X for the squared error loss
n 1
function (although max X i is dominated by n max X i ).
Key valuable asymptotic features of maximum likelihood
estimators:
5
For X1 , , X n iid p( x |  ),    , under “regularity
conditions” on p ( x |  ) [these are essentially smoothness
conditions on p ( x |  ) ]:
1. The MLE is consistent.
2. The MLE is asymptotically normal:
ˆMLE  
For a one dimensional parameter  , SE (ˆ ) converges in
MLE
distribution to a standard normal distribution.
3. The MLE is asymptotically optimal: roughly, this means that
among all well-behaved estimators, the MLE has the smallest
variance for large samples.
Consistency of maximum likelihood estimates:
A basic desirable property of estimators is that they are
consistent, i.e., converge to the true parameter when there is a
“large” amount of data. The maximum likelihood estimator is
generally, although not always consistent. We prove a special
case of consistency here.
Theorem: Consider the model X1 , , X n are iid with pmf or pdf
{ p( X i |  ), }
Suppose (a) the parameter space  is finite; (b)  is identifiable
and (c) the p( X i |  ) have common support for all   . Then
the maximum likelihood estimator ˆ is consistent as n   .
MLE
6
Proof: Let  0 denote the true parameter. First, we show that for
any    0
P0 (l x ( 0 )  l x ( ))  1 as n  
(0.1)
The inequality is equivalent to
 p( X i |  ) 
1 n
 log  p( X |  )   0 .
n i 1
i
0 

By the law of large numbers, the left side tends in probability
toward
 p( X i |  ) 
E0 log 

p
(
X
|

)
i
0 

Since –log is strictly convex, Jensen’s inequality shows that
 p( X i |  ) 
 p( X i |  ) 
E0 log 
  log E0 
0
p
(
X
|

)
p
(
X
|

)
i
0 
i
0 


and (0.1) follows.
For a finite parameter space, ˆMLE is consistent if and only if
P (ˆ   )  1 .
0
MLE
0
Denote the points other than  0 in the finite parameter space by
1 , , K . Let A jn be the event that for n observations,
l x ( 0 )  l x ( j ) . The event ˆ   for n observations is
MLE
0
contained in the event A1n   AKn . By (0.1), P ( A jn )  1 as
n   for j  1, , K . Consequently,
7
 AKn )  1 as n   and since ˆMLE   0 for n
observations is contained in the event A1n   AKn ,
P0 (ˆMLE   0 )  1 as n   .
P( A1n 
For infinite parameter spaces, the MLE can be shown to be
consistent under conditions (b)-(c) of the theorem plus the
following two assumptions: (1) The parameter space contains an
open set of which the true parameter is an interior point (i.e.,
true parameter is not on boundary of parameter space); (2)
p ( x |  ) is differentiable in  .
The consistency theorem assumes that the parameter space does
not depend on the sample size. The MLE can be inconsistent
when the number of parameters increases with the sample size,
e.g.,
X 1 , , X n independent normals with mean i and variance  2 .
2
MLE of  is inconsistent.
8