Download Notes 7 - Wharton Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Statistics 550 Notes 7
Reading: Section 1.5, 1.6.1
I. The Rao-Blackwell Theorem
Convex functions: A real valued function  defined on an
open interval I  (a, b) is convex if for any a  x  y  b
and 0    1 ,
[ x  (1   ) y ]   ( x)  (1   ) ( y) .
 is strictly convex if the inequality is strict.
If  '' exists, then  is convex if and only if  ''  0 on
I  ( a, b) .
A convex function lies above all its tangent lines.
Convexity of loss functions:
For point estimation:
2
 squared error loss l ( , a)  (  a) is strictly convex
in a .
 absolute error loss l ( , a) |   a | is convex but not
strictly convex in a
 zero-one loss function
if |q(  )- a | k
0
l (  a   
if |q(  )- a |> k
1
is nonconvex in a
Jensen’s Inequality: (Appendix B.9)
1
Let X be a random variable. (i) If  is convex in an open
interval I and P( X  I )  1 and E ( X )   , then
 ( E[ X ])  E[ ( X )] .
(ii) If  is strictly convex, then  ( E[ X ])  E[ ( X )] unless
X equals a constant with probability one.
Proof of (i): Let L ( x ) be a tangent line to  ( x) at the point
 ( E[ X ]) . Write L( x)  a  bx . By the convexity of  ,
 ( x)  a  bx . Since expectations preserve inequalities,
E[ ( X )]  E[a  bX ]
 a  bE[ X ]
 L( E[ X ])
  ( E[ X ])
as was to be shown.
Rao-Blackwell Theorem: Let T ( X ) be a sufficient statistic.
Let  be a point estimate of q( ) and assume that the loss
function l ( , d ) is strictly convex in d. Also assume that
R ( ,  )   . Let  (t )  E[ ( X ) | T ( X )  t ] . Then
R( , )  R( ,  ) unless  ( X )   (T ( X )) with probability
one.
Proof: Fix  . Apply Jensen’s inequality with
 (d ( x ))  l ( , d ( x )) and let X have the conditional
distribution of X | T ( X )  t for a particular choice of t .
By Jensen’s inequality,
l ( , (t ))  E[l[ ,  ( X )] | t ]
(0.1)
2
unless  ( X )   (T ( X )) with probability one in which case
(0.1) is an equality. Taking the expectation on both sides
of this inequality yields R( , )  R( ,  ) unless
 ( X )   (T ( X )) with probability one in which case
R( , )  R( ,  ) .
Comments:
(1) Sufficiency ensures  (t )  E[ ( X ) | T ( X )  t ] is an
estimator (i.e., it depends only on t and not on  ).
(2) If loss is convex rather than strictly convex, we get  in
(1.2)
(3) Theorem is not true without convexity of loss functions.
Consequence of Rao-Blackwell theorem: For convex loss
functions, we can dispense with randomized estimators.
A randomized estimator randomly chooses the estimate
Y( x ) , where the distribution of Y( x ) is known. A
randomized estimator can be obtained as an estimator
*
estimator  ( X ,U ) where X and U are independent and U
is uniformly distributed on (0,1). This is achieved by
observing X = x and then using U to construct the
distribution of Y( x ) . For the data ( X , U ) , X is sufficient.
Thus, by the Rao-Blackwell Theorem, the nonrandomized
*
*
estimator E[ ( X ,U ) | X ] dominates  ( X ,U ) for strictly
convex loss functions.
II. Minimal Sufficiency
For any model, there are many sufficient statistics.
3
Example 1: For X 1 ,
, X n iid Bernoulli (  ),
n
T ( X )   X i , T '( X )  ( X1 ,
, X n ) are both sufficient but
i 1
T provides a greater reduction of the data.
Definition: A statistic T ( X ) is minimally sufficient if it is
sufficient and it provides a reduction of the data that is at
least as great as that of any other sufficient statistic S ( X ) in
the sense that we can find a transformation r such that
T ( X )  r ( S ( X )) .
Comments:
(1) To say that we can find a transformation r such that
T ( X )  r ( S ( X )) means that if S ( x )  S ( y ) , then
T ( x ) must equal T ( y ) .
(2) Data reduction in terms of a particular statistic can be
thought of as a partition of the sample space. A statistic
T ( X ) partitions the sample space into sets
At  { x : T ( x)  t} .
If a statistic T ( X ) is minimally sufficient, then for another
sufficient statistic S ( X ) which partitions the sample space
into sets Bs  { x : S ( x)  s} , every set Bs must be a subset
of some At . Thus, the partition associated with a minimal
sufficient statistic is the coarsest possible partition for a
sufficient statistic and in this sense the minimal sufficient
4
statistic achieves the greatest possible data reduction for a
sufficient statistic.
A useful theorem for finding a minimal sufficient statistic
is the following:
Theorem 2 (Lehmann and Scheffe, 1950): Suppose S ( X ) is
a sufficient statistic for  . Also suppose that if for two
sample points x and y , the ratio f ( x |  ) / f ( y |  ) is
constant as a function of  , then S ( x )  S ( y ) . Then
S ( X ) is a minimal sufficient statistic for  .
Proof: Let T ( X ) be any statistic that is sufficient for  . By
the factorization theorem, there exist functions g and h
such that f ( x |  )  g (T ( x ) | θ )h( x ) . Let x and y be any
two sample points with T ( x )  T ( y ) . Then
f ( x |  ) g (T ( x ) |  )h( x ) h( x )


f ( y |  ) g (T ( y) |  )h( y) h( y) .
Since this ratio does not depend on  , the assumptions of
the theorem imply that S ( x )  S ( y ) . Thus, S ( X ) is at
least as coarse a partition of the sample space as T ( X ) , and
consequently S ( X ) is minimal sufficient.
Example 1 continued: Consider the ratio
n
n
x
n  x i

i 1 i
i 1
f (x | ) 
(1   )

n
n
f ( y |  )   i1 yi (1   ) n  i1 yi .
5
This ratio is constant as a function of  if
i1 xi  i1 yi . Since we have shown that
n
n
n
T ( X )   X i is a sufficient statistic, it follows from the
i 1
n
above sentence and Theorem 2 that T ( X )   X i is a
i 1
minimal sufficient statistic.
Note that a minimal sufficient statistic is not unique. Any
one-to-one function of a minimal sufficient statistic is also
a minimal sufficient statistic. For example,
1 n
T '( X )   X i is a minimal sufficient statistic for the
n i 1
i.i.d. Bernoulli case.
Example 2: Suppose X 1 , , X n are iid uniform on the
interval ( ,   1),      . Then the joint pdf of X is
1  <xi    1, i  1,..., n
f (x | )  
0 otherwise
1 max i xi  1    min i xi

0 otherwise
The statistic T ( X )  (min i X i , max i X i ) is a sufficient
statistic by the factorization theorem with
g (T ( X ), )  I (max i X i  1    min i X i ) and h( X )  1 .
For any two sample points x and y , the numerator and
denominator of the ratio f ( x |  ) / f ( y |  ) will be positive
6
for the same values of  if and only if min i xi  min i yi and
maxi xi  maxi yi ; if the minima and maxima are equal,
then the ratio is constant and in fact equals 1. Thus,
T ( X )   mini xi , maxi xi  is a minimal sufficient statistic
by Theorem 2.
Example 2 is a case in which the dimension of the minimal
sufficient statistic (2) does not match the dimension of the
parameter (1). There are models in which the dimension of
the minimal sufficient statistic is equal to the sample size,
1
f
(
x
|

)

e.g., X 1 , , X n iid Cauchy(  ),
 [1  ( x   )2 ] .
(Problem 1.5.15).
III. Ancillary Statistics
A statistic T ( X ) is ancillary if its distribution does not
depend on  .
Example 4: Suppose our model is X 1 , , X n iid N (  ,1) .
Then X is a sufficient statistic and ( X 1  X , , X n  X ) is
an ancillary statistic.
Although ancillary statistics contain no information about
 when the model is true, ancillary statistics are useful for
checking the validity of the model.
IV. Exponential Families
7
The binomial and normal models exhibited the interesting
feature that there is a natural minimal sufficient statistic
whose dimension is independent of the sample size. The
exponential family models are a general class of models
that exhibit this feature.
The class of exponential family models includes many of
the mostly widely used statistical models (e.g., binomial,
normal, gamma, Poisson, multinomial). Exponential
family models have an underlying structure with elegant
properties that we will discuss.
One-parameter exponential families: The family of
distributions of a model {P :  } is said to be a oneparameter exponential family if there exist real-valued
functions  ( ), B( ), T ( x ), h( x ) such that the pdf or pmf
may be written as
p( x |  )  h( x ) exp{ ( )T ( x )  B( )}
(0.2)
Comments:
(1) For an exponential family, the support of the
distribution (i.e., { x : p( x |  )  0} ) cannot depend on  .
Thus, X 1 , , X n iid Uniform (0,  ) is not an exponential
family model.
(2) For an exponential family model, T ( x ) is a sufficient
statistic by the factorization theorem.
8
(3)  , B, T are not unique. For example,  can be
multiplied by a constant c and T can be divided by the same
constant c.
Examples of one-parameter exponential family models:
(1) Poisson family.
Let X ~ Poisson( ), 0     . Then for x {0,1, 2,...} ,
 x e 1
p( x |  ) 
 exp{x log    } .
x!
x!
This is a one-parameter exponential family with
 ( )  log  , B( )   , T ( x)  x, h( x)  1/ x ! .
(2) Binomial family.
Let X ~ Binomial(n,  ), 0    1 . Then for
x {0,1, 2,..., n} ,
n
p( x |  )     x (1   ) n  x
 x
n
  
   exp[ x log 
 n log(1   )]

 1 
 x
This is a one-parameter exponential family with
n
  
 ( )  log 
,
B
(

)


n
log(1


),
T
(
x
)

x
,
h
(
x
)

 

 1 
 x
The family of distributions obtained by taking iid samples
from one-parameter exponential families are themselves
one-parameter exponential families.
9
Specifically, suppose X ~ P and {P :  } is an
exponential family, then for X1 , , X m iid with common
distribution P ,
p( x1 ,
 m

m
, xm |  )   h( xi )  exp  ( ) i 1T ( xi )  mB( ) 


 i 1

A sufficient statistic is i 1T ( xi ) and it is one dimensional
whatever the sample size m is.
For X 1 , , X n iid Poisson (  ), the sufficient statistic
m

m
i 1
T ( xi ) has a Poisson ( m ) distribution and hence has
an exponential family model. It is generally true that the
sufficient statistic of an exponential family model follows
an exponential family.
Theorem 1.6.1: Let {P :  } be a one-parameter
exponential family of discrete distributions:
p( x |  )  h( x ) exp{ ( )T ( x )  B( )}
Then the family of the distributions of the statistic T ( X ) is
a one-parameter exponential family of discrete distributions
whose pdf may be written
h *(t ) exp{ ( )t  B( )}
for suitable h*.
Proof: By definition,
10
P [T ( x)  t ] 

p( x |  )
{ x:T ( x ) t }


h( x) exp[ ( )T ( x)  B( )]
{ x:T ( x ) t }
 exp[ ( )t  B( )]{

h( x)}
{ x:T ( x ) t }
*
h
If we let (t ) 

{ x:T ( x ) t }
h( x) , the result follows.
A similar theorem holds for continuous exponential
families.
Canonical exponential families: A useful
reparameterization of the exponential family model is to
index    ( ) as the parameter to yield
p( x |  )  h( x) exp[T ( x)  A( )] ,
(0.3)
where A( )  log   h( x) exp[T ( x)]dx in the continuous
case and the integral is replaced by a sum in the discrete
space.
If   , then A( ) must be finite. Let
  { :| A( ) | } . The model given by (0.3) with
 ranging over  is called the canonical one-parameter
exponential family generated by T and h.  is called the
natural parameter space and T is called the natural
sufficient statistic. The canonical one-parameter
exponential family contains the one-parameter exponential
family (0.2) with parameter space   and can be thought
11
of as the “biggest” possible parameter space for the
exponential family.
Example 1: Let X ~ Poisson( ), 0     . Then for
x  {0,1, 2,...} ,
p( x |  ) 
 x e
x!

1
exp{x log    }
x!
(0.4)
Letting   log  , we have
1
p( x |  )  exp{ x  exp( )}, x={0,1,2,...} .
x!
We have

1
A( )  log  e x
x 0 x !
(e ) x
 log 
x!
x 0

 log exp(e )  e
Thus,   { :| A( ) | }   .
Note that if 1     , then (0.4) would still be a oneparameter exponential family but it would be a strict subset
of the canonical one-parameter exponential family
generated by T and h with natural parameter space
  { :| A( ) | }   .
A useful result about exponential families is the following
computational shortcut for moments of the natural
sufficient statistic:
12
Theorem 1.6.2: If X is distributed according to (0.3) and
 is an interior point of  , then the moment-generating
function of T ( X ) exists and is given by
M ( s)  E[exp( sT ( X ))]exp[ A( s   )  A( )]
for s in some neighborhood of 0.
Moreover,
E [T ( X )]  A '( ), Var [T ( X )]  A ''( ) .
Proof: This is the proof for the continuous case.
M ( s )  E (exp( sT ( X )))  
 h( x) exp[(s   )T ( x)  A( )]dx
 {exp[ A( s   )  A( )]}  h( x) exp[( s   )T ( x )  A(s   )]dx
 exp[ A( s   )  A( )]
because the last factor, being the integral of a density, is
one. The rest of the theorem follows from the moment
generating property of M ( s ) (see Section A.12 of Bickel
and Doksum).
Comment on proof: In order for the moment generating
function (MGF) properties to hold, the MGF must exist (be
less than infinity) for s in some neighborhood of 0. The
proof that the MGF exists for s in some neighborhood of 0
relies on the fact that  is an interval or  , which we shall
establish in Section 1.6.4.
Example 1 continued: Let X ~ Poisson( ), 0     .
The natural sufficient statistic is T ( X )  X and   log  ,
A( )  e . Thus, using Theorem 1.6.2,
13
E [ X ] 
d 
e
 e

  log
d  log
d2 
Var [ X ]  2 e
 e

  log
d
  log
Example 2: Suppose X 1 , , X n is a sample from a
population with pdf
x
x2
p( x |  )  2 exp( 2 ), x  0,   0

2
This is known as the Rayleigh distribution. It is used to
model the density of time until failure for certain types of
equipment. The data comes from an exponential family:
n
xi2
 n xi 
p( x1 , , xn |  )    2  exp( 2 )
i 1 2
 i 1  
 n 
1
   xi  exp( 2
2
 i 1 
n
x
i 1
2
i
 n log  2 )
Here

1
1
2
2
,



,
B
(

)

n
log

, A( )  n log(2 ) .
2
2
2
n
Therefore, the natural sufficient statistic  X
i 1
2
i
has mean
A '( )  n /   2n 2 and variance A ''( )  n /  2  4n 4 .
Proving that a one parameter family is not an exponential
family
14
A one parameter exponential family is a family
p( x |  )  h( x ) exp{ ( )T ( x )  B( )} ,   .
Consider a one parameter family { p( x |  ),   } . If the
support of p( x |  ) is different for different  , then the
family is not an exponential family because p ( x |  )  0 if
and only if h( x )  0 .
Suppose that the support of p( x |  ) is the same for all
  . We can write the pdf or pmf of the family as
p( x |  )  h( x ) exp{g ( x,  )} .
Furthermore, we can write the pdf or pmf of the family as
p( x |  )  h( x ) exp{g ( x,  )}
In order for this to be an exponential family, we need to be
able to write
g ( x,  )   ( )T ( x )  B( )
(0.5)
for some functions  , B and T .
Suppose (0.5) holds. Then for any two sample points x1
and x2 ,
g ( x1 , )  g ( x2 , )   ( )[T ( x1 )  T ( x2 )] and
for any four sample points x1 , x2 , x3 , x4 ,
g ( x1 ,  )  g ( x2 ,  ) T ( x1 )  T ( x2 )

g ( x3 ,  )  g ( x4 ,  ) T ( x3 )  T ( x4 )
is constant as a function of  .
15
Thus, a necessary condition for a one-parameter
exponential family is that for any four sample points,
x1 , x2 , x3 , x4 ,
g ( x1 ,  )  g ( x2 ,  )
g ( x3 ,  )  g ( x4 ,  )
must be constant as a function of  .
Proof that the Cauchy family is not an exponential family:
The Cauchy family is
1
p( x |  ) 
 (1  ( x   ) 2 )


1
 exp log
2 

(1

(
x


)
)

 exp{ log   log[1  ( x   ) 2 ]},      ,    x  
Thus, for the Cauchy family,
g ( x, )   log   log[1  ( x   )2 ] .
For any four sample points x1 , x2 , x3 , x4 ,
g ( x1 , )  g ( x2 , )  log[1  ( x1   ) 2 ]  log[1  ( x2   ) 2 ]

g ( x3 , )  g ( x4 , )  log[1  ( x3   ) 2 ]  log[1  ( x4   ) 2 ]
This is not constant as a function of  so the Cauchy
family is not an exponential family.
16
Related documents