Download Notes 4 - Wharton Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Sufficient statistic wikipedia , lookup

Transcript
Statistics 550 Notes 4
Reading: Sections 1.4, 1.5
Note: For Thursday, Sept. 18th, I will hold my office hours
from 1-2 instead of my usual time 4:45-5:45.
I. Prediction (Chapter 1.4)
A common decision problem is that we want to predict a
variable Y based on a covariate vector Z .
Examples: (1) Predict whether a patient, hospitalized due to
a heart attack, will have a second heart attack. The
prediction is to be based on demographic, diet and clinical
measurements for that patient; (2) Predict the price of a
stock 6 months from now, on the basis of company
performance measures and economic data; (3) Predict the
numbers in a handwritten ZIP code, from a digitized image.
We typically have a “training” sample (Y1 , Z1 ), , (Yn , Z n )
available from the joint distribution of (Y , Z ) and we want
to predict Ynew for a new observation from the distribution
of (Y , Z ) for which we know Z  Znew .
In Section 1.4, we consider how to make predictions when
we know the joint distribution of (Y , Z ) ; in practice, we
often have only an estimate of the joint distribution based
on the training sample.
1
Let g ( Z ) be a rule for predicting Y based on Z . A
criterion that is often used for judging different prediction
rules is mean squared prediction error:
2 (Y , g ( Z ))  E[{g ( Z )  Y }2 | Z ] -this is the average squared prediction error when g ( Z ) is
used to predict Y for a particular Z . We want
 2 (Y , g ( Z )) to be as small as possible.
Theorem 1.4.1: Let  ( Z )  E (Y | Z ) .  ( Z ) is the best
mean squared prediction error prediction rule.
Proof: For any prediction rule g ( Z ) ,
 2 (Y , g ( Z )) 
E[{g ( Z )  Y }2 | Z ]  E[{( g ( Z )  E (Y | Z ))  ( E (Y | Z )  Y )}2 | Z ] 
E[( g ( Z )  E (Y | Z )) 2 | Z ]  2 E[( g ( Z )  E (Y | Z ))( E (Y | Z )  Y ) | Z ] 
E[( E (Y | Z )  Y ) 2 | Z ] 
E[( g ( Z )  E (Y | Z )) 2 | Z ]  E[( E (Y | Z )  Y ) 2 | Z ]

E[( E (Y | Z )  Y ) 2 | Z ]   2 (Y , E (Y | Z ))
II. Sufficient Statistics
Our usual setting: We observe data X from a distribution
P , where we do not know the true P but only know that
P P = { P , } (the statistical model).
2
The observed sample of data X may be very complicated
(e.g., in the handwritten zip code example from Notes 1,
the data is 500 216  72 matrices). An experimenter may
wish to summarize the information in a sample by
determining a few key features of the sample values, e.g.,
the sample mean, the sample variance or the largest
observation. These are all examples of statistics.
Recall: A statistic Y  T ( X ) is a random variable or
random vector that is a function of the data.
A statistic is sufficient if it carries all the information in the
data about the parameter vector  . T ( X ) can be a scalar or
a vector. If T ( X ) is of lower “dimension” than X , then we
have a good summary of the data that does not discard any
important information.
For example, consider a sequence of independent Bernoulli
trials with unknown probability of success  . We may
have the intuitive feeling that the total number of successes
contains all the information about  that there is in the
sample, that the order in which the successes occurred, for
example, does not give any additional information. The
following definition formalizes this idea:
3
Definition: A statistic Y  T ( X ) is sufficient for  if the
conditional distribution of X given Y  y does not depend
on  for any value of y .1
Implication: If a statistic Y  T ( X ) is sufficient, then if we
already know the value of the statistic, knowing the full
data X does not provide any additional information about
.
Example 1: Let X 1 ,
, X n be a sequence of independent
Bernoulli random variables with P( X i  1)   . We will
verify that Y  i 1 X i is sufficient for  .
Consider
n
P ( X1  x1 ,
, X n  xn | i 1 X i  y) .
n
We have
1
This definition is not quite precise. Difficulties arise when
P (Y  y)  0 , so that the conditioning
event has probability zero. The definition of conditional probability can then be changed at one or more
values of y (in fact at any set of y values which has probability zero) without affecting the distribution of
X , which is the result of combining the distribution of Y with the conditional distribution of X given Y .
In general, there can be more than one version of the conditional probability distribution P ( X | Y ) which
together with the distribution of Y leads back to the distribution of X . We define a statistic as sufficient
if there exists at least one version of the conditional probability distributions P ( X | Y ) which are the
same for all   . See Lehmann and Casella, Theory of Point Estimation, 2 nd Edition, Chapter 1.6, pg.
34-35 for further discussion. For our purposes, we define Y to be a sufficient statistic if i) for discrete
distributions of the data, for each y that has positive probability for at least one   , the conditional
| Y ) does not depend on  for all  for which P (Y  y )  0 ; and ii) for continuous
distributions of the data, for each y that has positive density for at least one   , the conditional
probability density f ( X | Y ) does not depend on  for all  for which f (Y  y )  0 .
probability P ( X
4
P ( X 1  x1 ,
, X n  xn |  i 1 X i  y ) 
n
P ( X 1  x1 , , X n  xn , Y  y )
P (Y  y )
 y (1   ) n  y
1
n y
n
n y

(1


)
 
 
y
 
 y
The conditional distribution thus does not involve  at all


and thus Y  i 1 X i is sufficient for  .
n
Example 2:
Let X 1 , , X n be iid Uniform( 0,  ). Consider the statistic
Y  max1in X i .
We showed in Notes 4 that
 ny n 1
0  y 

fY ( y )    n
0
elsewhere

Y must be less than  . For Y   , we have
P ( X 1  x1 ,
, X n  xn | Y  y ) 
P ( X 1  x1 , , X n  xn , Y  y )
P (Y  y )
1
IY 
  n 1
ny
n
n
which does not depend on  .
IY 

1
ny n 1
It is often hard to verify or disprove sufficiency of a
statistic directly because we need to find the distribution of
5
the sufficient statistic. The following theorem is often
helpful.
Factorization Theorem: A statistic T ( X ) is sufficient for
 if and only if there exist functions g (t ,  ) and h( x ) such
that
p( x |  )  g (T ( x ),  ) h ( x )
for all x and all    .
(where p ( x |  ) denotes the probability mass function for
discrete data given the parameter  and the probability
density function for continuous data).
Proof: We prove the theorem for discrete data; the proof for
continuous distributions is similar. First, suppose that the
probability mass function factors as given in the theorem.
We have
P (T ( x )  t )   p( x ' |  )
x ':T ( x ')  t
so that
P ( X  x | T ( x )  t ) 
P ( X  x , T ( x )  t )

P (T ( x )  t )
P ( X  x )

P (T ( x )  t )
g (T ( x ),  )h( x )

 g (T ( x' ), )h( x' )
x ':T ( x ')  t
h( x )
 h( x' )
x ':T ( x ')  t
6
Thus, T ( X ) is sufficient for  because the conditional
distribution P ( X  x | T ( x)  t ) does not depend on  .
Conversely, suppose T ( X ) is sufficient for  . Then the
conditional distribution of X | T ( X ) does not depend on  .
Let P( X  x | T ( X )  t )  k ( x, t ) . Then
p( x |  )  k ( x, t ) P (T ( x)  t ) .
Thus, we can take h( x)  k ( x, t ), g (t , )  P (T ( x)  t )
Example 1 Continued: X 1 , , X n a sequence of
independent Bernoulli random variables with
P( X i  1)   . To show that Y  i 1 X i is sufficient for
 , we factor the probability mass function as follows:
n
P( X 1  x1 ,
n
, X n  xn |  )    xi (1   )1 xi
i 1
x
n
x
   i1 i (1   )  i1 i
n
  


 1 
n
 i1 xi
n
The pmf is of the form g (i 1 xi , )h( x1 ,
h( x1 , , xn )  1 .
n
(1   ) n
, xn ) where
Example 2 continued: Let X 1 , , X n be iid Uniform( 0,  ).
To show that Y  max1in X i is sufficient, we factor the pdf
as follows:
7
n
, xn |  )  
f ( x1 ,
i 1
1

I 0 xi  
1
n
I max1in X i  I min1in X i 0
The pdf is of the form g ( I max1in X i  , )h( x1 , , xn ) where
1
g ( x1 , , xn ,  )  n I max1in X i  , h( x1 , , xn )  I min1in X i 0

, X n be iid Normal (  ,  2 ). The pdf
Example 3: Let X 1 ,
factors as
1
 1

exp  2 ( xi   ) 2 
 2

i 1  2
1
n
 1

 n
exp
( xi   ) 2 
n/2
2  i 1

 (2 )
 2

n
, xn ;  ,  )  
2
f ( x1 ,

1
n
n
 1
2
2 
exp
(
x

2

x

n

)


i
i
 2 2
i 1
i 1
 n (2 ) n / 2


The pdf is thus of the form
g (i 1 xi , i 1 xi2 , ,  2 )h( x1,
n
n
, xn ) where h( x1 ,
, xn )  1 .
2
Thus, (i 1 xi , i 1 xi ) is a two-dimensional sufficient
n
n
2
statistic for   (  ,  ) , i.e., the distribution of X 1 ,
, X n is
2
independent of (  ,  ) given (i 1 xi , i 1 xi ) .
n
2
n
A theorem for proving that a statistic is not sufficient:
Theorem 1: Let T ( X ) be a statistic. If there exists some
1 , 2   and x , y such that
(i) T ( x )  T ( y ) ;
8
(ii) p( x | 1 ) p( y |  2 )  p( x |  2 ) p( y | 1 ) ,
then T ( X ) is not a sufficient statistic.
Proof: Assume that (i) and (ii) hold. Suppose that T ( X ) is
a sufficient statistic. Then by the factorization theorem,
p( x |  )  g (T ( x ),  )h( x ) . Thus,
p( x | 1 ) p( y |  2 )  g (T ( x), 1 )h( x) g (T ( y),  2 ) h( y)
 g (T ( x),  ) h( x) g (T ( x),  ) h( y ) ,
1
2
where the last equality follows from (i).
Also,
p( x |  2 ) p( y | 1 )  g (T ( x),  2 )h( x) g (T ( y), 1 )h( y)
 g (T ( x),  2 ) h( x) g (T ( x), 1 ) h( y )
where the last equality follows from (i). Thus,
p( x | 1 ) p( y |  2 )  p( x |  2 ) p( y | 1 ) . This contradicts (ii).
Hence the supposition that T ( X ) is a sufficient statistic is
impossible and T ( X ) must not be a sufficient statistic
when (i) and (ii) hold.
■
Example 4: Consider a series of three independent
Bernoulli trials X 1 , X 2 , X 3 with probability of success p.
Let T  X1  2 X 2  3 X 3 . Show that T is not sufficient.
Let x = ( X1  0, X 2  0, X 3  1) and
y  ( X1  1, X 2  1, X 3  0) . We have T ( x )  T ( y )  3 .
But
9
f ( x | p  1/ 3) f ( y | p  2 / 3)  ((2 / 3) 2 *(1/ 3))*((2 / 3) 2 *(1/ 3))  16 / 729 
f ( x | p  2 / 3) f ( y | p  1/ 3)  ((1/ 3) 2 *(2 / 3))*((1/ 3) 2 *(2 / 3))  4 / 729
Thus, by Theorem 1, T is not sufficient.
III. Implications of Sufficiency
We have said that reducing the data to a sufficient statistic
does not sacrifice any information about  .
We now justify this statement in two ways:
(1) We show that for any decision procedure, we can
find a randomized decision procedure that is based
only on the sufficient statistic and that has the same
risk function.
(2) We show that any point estimator that is not a
function of the sufficient statistic can be improved
upon for a strictly convex loss function.
(1) Let  ( X ) be a decision procedure and T ( X ) be a
sufficient statistic. Consider the following randomized
decision procedure [call it  '(T ( X )) ]:
Based on T ( X ) , randomly draw X ' from the distribution
X | T ( X ) (which does not depend on  and is hence
known) and take action  ( X' ) .
X has the same distribution as X' so that  ( X ) has the
same distribution as  '(T ( X ))   ( X' ) . Since  ( X ) and
10
 ( X' ) have the same distribution, they have the same risk
function.
2
Example 5: X ~ N (0,  ) . T ( X ) | X | is sufficient
2
because X | T ( X )  t is equally likely to be t for all  .
Given T  t , construct X ' to be t with probability 0.5
2
each. Then X ' ~ N (0,  ) .
(2) The Rao-Blackwell Theorem.
Convex functions: A real valued function  defined on an
open interval I  (a, b) is convex if for any a  x  y  b
and 0    1 ,
[ x  (1   ) y ]   ( x)  (1   ) ( y) .
 is strictly convex if the inequality is strict.
If  '' exists, then  is convex if and only if  ''  0 on
I  ( a, b) .
A convex function lies above all its tangent lines.
Convexity of loss functions:
For point estimation:
 squared error loss is strictly convex.
 absolute error loss is convex but not strictly convex
 Huber’s loss functions,
2

if |q(  ) - a | k
(q(  ) - a)
l (  a   
2

2k | q(  ) - a | -k if |q(  ) - a |> k
11
for some constant k is convex but not strictly convex.
 zero-one loss function
if |q(  )- a | k
0
l (  a   
if |q(  )- a |> k
1
is nonconvex.
Jensen’s Inequality: (Appendix B.9)
Let X be a random variable. (i) If  is convex in an open
interval I and P( X  I )  1 and E ( X )   , then
 ( E[ X ])  E[ ( X )] .
(ii) If  is strictly convex, then  ( E[ X ])  E[ ( X )] unless
X equals a constant with probability one.
Proof of (i): Let L ( x ) be a tangent line to  ( x) at the point
 ( E[ X ]) . Write L( x)  a  bx . By the convexity of  ,
 ( x)  a  bx . Since expectations preserve inequalities,
E[ ( X )]  E[a  bX ]
 a  bE[ X ]
 L( E[ X ])
  ( E[ X ])
as was to be shown.
Rao-Blackwell Theorem: Let T ( X ) be a sufficient statistic.
Let  be a point estimate of q( ) and assume that the loss
function l ( , d ) is strictly convex in d. Also assume that
R ( ,  )   . Let  (t )  E[ ( X ) | T ( X )  t ] . Then
12
R( , )  R( ,  ) unless  ( X )   (T ( X )) with probability
one.
Proof: Fix  . Apply Jensen’s inequality with
 (d ( x ))  l ( , d ( x )) and let X have the conditional
distribution of X | T ( X )  t for a particular choice of t .
By Jensen’s inequality,
l ( , (t ))  E[l[ ,  ( X )] | t ]
(0.1)
Taking the expectation on both sides of this inequality
yields R( , )  R( ,  ) unless  ( X )   (T ( X )) with
probability one.
Comments:
(1) Sufficiency ensures  (t )  E[ ( X ) | T ( X )  t ] is an
estimator (i.e., it depends only on t and not on  ).
(2) If loss is convex rather than strictly convex, we get  in
(1.2).
(3) Theorem is not true without convexity of loss functions.
Example 4 continued. Consider a series of three
independent Bernoulli trials X 1 , X 2 , X 3 with probability of
success p. We have shown that T ( X )  X1  X 2  X 3 is a
sufficient statistic and that T '( X )  X1  2 X 2  3 X 3 is an
insufficient statistic. The unbiased estimator
X  2 X 2  3X 3
 (X )  1
is a function of the insufficient
6
statistic T '( X )  X1  2 X 2  3 X 3 and can thus be improved
13
for a strictly convex loss function by using the RaoBlackwell theorem:
 X1  2 X 2  3 X 3

| X1  X 2  X 3  t 
6


 (t )  E ( ( X ) | T ( X )  t )  E 
Note that
Pp ( X 1  x | X 1  X 2  X 3  t ) 
Pp ( X 1  x, X 2  X 3  t  x )
Pp ( X 1  X 2  X 3  t )

t
 2  tx
2  
if x  1
2 t  x
p (1  p ) 
 p (1  p )

 3
t  x
t  x


 3 t
 3
3 t
 t
  p (1  p )
 
1  if x  0
t 
t 
 3
x
1 x
Thus,
 X1  2 X 2  3 X 3
 t 1 2t 1 3t 1 t
| X1  X 2  X 3  t  


 .
6
3
6
3
6
3
6
3


 (t )  E 
For squared error loss we have,
R( p, )  Bias p ( )  Varp ( )  0 
2
p(1  p)
3
and
R( p,  )  Bias p ( )  Varp ( )  0 
2
so that R( p, )  R( p,  ) .
14
p(1  p) ,
36
Consequence of Rao-Blackwell theorem: For convex loss
functions, we can dispense with randomized estimators.
A randomized estimator randomly chooses the estimate
Y( x ) , where the distribution of Y( x ) is known. A
randomized estimator can be obtained as an estimator
*
estimator  ( X ,U ) where X and U are independent and U
14
is uniformly distributed on (0,1). This is achieved by
observing X = x and then using U to construct the
distribution of Y( x ) . For the data ( X , U ) , X is sufficient.
Thus, by the Rao-Blackwell Theorem, the nonrandomized
*
*
estimator E[ ( X ,U ) | X ] dominates  ( X ,U ) for strictly
convex loss functions.
IV. Minimal Sufficiency
A statistic T ( X ) is minimally sufficient if it is sufficient and
it provides a reduction of the data that is at least as great as
that of any other sufficient statistic S ( X ) in the sense that
we can find a transformation r such that T ( X )  r ( S ( X )) .
Theorem 2 (Lehmann and Scheffe, 1950): Suppose S ( X ) is
a sufficient statistic for  . Also suppose that if for two
sample points x and y , the ratio f ( x |  ) / f ( y |  ) is
constant as a function of  , then S ( x )  S ( y ) . Then
S ( X ) is a minimal sufficient statistic for  .
Proof: Let T ( X ) be any statistic that is sufficient for  . By
the factorization theorem, there exist functions g and h
such that f ( x |  )  g (T ( x ) | θ ) h( x ) . Let x and y be any
two sample points with T ( x )  T ( y ) . Then
f ( x |  ) g (T ( x ) |  )h( x ) h( x)


f ( y |  ) g (T ( y) |  )h( y) h( y) .
Since this ratio does not depend on  , the assumptions of
the theorem imply that S ( x )  S ( y ) . Thus, S ( X ) is at
15
least as coarse a partition of the sample space as T ( X ) , and
consequently S ( X ) is minimal sufficient.
Example 6: Suppose X 1 , , X n iid Bernoulli ( ) . Consider
the ratio
n
n
x
n  x i

i 1 i
i 1
f (x | ) 
(1   )

n
n
f ( y |  )   i1 yi (1   ) n  i1 yi .
If this ratio is constant as a function of  , then
i1 xi  i1 yi . Since we have shown that
n
n
n
T ( X )   X i is a sufficient statistic, it follows from the
i 1
n
above sentence and Theorem 2 that T ( X )   X i is a
i 1
minimal sufficient statistic.
Note that a minimal sufficient statistic is not unique. Any
one-to-one function of a minimal sufficient statistic is also
a minimal sufficient statistic. For example,
1 n
T '( X )   X i is a minimal sufficient statistic for the
n i 1
i.i.d. Bernoulli case.
Example 7: Suppose X 1 , , X n are iid uniform on the
interval ( ,  1),      . Then the joint pdf of X is
16
1  <xi    1, i  1,..., n
f (x | )  
0 otherwise
1 max i xi  1    min i xi

0 otherwise
The statistic T ( X )  (min i X i , max i X i ) is a sufficient
statistic by the factorization theorem with
g (T ( X ), )  I (max i X i  1    min i X i ) and h( X )  1 .
For any two sample points x and y , the numerator and
denominator of the ratio f ( x |  ) / f ( y |  ) will be positive
for the same values of  if and only if min i xi  mini yi and
max i xi  max i yi ; if the minima and maxima are equal,
then the ratio is constant and in fact equals 1. Thus,
T ( X )   mini xi , maxi xi  is a minimal sufficient statistic
by Theorem 2.
Example 7 is a case in which the dimension of the minimal
sufficient statistic (2) does not match the dimension of the
parameter (1). There are models in which the dimension of
the minimal sufficient statistic is equal to the sample size,
1
f
(
x
|

)

e.g., X 1 , , X n iid Cauchy(  ),
 [1  ( x   )2 ] .
(Problem 1.5.15).
17