Download Statistics 512 Notes ID

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Sufficient statistic wikipedia , lookup

Transcript
Statistics 550 Notes 5
Reading: Sections 1.4, 1.5
I. Prediction (Chapter 1.4)
A common decision problem is that we want to predict a
variable Y based on a covariate vector Z .
Examples: (1) Predict whether a patient, hospitalized due to
a heart attack, will have a second heart attack. The
prediction is to be based on demographic, diet and clinical
measurements for that patient; (2) Predict the price of a
stock 6 months from now, on the basis of company
performance measures and economic data; (3) Predict the
numbers in a handwritten ZIP code, from a digitized image.
We typically have a “training” sample (Y1 , Z1 ), ,(Yn , Z n )
available from the joint distribution of (Y , Z ) and we want
to predict Ynew for a new observation from the distribution
of (Y , Z ) for which we know Z  Znew .
In Section 1.4, we consider how to make predictions when
we know the joint distribution of (Y , Z ) ; in practice, we
often have only an estimate of the joint distribution based
on the training sample.
1
Let g ( Z ) be a rule for predicting Y based on Z . A
criterion that is often used for judging different prediction
rules is mean squared prediction error:
2 (Y , g ( Z ))  E[{g ( Z )  Y }2 | Z ] -this is the average squared prediction error when g ( Z ) is
used to predict Y for a particular Z . We want
 2 (Y , g ( Z )) to be as small as possible.
Theorem 1.4.1: Let  ( Z )  E (Y | Z ) .  ( Z ) is the best
mean squared prediction error prediction rule.
Proof: For any prediction rule g ( Z ) ,
 2 (Y , g ( Z )) 
E[{g ( Z )  Y }2 | Z ]  E[{( g ( Z )  E (Y | Z ))  ( E (Y | Z )  Y )}2 | Z ] 
E[( g ( Z )  E (Y | Z )) 2 | Z ]  2 E[( g ( Z )  E (Y | Z ))( E (Y | Z )  Y ) | Z ] 
E[( E (Y | Z )  Y ) 2 | Z ] 
E[( g ( Z )  E (Y | Z )) 2 | Z ]  E[( E (Y | Z )  Y ) 2 | Z ]

E[( E (Y | Z )  Y ) 2 | Z ]   2 (Y , E (Y | Z ))
II. Sufficient Statistics
Our usual setting: We observe data X from a distribution
P , where we do not know the true P but only know that
P P = { P , } (the statistical model).
2
The observed sample of data X may be very complicated
(e.g., in the handwritten zip code example from Notes 1,
the data is 500 216  72 matrices). An experimenter may
wish to summarize the information in a sample by
determining a few key features of the sample values, e.g.,
the sample mean, the sample variance or the largest
observation. These are all examples of statistics.
Recall: A statistic Y  T ( X ) is a random variable or
random vector that is a function of the data.
A statistic is sufficient if it carries all the information in the
data about the parameter vector  . T ( X ) can be a scalar or
a vector. If T ( X ) is of lower “dimension” than X , then we
have a good summary of the data that does not discard any
important information.
For example, consider a sequence of independent Bernoulli
trials with unknown probability of success  . We may
have the intuitive feeling that the total number of successes
contains all the information about  that there is in the
sample, that the order in which the successes occurred, for
example, does not give any additional information. The
following definition formalizes this idea:
3
Definition: A statistic Y  T ( X ) is sufficient for  if the
conditional distribution of X given Y  y does not depend
on  for any value of y .1
Implication: If a statistic Y  T ( X ) is sufficient, then if we
already know the value of the statistic, knowing the full
data X does not provide any additional information about
.
Example 1: Let X 1 ,
, X n be a sequence of independent
Bernoulli random variables with P( X i  1)   . We will
verify that Y  i 1 X i is sufficient for  .
Consider
n
P ( X1  x1 , , X n  xn | i 1 X i  y) .
n
We have
1
This definition is not quite precise. Difficulties arise when
P (Y  y)  0 , so that the conditioning
event has probability zero. The definition of conditional probability can then be changed at one or more
values of y (in fact at any set of y values which has probability zero) without affecting the distribution of
X , which is the result of combining the distribution of Y with the conditional distribution of X given Y .
In general, there can be more than one version of the conditional probability distribution P ( X | Y ) which
together with the distribution of Y leads back to the distribution of X . We define a statistic as sufficient
if there exists at least one version of the conditional probability distributions P ( X | Y ) which are the
same for all   . See Lehmann and Casella, Theory of Point Estimation, 2 nd Edition, Chapter 1.6, pg.
34-35 for further discussion. For our purposes, we define Y to be a sufficient statistic if i) for discrete
distributions of the data, for each y that has positive probability for at least one   , the conditional
| Y ) does not depend on  for all  for which P (Y  y)  0 ; and ii) for continuous
distributions of the data, for each y that has positive density for at least one   , the conditional
probability density f ( X | Y ) does not depend on  for all  for which f (Y  y)  0 .
probability P ( X
4
P ( X 1  x1 ,
, X n  xn |  i 1 X i  y ) 
n
P ( X 1  x1 , , X n  xn , Y  y )
P (Y  y )
 y (1   )n  y
1


n y
n
n y

(1


)
 
 
y
 
 y
The conditional distribution thus does not involve  at all
n
Y

and thus
i1 X i is sufficient for  .
Example 2:
Let X 1 , , X n be iid Uniform( 0,  ). Consider the statistic
Y  max1in X i .
We showed in Notes 4 that
 ny n 1
0  y 

fY ( y )    n
0
elsewhere

Y must be less than  . For Y   , we have
P ( X 1  x1 ,
, X n  xn | Y  y ) 
P ( X 1  x1 , , X n  xn , Y  y )
P (Y  y )
1
n

IY  I min X i 0 I max X i Y
ny n 1
n
IY 

1
I min X i 0 I max X i Y
ny n 1
which does not depend on  (for all Y for which
p (Y )  0 ).
5
It is often hard to verify or disprove sufficiency of a
statistic directly because we need to find the distribution of
the sufficient statistic. The following theorem is often
helpful.
Factorization Theorem: A statistic T ( X ) is sufficient for
 if and only if there exist functions g (t ,  ) and h ( x ) such
that
p( x |  )  g (T ( x ),  ) h ( x )
for all x and all    .
(where p ( x |  ) denotes the probability mass function for
discrete data given the parameter  and the probability
density function for continuous data).
Proof: We prove the theorem for discrete data; the proof for
continuous distributions is similar. First, suppose that the
probability mass function factors as given in the theorem.
We have
P (T ( x )  t )   p( x ' |  )
x ':T ( x ')  t
so that
6
P ( X  x | T ( x )  t ) 
P ( X  x , T ( x )  t )

P (T ( x )  t )
P ( X  x )

P (T ( x )  t )
g (T ( x ),  )h( x )

 g (T ( x' ), )h( x' )
x ':T ( x ')  t
h( x )
 h( x' )
x ':T ( x ')  t
Thus, T ( X ) is sufficient for  because the conditional
distribution P ( X  x | T ( x)  t ) does not depend on  .
Conversely, suppose T ( X ) is sufficient for  . Then the
conditional distribution of X | T ( X ) does not depend on  .
Let P( X  x | T ( X )  t )  k ( x, t ) . Then
p( x |  )  k ( x, t ) P (T ( x)  t ) .
Thus, we can take h( x)  k ( x, t ), g (t , )  P (T ( x)  t )
Example 1 Continued: X 1 , , X n a sequence of
independent Bernoulli random variables with
P( X i  1)   . To show that Y  i 1 X i is sufficient for
 , we factor the probability mass function as follows:
n
7
P( X 1  x1 ,
n
, X n  xn |  )    xi (1   )1 xi
i 1
x
n
x
   i1 i (1   )  i1 i
n
  


 1 
n
 i1 xi
n
The pmf is of the form g (i 1 xi , )h( x1 ,
h( x1 , , xn )  1.
n
(1   ) n
, xn ) where
Example 2 continued: Let X 1 , , X n be iid Uniform( 0,  ).
To show that Y  max1i n X i is sufficient, we factor the pdf
as follows:
n
1
1
f ( x1 , , xn |  )   I 0 xi   n I max1in X i  I min1in X i 0
i 1


The pdf is of the form g ( I max1in X i  , )h( x1 , , xn ) where
1
g ( x1 , , xn ,  )  n I max1in X i  , h( x1 , , xn )  I min1in X i 0

Example 3: Let X 1 ,
factors as
, X n be iid Normal (  ,  2 ). The pdf
8
1
 1

exp  2 ( xi   ) 2 
 2

i 1  2
1
n
 1
2
 n
exp
(
x


)
 2 2  i 1 i

 (2 ) n / 2
n
, xn ;  ,  )  
2
f ( x1 ,

1
n
n
 1
2
2 
exp
(
x

2

x

n

)


i
i
 2 2
i 1
i 1
 n (2 ) n / 2

The pdf is thus of the form
g (i 1 xi , i 1 xi2 ,  ,  2 )h( x1 ,
n
n
, xn ) where h( x1 ,
, xn )  1.
2
Thus, (i 1 xi , i 1 xi ) is a two-dimensional sufficient
n
n
2
statistic for   (  ,  ) , i.e., the distribution of X 1 ,
, X n is
2
2
independent of (  ,  ) given (i 1 xi , i 1 xi ) .
n
n
A theorem for proving that a statistic is not sufficient:
Theorem 1: Let T ( X ) be a statistic. If there exists some
1 , 2   and x , y such that
(i) T ( x )  T ( y ) ;
(ii) p( x | 1 ) p( y |  2 )  p( x |  2 ) p( y | 1 ) ,
then T ( X ) is not a sufficient statistic.
Proof: Assume that (i) and (ii) hold. Suppose that T ( X ) is
a sufficient statistic. Then by the factorization theorem,
p( x |  )  g (T ( x ),  )h( x ) . Thus,
p( x | 1 ) p( y |  2 )  g (T ( x), 1 )h( x) g (T ( y),  2 ) h( y)
 g (T ( x ), )h( x ) g (T ( x ), )h( y ) ,
1
where the last equality follows from (i).
9
2
Also,
p( x |  2 ) p( y | 1 )  g (T ( x ),  2 )h( x) g (T ( y), 1 ) h( y)
 g (T ( x ), 2 )h( x ) g (T ( x ),1 )h( y )
where the last equality follows from (i). Thus,
p( x | 1 ) p( y |  2 )  p( x |  2 ) p( y | 1 ) . This contradicts (ii).
Hence the supposition that T ( X ) is a sufficient statistic is
impossible and T ( X ) must not be a sufficient statistic
when (i) and (ii) hold.
■
Example 4: Consider a series of three independent
Bernoulli trials X 1 , X 2 , X 3 with probability of success p.
Let T  X1  2 X 2  3 X 3 . Show that T is not sufficient.
Let x = ( X1  0, X 2  0, X 3  1) and
y  ( X1  1, X 2  1, X 3  0) . We have T ( x )  T ( y )  3 .
But
f ( x | p  1/ 3) f ( y | p  2 / 3)  ((2 / 3) 2 *(1/ 3))*((2 / 3) 2 *(1/ 3))  16 / 729 
f ( x | p  2 / 3) f ( y | p  1/ 3)  ((1/ 3) 2 *(2 / 3))*((1/ 3) 2 *(2 / 3))  4 / 729
Thus, by Theorem 1, T is not sufficient.
III. Implications of Sufficiency
We have said that reducing the data to a sufficient statistic
does not sacrifice any information about  .
We now justify this statement in two ways:
10
(1) We show that for any decision procedure, we can
find a randomized decision procedure that is based
only on the sufficient statistic and that has the same
risk function.
(2) We show that any point estimator that is not a
function of the sufficient statistic can be improved
upon for a strictly convex loss function.
(1) Let  ( X ) be a decision procedure and T ( X ) be a
sufficient statistic. Consider the following randomized
decision procedure [call it  '(T ( X )) ]:
Based on T ( X ) , randomly draw X ' from the distribution
X | T ( X ) (which does not depend on  and is hence
known) and take action  ( X' ) .
X has the same distribution as X' so that  ( X ) has the
same distribution as  '(T ( X ))   ( X' ) . Since  ( X ) and
 ( X' ) have the same distribution, they have the same risk
function.
2
Example 5: X ~ N (0,  ) . T ( X ) | X | is sufficient
2
because X | T ( X )  t is equally likely to be  t for all  .
Given T  t , construct X ' to be  t with probability 0.5
2
each. Then X ' ~ N (0,  ) .
(2) The Rao-Blackwell Theorem.
11
Convex functions: A real valued function  defined on an
open interval I  (a, b) is convex if for any a  x  y  b
and 0    1 ,
[ x  (1   ) y]   ( x)  (1   ) ( y) .
 is strictly convex if the inequality is strict.
If  '' exists, then  is convex if and only if  ''  0 on
I  ( a, b) .
A convex function lies above all its tangent lines.
Convexity of loss functions:
For point estimation:
 squared error loss is strictly convex.
 absolute error loss is convex but not strictly convex
 Huber’s loss functions,
2

if |q(  ) - a | k
(q(  ) - a)
l (  a   
2

2k | q(  ) - a | -k if |q(  ) - a |> k
for some constant k is convex but not strictly convex.
 zero-one loss function
if |q(  )- a | k
0
l (  a   
if |q(  )- a |> k
1
is nonconvex.
Jensen’s Inequality: (Appendix B.9)
Let X be a random variable. (i) If  is convex in an open
interval I and P( X  I )  1 and E ( X )   , then
 ( E[ X ])  E[ ( X )] .
12
(ii) If  is strictly convex, then  ( E[ X ])  E[ ( X )] unless
X equals a constant with probability one.
Proof of (i): Let L ( x ) be a tangent line to  ( x) at the point
 ( E[ X ]) . Write L( x)  a  bx . By the convexity of  ,
 ( x)  a  bx . Since expectations preserve inequalities,
E[ ( X )]  E[a  bX ]
 a  bE[ X ]
 L( E[ X ])
  ( E[ X ])
as was to be shown.
Rao-Blackwell Theorem: Let T ( X ) be a sufficient statistic.
Let  be a point estimate of q( ) and assume that the loss
function l ( , d ) is strictly convex in d. Also assume that
R ( ,  )   . Let  (t )  E[ ( X ) | T ( X )  t ] . Then
R( , )  R( ,  ) unless  ( X )   (T ( X )) with probability
one.
Proof: Fix  . Apply Jensen’s inequality with
 (d ( x ))  l ( , d ( x )) and let X have the conditional
distribution of X | T ( X )  t for a particular choice of t .
By Jensen’s inequality,
l ( , (t ))  E[l[ ,  ( X )] | t ]
(0.1)
with the inequality being strict unless  ( X )   (t ) with
probability one. Taking the expectation on both sides of
13
this inequality yields R( , )  R( ,  ) unless
 ( X )   (T ( X )) with probability one.
Comments:
(1) Sufficiency ensures  (t )  E[ ( X ) | T ( X )  t ] is an
estimator (i.e., it depends only on t and not on  ).
(2) If loss is convex rather than strictly convex, we get  in
(1.2).
(3) Theorem is not true without convexity of loss functions.
Example 4 continued. Consider a series of three
independent Bernoulli trials X 1 , X 2 , X 3 with probability of
success p. We have shown that T ( X )  X1  X 2  X 3 is a
sufficient statistic and that T '( X )  X1  2 X 2  3 X 3 is an
insufficient statistic. The unbiased estimator
X  2 X 2  3X 3
 (X )  1
is a function of the insufficient
6
statistic T '( X )  X1  2 X 2  3 X 3 and can thus be improved
for a strictly convex loss function by using the RaoBlackwell theorem:
 X1  2 X 2  3 X 3

| X1  X 2  X 3  t 
6


 (t )  E ( ( X ) | T ( X )  t )  E 
Note that
14
Pp ( X 1  x | X 1  X 2  X 3  t ) 
Pp ( X 1  x, X 2  X 3  t  x)
Pp ( X 1  X 2  X 3  t )

t
 2  tx
2  
if x  1
2 t  x
p (1  p) 
 p (1  p)

 3
t  x
t  x


 3 t
 3
3 t
 t
  p (1  p)
 
1  if x  0
t 
t 
 3
x
1 x
Thus,
 X1  2 X 2  3 X 3
 t 1 2t 1 3t 1 t
| X1  X 2  X 3  t  


 .
6

 36 3 6 3 6 3
 (t )  E 
For squared error loss we have,
R( p, )  Bias p ( )  Varp ( )  0 
2
p(1  p)
3
and
R( p,  )  Bias p ( )  Varp ( )  0 
2
so that R( p, )  R( p,  ) .
14
p(1  p) ,
36
Consequence of Rao-Blackwell theorem: For convex loss
functions, we can dispense with randomized estimators.
A randomized estimator randomly chooses the estimate
Y( x ) , where the distribution of Y( x ) is known. A
randomized estimator can be obtained as an estimator
*
estimator  ( X ,U ) where X and U are independent and U
is uniformly distributed on (0,1). This is achieved by
observing X = x and then using U to construct the
distribution of Y( x ) . For the data ( X , U ) , X is sufficient.
Thus, by the Rao-Blackwell Theorem, the nonrandomized
15
*
*
estimator E[ ( X ,U ) | X ] dominates  ( X ,U ) for strictly
convex loss functions.
16