Download Statistics 512 Notes ID

Statistics 550 Notes 5 Reading: Sections 1.4, 1.5 I. Prediction (Chapter 1.4) A common decision problem is that we want to predict a variable Y based on a covariate vector Z . Examples: (1) Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient; (2) Predict the price of a stock 6 months from now, on the basis of company performance measures and economic data; (3) Predict the numbers in a handwritten ZIP code, from a digitized image. We typically have a “training” sample (Y1 , Z1 ), ,(Yn , Z n ) available from the joint distribution of (Y , Z ) and we want to predict Ynew for a new observation from the distribution of (Y , Z ) for which we know Z  Znew . In Section 1.4, we consider how to make predictions when we know the joint distribution of (Y , Z ) ; in practice, we often have only an estimate of the joint distribution based on the training sample. 1 Let g ( Z ) be a rule for predicting Y based on Z . A criterion that is often used for judging different prediction rules is mean squared prediction error: 2 (Y , g ( Z ))  E[{g ( Z )  Y }2 | Z ] -this is the average squared prediction error when g ( Z ) is used to predict Y for a particular Z . We want  2 (Y , g ( Z )) to be as small as possible. Theorem 1.4.1: Let  ( Z )  E (Y | Z ) .  ( Z ) is the best mean squared prediction error prediction rule. Proof: For any prediction rule g ( Z ) ,  2 (Y , g ( Z ))  E[{g ( Z )  Y }2 | Z ]  E[{( g ( Z )  E (Y | Z ))  ( E (Y | Z )  Y )}2 | Z ]  E[( g ( Z )  E (Y | Z )) 2 | Z ]  2 E[( g ( Z )  E (Y | Z ))( E (Y | Z )  Y ) | Z ]  E[( E (Y | Z )  Y ) 2 | Z ]  E[( g ( Z )  E (Y | Z )) 2 | Z ]  E[( E (Y | Z )  Y ) 2 | Z ]  E[( E (Y | Z )  Y ) 2 | Z ]   2 (Y , E (Y | Z )) II. Sufficient Statistics Our usual setting: We observe data X from a distribution P , where we do not know the true P but only know that P P = { P , } (the statistical model). 2 The observed sample of data X may be very complicated (e.g., in the handwritten zip code example from Notes 1, the data is 500 216  72 matrices). An experimenter may wish to summarize the information in a sample by determining a few key features of the sample values, e.g., the sample mean, the sample variance or the largest observation. These are all examples of statistics. Recall: A statistic Y  T ( X ) is a random variable or random vector that is a function of the data. A statistic is sufficient if it carries all the information in the data about the parameter vector  . T ( X ) can be a scalar or a vector. If T ( X ) is of lower “dimension” than X , then we have a good summary of the data that does not discard any important information. For example, consider a sequence of independent Bernoulli trials with unknown probability of success  . We may have the intuitive feeling that the total number of successes contains all the information about  that there is in the sample, that the order in which the successes occurred, for example, does not give any additional information. The following definition formalizes this idea: 3 Definition: A statistic Y  T ( X ) is sufficient for  if the conditional distribution of X given Y  y does not depend on  for any value of y .1 Implication: If a statistic Y  T ( X ) is sufficient, then if we already know the value of the statistic, knowing the full data X does not provide any additional information about . Example 1: Let X 1 , , X n be a sequence of independent Bernoulli random variables with P( X i  1)   . We will verify that Y  i 1 X i is sufficient for  . Consider n P ( X1  x1 , , X n  xn | i 1 X i  y) . n We have 1 This definition is not quite precise. Difficulties arise when P (Y  y)  0 , so that the conditioning event has probability zero. The definition of conditional probability can then be changed at one or more values of y (in fact at any set of y values which has probability zero) without affecting the distribution of X , which is the result of combining the distribution of Y with the conditional distribution of X given Y . In general, there can be more than one version of the conditional probability distribution P ( X | Y ) which together with the distribution of Y leads back to the distribution of X . We define a statistic as sufficient if there exists at least one version of the conditional probability distributions P ( X | Y ) which are the same for all   . See Lehmann and Casella, Theory of Point Estimation, 2 nd Edition, Chapter 1.6, pg. 34-35 for further discussion. For our purposes, we define Y to be a sufficient statistic if i) for discrete distributions of the data, for each y that has positive probability for at least one   , the conditional | Y ) does not depend on  for all  for which P (Y  y)  0 ; and ii) for continuous distributions of the data, for each y that has positive density for at least one   , the conditional probability density f ( X | Y ) does not depend on  for all  for which f (Y  y)  0 . probability P ( X 4 P ( X 1  x1 , , X n  xn |  i 1 X i  y )  n P ( X 1  x1 , , X n  xn , Y  y ) P (Y  y )  y (1   )n  y 1   n y n n y  (1   )     y    y The conditional distribution thus does not involve  at all n Y  and thus i1 X i is sufficient for  . Example 2: Let X 1 , , X n be iid Uniform( 0,  ). Consider the statistic Y  max1in X i . We showed in Notes 4 that  ny n 1 0  y   fY ( y )    n 0 elsewhere  Y must be less than  . For Y   , we have P ( X 1  x1 , , X n  xn | Y  y )  P ( X 1  x1 , , X n  xn , Y  y ) P (Y  y ) 1 n  IY  I min X i 0 I max X i Y ny n 1 n IY   1 I min X i 0 I max X i Y ny n 1 which does not depend on  (for all Y for which p (Y )  0 ). 5 It is often hard to verify or disprove sufficiency of a statistic directly because we need to find the distribution of the sufficient statistic. The following theorem is often helpful. Factorization Theorem: A statistic T ( X ) is sufficient for  if and only if there exist functions g (t ,  ) and h ( x ) such that p( x |  )  g (T ( x ),  ) h ( x ) for all x and all    . (where p ( x |  ) denotes the probability mass function for discrete data given the parameter  and the probability density function for continuous data). Proof: We prove the theorem for discrete data; the proof for continuous distributions is similar. First, suppose that the probability mass function factors as given in the theorem. We have P (T ( x )  t )   p( x ' |  ) x ':T ( x ')  t so that 6 P ( X  x | T ( x )  t )  P ( X  x , T ( x )  t )  P (T ( x )  t ) P ( X  x )  P (T ( x )  t ) g (T ( x ),  )h( x )   g (T ( x' ), )h( x' ) x ':T ( x ')  t h( x )  h( x' ) x ':T ( x ')  t Thus, T ( X ) is sufficient for  because the conditional distribution P ( X  x | T ( x)  t ) does not depend on  . Conversely, suppose T ( X ) is sufficient for  . Then the conditional distribution of X | T ( X ) does not depend on  . Let P( X  x | T ( X )  t )  k ( x, t ) . Then p( x |  )  k ( x, t ) P (T ( x)  t ) . Thus, we can take h( x)  k ( x, t ), g (t , )  P (T ( x)  t ) Example 1 Continued: X 1 , , X n a sequence of independent Bernoulli random variables with P( X i  1)   . To show that Y  i 1 X i is sufficient for  , we factor the probability mass function as follows: n 7 P( X 1  x1 , n , X n  xn |  )    xi (1   )1 xi i 1 x n x    i1 i (1   )  i1 i n       1  n  i1 xi n The pmf is of the form g (i 1 xi , )h( x1 , h( x1 , , xn )  1. n (1   ) n , xn ) where Example 2 continued: Let X 1 , , X n be iid Uniform( 0,  ). To show that Y  max1i n X i is sufficient, we factor the pdf as follows: n 1 1 f ( x1 , , xn |  )   I 0 xi   n I max1in X i  I min1in X i 0 i 1   The pdf is of the form g ( I max1in X i  , )h( x1 , , xn ) where 1 g ( x1 , , xn ,  )  n I max1in X i  , h( x1 , , xn )  I min1in X i 0  Example 3: Let X 1 , factors as , X n be iid Normal (  ,  2 ). The pdf 8 1  1  exp  2 ( xi   ) 2   2  i 1  2 1 n  1 2  n exp ( x   )  2 2  i 1 i   (2 ) n / 2 n , xn ;  ,  )   2 f ( x1 ,  1 n n  1 2 2  exp ( x  2  x  n  )   i i  2 2 i 1 i 1  n (2 ) n / 2  The pdf is thus of the form g (i 1 xi , i 1 xi2 ,  ,  2 )h( x1 , n n , xn ) where h( x1 , , xn )  1. 2 Thus, (i 1 xi , i 1 xi ) is a two-dimensional sufficient n n 2 statistic for   (  ,  ) , i.e., the distribution of X 1 , , X n is 2 2 independent of (  ,  ) given (i 1 xi , i 1 xi ) . n n A theorem for proving that a statistic is not sufficient: Theorem 1: Let T ( X ) be a statistic. If there exists some 1 , 2   and x , y such that (i) T ( x )  T ( y ) ; (ii) p( x | 1 ) p( y |  2 )  p( x |  2 ) p( y | 1 ) , then T ( X ) is not a sufficient statistic. Proof: Assume that (i) and (ii) hold. Suppose that T ( X ) is a sufficient statistic. Then by the factorization theorem, p( x |  )  g (T ( x ),  )h( x ) . Thus, p( x | 1 ) p( y |  2 )  g (T ( x), 1 )h( x) g (T ( y),  2 ) h( y)  g (T ( x ), )h( x ) g (T ( x ), )h( y ) , 1 where the last equality follows from (i). 9 2 Also, p( x |  2 ) p( y | 1 )  g (T ( x ),  2 )h( x) g (T ( y), 1 ) h( y)  g (T ( x ), 2 )h( x ) g (T ( x ),1 )h( y ) where the last equality follows from (i). Thus, p( x | 1 ) p( y |  2 )  p( x |  2 ) p( y | 1 ) . This contradicts (ii). Hence the supposition that T ( X ) is a sufficient statistic is impossible and T ( X ) must not be a sufficient statistic when (i) and (ii) hold. ■ Example 4: Consider a series of three independent Bernoulli trials X 1 , X 2 , X 3 with probability of success p. Let T  X1  2 X 2  3 X 3 . Show that T is not sufficient. Let x = ( X1  0, X 2  0, X 3  1) and y  ( X1  1, X 2  1, X 3  0) . We have T ( x )  T ( y )  3 . But f ( x | p  1/ 3) f ( y | p  2 / 3)  ((2 / 3) 2 *(1/ 3))*((2 / 3) 2 *(1/ 3))  16 / 729  f ( x | p  2 / 3) f ( y | p  1/ 3)  ((1/ 3) 2 *(2 / 3))*((1/ 3) 2 *(2 / 3))  4 / 729 Thus, by Theorem 1, T is not sufficient. III. Implications of Sufficiency We have said that reducing the data to a sufficient statistic does not sacrifice any information about  . We now justify this statement in two ways: 10 (1) We show that for any decision procedure, we can find a randomized decision procedure that is based only on the sufficient statistic and that has the same risk function. (2) We show that any point estimator that is not a function of the sufficient statistic can be improved upon for a strictly convex loss function. (1) Let  ( X ) be a decision procedure and T ( X ) be a sufficient statistic. Consider the following randomized decision procedure [call it  '(T ( X )) ]: Based on T ( X ) , randomly draw X ' from the distribution X | T ( X ) (which does not depend on  and is hence known) and take action  ( X' ) . X has the same distribution as X' so that  ( X ) has the same distribution as  '(T ( X ))   ( X' ) . Since  ( X ) and  ( X' ) have the same distribution, they have the same risk function. 2 Example 5: X ~ N (0,  ) . T ( X ) | X | is sufficient 2 because X | T ( X )  t is equally likely to be  t for all  . Given T  t , construct X ' to be  t with probability 0.5 2 each. Then X ' ~ N (0,  ) . (2) The Rao-Blackwell Theorem. 11 Convex functions: A real valued function  defined on an open interval I  (a, b) is convex if for any a  x  y  b and 0    1 , [ x  (1   ) y]   ( x)  (1   ) ( y) .  is strictly convex if the inequality is strict. If  '' exists, then  is convex if and only if  ''  0 on I  ( a, b) . A convex function lies above all its tangent lines. Convexity of loss functions: For point estimation:  squared error loss is strictly convex.  absolute error loss is convex but not strictly convex  Huber’s loss functions, 2  if |q(  ) - a | k (q(  ) - a) l (  a    2  2k | q(  ) - a | -k if |q(  ) - a |> k for some constant k is convex but not strictly convex.  zero-one loss function if |q(  )- a | k 0 l (  a    if |q(  )- a |> k 1 is nonconvex. Jensen’s Inequality: (Appendix B.9) Let X be a random variable. (i) If  is convex in an open interval I and P( X  I )  1 and E ( X )   , then  ( E[ X ])  E[ ( X )] . 12 (ii) If  is strictly convex, then  ( E[ X ])  E[ ( X )] unless X equals a constant with probability one. Proof of (i): Let L ( x ) be a tangent line to  ( x) at the point  ( E[ X ]) . Write L( x)  a  bx . By the convexity of  ,  ( x)  a  bx . Since expectations preserve inequalities, E[ ( X )]  E[a  bX ]  a  bE[ X ]  L( E[ X ])   ( E[ X ]) as was to be shown. Rao-Blackwell Theorem: Let T ( X ) be a sufficient statistic. Let  be a point estimate of q( ) and assume that the loss function l ( , d ) is strictly convex in d. Also assume that R ( ,  )   . Let  (t )  E[ ( X ) | T ( X )  t ] . Then R( , )  R( ,  ) unless  ( X )   (T ( X )) with probability one. Proof: Fix  . Apply Jensen’s inequality with  (d ( x ))  l ( , d ( x )) and let X have the conditional distribution of X | T ( X )  t for a particular choice of t . By Jensen’s inequality, l ( , (t ))  E[l[ ,  ( X )] | t ] (0.1) with the inequality being strict unless  ( X )   (t ) with probability one. Taking the expectation on both sides of 13 this inequality yields R( , )  R( ,  ) unless  ( X )   (T ( X )) with probability one. Comments: (1) Sufficiency ensures  (t )  E[ ( X ) | T ( X )  t ] is an estimator (i.e., it depends only on t and not on  ). (2) If loss is convex rather than strictly convex, we get  in (1.2). (3) Theorem is not true without convexity of loss functions. Example 4 continued. Consider a series of three independent Bernoulli trials X 1 , X 2 , X 3 with probability of success p. We have shown that T ( X )  X1  X 2  X 3 is a sufficient statistic and that T '( X )  X1  2 X 2  3 X 3 is an insufficient statistic. The unbiased estimator X  2 X 2  3X 3  (X )  1 is a function of the insufficient 6 statistic T '( X )  X1  2 X 2  3 X 3 and can thus be improved for a strictly convex loss function by using the RaoBlackwell theorem:  X1  2 X 2  3 X 3  | X1  X 2  X 3  t  6    (t )  E ( ( X ) | T ( X )  t )  E  Note that 14 Pp ( X 1  x | X 1  X 2  X 3  t )  Pp ( X 1  x, X 2  X 3  t  x) Pp ( X 1  X 2  X 3  t )  t  2  tx 2   if x  1 2 t  x p (1  p)   p (1  p)   3 t  x t  x    3 t  3 3 t  t   p (1  p)   1  if x  0 t  t   3 x 1 x Thus,  X1  2 X 2  3 X 3  t 1 2t 1 3t 1 t | X1  X 2  X 3  t      . 6   36 3 6 3 6 3  (t )  E  For squared error loss we have, R( p, )  Bias p ( )  Varp ( )  0  2 p(1  p) 3 and R( p,  )  Bias p ( )  Varp ( )  0  2 so that R( p, )  R( p,  ) . 14 p(1  p) , 36 Consequence of Rao-Blackwell theorem: For convex loss functions, we can dispense with randomized estimators. A randomized estimator randomly chooses the estimate Y( x ) , where the distribution of Y( x ) is known. A randomized estimator can be obtained as an estimator * estimator  ( X ,U ) where X and U are independent and U is uniformly distributed on (0,1). This is achieved by observing X = x and then using U to construct the distribution of Y( x ) . For the data ( X , U ) , X is sufficient. Thus, by the Rao-Blackwell Theorem, the nonrandomized 15 * * estimator E[ ( X ,U ) | X ] dominates  ( X ,U ) for strictly convex loss functions. 16

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Statistics 512 Notes ID