Download Notes 7 - Wharton Statistics

Statistics 550 Notes 7 Reading: Section 1.5, 1.6.1 I. The Rao-Blackwell Theorem Convex functions: A real valued function  defined on an open interval I  (a, b) is convex if for any a  x  y  b and 0    1 , [ x  (1   ) y ]   ( x)  (1   ) ( y) .  is strictly convex if the inequality is strict. If  '' exists, then  is convex if and only if  ''  0 on I  ( a, b) . A convex function lies above all its tangent lines. Convexity of loss functions: For point estimation: 2  squared error loss l ( , a)  (  a) is strictly convex in a .  absolute error loss l ( , a) |   a | is convex but not strictly convex in a  zero-one loss function if |q(  )- a | k 0 l (  a    if |q(  )- a |> k 1 is nonconvex in a Jensen’s Inequality: (Appendix B.9) 1 Let X be a random variable. (i) If  is convex in an open interval I and P( X  I )  1 and E ( X )   , then  ( E[ X ])  E[ ( X )] . (ii) If  is strictly convex, then  ( E[ X ])  E[ ( X )] unless X equals a constant with probability one. Proof of (i): Let L ( x ) be a tangent line to  ( x) at the point  ( E[ X ]) . Write L( x)  a  bx . By the convexity of  ,  ( x)  a  bx . Since expectations preserve inequalities, E[ ( X )]  E[a  bX ]  a  bE[ X ]  L( E[ X ])   ( E[ X ]) as was to be shown. Rao-Blackwell Theorem: Let T ( X ) be a sufficient statistic. Let  be a point estimate of q( ) and assume that the loss function l ( , d ) is strictly convex in d. Also assume that R ( ,  )   . Let  (t )  E[ ( X ) | T ( X )  t ] . Then R( , )  R( ,  ) unless  ( X )   (T ( X )) with probability one. Proof: Fix  . Apply Jensen’s inequality with  (d ( x ))  l ( , d ( x )) and let X have the conditional distribution of X | T ( X )  t for a particular choice of t . By Jensen’s inequality, l ( , (t ))  E[l[ ,  ( X )] | t ] (0.1) 2 unless  ( X )   (T ( X )) with probability one in which case (0.1) is an equality. Taking the expectation on both sides of this inequality yields R( , )  R( ,  ) unless  ( X )   (T ( X )) with probability one in which case R( , )  R( ,  ) . Comments: (1) Sufficiency ensures  (t )  E[ ( X ) | T ( X )  t ] is an estimator (i.e., it depends only on t and not on  ). (2) If loss is convex rather than strictly convex, we get  in (1.2) (3) Theorem is not true without convexity of loss functions. Consequence of Rao-Blackwell theorem: For convex loss functions, we can dispense with randomized estimators. A randomized estimator randomly chooses the estimate Y( x ) , where the distribution of Y( x ) is known. A randomized estimator can be obtained as an estimator * estimator  ( X ,U ) where X and U are independent and U is uniformly distributed on (0,1). This is achieved by observing X = x and then using U to construct the distribution of Y( x ) . For the data ( X , U ) , X is sufficient. Thus, by the Rao-Blackwell Theorem, the nonrandomized * * estimator E[ ( X ,U ) | X ] dominates  ( X ,U ) for strictly convex loss functions. II. Minimal Sufficiency For any model, there are many sufficient statistics. 3 Example 1: For X 1 , , X n iid Bernoulli (  ), n T ( X )   X i , T '( X )  ( X1 , , X n ) are both sufficient but i 1 T provides a greater reduction of the data. Definition: A statistic T ( X ) is minimally sufficient if it is sufficient and it provides a reduction of the data that is at least as great as that of any other sufficient statistic S ( X ) in the sense that we can find a transformation r such that T ( X )  r ( S ( X )) . Comments: (1) To say that we can find a transformation r such that T ( X )  r ( S ( X )) means that if S ( x )  S ( y ) , then T ( x ) must equal T ( y ) . (2) Data reduction in terms of a particular statistic can be thought of as a partition of the sample space. A statistic T ( X ) partitions the sample space into sets At  { x : T ( x)  t} . If a statistic T ( X ) is minimally sufficient, then for another sufficient statistic S ( X ) which partitions the sample space into sets Bs  { x : S ( x)  s} , every set Bs must be a subset of some At . Thus, the partition associated with a minimal sufficient statistic is the coarsest possible partition for a sufficient statistic and in this sense the minimal sufficient 4 statistic achieves the greatest possible data reduction for a sufficient statistic. A useful theorem for finding a minimal sufficient statistic is the following: Theorem 2 (Lehmann and Scheffe, 1950): Suppose S ( X ) is a sufficient statistic for  . Also suppose that if for two sample points x and y , the ratio f ( x |  ) / f ( y |  ) is constant as a function of  , then S ( x )  S ( y ) . Then S ( X ) is a minimal sufficient statistic for  . Proof: Let T ( X ) be any statistic that is sufficient for  . By the factorization theorem, there exist functions g and h such that f ( x |  )  g (T ( x ) | θ )h( x ) . Let x and y be any two sample points with T ( x )  T ( y ) . Then f ( x |  ) g (T ( x ) |  )h( x ) h( x )   f ( y |  ) g (T ( y) |  )h( y) h( y) . Since this ratio does not depend on  , the assumptions of the theorem imply that S ( x )  S ( y ) . Thus, S ( X ) is at least as coarse a partition of the sample space as T ( X ) , and consequently S ( X ) is minimal sufficient. Example 1 continued: Consider the ratio n n x n  x i  i 1 i i 1 f (x | )  (1   )  n n f ( y |  )   i1 yi (1   ) n  i1 yi . 5 This ratio is constant as a function of  if i1 xi  i1 yi . Since we have shown that n n n T ( X )   X i is a sufficient statistic, it follows from the i 1 n above sentence and Theorem 2 that T ( X )   X i is a i 1 minimal sufficient statistic. Note that a minimal sufficient statistic is not unique. Any one-to-one function of a minimal sufficient statistic is also a minimal sufficient statistic. For example, 1 n T '( X )   X i is a minimal sufficient statistic for the n i 1 i.i.d. Bernoulli case. Example 2: Suppose X 1 , , X n are iid uniform on the interval ( ,   1),      . Then the joint pdf of X is 1  <xi    1, i  1,..., n f (x | )   0 otherwise 1 max i xi  1    min i xi  0 otherwise The statistic T ( X )  (min i X i , max i X i ) is a sufficient statistic by the factorization theorem with g (T ( X ), )  I (max i X i  1    min i X i ) and h( X )  1 . For any two sample points x and y , the numerator and denominator of the ratio f ( x |  ) / f ( y |  ) will be positive 6 for the same values of  if and only if min i xi  min i yi and maxi xi  maxi yi ; if the minima and maxima are equal, then the ratio is constant and in fact equals 1. Thus, T ( X )   mini xi , maxi xi  is a minimal sufficient statistic by Theorem 2. Example 2 is a case in which the dimension of the minimal sufficient statistic (2) does not match the dimension of the parameter (1). There are models in which the dimension of the minimal sufficient statistic is equal to the sample size, 1 f ( x |  )  e.g., X 1 , , X n iid Cauchy(  ),  [1  ( x   )2 ] . (Problem 1.5.15). III. Ancillary Statistics A statistic T ( X ) is ancillary if its distribution does not depend on  . Example 4: Suppose our model is X 1 , , X n iid N (  ,1) . Then X is a sufficient statistic and ( X 1  X , , X n  X ) is an ancillary statistic. Although ancillary statistics contain no information about  when the model is true, ancillary statistics are useful for checking the validity of the model. IV. Exponential Families 7 The binomial and normal models exhibited the interesting feature that there is a natural minimal sufficient statistic whose dimension is independent of the sample size. The exponential family models are a general class of models that exhibit this feature. The class of exponential family models includes many of the mostly widely used statistical models (e.g., binomial, normal, gamma, Poisson, multinomial). Exponential family models have an underlying structure with elegant properties that we will discuss. One-parameter exponential families: The family of distributions of a model {P :  } is said to be a oneparameter exponential family if there exist real-valued functions  ( ), B( ), T ( x ), h( x ) such that the pdf or pmf may be written as p( x |  )  h( x ) exp{ ( )T ( x )  B( )} (0.2) Comments: (1) For an exponential family, the support of the distribution (i.e., { x : p( x |  )  0} ) cannot depend on  . Thus, X 1 , , X n iid Uniform (0,  ) is not an exponential family model. (2) For an exponential family model, T ( x ) is a sufficient statistic by the factorization theorem. 8 (3)  , B, T are not unique. For example,  can be multiplied by a constant c and T can be divided by the same constant c. Examples of one-parameter exponential family models: (1) Poisson family. Let X ~ Poisson( ), 0     . Then for x {0,1, 2,...} ,  x e 1 p( x |  )   exp{x log    } . x! x! This is a one-parameter exponential family with  ( )  log  , B( )   , T ( x)  x, h( x)  1/ x ! . (2) Binomial family. Let X ~ Binomial(n,  ), 0    1 . Then for x {0,1, 2,..., n} , n p( x |  )     x (1   ) n  x  x n       exp[ x log   n log(1   )]   1   x This is a one-parameter exponential family with n     ( )  log  , B (  )   n log(1   ), T ( x )  x , h ( x )      1   x The family of distributions obtained by taking iid samples from one-parameter exponential families are themselves one-parameter exponential families. 9 Specifically, suppose X ~ P and {P :  } is an exponential family, then for X1 , , X m iid with common distribution P , p( x1 ,  m  m , xm |  )   h( xi )  exp  ( ) i 1T ( xi )  mB( )     i 1  A sufficient statistic is i 1T ( xi ) and it is one dimensional whatever the sample size m is. For X 1 , , X n iid Poisson (  ), the sufficient statistic m  m i 1 T ( xi ) has a Poisson ( m ) distribution and hence has an exponential family model. It is generally true that the sufficient statistic of an exponential family model follows an exponential family. Theorem 1.6.1: Let {P :  } be a one-parameter exponential family of discrete distributions: p( x |  )  h( x ) exp{ ( )T ( x )  B( )} Then the family of the distributions of the statistic T ( X ) is a one-parameter exponential family of discrete distributions whose pdf may be written h *(t ) exp{ ( )t  B( )} for suitable h*. Proof: By definition, 10 P [T ( x)  t ]   p( x |  ) { x:T ( x ) t }   h( x) exp[ ( )T ( x)  B( )] { x:T ( x ) t }  exp[ ( )t  B( )]{  h( x)} { x:T ( x ) t } * h If we let (t )   { x:T ( x ) t } h( x) , the result follows. A similar theorem holds for continuous exponential families. Canonical exponential families: A useful reparameterization of the exponential family model is to index    ( ) as the parameter to yield p( x |  )  h( x) exp[T ( x)  A( )] , (0.3) where A( )  log   h( x) exp[T ( x)]dx in the continuous case and the integral is replaced by a sum in the discrete space. If   , then A( ) must be finite. Let   { :| A( ) | } . The model given by (0.3) with  ranging over  is called the canonical one-parameter exponential family generated by T and h.  is called the natural parameter space and T is called the natural sufficient statistic. The canonical one-parameter exponential family contains the one-parameter exponential family (0.2) with parameter space   and can be thought 11 of as the “biggest” possible parameter space for the exponential family. Example 1: Let X ~ Poisson( ), 0     . Then for x  {0,1, 2,...} , p( x |  )   x e x!  1 exp{x log    } x! (0.4) Letting   log  , we have 1 p( x |  )  exp{ x  exp( )}, x={0,1,2,...} . x! We have  1 A( )  log  e x x 0 x ! (e ) x  log  x! x 0   log exp(e )  e Thus,   { :| A( ) | }   . Note that if 1     , then (0.4) would still be a oneparameter exponential family but it would be a strict subset of the canonical one-parameter exponential family generated by T and h with natural parameter space   { :| A( ) | }   . A useful result about exponential families is the following computational shortcut for moments of the natural sufficient statistic: 12 Theorem 1.6.2: If X is distributed according to (0.3) and  is an interior point of  , then the moment-generating function of T ( X ) exists and is given by M ( s)  E[exp( sT ( X ))]exp[ A( s   )  A( )] for s in some neighborhood of 0. Moreover, E [T ( X )]  A '( ), Var [T ( X )]  A ''( ) . Proof: This is the proof for the continuous case. M ( s )  E (exp( sT ( X )))    h( x) exp[(s   )T ( x)  A( )]dx  {exp[ A( s   )  A( )]}  h( x) exp[( s   )T ( x )  A(s   )]dx  exp[ A( s   )  A( )] because the last factor, being the integral of a density, is one. The rest of the theorem follows from the moment generating property of M ( s ) (see Section A.12 of Bickel and Doksum). Comment on proof: In order for the moment generating function (MGF) properties to hold, the MGF must exist (be less than infinity) for s in some neighborhood of 0. The proof that the MGF exists for s in some neighborhood of 0 relies on the fact that  is an interval or  , which we shall establish in Section 1.6.4. Example 1 continued: Let X ~ Poisson( ), 0     . The natural sufficient statistic is T ( X )  X and   log  , A( )  e . Thus, using Theorem 1.6.2, 13 E [ X ]  d  e  e    log d  log d2  Var [ X ]  2 e  e    log d   log Example 2: Suppose X 1 , , X n is a sample from a population with pdf x x2 p( x |  )  2 exp( 2 ), x  0,   0  2 This is known as the Rayleigh distribution. It is used to model the density of time until failure for certain types of equipment. The data comes from an exponential family: n xi2  n xi  p( x1 , , xn |  )    2  exp( 2 ) i 1 2  i 1    n  1    xi  exp( 2 2  i 1  n x i 1 2 i  n log  2 ) Here  1 1 2 2 ,    , B (  )  n log  , A( )  n log(2 ) . 2 2 2 n Therefore, the natural sufficient statistic  X i 1 2 i has mean A '( )  n /   2n 2 and variance A ''( )  n /  2  4n 4 . Proving that a one parameter family is not an exponential family 14 A one parameter exponential family is a family p( x |  )  h( x ) exp{ ( )T ( x )  B( )} ,   . Consider a one parameter family { p( x |  ),   } . If the support of p( x |  ) is different for different  , then the family is not an exponential family because p ( x |  )  0 if and only if h( x )  0 . Suppose that the support of p( x |  ) is the same for all   . We can write the pdf or pmf of the family as p( x |  )  h( x ) exp{g ( x,  )} . Furthermore, we can write the pdf or pmf of the family as p( x |  )  h( x ) exp{g ( x,  )} In order for this to be an exponential family, we need to be able to write g ( x,  )   ( )T ( x )  B( ) (0.5) for some functions  , B and T . Suppose (0.5) holds. Then for any two sample points x1 and x2 , g ( x1 , )  g ( x2 , )   ( )[T ( x1 )  T ( x2 )] and for any four sample points x1 , x2 , x3 , x4 , g ( x1 ,  )  g ( x2 ,  ) T ( x1 )  T ( x2 )  g ( x3 ,  )  g ( x4 ,  ) T ( x3 )  T ( x4 ) is constant as a function of  . 15 Thus, a necessary condition for a one-parameter exponential family is that for any four sample points, x1 , x2 , x3 , x4 , g ( x1 ,  )  g ( x2 ,  ) g ( x3 ,  )  g ( x4 ,  ) must be constant as a function of  . Proof that the Cauchy family is not an exponential family: The Cauchy family is 1 p( x |  )   (1  ( x   ) 2 )   1  exp log 2   (1  ( x   ) )   exp{ log   log[1  ( x   ) 2 ]},      ,    x   Thus, for the Cauchy family, g ( x, )   log   log[1  ( x   )2 ] . For any four sample points x1 , x2 , x3 , x4 , g ( x1 , )  g ( x2 , )  log[1  ( x1   ) 2 ]  log[1  ( x2   ) 2 ]  g ( x3 , )  g ( x4 , )  log[1  ( x3   ) 2 ]  log[1  ( x4   ) 2 ] This is not constant as a function of  so the Cauchy family is not an exponential family. 16

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Notes 7 - Wharton Statistics