Download Notes 16 - Wharton Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistical inference wikipedia , lookup

Bayesian inference wikipedia , lookup

Transcript
Statistics 550 Notes 16
Schedule:
1. I will e-mail Homework 7 to you tonight.
2. I will return your midterms on Tuesday.
3. Homework 6 is due on Friday.
Reading: Section 3.2
I. Computation of Bayes Procedures
Review
The Bayes risk of a decision procedure  for a prior
distribution  (  , denoted by r ( ) , is the expected value
of the loss function over the joint distribution of ( X ,  )
(given the prior  (  ), which is the expected value of the
risk function over the prior distribution of  :
r ( )  E [ E[l ( ,  ( X )) |  ]]  E [ R( ,  )] .
The decision procedure which minimizes the Bayes risk
for a prior  (  is called the Bayes rule (Bayes procedure)
for the prior  (  . For a person with prior distribution
 (  , the Bayes is the best procedure from this person’s
point of view.
The posterior risk of an action a is the expected loss from
taking action a under the posterior distribution p ( | x ) .
r (a | x )  E p ( | x ) [l ( , a)] .
1
*
Proposition 3.2.1: A procedure  ( x) which takes an action
which minimizes the posterior risk for each x in the
sample space is a Bayes procedure.
Note: To follow the Bayes procedure for the data x that we
actually obtain, we just need to find the action that
minimizes the posterior risk r ( a | x ) , i.e., we do not to find
the entire Bayes procedure.
Bayes procedures for common loss functions:
Bayes decision procedures for point estimation of g ( ) for
some common loss functions using Proposition 3.2.1:
2
(1) Squared Error Loss ( l ( , a)  ( g ( )  a) ): The action
(point estimate) taken by the Bayes rule is the action that
minimizes the posterior expected square loss:
 * ( x )  arg min E p ( | x ) [(a  g ( )) 2 ]
a
*
By Lemma 1.4.1,  ( x) is the mean of the posterior
distribution p ( | x ) .
(2) Absolute Error Loss ( l ( , a) | g ( )  a | ). The action
(point estimate) taken by the Bayes rule is the action that
minimizes the posterior expected absolute loss:
 * ( x )  arg min E p ( | x ) [| a  g ( ) |]
(1.1)
a
2
The minimizer of (1.1) is any median of the posterior
distribution p ( | x ) so that a Bayes rule is to use any
median of the posterior distribution.
Proof that the minimizer of E[| X  a |] is a median of X:
Let X be a random variable and let the interval m0  m  m1
1
1
P
(
X

m
)

,
P
(
X

m
)

be the medians of X, i.e.,
2
2.
For m1  c and a continuous random variable,
E[| X  c |]  E[| X  m |] 

m

c

m
c
(c  m)dP ( x)   ((c  x)  ( x  m)) dP ( x)   (m  c)dP ( x) 
(c  m)[ P( X  m)  P ( X  m)]  2

(c  x) dP ( x)]  0
m xc
A similar result holds for c  m0 . A similar argument
holds for discrete random variables.
(3) Zero-one loss
| a  g ( ) | c
1
l ( , a)  
| a  g ( ) | c
0
The Bayes rule is the midpoint of the interval of length
2c that maximizes the posterior probability that
g ( ) belongs to the interval.
(4) Weighted squared error loss
l ( , a)  w( )( g ( )  a) 2
Bayes rule is
3
 * ( x) 
E p ( | x ) [ w( ) g ( )]
E p ( | x ) [ w( )]
Example 1: Recall from Notes 2 (Chapter 1.2), for
X 1 , , X n iid Bernoulli( p ) and a Beta(r,s) prior for p, the
posterior distribution for p is
Beta( r  i 1 xi , s  n  i 1 xi ).
n
n
Thus, for squared error loss, the Bayes estimate of p is the
mean of Beta( r  i 1 xi , s  n  i 1 xi ), which equals
n
n
r   i 1 xi
n
n
r   i 1 xi  s  n   i 1 xi
n
r   i 1 xi
n

rsn .
For absolute error loss, the Bayes estimate of p is the
n
n
r

x
,
s

n

median of the Beta( i 1 i
i1 xi ) distribution
which does not have a closed form.
For n=10, here are the Bayes estimators and MLE for the
Beta(1,1) = uniform prior.

10
0
1
2
i 1
Xi
MLE
Bayes
absolute error
loss
.0611
.1480
.2358
.0000
.1000
.2000
4
Bayes
squared error
loss
.0833
.1667
.2500
3
4
5
6
7
8
9
10
.3000
.4000
.5000
.6000
.7000
.8000
.9000
1.0000
.3238
.4119
.5000
.5881
.6762
.7642
.8520
.9389
.3333
.4167
.5000
.5833
.6667
.7500
.8333
.9137
2
2
Example 2: Suppose X 1 , , X n iid N ( ,  ) ,  known,
2
and our prior on  is N (  , b ) .
We showed in Notes 3 that the posterior distribution for
 is
 nX 


  2 b2

1
N
,

n
1
n
1

 2
 2  .
2
2
b 
b 

The mean and median of the posterior distribution is
nX
2


b2
n
1 , so that the Bayes estimator for both squared error

 2 b2
nX 
 2
2

b
 (X ) 
loss and absolute error loss is
n
1 .

 2 b2
5
Note on Bayes procedures and sufficiency: Suppose the
prior distribution  ( ) has support on    and the family
of distributions for the data { p( X |  ),   } has
sufficient statistic T ( X ) . Then to find the Bayes
procedure, we can reduce the data to T ( X ) .
This is because the posterior distribution of  | X is the
same as the posterior distribution of  | T ( X ) since
p( | X )  p( X |  ) ( ) 
p( X | T ( X ),  ) p(T ( X ) |  ) ( )  p(T ( X ) |  ) ( )
where the last  uses the sufficiency of T ( X ) .
Computation of Bayes procedures for complex problems:
For nonconjugate priors, the posterior mean (which is the
Bayes estimator under squared error loss) is not typically
available in closed form.
2
2
Example: X 1 , , X n iid N ( ,  ) ,  known, and our prior
on  is a logistic (a, b) distribution:
1 exp{( x  a) / b}
 ( ) 
b [1  exp{( x  a) / b}]2
The logistic distribution is a more heavily tailed
distribution than the normal distribution.
Since X is sufficient, we can just compute the posterior
given X .
The posterior pdf is
6
 1 ( X   ) 2  1 e (  a ) / b
exp 

2  2 / n  b [1  e (  a ) / b ]2
p( X |  ) ( )
2 / n

p( | X ) 


p( X )
 1 ( X   ) 2  1 e (  a ) / b
1
 2 / n exp  2  2 / n  b [1  e( a ) / b ]2
1
The numerator is not proportional to any commonly used
density function and the denominator is not evaluatable in
closed form.
Monte Carlo methods can be used to sample from the
posterior distribution approximate the Bayes estimator.
For discussion of Monte Carlo methods for Bayesian
inference, see Bayesian Data Analysis by Gelman, Carlin,
Stern and Rubin; these methods are discussed in Statistics
540 (will become Stat 542 next year) taught by Prof. Shane
Jensen.
II. Improper priors
Bayes estimators are defined with respect to a proper prior
distribution  ( ) .
Often a weighting function  ( ) is considered that is not a
probability distribution; this is called an improper prior.
Example 3: X ~ N (  ,1) ,  (  )  1 .
We can still consider the “Bayes” risk of a decision
procedure:
7
r ( )   R( ,  ) ( )d
(1.2)
*
An estimator  ( x) is called a generalized Bayes estimator
with respect to a weighting function  ( ) (even if it is not a
proper probability distribution) if it minimizes the “Bayes”
risk (1.2) over all estimators.
We can write the “Bayes” risk (1.3) as
r ( )     l ( ,  ( X ) p ( X |  )   ( )d



p ( X |  ) ( ) d  
  p( X |  ) ( ) d  dX
    l ( ,  ( X )


 p( X |  ) ( )d  

p ( X |  ) ( ) d 
q( X ) dX
    l ( ,  ( X )

 p( X |  ) ( )d 

A decision procedure  ( X ) which minimizes
p( X |  ) ( )d
l
(

,

(
X
)

 p( X |  ) ( )d
(1.4)
for each X is a generalized Bayes estimator (this is the
analogue of Proposition 3.2.1).
Sometimes
p( X |  ) ( )d
 p( X |  ) ( )d
8
(1.5)
is a proper probability distribution even if  ( ) is not a
proper probability distribution, and then we can think of
p( X |  ) ( )d
p( X |  ) ( )d as the “posterior” density function of

 | X and (1.6) as the “posterior” risk.
Example 3 continued: X ~ N (  ,1) ,  (  )  1 . (1.7) equals

 ( X   )2 
1
exp 

2
2

 
2
 ( X  ) 
1
exp 
 d
2
2



 ( X   )2 
1
exp 

2
2
2

  1 exp  (   X ) 


2
 (  X )2 
1
2


exp 
 d
2
2


where the last equality follows because the denominator in
the second equality is just the integral of a normal density
for  with mean X . 1 exp  (  X )  is a normal density
2
2

2

for  with mean X and variance 1, and consequently (1.4)
is just the expected loss given X with respect to a N ( X ,1)
density for  . Thus, the generalized Bayes estimator for
squared error loss (and absolute error loss) is X .
Another useful variant of Bayes estimators are limits of
Bayes estimators.
A nonrandomized estimator  ( x ) is a limit of Bayes
estimators if there exists a sequence of proper priors

Bayes estimators  v with respect to these prior

distributions such that  v ( x )   ( x ) for all x.
9
 v and
Example 3 continued: For next week’s homework, I will
ask you to show that X is a limit of Bayes estimators.
From a Bayesian point of view, estimators that are limits of
Bayes estimators are somewhat more desirable than
generalized Bayes estimators. This is because, by
construction, a limit of Bayes estimators must be close to a
proper Bayes estimator. In contrast, a generalized Bayes
estimator may not be close to any proper Bayes estimator
(an example will be given for next week’s homework).
Admissibility of Bayes rules: In general, Bayes rules are
admissible.
*
Theorem : Suppose that  is an interval and 
is a Bayes rule with respect to a prior density function
 ( ) such that  ( )  0 for all   and R ( , d ) is a
*
continuous function of  for all d . Then  is admissible.
Proof: The proof is by contradiction. Suppose that  is
inadmissible. There is then another estimate,  , such that
R( ,  * )  R( ,  ) for all  and with strict inequality for
some  , say 0 . Since R( ,  *)  R( ,  ) is a continuous
function of  , there is an   0 and an interval   h such
that
R( ,  *)  R( ,  )   for 0  h    0  h
Then,
*
10
0  h

  R( ,  *)  R( ,  ) ( )d     R( ,  *)  R( ,  )  ( )d

0 h

0  h


 ( ) d  0
0 h
But this contradicts the fact that  * is a Bayes rule because
a Bayes rule has the property that
B( *)  B( ) 

  R( ,  *)  R( ,  ) ( )d  0 .

The proof is complete.

The theorem can be regarded as both a positive and
negative result. It is positive in that it identifies a certain
class of estimates as being admissible, in particular, any
Bayes estimate. It is negative in that there are apparently
so many admissible estimates – one for every prior
distribution that satisfies the hypotheses of the theorem –
and some of these might make little sense (like  ( X )  3
for the normal distribution above).
Complete class theorems characterize the class of all
admissible estimators.
Roughly the class of all admissible estimators for most
models is the class of all Bayes and limit of Bayes
estimators.
11