Download Notes 17 - Wharton Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Risk management wikipedia , lookup

Risk wikipedia , lookup

Pattern recognition wikipedia , lookup

Enterprise risk management wikipedia , lookup

Least squares wikipedia , lookup

Generalized linear model wikipedia , lookup

Probability box wikipedia , lookup

Transcript
Statistics 550 Notes 17
Reading: Section 3.2-3.3
I. Improper priors and Generalized Bayes estimators
Review: Bayes estimators are defined with respect to a
proper prior distribution  ( ) , where proper means that
 ( ) is a probability distribution.
Often a weighting function  ( ) is considered that is not a
probability distribution; this is called an improper prior.
Example 1: X ~ Binomial( p , n ) . Consider the prior
 ( p)  p 1 (1  p)1 , 0  p  1 , which in some sense
corresponds to a Beta(0,0) distribution. However, this is
not a proper distribution because

1
0
p 1 (1  p)1 dp   .
We can still consider the “Bayes” risk of a decision
procedure:
r ( )   R( ,  ) ( )d
(1.1)
*
An estimator  ( x) is called a generalized Bayes estimator
with respect to a weighting function  ( ) (even if it is not a
proper probability distribution) if it minimizes the “Bayes”
risk (1.1) over all estimators.
We can write the “Bayes” risk (1.2) as
1
r ( )     l ( ,  ( X ) p( X |  )   ( )d



p( X |  ) ( )d  
  p( X |  ) ( ) d  dX
    l ( ,  ( X )


p
(
X
|

)

(

)
d






p( X |  ) ( )d 
q( X )dX
    l ( ,  ( X )

 p( X |  ) ( )d 
A decision procedure  ( X ) which minimizes
p( X |  ) ( )d
l
(

,

(
X
)

 p( X |  ) ( )d
(1.3)
for each X is a generalized Bayes estimator (this is the
analogue of Proposition 3.2.1).
Sometimes
s( | X ) 
p( X |  ) ( )
 p( X |  ) ( )d
(1.4)
is a proper probability distribution even if  ( ) is not a
proper probability distribution, and then we can think of
s ( | X ) as the “posterior” density function of  | X and
(1.4) as the “posterior” risk.
Example 1 continued: For X ~ Binomial( p, n) and the
1
1
prior  ( p)  p (1  p) , 0  p  1 , consider the
generalized Bayes estimator under squared error loss.
(1.4) equals
2
n  X
n X
1
1
  p (1  p) p (1  p)
X
s( p | X )   
for 0  p  1
1 n 
X
n X
1
1
0  X  p (1  p) p (1  p) dp
For 1  X  n 1, s ( p | X ) is a Beta( X , n  X ) distribution
and the generalized Bayes estimator with respect to squared
error loss is the expected value of p under the distribution
X
X

s ( p | X ) , which equals
X  0 and
X n X
n . For
X  n , the “posterior” density s ( p | X ) is no longer proper.
X
However, it can be shown that for n , the “posterior”
X
expected loss (1.3) is finite, and n minimizes (1.3). Thus,
X
for all X , n is the generalized Bayes estimator of p for
 ( p)  p 1 (1  p)1 .
Another useful variant of Bayes estimators are limits of
Bayes estimators.
A nonrandomized estimator  ( x ) is a limit of Bayes
estimators if there exists a sequence of proper priors

Bayes estimators  v with respect to these prior

distributions such that  v ( x )   ( x) for all x.
3
 v and
Example 1 continued: For X ~ Binomial( p, n) ,
 ( X )  X / n is a limit of Bayes estimators. Consider a
Beta ( r , s ) prior (which is proper if r  0, s  0 ); the Bayes
X r
estimator is n  r  s . Consider the sequence of priors
Beta (1,1), Beta (1/2,1/2),Beta(1/3,1/3),... Since
X r
X
lim

, we have that  ( X )  X / n is a limit of
r 0 n  r  s
n
s 0
Bayes estimators.
From a Bayesian point of view, estimators that are limits of
Bayes estimators are somewhat more desirable than
generalized Bayes estimators (often estimators are both
limit of Bayes estimators and generalized Bayes estimators
as in Example 1). This is because, by construction, a limit
of Bayes estimators must be close to a proper Bayes
estimator. In contrast, a generalized Bayes estimator may
not be close to any proper Bayes estimator.
II. Admissibility of Bayes rules:
In general, Bayes rules are admissible.
*
Theorem : Suppose that  is an interval and 
is a Bayes rule with respect to a prior density function
 ( ) such that  ( )  0 for all   and R ( , d ) is a
*
continuous function of  for all d . Then  is admissible.
4
Proof: The proof is by contradiction. Suppose that  is
inadmissible. There is then another estimate,  , such that
R( ,  * )  R( ,  ) for all  and with strict inequality for
some  , say 0 . Since R( ,  *)  R( ,  ) is a continuous
function of  , there is an   0 and an interval   h such
that
R( ,  *)  R( ,  )   for 0  h    0  h
Then,
*
0  h

  R( ,  *)  R( ,  ) ( )d     R( ,  *)  R( ,  ) ( )d

0 h

0  h


 ( )d  0
0 h
But this contradicts the fact that  * is a Bayes rule because
a Bayes rule has the property that
B( *)  B( ) 

  R( ,  *)  R( ,  ) ( )d  0 .

The proof is complete.

The theorem can be regarded as both a positive and
negative result. It is positive in that it identifies a certain
class of estimates as being admissible, in particular, any
Bayes estimate. It is negative in that there are apparently
so many admissible estimates – one for every prior
distribution that satisfies the hypotheses of the theorem –
and some of these might make little sense (like  ( X )  3
for the normal distribution above).
5
Complete class theorems characterize the class of all
admissible estimators.
Roughly the class of all admissible estimators for most
models is the class of all Bayes and limit of Bayes
estimators.
III. Minimax Procedures (Section 3.3)
The minimax criteria minimizes the worst possible risk.
That is, we prefer  to  ' , if and only if
sup  R( ,  )  sup  R( ,  ') .
*
A procedure  is minimax (over a class of considered
decision procedures) if it satisfies
sup R( ,  *)  inf sup R( ,  ) .
Let   denote the Bayes estimator with respect to the prior
 ( ) and
r ( )  E [ E[l ( ,  ( X )) |  ]]  E [ R( ,  )]
denotes the Bayes risk for the Bayes estimator for the prior
 ( ) .
A prior distribution  is least favorable if r  r ' for all
prior distributions  ' .
This is the prior distribution which causes the statistician
the greatest average loss assuming the statistician uses the
Bayes estimator.
6
Theorem 3.3.2 (I’ve expanded on the statement of it):
Suppose that  is a prior distribution on  and   is a
Bayes estimator with respect to  such that
r (  )   R( ,  )d ( )  sup R( ,   )
(1.5)
Then:
(i)   is minimax.
(ii) If   is the unique Bayes solution with respect to  , it
is the unique minimax procedure.
(iii)   is a least favorable prior.
Proof:
(i) Let  be any other procedure. Then,
sup R( ,  )   R( , )d  ( )
  R( ,  ) d  ( )  sup R( ,   )
(ii) This follows by replacing  by > in the second equality
of the proof of (i).
(iii) Let  ' be some other distribution of  . Then,
r ' ( ' )   R( , ' )d '(   R( , )d '(
 sup R( ,  )  r ( )
Corollary: If a Bayes procedure   has constant risk, then it
is minimax.
Proof: If   has constant risk, then (1.5) is clearly satisfied.
7
Example 1 (Example 3.3.1, Problem 3.3.4): Suppose
X 1 , , X n are iid Bernoulli (  ) and we want to estimate  .
2
Consider the squared error loss function l ( , a)  (  a) .
For squared error loss and a Beta(r,s) prior, we showed in
Notes 16 that the Bayes estimator is
r   i 1 xi
n
ˆr , s 
rsn .
We now seek to choose r and s so that ˆr , s has constant risk.
The risk of ˆ is
r ,s
2
n



r

x
 i 1 i     
R ( , ˆr , s )  E 
 r  s  n
 

 

 r   xi    r   n xi
i 1
i 1
   E 
 Var 
 rsn    rsn

  
n
n (1   )  r  n







( r  s  n) 2  r  s  n

n (1   )  r  n  r  s  n 



2
( r  s  n) 
rsn

n (1   )  (r  r  s ) 2

( r  s  n) 2
2
8


  




2
2
2
The coefficient on  in the numerator is n  (r  s) and
the coefficient on  in the numerator is n  2r (r  s ) . We
choose r and s so that both these coefficients are zero:
n  (r  s)2  0, n  2r (r  s)  0
n
r

s

Solving these equations gives
2 .
The unique minimax estimator is
n
n
  i 1 xi
ˆminimax  ˆ n n  2
,
n
n
2 2

n
2
2
1
which has constant risk 4(1  n ) 2 compared to
 (1   )
for the MLE X .
n
For small n, the minimax estimator is better than the MLE
for a large range of  . For large n, the minimax estimator
is better than the MLE for only a small range of  near 0.5.
9
Minimax as limit of Bayes rules:
If the parameter space  is not bounded, minimax rules are
often not Bayes rules but instead can be obtained as limits
of Bayes rules. To deal with such situations we need an
extension of Theorem 3.3.2.
*
Theorem 3.3.3: Let  be a decision rule such that
sup R( ,  * )  r   . Let { k } be a sequence of prior
10
distributions and let rk be the Bayes risk of the Bayes rule
with respect to the prior  k . If
rk  r as k   , then  * is minimax.
Proof: Suppose  is any other estimator. Then,
sup R( ,  )   R( ,  )d  k ( )  rk ,
and this holds for every k. Hence,
sup R( ,  )  sup R( ,  * ) and
 * is minimax.
Note: Unlike Theorem 3.3.2, even if the Bayes estimators
for the priors  k are unique, the theorem does not guarantee
*
that  is the unique minimax estimator.
Example 2 (Example 3.3.3): X 1 , , X n iid
N (  ,1),       . Suppose we want to estimate
 with squared error loss. We will show that X is
minimax.
1
First, note that X has constant risk n . Consider the
sequence of priors,  k  N (0, k ) . In Notes 16, we showed
that the Bayes estimator for squared error loss with respect
n
ˆk 
X

1
to the prior k is
. The risk function of ˆk is
n
k
11
2

  n  2  1 
 



n
k
R( ,ˆk )  E   
X 
2
1  

.
1


n
n




k  
k

The Bayes risk of ˆ with respect to  is
2
k
k
1
n  

k
rk  
2

1

n 
k

2
2
 2 
exp  
d
2
k
2 k


1
1
n
k


2
2
1 
1

n  n 
k 
k

1
r

As k   , k
n , which is the constant risk of X .
Thus, by Theorem 3.3.3, X is minimax.
12