Download Notes 3 - Wharton Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Transcript
Statistics 550 Notes 3
Reading: Section 1.3
Decision Theoretic Framework: Framework for evaluating
and choosing statistical inference procedures
I. Motivating Example
A cofferdam protecting a construction site was designed to
withstand flows of up to 1870 cubic feet per second (cfs).
An engineer wishes to estimate the probability that the dam
will be overtopped during the upcoming year. Over the
previous 25-year periods, the annual maximum flood levels
of the dam has exceeded 1870 cfs 5 times. The engineer
models the data on whether the flood level has exceeded
1870 cfs as independent Bernoulli trials with the same
probability p that the flood level will exceed 1870 cfs in
each year.
Some possible estimates of p based on iid Bernoulli trials
X1 , , X n :
(1)

pˆ 
n
i 1
Xi
n
1   i 1 X i
n
(2) pˆ 
on p .
n2
, the posterior mean for a uniform prior
1
2   i 1 X i
n
(3) pˆ 
, the posterior mean for a Beta(2,2)
n4
prior on p (called the Wilson estimate, recommended by
Moore and McCabe, Introduction to the Practice of
Statistics).
How should we decide which of these estimates to use?
The answer depends in part on how errors in the estimation
of p affect us.
Example 1 of decision problem: The firm wants the
engineer to provide her best “guess” of p , the probability
of an overflow, i.e., to estimate p by p̂ . The firm wants
the probability of an overflow to be at most 0.05. Based on
the estimate p̂ of p , the engineer’s firm plans to spend an
additional f (max(0, pˆ  0.05)) dollars to shore up the dam
where f (0)  0 and f is an increasing function. By
spending this money, the firm will make the probability of
an overflow be max(0, p  max(0, pˆ  0.05)) . The cost of
an overflow to the firm is $C. The expected cost to the
firm of using an estimate of p̂ (for a true initial probability
of overflow of p ) is
f (max(0, pˆ  0.05))  C *(max(0, p  max(0, pˆ  0.05))) .
We want to choose an estimate which provides low
expected cost.
Example 2 of decision problem: Another decision problem
besides estimating p might be that the firm wants to decide
2
whether p  0.15 or p  0.15 ; if p  0.15 , the firm would
like to build additional support for the additional dam. This
is an example of a testing problem of deciding whether a
parameter lives in one of two subsets that form a partition
of the sample space. The cost to the firm of making the
wrong decision about whether p  0.15 or
p  0.15 depends on what type of error was made (deciding
that p  0.15 when in fact p  0.15 or deciding that
p  0.15 when in fact p  0.15 ).
The decision theoretic framework involves:
(1) clarifying the objectives of the study;
(2) pointing to what the different possible actions are
(3) providing assessments of risk, accuracy, and
reliability of statistical procedures
(4) providing guidance in the choice of procedures for
analyzing outcomes of experiments.
II. Components of the Decision Theory Framework
(Section 1.3.1)
We observe data X from a distribution P , where we do not
know the true P but only know that
P P = { P , } (the statistical model).
The true parameter vector  is sometimes called the “state
of nature.”
3
Action space: The action space A is the set of possible
actions, decisions or claims that we can contemplate
making after observing the data X .
For Example 1, the action space is the possible estimates of
p (probability of the dam being overtopped), A  [0,1] .
For Example 2, the action space is {decide that p  0.15 ,
decide that p  0.15 }.
Loss function: The loss function l (  a  is the loss
incurred by taking the action a when the true parameter
vector is  .
The loss function is assumed to be nonnegative. We want
the loss to be small.
Relationship between loss function and utility function in
economics. The loss function is related to the utility
function in economics. If the utility of taking the action a
when the true state of nature is  is U ( , a) , then we can
define the loss as l (  a)  max ',a ' U ( ' a') -U (  a)
When there is uncertainty about the outcome of interest
after taking the action (as in Example 1), then we can
replace the utility with the expected utility under the von
Neumann-Morganstern axioms for decision making under
uncertainty (W. Nicholson, Microeconomic Theory, 6th ed.,
Ch. 12).
Ideally, we choose the loss function based on the
economics of the decision problem as in Example 1.
4
However, more commonly, the loss function is chosen to
qualitatively reflect what we are trying to do and to be
mathematically convenient.
Commonly used loss functions for point estimates of a real
valued parameter q( ) :
Denote our estimate of q( ) by a .
The most commonly used loss function is
2
quadratic (squared error) loss: l (  a    q(  )- a) .
Other choices that are less computationally convenient but
perhaps more realistically penalize large errors less are:
(1) absolute value loss, l (  a    q(  ) - a | ;
(2 ) Huber’s loss functions,
2

if |q(  ) - a | k
(q(  ) - a)
l (  a   
2

2k | q(  ) - a | -k if |q(  ) - a |> k
for some constant k
(3) zero-one loss function
if |q(  )- a | k
0
l (  a   
if |q(  )- a |> k
1
for some constant k
Decision procedures: A decision procedure or decision rule
specifies how we use the data to choose an action a . A
decision procedure  is a function  ( X ) from the sample
space of the experiment to the action space.
5
For Example 1, decision procedures include

 (X ) 
n
Xi
i 1
n
1   i 1 X i
n
and  ( X ) 
n 1
.
Risk function: The loss of a decision procedure will vary
over repetitions of the experiment because the data from
the experiment X is random. The risk function R (θ ,  ) is
the expected loss from using the decision procedure
 when the true parameter vector is  :
R(θ,  )  E [l ( ,  ( X ))]
Example: For quadratic loss in point estimation of q( ) ,
the risk function is the mean squared error:
R(θ ,  )  E [l ( ,  ( X ))]  E [(q(   ( X )) 2 ]
This mean square error can be decomposed as bias squared
plus variance.
Proposition 3.1:
E [(q(   ( X )) 2 ]  (q(  E [ ( X )]) 2  E {( ( X )  E [ ( X )]) 2}
Proof: We have
E [(q(   ( X )) 2 ]  E [({q(  E [ ( X )]}  {E [ ( X )]   ( X )  ] 
(q(  E [ ( X )]) 2  2{q(  E [ ( X )]}E {E [ ( X )]   ( X ) 
E [{E [ ( X )]   ( X ) ] 
(q(  E [ ( X )]) 2  E [{E [ ( X )]   ( X ) ] 
{Bias[ ( X )]}2  Variance[ ( X )]
6
Example 3: Suppose that an iid sample X1,...,Xn is drawn
from the uniform distribution on [0,  ] where  is an
unknown parameter and the distribution of Xi is
1
0<x<

f X ( x; )  
0
elsewhere
Several point estimators:
n
E
(
W
)


W

max
X

1
1. 1
i
i . Note: W1 is biased,
n 1 .
 n 1 
W

2. 2  n  max i X i . Note: Unlike W1, W2 is unbiased


because
n 1
n 1 n
E (W2 )   
E (W1 )   
   0 .
n
n n 1
3. W3=2 X . Note: W3 is unbiased,
E [ X ]  

0
x2
x dx 

2
1
E [W3 ]  2 E [ X ]  2

2


0

2

Comparison of three estimators for uniform example using
mean squared error criterion
1. W1  max i X i
The sampling distribution for W1 is
7
w 
, X n  w1    1 
 
P(W1  w1 )  P  X 1 ,
 nw1n 1

( w1 )    n
0

fW1
n
0  w1  
elsewhere
and


0
0
E [W1 ]   w1 fW1 ( w1 )dw1   w1
nw1n 1
n
nw1n 1
dw1 
(n  1) n

0
n
1
   

n 1
n 1
2
To calculate Var (W1 ) , we calculate E (W1 ) and use the
2
2
formula Var ( X )  E ( X )  [ E ( X )] .
Bias (W1 )  E [W1 ]   


E (W )   w f dw1   w
2
1
0
=
2
1 w1
n2
1
nw
(n  2) n
0


0
2
1
nw1n 1
n
dw1
n
2
n2
2

n 2  n
n
Var (W1 ) 
 
 
2
2
n2
 (n  1)  (n  2)(n  1)
Thus,
MSE (W1 )  {Bias (W1 )}2  Var (W1 ) 
2
n 2
2
2
.



2
2
(n  1) (n  2)(n  1)
(n  1)(n  2)
8
 n 


 n 1 
n 1
W

max i X i
2. 2
n
n 1
W

W1 .
Note 2
n
n 1
n 1 n
E
(
W
)

E
(
W
)

  ,

1
Thus,  2
n
n n 1
Bias (W2 )  0 and
n 1
 n 1 
Var (W2 )  Var (
W1 )  
 Var (W1 )
n
 n 
2
n
1
 n 1 
2
2





2
n(n  2)
 n  (n  2)(n  1)
Because W2 is unbiased,
1
MSE (W2 )  Var (W2 ) 
2
n(n  2)
2
3. W3  2 X
To find the mean square error, we use the fact that if
X 1 , , X n iid with mean  and variance  2 , then
X   Xn
X 1
 and variance  2 / n
has
mean
n
We have
E ( X )  
x2
x dx 

2

1
0
E ( X )  
2

0


0
x3
2
x dx 

3
1

2


2
0
9
3
2
  
Var ( X ) 
  
3  2  12
2

2
2
Thus, E ( X )  2 , Var ( X )  12n and
2
E (W3 )  2 E ( X )   and Var (W3 )  4Var ( X ) 
3n .
2
W3 is unbiased and has mean square error 3n .
The mean square errors of the three estimators are the
following:
MSE
2
2
(n  1)(n  2)
1
2
n(n  2)
1 2

3n
W1
W2
W3
For n=1, the three estimators have the same MSE.
1
2
1


For n>1, n(n  2) (n  1)(n  2) 3n
So W2 is best, W1 is second best and W3 is the worst.
III. Admissibility/Inadmissibility of Decision Procedures
A decision procedure  is inadmissible if there exists
another decision procedure  ' such that
10
R( ,  ')  R( ,  ) for all    and R( ,  ')  R( ,  ) for at
least one    . The decision procedure  ' is said to
dominate  ; there is no justification for using  rather than
 '.
In Example 3, W1 and W3 are inadmissible point estimators
under squared error loss for n  1 .
A decision procedure  is admissible if it is not
inadmissible, i.e., if there does not exist a decision
procedure  ' such that R( ,  ')  R( ,  ) for all    and
R( ,  ')  R( ,  ) for at least one    .
IV. Selection of a decision procedure:
We would like to choose a decision procedure which has a
“good” risk function.
Ideal: We would like to construct a decision procedure that
is at least as good as all other decision procedures for all
   , i.e.,  ( x ) such that R( ,  ')  R( ,  ) for all
   and all other decision procedures  ' .
This is generally impossible!
Example 2: For X1,...,Xn iid N (  ,1) ,  ( X )  1 is an
admissible point estimator of  for squared error loss.
11
Proof: Suppose  ( X )  1 is inadmissible. Then there exists
a decision procedure  ' that dominates  . This implies
that R(1,  ')  R(1,  )  0 .
2
Hence, 0  R(1,  ')  E 1[( '( x1, , xn ) 1) ] . Since
( '( x1 ,
, xn )  1) 2 is nonnegative, this implies
P 1[( '( x1 ,
, xn )  1)  0]  1 .
Let B be the event that ( '( x1 , , xn )  1)  0 . We will
show that P ( B )  0 for all   (, ) . This means that
 '( x1 , , xn )  1 with probability 1 for all   (, ) ,
which means that R(  ,  )  R(  ,  ') for all   (, ) ;
this contradicts  ' dominates  and proves that  ( X )  1 is
admissible.
To show that P ( B )  0 for all   (, ) , we use the
importance sampling idea that the expectation of a random
variable X under a density f can be evaluated as the
expectation of the random variable Xf(X)/g(X) under a
density g as long as f and g have the same support:
12
P ( B)  



  n ( xi   ) 2 
1
 dx1
 I B (2 )n / 2 exp  i1 2






dxn
  n ( xi   ) 2 
1
i 1

exp 
n/2

 1
(2

)
2
  n ( xi  1) 2 



i 1
 dx1
exp 
n/2
 I B
n
2


(2

)
2



(
x

1)
1

i


i 1


exp
n/2


(2 )
2


dxn

  n ( xi   ) 2  
1
i 1


exp 
n/2


2
 (2 )


 E 1  I B
n
2 


 i 1 ( xi  1) 
1

exp 

n/2


(2

)
2



(0.1)
Since P 1 ( B ) =0, the random variable

  n ( xi   ) 2  
1
i 1


exp 
n/2


2
 (2 )


IB
n
2
  ( xi  1)  

1
i 1


exp

n/2


(2

)
2



is zero with probability one under   1 Thus, by (0.1),
P ( B )  0 for all   (, ) .
■
Comparison of risk under squared error loss for
1 ( X )  1 and  2 ( X )  X .
R(, 1 )  E [(1   )2 ]  (1   )2
R(  ,  2 )  E [( X   ) 2 ]  Var ( X ) 
13
1
n
Although 1 ( X )  1 is admissible, it does not have good
risk properties for many values of  .
Approaches to choosing a decision procedure with good
risk properties:
(1) Restrict class of decision procedures and try to choose
optimal procedure within this class, e.g., for point
estimation, we might only consider unbiased estimators
 ( x ) of q( ) such that E [ ( x)]  q( ) for all    .
(2) Compare risk functions by global criterion. We shall
discuss Bayes and minimax criteria.
I. Example 1 (Example 1.3.5 from Bickel and Doksum)
14
We are trying to decide whether to drill a location for oil.
There are two possible states of nature,
1  location contains oil and  2  location doesn’t contain
oil. We are considering three actions, a1 =drill for oil,
a2 =sell the location or a3 =sell partial rights to the location.
The following loss function is decided on
(Drill)
(Sell)
(Partial rights)
a1
a2
a3
0
10
5
1
(Oil)
(No oil)  2
12
1
6
An experiment is conducted to obtain information about 
resulting in the random variable X with possible values 0,1
and frequency function p( x,  ) given by the following
table:
Rock formation
X
0
1
0.3
0.7
1
(Oil)
0.6
0.4
(No oil)  2
X  1 represents the presence of a certain geological
formation that is more likely to be present when there is oil.
The possible nonrandomized decision procedures  ( x) are
Rule
1
2
3
4
5
6
7
8
9
15
x=0
a1
a1
a1
a2
a2
a2
a3
a3
a3
x=1
a1
a2
a3
a1
a2
a3
a1
a2
a3
The risk of  at  is
R( ,  )  E [l ( ,  ( X ))]  l ( , a1 ) P [ ( X )  a1 ] 
+l ( , a2 ) P [ ( X )  a2 ]  l ( , a3 ) P [ ( X )  a3 ]
The risk functions are
1
R(1 ,  ) 0
R( 2 ,  ) 12
Rule
5
6
10 6.5
2
7
3
3.5
4
3
7.6
9.6
5.4
16
1
3
7
1.5
8
8.5
9
5
8.4
4
6
The decision rules 2, 3, 8 and 9 are inadmissible but the
decision rules 1, 4, 5, 6 and 7 are all admissible.
V Bayes Criterion
The Bayesian point of view leads to a natural global
criterion.
Suppose a person’s prior distribution about  is  (  and
the model is that X |  has probability density function (or
probability mass function) p( x |  ) . Then the joint
(subjective) pdf (or pmf) of ( X ,  ) is  (  p ( x |  ) .
The Bayes risk of a decision procedure  for a prior
distribution  (  , denoted by r ( ) , is the expected value
of the risk over the joint distribution of ( X ,  ) :
r ( )  E [ E[l ( ,  ( X )) |  ]]  E [ R( ,  )] .
For a person with subjective prior probability distribution
 (  , the decision procedure which minimizes
r ( ) minimizes the person’s (subjective) expected loss and
is the best procedure from this person’s point of view. The
decision procedure which minimizes the Bayes risk for a
prior  (  is called the Bayes rule for the prior  (  .
Example continued: For prior,  (1 )  0.2 and  (2 )  0.8 ,
the Bayes risks are
r ( )  0.2R(1 ,  )  0.8R(2 ,  )
17
1
r ( ) 9.6
Rule
2
3
4
5
7.48 8.38 4.92 2.8
6
3.7
7
8
7.02 4.9
9
5.8
Thus, rule 5 is the Bayes rule for this prior distribution.
The Bayes rule depends on the prior. For prior
 (1 )  0.5 and  (2 )  0.5 , the Bayes risks are
r ( )  0.5R(1 ,  )  0.5R(2 ,  )
Rule
3
4
5
6.55 4.2 5.5
1
2
6
7
8
9
7.3
4.75 4.95 6.25 5.5
r ( ) 6
Thus, rule 4 is the Bayes rule for this prior distribution.
A non-subjective interpretation of Bayes rules: The Bayes
approach leads us to compare procedures on the basis of
r ( )   R( ,  ) ( )
if  is discrete with frequency function  (  or
r ( )   R( ,  ) ( )d
if  is continuous with density  (  .
Such comparisons make sense even if we do not interpret
 (  as a prior density or frequency, but only as a weight
function that reflects the importance we place on doing
well at the different possible values of  .
18
For example, in Example 1, if we felt that doing well at
both 1 and  2 are equally important, we would set
 (1 )   (2 )  0.5 .
VI. Minimax Criteria
The minimax criteria minimizes the worst possible risk.
That is, we prefer  to  ' , if and only if
sup  R( ,  )  sup  R( ,  ') .
*
A procedure  is minimax (over a class of considered
decision procedures) if it satisfies
sup R( ,  *)  inf sup R( ,  ) .
Among the nine decision rules considered for Example 2,
rule 4 is the minimax rule.
Rule
1 2
3
4
5 6
7
8
9
0 7
3.5 3
10 6.5 1.5 8.5 5
R(1 ,  )
12 7.6 9.6 5.4 1 3
8.4 4
6
R( 2 ,  )
max{ R(1 ,  ) , 12 7.6 9.6 5.4 10 6.5 8.4 8.5 6
R( 2 ,  ) }
Game theory motivation for minimax criterion: Suppose
we play a two-person zero sum game against Nature. Then
the minimax decision procedure is the minimax strategy for
the game.
19
Comments on the minimax criteria: The minimax criteria is
very conservative. It aims to give maximum protection
against the worst can happen. The principle would be
compelling if the statistician believed that Nature was a
malevolent “opponent” but in fact Nature is just the
inanimate state of the world.
Although the minimax criterion is conservative, in many
cases the principle does lead to reasonable procedures.
VII. Other Global Criteria for Decision Procedures
Two compromises between Bayes and minimax criteria
that have been proposed are:
(1) Restricted Risk Bayes: Suppose that M is the maximum
risk of the minimax decision procedure. Then, one may be
willing to consider decision procedures whose maximum
risk exceeds M , if the excess is controlled, say, if
R( ,  )  M (1   ) for all   
(0.2)
where  is the proportional increase in risk that one is
willing to tolerate. A restricted risk Bayes decision
procedure for the prior  is then obtained by minimizing
the Bayes risk r ( ) among all decision procedures that
satisfy (0.2).
For Example 1 and prior  (1 )  0.2 ,  (2 )  0.8
Rule
1
2
3
4
5
6
7
8
20
9
r ( ) 9.6
Max 12
Risk
7.48 8.38 4.92 2.8
3.7
7.02 4.9
5.8
7.6
6.5
8.4
6
9.6
5.4
10
8.5
For  =0.1 (maximum risk allowed = (1+0.1)*5.4=5.94),
decision rule 4 is the restricted risk Bayes procedure; for
 =0.25 (maximum risk allowed = (1+0.25)*5.4=6.75),
decision rule 6 is the restricted risk Bayes procedure.
(2) Gamma minimaxity. Let  be a class of prior
*
distributions. A decision procedure  is gamma-minimax
(over a class of considered decision procedures) if
inf sup  r ( )  sup  r ( * )
*
Thus, the estimator  minimizes the maximum Bayes risk
over those priors in the class  .
Computational issues: We will study more on how to find
Bayes and minimax point estimators in Chapter 3. The
restricted risk Bayes procedure is appealing but it is
difficult to compute.
VIII. Randomized decision procedures
A randomized decision procedure  is a decision procedure
which assigns to each possible outcome of the data X , a
random variable Y( X ) , where the values of Y( X ) are
actions in the action space. When X = x , a draw from the
distribution of Y( x ) will be taken and will constitute the
action taken.
21
We will show in Chapter 3 that for any prior, there is
always a nonrandomized decision procedure that has at
least as small Bayes risk as a randomized decision
procedure (so we can ignore randomized decision
procedures in looking for the Bayes rule).
Students of game theory will realize that a randomized
decision procedure may lead to a lower maximum risk than
a nonrandomized decision procedure.
Example: For Example 1, a randomized decision procedure
is to flip a fair coin and use decision rule 4 if the coin lands
heads and decision rule 6 if the coin lands tails – i.e.,
Y ( x  0)  a2 with probability 1 and Y ( x  1)  a1 with
probability 0.5 and Y ( x  1)  a3 with probability 0.5. The
risk of this randomized decision procedure is
4.75 if  =1
0.5 R ( ,  4 )  0.5 R ( ,  6 )  
4.20 if  = 2 ,
which has lower maximum risk than decision rule 4 (the
minimax rule among nonrandomized decision rules).
Randomized decision procedures are somewhat impractical
– it makes the statistician’s inferences seem less credible if
she has to explain to a scientist that she flipped a coin after
observing the data to determine the inferences.
22
We will show in Chapter 1.5 that a randomized decision
procedure cannot lower the maximum risk if the loss
function is convex.
23