Download Statistics 512 Notes ID

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Statistics 550 Notes 3
Reading: Section 1.3
Decision Theoretic Framework: Framework for evaluating
and choosing statistical inference procedures
I. Motivating Example
A cofferdam protecting a construction site was designed to
withstand flows of up to 1870 cubic feet per second (cfs).
An engineer wishes to estimate the probability that the dam
will be overtopped during the upcoming year. Over the
previous 25-year periods, the annual maximum flood levels
of the dam has exceeded 1870 cfs 5 times. The engineer
models the data on whether the flood level has exceeded
1870 cfs as independent Bernoulli trials with the same
probability p that the flood level will exceed 1870 cfs in
each year.
Some possible estimates of p based on iid Bernoulli trials
X1 , , X n :
(1)

pˆ 
n
i 1
Xi
n
1   i 1 X i
n
(2) pˆ 
on p .
n2
, the posterior mean for a uniform prior
1
2   i 1 X i
n
(3) pˆ 
, the posterior mean for a Beta(2,2)
n4
prior on p (called the Wilson estimate, recommended by
Moore and McCabe, Introduction to the Practice of
Statistics).
How should we decide which of these estimates to use?
The answer depends in part on how errors in the estimation
of p affect us.
Example of decision problem: The firm can spend f (d )
dollars to shore up the dam and prevent the proportion d of
the overflows that would have occurred without shoring up
the dam. The cost of an overflow to the firm is $C. Let
The expected cost to a firm of a choice of d is
f (d )  Cp (1  d ) .
If the firm has a utility for overall cost c , U (c ) , the
expected utility of a choice of d is
E[U (c)]  (1  p  pd )U [ f (d )]  p(1  d )U [ f (d )  C ] .
The decision theoretic framework involves:
(1) clarifying the objectives of the study;
(2) pointing to what the different possible actions are
(3) providing assessments of risk, accuracy, and
reliability of statistical procedures
(4) providing guidance in the choice of procedures for
analyzing outcomes of experiments.
2
II. Components of the Decision Theory Framework
(Section 1.3.1)
We observe data X from a distribution P , where we do not
know the true P but only know that
P P = { P , } (the statistical model).
The true parameter vector  is sometimes called the “state
of nature.”
Action space: The action space A is the set of possible
actions, decisions or claims that we can contemplate
making after observing the data X .
For the example decision problem, the action space is the
choices of d, A  [0,1] .
Loss function: The loss function l (  a  is the loss
incurred by taking the action a when the true parameter
vector is  .
The loss function is assumed to be nonnegative. We want
the loss to be small.
Relationship between loss function and utility function in
economics. The loss function is related to the utility
function in economics. If the utility of taking the action a
when the true state of nature is  is U ( , a ) , then we can
define the loss as l (  a)  [max ',a ' U ( ' a')] -U (  a)
3
When there is uncertainty about the outcome of interest
after taking the action (as in Example 1), then we can
replace the utility with the expected utility under the von
Neumann-Morganstern axioms for decision making under
uncertainty (W. Nicholson, Microeconomic Theory, 6th ed.,
Ch. 12).
Commonly used loss functions for point estimates of a real
valued parameter q( ) :
Denote our estimate of q( ) by a .
The most commonly used loss function is
2
quadratic (squared error) loss: l (  a    q(  )- a) .
Other choices that are less computationally convenient but
perhaps more realistically penalize large errors less are:
(1) absolute value loss, l (  a    q(  ) - a | ;
(2 ) Huber’s loss functions,
(q(  ) - a)2
if |q(  ) - a | k

l (  a   
2

2k | q(  ) - a | -k if |q(  ) - a |> k
for some constant k
(3) zero-one loss function
if |q(  )- a | k
0
l (  a   
if |q(  )- a |> k
1
for some constant k
Decision procedures: A decision procedure or decision rule
specifies how we use the data to choose an action a . A
4
decision procedure  is a function  ( X ) from the sample
space of the experiment to the action space.
For Example 1, decision procedures include
 i 1 X i
n
 (X ) 
n
1   i 1 X i
n
and  ( X ) 
n 1
.
Risk function: The loss of a decision procedure will vary
over repetitions of the experiment because the data from
the experiment X is random. The risk function R (θ ,  ) is
the expected loss from using the decision procedure
 when the true parameter vector is  :
R(θ,  )  E [l ( ,  ( X ))]
Example: For quadratic loss in point estimation of q( ) ,
the risk function is the mean squared error:
R(θ ,  )  E [l ( ,  ( X ))]  E [(q(   ( X )) 2 ]
This mean square error can be decomposed as bias squared
plus variance.
Proposition 3.1:
E [(q(   ( X )) 2 ]  (q(  E [ ( X )]) 2  E {( ( X )  E [ ( X )]) 2 }
5
Proof: We have
E [(q(   ( X )) 2 ]  E [({q(  E [ ( X )]}  {E [ ( X )]   ( X )  ] 
(q(  E [ ( X )]) 2  2{q(  E [ ( X )]}E {E [ ( X )]   ( X ) 
E [{E [ ( X )]   ( X ) ] 
(q(  E [ ( X )]) 2  E [{E [ ( X )]   ( X ) ] 
{Bias[ ( X )]}2  Variance[ ( X )]
6
Example 3: Suppose that an iid sample X1,...,Xn is drawn
from the uniform distribution on [0,  ] where  is an
unknown parameter and the distribution of Xi is
1
0<x<

f X ( x; )  
0
elsewhere
Several point estimators:
n
E
(
W
)


W

max
X

1
1. 1
i
i . Note: W1 is biased,
n 1 .
 n 1 
W

2. 2  n  max i X i . Note: Unlike W1, W2 is unbiased
because
n 1
n 1 n
E (W2 )   
E (W1 )   
   0 .
n
n n 1
3. W3=2 X . Note: W3 is unbiased,
E [ X ]  

0
x2
x dx 

2
1
E [W3 ]  2 E [ X ]  2

2


0

2

Comparison of three estimators for uniform example using
mean squared error criterion
1. W1  max i X i
The sampling distribution for W1 is
7
w 
, X n  w1    1 
 
P(W1  w1 )  P  X 1 ,
 nw1n 1

( w1 )    n
0

fW1
n
0  w1  
elsewhere
and


0
0
E [W1 ]   w1 fW1 ( w1 ) dw1   w1
nw1n 1
n
nw1n 1
dw1 
(n  1) n

0
n
1
   

n 1
n 1
2
To calculate Var (W1 ) , we calculate E (W1 ) and use the
2
2
formula Var ( X )  E ( X )  [ E ( X )] .
Bias (W1 )  E [W1 ]   


E (W )   w f dw1   w
2
1
0
2
1 w1
nw1n  2
=
(n  2) n
0


0
2
1
nw1n 1
n
dw1
n 2

n2
2

n 2  n
n
Var (W1 ) 
 
 
2
2
n2
 (n  1)  (n  2)(n  1)
Thus,
MSE (W1 )  {Bias (W1 )}2  Var (W1 ) 
2
n 2
2
2
.



2
2
(n  1) (n  2)(n  1)
(n  1)(n  2)
8
 n 


 n 1 
n 1
W

max i X i
2. 2
n
n 1
W

W1 .
Note 2
n
n 1
n 1 n
E
(
W
)

E
(
W
)

  ,

1
Thus,  2
n
n n 1
Bias (W2 )  0 and
n 1
 n 1 
Var (W2 )  Var (
W1 )  
 Var (W1 )
n
 n 
2
n
1
 n 1 
2
2





2
n(n  2)
 n  (n  2)(n  1)
Because W2 is unbiased,
1
MSE (W2 )  Var (W2 ) 
2
n(n  2)
2
3. W3  2 X
To find the mean square error, we use the fact that if
X 1 , , X n iid with mean  and variance  2 , then
X   Xn
X 1
 and variance  2 / n
has
mean
n
We have
E ( X )  
x2
x dx 

2

1
0
E ( X )  
2

0


0
x3
2
x dx 

3
1

2


2
0
9
3
Var ( X ) 
2  
2
2
  
3  2  12

2
Thus, E ( X )  2 , Var ( X )  12n and
2
E (W3 )  2 E ( X )   and Var (W3 )  4Var ( X ) 
3n .
2
W3 is unbiased and has mean square error 3n .
The mean square errors of the three estimators are the
following:
MSE
2
2
(n  1)(n  2)
1
2
n(n  2)
1 2

3n
W1
W2
W3
For n=1, the three estimators have the same MSE.
1
2
1


For n=2, n(n  2) (n  1)(n  2) 3n
1
2
1


For n>2, n(n  2) (n  1)(n  2) 3n
So W2 is best, W1 is second best and W3 is the worst.
III. Admissibility/Inadmissibility of Decision Procedures
10
A decision procedure  is inadmissible if there exists
another decision procedure  ' such that
R( ,  ')  R( ,  ) for all    and R( ,  ')  R( ,  ) for at
least one    . The decision procedure  ' is said to
dominate  ; there is no justification for using  rather than
 '.
In Example 3, W1 and W3 are inadmissible point estimators
under squared error loss for n  1 .
A decision procedure  is admissible if it is not
inadmissible, i.e., if there does not exist a decision
procedure  ' such that R( ,  ')  R( ,  ) for all    and
R( ,  ')  R( ,  ) for at least one    .
IV. Selection of a decision procedure:
We would like to choose a decision procedure which has a
“good” risk function.
Ideal: We would like to construct a decision procedure that
is at least as good as all other decision procedures for all
   , i.e.,  ( x ) such that R( ,  ')  R( ,  ) for all
   and all other decision procedures  ' .
This is generally impossible!
Example 2: For X1,...,Xn iid N (  ,1) ,  ( X )  1 is an
admissible point estimator of  for squared error loss.
11
Proof: Suppose  ( X )  1 is inadmissible. Then there exists
a decision procedure  ' that dominates  . This implies
that R(1,  ')  R(1,  )  0 .
2
Hence, 0  R(1,  ')  E 1[( '( x1 , , xn ) 1) ] . Since
( '( x1 ,
, xn )  1) 2 is nonnegative, this implies
P 1[( '( x1 ,
, xn )  1)  0]  1 .
Let B be the event that ( '( x1 , , xn )  1)  0 . We will
show that P ( B )  0 for all   (, ) . This means that
 '( x1 , , xn )  1 with probability 1 for all   (, ) ,
which means that R(  ,  )  R(  ,  ') for all   (, ) ;
this contradicts  ' dominates  and proves that  ( X )  1 is
admissible.
To show that P ( B )  0 for all   (, ) , we use the
importance sampling idea that the expectation of a random
variable X under a density f can be evaluated as the
expectation of the random variable Xf(X)/g(X) under a
density g as long as f and g have the same support:
12
  n ( xi   ) 2 
1
 dx1 dxn
P ( B)  
 I B (2 )n / 2 exp  i 1 2




  n ( xi   ) 2 
1
i 1

exp 
n/2

 1
(2

)
2
  n ( xi  1) 2 




i 1
 dx1

exp 
n/2
 I B
n

2


(2

)
2



(
x

1)
1

i


i 1


exp
n/2


(2 )
2


n

  ( xi   ) 2  
1
i 1


exp 
n/2


2
 (2 )


 E 1  I B
n
2 


 i 1 ( xi  1) 
1

exp 

n/2


2
 (2 )




dxn
(0.1)
Since P 1 ( B ) =0, the random variable

  n ( xi   )2  
1
i 1


exp 
n/2


2
 (2 )


IB
n
  ( xi  1) 2  

1
i 1


exp

n/2


2
 (2 )


is zero with probability one under   1 Thus, by (0.1),
P ( B )  0 for all   (, ) .
■
Comparison of risk under squared error loss for
1 ( X )  1 and  2 ( X )  X .
R(, 1 )  E [(1   )2 ]  (1   )2
R(  ,  2 )  E [( X   ) 2 ]  Var ( X ) 
13
1
n
Although 1 ( X )  1 is admissible, it does not have good
risk properties for many values of  .
Approaches to choosing a decision procedure with good
risk properties:
(1) Restrict class of decision procedures and try to choose
optimal procedure within this class, e.g., for point
estimation, we might only consider unbiased estimators
 ( x ) of q( ) such that E [ ( x)]  q( ) for all    .
(2) Compare risk functions by global criterion. We shall
discuss Bayes and minimax criteria.
I. Example 1 (Example 1.3.5 from Bickel and Doksum)
14
We are trying to decide whether to drill a location for oil.
There are two possible states of nature,
1  location contains oil and  2  location doesn’t contain
oil. We are considering three actions, a1 =drill for oil,
a2 =sell the location or a3 =sell partial rights to the location.
The following loss function is decided on
(Drill)
(Sell)
(Partial rights)
a1
a2
a3
0
10
5
1
(Oil)
(No oil)  2
12
1
6
An experiment is conducted to obtain information about 
resulting in the random variable X with possible values 0,1
and frequency function p( x,  ) given by the following
table:
Rock formation
X
0
1
0.3
0.7
1
(Oil)
0.6
0.4
(No oil)  2
X  1 represents the presence of a certain geological
formation that is more likely to be present when there is oil.
The possible nonrandomized decision procedures  ( x) are
Rule
1
2
3
4
5
6
7
8
9
15
x=0
a1
a1
a1
a2
a2
a2
a3
a3
a3
x=1
a1
a2
a3
a1
a2
a3
a1
a2
a3
The risk of  at  is
R( ,  )  E [l ( ,  ( X ))]  l ( , a1 ) P [ ( X )  a1 ] 
+l ( , a2 ) P [ ( X )  a2 ]  l ( , a3 ) P [ ( X )  a3 ]
The risk functions are
1
R(1 ,  ) 0
R( 2 ,  ) 12
Rule
5
6
10 6.5
2
7
3
3.5
4
3
7.6
9.6
5.4
16
1
3
7
1.5
8
8.5
9
5
8.4
4
6
The decision rules 2, 3, 8 and 9 are inadmissible but the
decision rules 1, 4, 5, 6 and 7 are all admissible.
17