Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics 550 Notes 3 Reading: Section 1.3 Decision Theoretic Framework: Framework for evaluating and choosing statistical inference procedures I. Motivating Example A cofferdam protecting a construction site was designed to withstand flows of up to 1870 cubic feet per second (cfs). An engineer wishes to estimate the probability that the dam will be overtopped during the upcoming year. Over the previous 25-year periods, the annual maximum flood levels of the dam has exceeded 1870 cfs 5 times. The engineer models the data on whether the flood level has exceeded 1870 cfs as independent Bernoulli trials with the same probability p that the flood level will exceed 1870 cfs in each year. Some possible estimates of p based on iid Bernoulli trials X1 , , X n : (1) pˆ n i 1 Xi n 1 i 1 X i n (2) pˆ on p . n2 , the posterior mean for a uniform prior 1 2 i 1 X i n (3) pˆ , the posterior mean for a Beta(2,2) n4 prior on p (called the Wilson estimate, recommended by Moore and McCabe, Introduction to the Practice of Statistics). How should we decide which of these estimates to use? The answer depends in part on how errors in the estimation of p affect us. Example of decision problem: The firm can spend f (d ) dollars to shore up the dam and prevent the proportion d of the overflows that would have occurred without shoring up the dam. The cost of an overflow to the firm is $C. Let The expected cost to a firm of a choice of d is f (d ) Cp (1 d ) . If the firm has a utility for overall cost c , U (c ) , the expected utility of a choice of d is E[U (c)] (1 p pd )U [ f (d )] p(1 d )U [ f (d ) C ] . The decision theoretic framework involves: (1) clarifying the objectives of the study; (2) pointing to what the different possible actions are (3) providing assessments of risk, accuracy, and reliability of statistical procedures (4) providing guidance in the choice of procedures for analyzing outcomes of experiments. 2 II. Components of the Decision Theory Framework (Section 1.3.1) We observe data X from a distribution P , where we do not know the true P but only know that P P = { P , } (the statistical model). The true parameter vector is sometimes called the “state of nature.” Action space: The action space A is the set of possible actions, decisions or claims that we can contemplate making after observing the data X . For the example decision problem, the action space is the choices of d, A [0,1] . Loss function: The loss function l ( a is the loss incurred by taking the action a when the true parameter vector is . The loss function is assumed to be nonnegative. We want the loss to be small. Relationship between loss function and utility function in economics. The loss function is related to the utility function in economics. If the utility of taking the action a when the true state of nature is is U ( , a ) , then we can define the loss as l ( a) [max ',a ' U ( ' a')] -U ( a) 3 When there is uncertainty about the outcome of interest after taking the action (as in Example 1), then we can replace the utility with the expected utility under the von Neumann-Morganstern axioms for decision making under uncertainty (W. Nicholson, Microeconomic Theory, 6th ed., Ch. 12). Commonly used loss functions for point estimates of a real valued parameter q( ) : Denote our estimate of q( ) by a . The most commonly used loss function is 2 quadratic (squared error) loss: l ( a q( )- a) . Other choices that are less computationally convenient but perhaps more realistically penalize large errors less are: (1) absolute value loss, l ( a q( ) - a | ; (2 ) Huber’s loss functions, (q( ) - a)2 if |q( ) - a | k l ( a 2 2k | q( ) - a | -k if |q( ) - a |> k for some constant k (3) zero-one loss function if |q( )- a | k 0 l ( a if |q( )- a |> k 1 for some constant k Decision procedures: A decision procedure or decision rule specifies how we use the data to choose an action a . A 4 decision procedure is a function ( X ) from the sample space of the experiment to the action space. For Example 1, decision procedures include i 1 X i n (X ) n 1 i 1 X i n and ( X ) n 1 . Risk function: The loss of a decision procedure will vary over repetitions of the experiment because the data from the experiment X is random. The risk function R (θ , ) is the expected loss from using the decision procedure when the true parameter vector is : R(θ, ) E [l ( , ( X ))] Example: For quadratic loss in point estimation of q( ) , the risk function is the mean squared error: R(θ , ) E [l ( , ( X ))] E [(q( ( X )) 2 ] This mean square error can be decomposed as bias squared plus variance. Proposition 3.1: E [(q( ( X )) 2 ] (q( E [ ( X )]) 2 E {( ( X ) E [ ( X )]) 2 } 5 Proof: We have E [(q( ( X )) 2 ] E [({q( E [ ( X )]} {E [ ( X )] ( X ) ] (q( E [ ( X )]) 2 2{q( E [ ( X )]}E {E [ ( X )] ( X ) E [{E [ ( X )] ( X ) ] (q( E [ ( X )]) 2 E [{E [ ( X )] ( X ) ] {Bias[ ( X )]}2 Variance[ ( X )] 6 Example 3: Suppose that an iid sample X1,...,Xn is drawn from the uniform distribution on [0, ] where is an unknown parameter and the distribution of Xi is 1 0<x< f X ( x; ) 0 elsewhere Several point estimators: n E ( W ) W max X 1 1. 1 i i . Note: W1 is biased, n 1 . n 1 W 2. 2 n max i X i . Note: Unlike W1, W2 is unbiased because n 1 n 1 n E (W2 ) E (W1 ) 0 . n n n 1 3. W3=2 X . Note: W3 is unbiased, E [ X ] 0 x2 x dx 2 1 E [W3 ] 2 E [ X ] 2 2 0 2 Comparison of three estimators for uniform example using mean squared error criterion 1. W1 max i X i The sampling distribution for W1 is 7 w , X n w1 1 P(W1 w1 ) P X 1 , nw1n 1 ( w1 ) n 0 fW1 n 0 w1 elsewhere and 0 0 E [W1 ] w1 fW1 ( w1 ) dw1 w1 nw1n 1 n nw1n 1 dw1 (n 1) n 0 n 1 n 1 n 1 2 To calculate Var (W1 ) , we calculate E (W1 ) and use the 2 2 formula Var ( X ) E ( X ) [ E ( X )] . Bias (W1 ) E [W1 ] E (W ) w f dw1 w 2 1 0 2 1 w1 nw1n 2 = (n 2) n 0 0 2 1 nw1n 1 n dw1 n 2 n2 2 n 2 n n Var (W1 ) 2 2 n2 (n 1) (n 2)(n 1) Thus, MSE (W1 ) {Bias (W1 )}2 Var (W1 ) 2 n 2 2 2 . 2 2 (n 1) (n 2)(n 1) (n 1)(n 2) 8 n n 1 n 1 W max i X i 2. 2 n n 1 W W1 . Note 2 n n 1 n 1 n E ( W ) E ( W ) , 1 Thus, 2 n n n 1 Bias (W2 ) 0 and n 1 n 1 Var (W2 ) Var ( W1 ) Var (W1 ) n n 2 n 1 n 1 2 2 2 n(n 2) n (n 2)(n 1) Because W2 is unbiased, 1 MSE (W2 ) Var (W2 ) 2 n(n 2) 2 3. W3 2 X To find the mean square error, we use the fact that if X 1 , , X n iid with mean and variance 2 , then X Xn X 1 and variance 2 / n has mean n We have E ( X ) x2 x dx 2 1 0 E ( X ) 2 0 0 x3 2 x dx 3 1 2 2 0 9 3 Var ( X ) 2 2 2 3 2 12 2 Thus, E ( X ) 2 , Var ( X ) 12n and 2 E (W3 ) 2 E ( X ) and Var (W3 ) 4Var ( X ) 3n . 2 W3 is unbiased and has mean square error 3n . The mean square errors of the three estimators are the following: MSE 2 2 (n 1)(n 2) 1 2 n(n 2) 1 2 3n W1 W2 W3 For n=1, the three estimators have the same MSE. 1 2 1 For n=2, n(n 2) (n 1)(n 2) 3n 1 2 1 For n>2, n(n 2) (n 1)(n 2) 3n So W2 is best, W1 is second best and W3 is the worst. III. Admissibility/Inadmissibility of Decision Procedures 10 A decision procedure is inadmissible if there exists another decision procedure ' such that R( , ') R( , ) for all and R( , ') R( , ) for at least one . The decision procedure ' is said to dominate ; there is no justification for using rather than '. In Example 3, W1 and W3 are inadmissible point estimators under squared error loss for n 1 . A decision procedure is admissible if it is not inadmissible, i.e., if there does not exist a decision procedure ' such that R( , ') R( , ) for all and R( , ') R( , ) for at least one . IV. Selection of a decision procedure: We would like to choose a decision procedure which has a “good” risk function. Ideal: We would like to construct a decision procedure that is at least as good as all other decision procedures for all , i.e., ( x ) such that R( , ') R( , ) for all and all other decision procedures ' . This is generally impossible! Example 2: For X1,...,Xn iid N ( ,1) , ( X ) 1 is an admissible point estimator of for squared error loss. 11 Proof: Suppose ( X ) 1 is inadmissible. Then there exists a decision procedure ' that dominates . This implies that R(1, ') R(1, ) 0 . 2 Hence, 0 R(1, ') E 1[( '( x1 , , xn ) 1) ] . Since ( '( x1 , , xn ) 1) 2 is nonnegative, this implies P 1[( '( x1 , , xn ) 1) 0] 1 . Let B be the event that ( '( x1 , , xn ) 1) 0 . We will show that P ( B ) 0 for all (, ) . This means that '( x1 , , xn ) 1 with probability 1 for all (, ) , which means that R( , ) R( , ') for all (, ) ; this contradicts ' dominates and proves that ( X ) 1 is admissible. To show that P ( B ) 0 for all (, ) , we use the importance sampling idea that the expectation of a random variable X under a density f can be evaluated as the expectation of the random variable Xf(X)/g(X) under a density g as long as f and g have the same support: 12 n ( xi ) 2 1 dx1 dxn P ( B) I B (2 )n / 2 exp i 1 2 n ( xi ) 2 1 i 1 exp n/2 1 (2 ) 2 n ( xi 1) 2 i 1 dx1 exp n/2 I B n 2 (2 ) 2 ( x 1) 1 i i 1 exp n/2 (2 ) 2 n ( xi ) 2 1 i 1 exp n/2 2 (2 ) E 1 I B n 2 i 1 ( xi 1) 1 exp n/2 2 (2 ) dxn (0.1) Since P 1 ( B ) =0, the random variable n ( xi )2 1 i 1 exp n/2 2 (2 ) IB n ( xi 1) 2 1 i 1 exp n/2 2 (2 ) is zero with probability one under 1 Thus, by (0.1), P ( B ) 0 for all (, ) . ■ Comparison of risk under squared error loss for 1 ( X ) 1 and 2 ( X ) X . R(, 1 ) E [(1 )2 ] (1 )2 R( , 2 ) E [( X ) 2 ] Var ( X ) 13 1 n Although 1 ( X ) 1 is admissible, it does not have good risk properties for many values of . Approaches to choosing a decision procedure with good risk properties: (1) Restrict class of decision procedures and try to choose optimal procedure within this class, e.g., for point estimation, we might only consider unbiased estimators ( x ) of q( ) such that E [ ( x)] q( ) for all . (2) Compare risk functions by global criterion. We shall discuss Bayes and minimax criteria. I. Example 1 (Example 1.3.5 from Bickel and Doksum) 14 We are trying to decide whether to drill a location for oil. There are two possible states of nature, 1 location contains oil and 2 location doesn’t contain oil. We are considering three actions, a1 =drill for oil, a2 =sell the location or a3 =sell partial rights to the location. The following loss function is decided on (Drill) (Sell) (Partial rights) a1 a2 a3 0 10 5 1 (Oil) (No oil) 2 12 1 6 An experiment is conducted to obtain information about resulting in the random variable X with possible values 0,1 and frequency function p( x, ) given by the following table: Rock formation X 0 1 0.3 0.7 1 (Oil) 0.6 0.4 (No oil) 2 X 1 represents the presence of a certain geological formation that is more likely to be present when there is oil. The possible nonrandomized decision procedures ( x) are Rule 1 2 3 4 5 6 7 8 9 15 x=0 a1 a1 a1 a2 a2 a2 a3 a3 a3 x=1 a1 a2 a3 a1 a2 a3 a1 a2 a3 The risk of at is R( , ) E [l ( , ( X ))] l ( , a1 ) P [ ( X ) a1 ] +l ( , a2 ) P [ ( X ) a2 ] l ( , a3 ) P [ ( X ) a3 ] The risk functions are 1 R(1 , ) 0 R( 2 , ) 12 Rule 5 6 10 6.5 2 7 3 3.5 4 3 7.6 9.6 5.4 16 1 3 7 1.5 8 8.5 9 5 8.4 4 6 The decision rules 2, 3, 8 and 9 are inadmissible but the decision rules 1, 4, 5, 6 and 7 are all admissible. 17