Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics 550 Notes 16 Schedule: 1. I will e-mail Homework 7 to you tonight. 2. I will return your midterms on Tuesday. 3. Homework 6 is due on Friday. Reading: Section 3.2 I. Computation of Bayes Procedures Review The Bayes risk of a decision procedure for a prior distribution ( , denoted by r ( ) , is the expected value of the loss function over the joint distribution of ( X , ) (given the prior ( ), which is the expected value of the risk function over the prior distribution of : r ( ) E [ E[l ( , ( X )) | ]] E [ R( , )] . The decision procedure which minimizes the Bayes risk for a prior ( is called the Bayes rule (Bayes procedure) for the prior ( . For a person with prior distribution ( , the Bayes is the best procedure from this person’s point of view. The posterior risk of an action a is the expected loss from taking action a under the posterior distribution p ( | x ) . r (a | x ) E p ( | x ) [l ( , a)] . 1 * Proposition 3.2.1: A procedure ( x) which takes an action which minimizes the posterior risk for each x in the sample space is a Bayes procedure. Note: To follow the Bayes procedure for the data x that we actually obtain, we just need to find the action that minimizes the posterior risk r ( a | x ) , i.e., we do not to find the entire Bayes procedure. Bayes procedures for common loss functions: Bayes decision procedures for point estimation of g ( ) for some common loss functions using Proposition 3.2.1: 2 (1) Squared Error Loss ( l ( , a) ( g ( ) a) ): The action (point estimate) taken by the Bayes rule is the action that minimizes the posterior expected square loss: * ( x ) arg min E p ( | x ) [(a g ( )) 2 ] a * By Lemma 1.4.1, ( x) is the mean of the posterior distribution p ( | x ) . (2) Absolute Error Loss ( l ( , a) | g ( ) a | ). The action (point estimate) taken by the Bayes rule is the action that minimizes the posterior expected absolute loss: * ( x ) arg min E p ( | x ) [| a g ( ) |] (1.1) a 2 The minimizer of (1.1) is any median of the posterior distribution p ( | x ) so that a Bayes rule is to use any median of the posterior distribution. Proof that the minimizer of E[| X a |] is a median of X: Let X be a random variable and let the interval m0 m m1 1 1 P ( X m ) , P ( X m ) be the medians of X, i.e., 2 2. For m1 c and a continuous random variable, E[| X c |] E[| X m |] m c m c (c m)dP ( x) ((c x) ( x m)) dP ( x) (m c)dP ( x) (c m)[ P( X m) P ( X m)] 2 (c x) dP ( x)] 0 m xc A similar result holds for c m0 . A similar argument holds for discrete random variables. (3) Zero-one loss | a g ( ) | c 1 l ( , a) | a g ( ) | c 0 The Bayes rule is the midpoint of the interval of length 2c that maximizes the posterior probability that g ( ) belongs to the interval. (4) Weighted squared error loss l ( , a) w( )( g ( ) a) 2 Bayes rule is 3 * ( x) E p ( | x ) [ w( ) g ( )] E p ( | x ) [ w( )] Example 1: Recall from Notes 2 (Chapter 1.2), for X 1 , , X n iid Bernoulli( p ) and a Beta(r,s) prior for p, the posterior distribution for p is Beta( r i 1 xi , s n i 1 xi ). n n Thus, for squared error loss, the Bayes estimate of p is the mean of Beta( r i 1 xi , s n i 1 xi ), which equals n n r i 1 xi n n r i 1 xi s n i 1 xi n r i 1 xi n rsn . For absolute error loss, the Bayes estimate of p is the n n r x , s n median of the Beta( i 1 i i1 xi ) distribution which does not have a closed form. For n=10, here are the Bayes estimators and MLE for the Beta(1,1) = uniform prior. 10 0 1 2 i 1 Xi MLE Bayes absolute error loss .0611 .1480 .2358 .0000 .1000 .2000 4 Bayes squared error loss .0833 .1667 .2500 3 4 5 6 7 8 9 10 .3000 .4000 .5000 .6000 .7000 .8000 .9000 1.0000 .3238 .4119 .5000 .5881 .6762 .7642 .8520 .9389 .3333 .4167 .5000 .5833 .6667 .7500 .8333 .9137 2 2 Example 2: Suppose X 1 , , X n iid N ( , ) , known, 2 and our prior on is N ( , b ) . We showed in Notes 3 that the posterior distribution for is nX 2 b2 1 N , n 1 n 1 2 2 . 2 2 b b The mean and median of the posterior distribution is nX 2 b2 n 1 , so that the Bayes estimator for both squared error 2 b2 nX 2 2 b (X ) loss and absolute error loss is n 1 . 2 b2 5 Note on Bayes procedures and sufficiency: Suppose the prior distribution ( ) has support on and the family of distributions for the data { p( X | ), } has sufficient statistic T ( X ) . Then to find the Bayes procedure, we can reduce the data to T ( X ) . This is because the posterior distribution of | X is the same as the posterior distribution of | T ( X ) since p( | X ) p( X | ) ( ) p( X | T ( X ), ) p(T ( X ) | ) ( ) p(T ( X ) | ) ( ) where the last uses the sufficiency of T ( X ) . Computation of Bayes procedures for complex problems: For nonconjugate priors, the posterior mean (which is the Bayes estimator under squared error loss) is not typically available in closed form. 2 2 Example: X 1 , , X n iid N ( , ) , known, and our prior on is a logistic (a, b) distribution: 1 exp{( x a) / b} ( ) b [1 exp{( x a) / b}]2 The logistic distribution is a more heavily tailed distribution than the normal distribution. Since X is sufficient, we can just compute the posterior given X . The posterior pdf is 6 1 ( X ) 2 1 e ( a ) / b exp 2 2 / n b [1 e ( a ) / b ]2 p( X | ) ( ) 2 / n p( | X ) p( X ) 1 ( X ) 2 1 e ( a ) / b 1 2 / n exp 2 2 / n b [1 e( a ) / b ]2 1 The numerator is not proportional to any commonly used density function and the denominator is not evaluatable in closed form. Monte Carlo methods can be used to sample from the posterior distribution approximate the Bayes estimator. For discussion of Monte Carlo methods for Bayesian inference, see Bayesian Data Analysis by Gelman, Carlin, Stern and Rubin; these methods are discussed in Statistics 540 (will become Stat 542 next year) taught by Prof. Shane Jensen. II. Improper priors Bayes estimators are defined with respect to a proper prior distribution ( ) . Often a weighting function ( ) is considered that is not a probability distribution; this is called an improper prior. Example 3: X ~ N ( ,1) , ( ) 1 . We can still consider the “Bayes” risk of a decision procedure: 7 r ( ) R( , ) ( )d (1.2) * An estimator ( x) is called a generalized Bayes estimator with respect to a weighting function ( ) (even if it is not a proper probability distribution) if it minimizes the “Bayes” risk (1.2) over all estimators. We can write the “Bayes” risk (1.3) as r ( ) l ( , ( X ) p ( X | ) ( )d p ( X | ) ( ) d p( X | ) ( ) d dX l ( , ( X ) p( X | ) ( )d p ( X | ) ( ) d q( X ) dX l ( , ( X ) p( X | ) ( )d A decision procedure ( X ) which minimizes p( X | ) ( )d l ( , ( X ) p( X | ) ( )d (1.4) for each X is a generalized Bayes estimator (this is the analogue of Proposition 3.2.1). Sometimes p( X | ) ( )d p( X | ) ( )d 8 (1.5) is a proper probability distribution even if ( ) is not a proper probability distribution, and then we can think of p( X | ) ( )d p( X | ) ( )d as the “posterior” density function of | X and (1.6) as the “posterior” risk. Example 3 continued: X ~ N ( ,1) , ( ) 1 . (1.7) equals ( X )2 1 exp 2 2 2 ( X ) 1 exp d 2 2 ( X )2 1 exp 2 2 2 1 exp ( X ) 2 ( X )2 1 2 exp d 2 2 where the last equality follows because the denominator in the second equality is just the integral of a normal density for with mean X . 1 exp ( X ) is a normal density 2 2 2 for with mean X and variance 1, and consequently (1.4) is just the expected loss given X with respect to a N ( X ,1) density for . Thus, the generalized Bayes estimator for squared error loss (and absolute error loss) is X . Another useful variant of Bayes estimators are limits of Bayes estimators. A nonrandomized estimator ( x ) is a limit of Bayes estimators if there exists a sequence of proper priors Bayes estimators v with respect to these prior distributions such that v ( x ) ( x ) for all x. 9 v and Example 3 continued: For next week’s homework, I will ask you to show that X is a limit of Bayes estimators. From a Bayesian point of view, estimators that are limits of Bayes estimators are somewhat more desirable than generalized Bayes estimators. This is because, by construction, a limit of Bayes estimators must be close to a proper Bayes estimator. In contrast, a generalized Bayes estimator may not be close to any proper Bayes estimator (an example will be given for next week’s homework). Admissibility of Bayes rules: In general, Bayes rules are admissible. * Theorem : Suppose that is an interval and is a Bayes rule with respect to a prior density function ( ) such that ( ) 0 for all and R ( , d ) is a * continuous function of for all d . Then is admissible. Proof: The proof is by contradiction. Suppose that is inadmissible. There is then another estimate, , such that R( , * ) R( , ) for all and with strict inequality for some , say 0 . Since R( , *) R( , ) is a continuous function of , there is an 0 and an interval h such that R( , *) R( , ) for 0 h 0 h Then, * 10 0 h R( , *) R( , ) ( )d R( , *) R( , ) ( )d 0 h 0 h ( ) d 0 0 h But this contradicts the fact that * is a Bayes rule because a Bayes rule has the property that B( *) B( ) R( , *) R( , ) ( )d 0 . The proof is complete. The theorem can be regarded as both a positive and negative result. It is positive in that it identifies a certain class of estimates as being admissible, in particular, any Bayes estimate. It is negative in that there are apparently so many admissible estimates – one for every prior distribution that satisfies the hypotheses of the theorem – and some of these might make little sense (like ( X ) 3 for the normal distribution above). Complete class theorems characterize the class of all admissible estimators. Roughly the class of all admissible estimators for most models is the class of all Bayes and limit of Bayes estimators. 11