Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics 512 Notes 27: Bayesian Statistics Views of Probability “The probability that this coin will land heads up is 1 .” 2 Frequentist (sometimes called objectivist) viewpoint: This statement means that if the experiment were repeated many, many times, the long-run average proportion of heads would tend to 1 . 2 Bayesian (sometimes called subjectivist or personal) viewpoint: This statement means that the person making the statement has a prior opinion about the coin toss such that he or she would as soon guess heads or tails if the rewards are equal. In the frequentist viewpoint, the probability of an event A , P( A) , represents the long run frequency of event A in repeated experiments. In the Bayesian viewpoint, the probability of an event A , P( A) , has the following meaning: For a game in which if A occurs the Bayesian will be paid $1, P( A) is the amount of money the Bayesian would be willing to pay to buy into the game. Thus, if the Bayesian is willing to pay 50 cents to buy in, P( A) =.5. Note that this concept of probability is personal: P( A) may vary from person to person depending on their opinions. In the Bayesian viewpoint, we can make probability statements about lots of things, not just data which are subject to random variation. For example, I might say that “the probability that Franklin D. Roosevelt had a cup of coffee on February 21, 1935” is .68. This does not refer to any limiting frequency. It reflects my strength of belief that the proposition is true. Rules for Manipulating Subjective Probabilities All the usual rules for manipulating probabilities apply to subjective probabilities. For example, Theorem 11.1: If C1 and C2 are mutually exclusive, then P(C1 C2 ) P(C1 ) P(C2 ) . Proof: Suppose a person thinks a fair price for C1 is p1 P(C1 ) and that for C2 is p2 P(C2 ) . However, that person believes that the fair price for C1 C2 is p3 which differs from p1 p2 . Say p3 p1 p2 and let the difference be d ( p1 p2 ) p3 . A gambler offers this d p person the price 3 4 for C1 C2 . The person takes the offer because it is better than p3 . The gambler sells C1 at a d p discount price of 1 4 and sells C2 at a discount price of d 4 to the person. Being a rational person with those given prices of p1 , p2 , and p3 , all three of these deals seem very satisfactory. At this point, the person has received d d p3 and paid p1 p2 . Thus before any bets are 2 4 paid off, the person has d d 3d d p3 ( p1 p2 ) p3 p1 p2 . 4 2 4 4 That is, the person is down d before any bets are settled. p2 4 We now show that no matter what event happens, the person will pay and receive the same amount in settling the bets: Suppose C1 happens: the gambler has C1 C2 and the person has C1 so they exchange $1’s and the person is still down d . The same thing occurs if C2 happens. 4 Suppose neither C1 nor C2 happens, then the gambler and the person receive zero, and the person is still down d . 4 C1 and C2 cannot occur together since they are mutually exclusive. Thus, we see that it is bad for the person to assign p3 P(C1 C2 ) p1 p2 P(C1 ) P(C2 ) Because the gambler can put the person in a position to lose ( p1 p2 p3 ) / 4 no matter what happens. This is sometimes referred to as a Dutch book. The argument when p3 p1 p2 is similar and can also lead to a Dutch book. Thus p3 must equal p1 p2 to avoid a Dutch book; that is, P(C1 C2 ) P(C1 ) P(C2 ) . The Bayesian can consider subjective conditional probabilities, such as P(C1 | C2 ) , which is the fair price of C1 only if C2 is true. If C2 is not true, the bet is off. Of course, P(C1 | C2 ) could differ from P(C1 ) . To illustrate, say C2 is the event that “it will rain today” and C1 is the event that “a certain person who will be outside on that day will catch a cold.” Most of us would probably assign the fair prices so that P(C1 ) P(C1 | C2 ) . Consequently, a person has a better chance of getting a cold on a rainy day. Frequentist vs. Bayesian statistics The frequentist point of view towards statistics is based on the following postulates: F1: Probability refers to limiting relative frequencies. Probabilities are objective properties of the real world. F2: Parameters are fixed, unknown constants. Because they are not fluctuating, no useful probability statements can be made about parameters. F3: Statistical procedures should be designed to have well-defined long run frequency properties. For example, a 95 percent confidence interval should trap the true value of the parameter with limiting frequency at least 95 percent. The Bayesian approach to statistics is based on the following postulates: B1: Probability describes a person’s degree of belief, not limiting frequency. B2: We can make probability statements about parameters that are reflect our degree of belief about the parameters, even though the parameters are fixed constants. B3: We can make inferences about a parameter by producing a probability distribution for . Inferences, such as point estimates and interval estimates, may then be extracted from this distribution. Bayesian inference Bayesian inference about a parameter is usually carried out in the following way: 1. We choose a probability density ( ) -- called the prior distribution – that expresses our beliefs about a parameter before we see any data. 2. We choose a probability model f ( x | ) that reflects our beliefs about x given . 3. After observing data X 1 , , X n , we update our beliefs and calculate the posterior distribution h( | X1 , , X n ) . As in our discussion of the Bayesian approach in decision theory, the posterior distribution is calculated using Bayes rule: f ( x, ) f ( x | ) ( ) h( | x) X , f X ( x) f ( x | ) ( )d Note that h( | x) f ( x | ) ( ) as varies so that the posterior distribution is proportional to the likelihood times the prior. Based on the posterior distribution, we can get a point estimate, an interval estimate and carry out hypothesis tests as we shall discuss below. Bayesian inference for the normal distribution Suppose that we observe a single observation x from a normal distribution with unknown mean and known 2 variance . Suppose that our prior distribution for is N ( 0 , 02 ) . The posterior distribution of is f ( x | ) ( ) h( | x ) f ( x | ) ( ) f ( x | ) ( )d Now f ( x | ) ( ) 1 1 1 1 exp 2 ( x ) 2 exp 2 ( 0 ) 2 2 2 0 2 2 0 1 1 exp 2 ( x ) 2 2 ( 0 ) 2 2 0 2 1 2 1 x 0 x 2 02 1 exp 2 2 2 2 2 2 2 0 0 0 2 Let a, b, and c be the coefficients in the quadratic polynomial in that is the last expression. The last expression may then be written as a 2b c exp 2 a a 2 To simplify this further, we use the technique of completing the square and rewrite the expression as 2 2 2 a b a c b exp exp 2 a 2 a a The second term does not depend on and we thus have that 2 a b h( | x) exp a 2 This is the density of a normal random variable with mean b 1 a and variance a . Thus, the posterior distribution of is normal with mean x 0 2 2 0 1 1 1 2 2 0 and variance 1 12 1 1 . 2 02 Comments about role of prior in the posterior distribution: The posterior mean is a weighted average of the prior mean and the data, with weights proportional to the respective precisions of the prior and the data, where the precision is equal to 1/variance. If we assume that the experiment (the observation of X ) is much more informative than the prior 2 2 distribution in the sense that 0 , then 12 2 1 x Thus, the posterior distribution of is nearly normal with 2 mean x and variance . This result illustrates that if the prior distribution is quite flat relative to the likelihood, then 1. the prior distribution has little influence on the posterior 2. the posterior distribution is approximately proportional to the likelihood function. On a heuristic level, the first point says that if one does not have strong prior opinions, one’s posterior opinion is mainly determined by the data one observes. Such a prior distribution is often called a vague or noninformative prior. Bayesian decisionmaking: When faced with a decision, the Bayesian wants to minimize the expected loss (i.e., maximize the expected utility) of a decision rule under the prior distribution ( ) for . In other words the Bayesian chooses the decision rule d that minimizes the Bayes risk: B(d ) E ( ) [ R ( , d )] , i.e., the Bayesian chooses to use the Bayes rule for the Bayesian’s prior distribution ( ) As we showed in last class, for point estimation with squared error loss, the Bayes rule is to use the posterior mean as the estimate. Thus, for the above normal distribution setup, the Bayesian’s estimate of is x 0 2 2 0 1 1 2 02