* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Statistics 512 Notes ID
Survey
Document related concepts
Transcript
Statistics 550 Notes 2 Reading: Section 1.2. Homework Note: As I mentioned in class on Thursday, Problem 1(b) should be Problem 1.1.2 in the book. I. Frequentist vs. subjective probability (Section 1.2) Model: We toss a coin 3 times. The tosses are iid Bernoulli trials with probability p of landing heads. What does p mean here? Mathematical definition – A probability function for a random experiment is a function on the sample space S of possible outcomes of the experiment that satisfies: Axiom 1: For all (measurable) events E: 0 P ( E ) 1 Axiom 2: P( S ) 1 Countable additivity Axiom 3: For any sequence of mutually exclusive (measurable) events, P Ei P( Ei ) . i 1 i 1 For a single coin toss, the axioms imply that p must be between 0 and 1 and the probability of the coin landing tails must be 1 p . Meaning of probability as a model for the real world: 1 Frequentist probability – In “many” independent coin tosses, the proportion of heads would be about p . The French naturalist Count Buffon (1707-1788) tossed a coin 4040 times. Result: 2048 heads, or relative frequency 2048/4040=0.5069 for heads. Around 1900, the English statistician Karl Pearson heroically tossed a coin 24,000 times. Result: 12,012 heads, a relative frequency of 0.5005. While imprisoned by the Germans during World War II, the Australian mathematician John Kerrich tossed a coin 10,000 times. Result: 5067 heads, a relative frequency of 0.5067. Subjective (personal) probability – Probability describes a person’s degree of belief about a statement. “The probability that the coin lands heads is 0.5” Subjective interpretation – This represents a person’s personal degree of belief that a toss of the coin will land heads. Specifically, if offered the chance to make a bet in which the person will win 1 p dollars if the coin lands heads and lose p dollars if the coin lands tails, the maximum p the person would play the game with is p 0.5 dollars. In general, a person’s subjective probability of an event A , P( A) , is the value of p for which the person thinks a bet in which she will win 1 p dollars if A occurs and lose 2 p dollars if A does not occur is a fair bet (i.e., has expected profit zero). Subjective probability rejects the view of probability as a physical feature of the world and interprets probability as a statement about an individual’s state of knowledge. “Coins don’t have probabilities, people have probabilities.” – Persi Diaconis. An advantage of subjective probability is that it can be applied to things about which we are uncertain which cannot be envisioned as being part of a sequence of repeated trials: “Will the Philadelphia Eagles win the Super Bowl this year?” “Was there life on Mars 1 billion years ago?” “Did Lee Harvey Oswald act alone in assassinating John F. Kennedy?” “What is the 353rd digit of ?” If I say the probability that the 353rd digit of is 5 is 0.2, I mean that I would consider it a fair bet for me to gain 0.8 dollars if the 353rd digit of is 5 and lose 0.2 dollars if the 353rd digit of is not 5. II. Coherence and the axioms of probability A rational person should have a “coherent” system of subjective probabilities: a system is said to be incoherent if 3 there exists some set of bets such that the bettor will take the set of bets but will lose no matter what happens. If a person’s system of probabilities is coherent, then they must satisfy the axioms of probability. Example: Proposition 1: If A and B are mutually exclusive events, then P( A B) P( A) P( B) . Proof: Let P( A) p1 , P( B) p2 and P( A B) p3 be a person’s subjective probabilities for these events. Suppose that p3 differs from p1 p2 . Then the person thinks that the following bets are fair: (i) a bet in which the person will win 1 p1 dollars if A occurs and lose p1 dollars if A does not occur; (ii) a bet in which the person will win 1 p2 if B occurs and lose p2 dollars if B does not occur; and (iii) a bet in which the person will win 1 p3 if A B occurs and lose p3 dollars if A B does not occur Say, p3 p1 p2 and let the difference be d ( p1 p2 ) p3 . A gambler offers this person the following bets: (a) the person will lose 1 p3 d / 4 dollars if A B occurs and win p3 d / 4 if A B does not occur; (b) the person will win 1 p1 d / 4 dollars if A occurs and lose p1 d / 4 dollars if A does not occur; (c) the person will 4 win 1 p2 d / 4 dollars if B occurs and lose p2 d / 4 dollars if B does not occur. According to the person’s subjective probabilities, bets (a)(c) are all expected to yield a profit, so the person takes all of them. However, (1) Suppose A occurs. Then the person loses 1 p3 d / 4 from bet (a), wins 1 p1 d / 4 from bet (b) and loses p2 d / 4 from bet (c). The person’s profit is (1 p1 d / 4) (1 p3 d / 4) ( p2 d / 4) p3 3d / 4 p1 p2 ( p1 p2 d ) 3d / 4 p1 p2 d / 4 (2) Suppose B occurs. The person’s profit is (1 p2 d / 4) (1 p3 d / 4) ( p1 d / 4) p3 3d / 4 p1 p2 ( p1 p2 d ) 3d / 4 p1 p2 d / 4 (3) Suppose neither A nor B occurs. The person’s profit is ( p3 d / 4) ( p1 d / 4) ( p2 d / 4) p3 p1 p2 3d / 4 ( p1 p2 d ) p1 p2 3d / 4 d / 4 5 Thus, the gambler has put the person in a position in which the person is guaranteed to lose d / 4 no matter what happens (when a person accepts a set of bets that guarantees that they will lose no matter what happens, it is said that a Dutch book has been set against them). So it is irrational for the person to assign P( A B) P( A) P( B) . A similar argument can be made that it is irrational to assign P( A B) P( A) P( B) . Similar arguments can be made that a rational person’s subjective probabilities should satisfy the other axioms of probability: (1) for an event E , 0 P ( E ) 1 ; (2) P( S ) 1 , where S is the sample space. Although, from Proposition 1, it is clear that a rational person’s personal probabilities should obey finite additivity (i.e., if E1 , , En are mutually exclusive events, n n P Ei P( Ei ) ), there is some controversy about i 1 i 1 additivity for a countable infinite sequence of mutually exclusive events (see J. Williamson, “Countable Additivity and Subjective Probability,” The British Journal for the Philosophy of Science, 1999). We assume countable additivity holds for subjective probability. The mathematical axioms of probability, and hence all results in probability theory, hold for both subjective probabilities and frequentist probabilities -- it is just a matter of how we interpret the probabilities. 6 In particular, Bayes theorem holds for subjective probability. Let A be an event and B1 , , Bn be mutually exclusive and exhaustive events where P( Bi ) 0 for all i . Then P( A | B j ) P( B j ) P( B j | A) n P( A | B ) P( B ) . i 1 i i III. The Bayesian Approach to Statistics Example 1 from Notes 1: Lebron James’ free throws in the 2008-2009 season and future seasons are iid Bernoulli trials with probability p of success. In the 2008-2009 season, Lebron made 594 out of the 762 free throws he attempted (78.0%). What inferences can we make about p ? The Bayesian approach to statistics uses subjective probability and regards p as being a fixed but unknown object over which we have beliefs given by a subjective 7 probability distribution and then uses the data to update our beliefs using Bayes rule. Prior distribution: Subjective probability distribution about parameter vector ( p ) before seeing any data. Posterior distribution: Updated subjective probability distribution about parameter vector ( p ) after seeing the data. Bernoulli trials example: Suppose that X 1 , Bernoulli with probability of success p . , X n are iid We want our prior distribution for p to be a distribution that concentrates on the interval [0,1]. A class of distributions that concentrates on [0,1] is the two-parameter beta family. The beta family of distributions: The density function f of the beta distribution with parameters r and s is (r s) r 1 f ( x) x (1 x) s 1 , 0 x 1 ( r ) ( s ) The mean and variance of the Beta(r,s) distribution are rs r r s and (r s 1)(r s) 2 respectively. See Appendix B.2 for details. 8 Suppose our prior distribution for p is Beta(r,s) with density ( p) and we observe X1 x1 , , X n xn . Using Bayes theorem, our posterior distribution for p is ( p | x1 , , xn ) ( p) f ( x1 , xn | p) 1 (t ) f ( x , , xn | t )dt 1 0 ( p) f ( x1 , xn | p) n n n xi (r s ) r 1 n xi s 1 i 1 i 1 p (1 p) n p (1 p) xi ( r ) ( s ) i 1 p r 1 i1 xi (1 p) s 1 n i1 xi n n The last expression is proportional to the Beta( r i 1 xi , s n i 1 xi ) density so the posterior n n density for p is Beta( r i 1 xi , s n i 1 xi ). n n 9 Families of distributions for which the prior and posterior distributions belong to the same family are called conjugate families. r i 1 xi n The posterior mean is thus equal to r s n and the (r i 1 xi )( s n i 1 xi ) n posterior variance is n (r s n)2 (r s n 1) . Returning to our example in which Lebron James made 594 out of the 762 free throws he attempted (78.0%) in 20082009, if we had a Beta(1,1) [uniform] prior, then the posterior distribution for Lebron’s probability of making a free throw in the next season is (594+1,762594+1)=Beta(595,169). The posterior mean is 0.779 and the posterior standard deviation is 0.015. A valuable feature of the Bayesian approach is its ability to incorporate prior information. Returning to our example in which Lebron made 594 out of the 762 free throws he attempted (78.0%) in 2008-2009, we could use information from Lebron’s previous seasons. 2003-2004 Free Throws Made 347 Free Throws Attempted 460 10 FT% 0.754 2004-2005 2005-2006 2006-2007 2007-2008 Total 477 601 489 549 2463 636 814 701 771 3382 0.750 0.738 0.698 0.712 0.728 A way to think about the Beta( r , s ) prior is the following: Suppose that if we had no informative prior information on p , we would use a uniform (Beta (1,1)) prior. Then a Beta( r , s ) prior means that our prior information is equivalent to seeing r 1 successes and s 1 failures prior to seeing the current data. So, for example, a Beta(82,18) prior is equivalent to saying that our prior information is equivalent to seeing 81 made free throws and 17 missed free throws prior to seeing our data. If we consider all the free throws from seasons prior to 2008-2009 to be equivalent to free throws in 2008-2009, we would use a Beta(2463+1,3382-2463+1) =Beta (2464,920) prior. If we consider each free throw in prior seasons to be equivalent to half a free throw of information, we would use a Beta (2463*0.5+1, (3382-2463)*0.5+1) =Beta (1322.5, 460.5) prior. 11 When there is more data and prior beliefs are less strong, the prior distribution does not play as strong a role. The posterior mean is r i 1 xi n n n r rs rsn n rsn rs rsn The posterior mean is a weighted average of the sample mean and the prior mean, with the weight on the sample mean increasing to 1 as n . x i 1 i 12 Example 2: A cofferdam protecting a construction site was designed to withstand flows of up to 1870 cubic feet per second (cfs). An engineer wishes to compute the probability that the dam will be overtopped during the upcoming year. Over the previous 25-year period, the annual maximum flood levels of the dam had ranged from 629 to 4720 cfs and 1870 cfs had been exceeded 5 times. Modeling the 25 years as 25 independent Bernoulli trials with the same probability p that the flood level will exceed 1870 cfs and using a uniform prior distribution for p (which corresponds to a Beta (1,1)), the prior and posterior densities are 13 IV. Bayesian Inference for the Normal Distribution 2 2 Suppose X 1 , , X n iid N ( , ) , known, and our prior 2 on is N ( , b ) . The posterior distribution is proportional to 14 f ( x | ) ( ) n 1 1 1 1 2 2 exp ( x ) exp ( ) i 2 2 2 2 b 2 b i=1 2 1 n 1 exp 2 i 1 X i2 2nX n 2 2 2 2 2 2b 2 nX 1 n exp 2 2 2 2 2 b 2b 2 2 1 nX / 2 / b 2 n 1 exp 2 n / 2 1 / b 2 2 b 2 Thus, the posterior distribution is nX n 1 2 b2 2 2 1 1 b N , , NX n 1 n 1 n 1 n 1 n 1 2 b2 2 b2 2 b2 2 b2 2 b2 Note that the posterior mean is a weighted average of the sample mean and the prior mean, with the weight on the sample mean increasing to 1 as n . V. Bickel and Doksum’s perspective on Bayesian models. (a) Bayesian models are useful as a way to generate statistical procedures which incorporate prior information when appropriate. 15 However, statistical procedures should be evaluated in a frequentist (repeated sampling) way: For example, for the iid Bernoulli trials example, if we use the posterior mean with a uniform prior distribution to 1 i 1 xi n estimate p , i.e., pˆ n 1 , then we should look at how this estimate would perform in many repetitions of the experiment if the true parameter is p for various values of p . More to come on this frequentist perspective in Chapter 1.3. (b) We can view the parameter as random and view the Bayesian model as providing a joint probability distribution on the parameter and the data in a frequentist probability sense. Consider the frequentist probability model P P = { P , } for the probability distribution of the data X . The subjective Bayesian perspective is that there is a true unknown and our goal is to describe our beliefs about after seeing the data X . This requires specifying a prior distribution ( ) for our beliefs about ; the posterior distribution describes our beliefs about after seeing the data X . Bickel and Doksum’s viewpoint is to see a Bayesian model as a model for the joint probability distribution of ( , X ) 16 where is considered random. Specifically, a Bayesian model consists of a specification of the marginal distribution of , which is the prior distribution ( ) and a specification of the conditional distribution of the data X | , i.e., P . For example, if the data X 1 , , X n are iid Bernoulli trials with probability p of success and the prior distribution for p is Beta(r,s), then the joint probability distribution for ( p, X1 , , X n ) is generated by: 1. We first generate p from a Beta(r,s) distribution. 2. Conditional on p , we generate X 1 , , X n iid Bernoulli trials with probability p of success. Although the conceptual basis for regarding as random in a frequentist sense is not clear, this point of view is useful for developing properties of Bayesian procedures. 17