Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
10.0 Central Limit Theorem • Answer Questions • Review Bayes • Box Models • Expected Value and Standard Error • Central Limit Theorem 1 10.1 Bayes Review Recall: Let A1 , . . . , An be mutually exclusive and suppose that P [A1 or A2 or . . . or An ] = 1. Then P [B|A1 ] ∗ P [A1 ] P [A1 |B] = Pn . i=1 P [B|Ai ] ∗ P [Ai ] Suppose that 70% of Republicans favor the medicare prescription plan, as do 20% of Democrats and 40% of Independents. And assume that 35% of the population are Republicans, 45% are Democrats, and 20% are Independents. You meet someone who says they support the plan; what is the chance that they are Republican? 2 Here the Ai are the three parties, which we assume are mutually exclusive and exhaustive. A1 : Repulicans A2 : Democrats A3 : Independents We want to find the probability of a Republican given support in terms of the probabilities of support given the party affiliations. Pair and Discuss 3 So you want to find: P [ Rep | supp ] = P [ Rep and supp ]/P [ supp ] P[ s | R ] ∗ P[ R ] = P[ s | R ] ∗ P[ R ] + P[ s | D ] ∗ P[ D ] + P[ s | I ] ∗ P[ I ] = (.7) ∗ (.35)/[(.7) ∗ (.35) + (.2) ∗ (.45) + (.4)(.2)] = .590. 4 10.2 Box Models A Box Model describes a process in terms of making repeated draws, with replacement, from a box containing numbers. Since draws are made with replacement, the outcomes in a series of draws are independent. The value on the first draw does not affect the value on the second. Box models describe: repeated rolls of a die, repeated tosses of a coin (fair or unfair), and (approximately) drawing a random sample of U.S. citizens and asking for whom they plan to vote. 5 For a box model, the expected value is the average of the numbers in the box. For categorical data, such as voting for Bush or Kerry, one averages zeroes (Bush) and ones (Kerry). When averaging zeroes and ones, the result is just a proportion, or the total number of people voting for Kerry divided by the total number of people whom you ask. The Law of Averages says that if one makes many draws from the box and averages the results, that average will approach to the expected value of the box. 6 10.3 Expected Value and Standard Error Suppose that a box contains B numbers, X1 , . . . , XB . Then the expected value of the box is is B 1 X EV = X̄ = Xi B i=1 and the standard deviation of the box (X1 , . . . , XB ) is r 1 sd = [(X1 − EV )2 + · · · + (XB − EV )2 ] B v ! u B u 1 X Xi2 − (EV )2 . = t B i=1 This is just finding the mean and sd for a list of numbers (X1 , . . . , XB ). 7 The standard error for the average of n draws from a box (with replacement) is: sd se = √ . n sd ⇐ standard deviation of a box. se ⇐ standard error for n draws from a box,. se is the sd of the averages found in many samples of size n. 8 Recall that The Law of Averages says that if one makes many draws from the box and averages the results, that average will approach to the expected value of the box. Another way to understand law of average: The standard error is the likely size of the difference between the average of n draws from the box and the expected value of the box. And the standard error se goes to zero as n → ∞. It means that the sample average is a good estimate of the average in the box, and the accuracy of the estimate improves as you take more and more draws from the box. In the long run, average will approach to the expected value of the box 9 The standard error for the sum of n draws (with replacement) is: √ se = n × sd. This is the sd of many sums of size n. Analgously to standard error for averages, the standard error of the sum is the likely size of the difference between the sum of n draws from the box and n times the expected value of the box. Note that as n → ∞, this standard error does not go to zero. This means that as the number of draws increases, the likely difference between the sum of the draws and nEV gets larger, rather than smaller. This concern about the sum, rather than the average, arises in contexts such as investment or gambling, where the total return from multiple trials is important, not the average return. 10 10.4 The Central Limit Theorem The Central Limit Theorem is one of the high-water marks of mathematical thinking. It was worked upon by James Bernoulli, Abraham de Moivre, and Alan Turing. Over the centuries, the theory improved from special cases to a very general rule. Essentially, the Central Limit Theorem allows one to describe how accurately the Law of Averages works. That is, what is the chance that the sample average is more than d away from the true EV? 11 Formally, the Central Limit Theorem for averages says: √ X̄ ∼ ˙ N (EV, sd/ n) X̄ − EV √ ∼ ˙ N (0, 1) (1) sd/ n or where N (0, 1) is the standard normal, X̄ is the average of n draws, EV is the expected value of the box, and sd is the standard deviation of the box. √ ) is a random number that is approximately This means that ( X̄−EV sd/ n normal with mean zero and standard deviation one. Note that EV and sd is a fixed number and X̄ is a random number. The approximation gets better as n gets larger. Modifications of this formula hold for many other situations in practice. 12 Another version of the Central Limit holds for sums: √ sum ∼ ˙ N (nEV, n sd) sum − nEV √ ∼ ˙ N (0, 1). (2) n sd or Note that sum = nX̄. This formula is useful when calculating the chance of winning a given amount of money when gambling, or getting more than a specific score on a test. With these two central limit formulas, one can answer all sorts of practical questions. 13 Problem 1a: You want to estimate the average income of people in Durham. Suppose the true mean income (EV) is $42,000 with sd of $10,000. You draw a random sample of 100 households (n=100 ). What is the probability that your sample estimate is over the true value by $500 or more? P[X̄ > 42, 000 + 500] =? • What is the box model for this problem? • What is the expected value of the box? • What is the sd of the box? • What formulae should we use? (1) [average] or (2) [sum]? 14 Recall that √ X̄ ∼ ˙ N (EV, sd/ n) X̄ − EV √ ∼ ˙ N (0, 1) (1) sd/ n or where N (0, 1) is the standard normal, X̄ is the average of n draws, EV is the expected value of the box, and sd is the standard deviation of the box. 15 P[X̄ > 42, 500] = P[X̄ − EV > 500] = P[(X̄ − EV )/(se) > 500/(se)] √ = ˙ P[Z > 500/(10, 000/ 100)] = P[Z > .5]. From the standard normal table, we know this has chance (1/2) (100 38.29) = 30.85%, so the probability of the estimate being too high by $500 is just .3085. 16 Note that in order to solve this, we made the unreasonable assumption that we knew the true mean of the box and the sd. Later we shall relax this assumption. But this is not really the question one wants to ask in practice, nor is it the kind of information that one really has from a survey. 17 Problem 1b: You want to estimate the average income of people in Durham. You draw a random sample of 100 households and find the sample mean is $42,500 and the sd of your sample is $10,000. What is the approximate probability that you have overestimated the true average income of Durham by $500 or more? First, assume that the sd of your sample equals the sd of the box. This is an approximation, and later we shall see a way to improve this. We want to find P[X̄ − EV > 500] = P[(X̄ − EV )/(se) > 500/(se)] √ = ˙ P[Z > 500/(10, 000/ 100)] = P[Z > .5]. The probability is .3085 that $42,500 is $500 too high (or more). 18 Problem 2: You are playing Red and Black in roulette. (A roulette wheel has 38 pockets; 18 are red, 18 are black, and 2 are green—the house takes all the money on green). You pick either red or black; if the ball lands in the color you pick, you win a dollar. Otherwise you lose a dollar. Suppose you make 100 plays. What is the chance that you lose $10 or more? What is the box model? What formulae should we use? (1) [average] or (2) [sum]? . 19 Recall that another version of the Central Limit holds for sums: √ sum ∼ ˙ N (nEV, n sd) sum − nEV √ ∼ ˙ N (0, 1). (2) n sd or 20 There are 38 tickets, and 18 are labelled 1 and the 20 are labelled -1. So the expected value of the box is EV 1 [1 + 1 + · · · + 1 + (−1) + (−1) + · · · + (−1)] = 38 1 [−2] = 38 = −1/19. The standard deviation of the box is v u 38 u 1 X sd = t( Xi2 ) − (EV )2 38 i=1 p = 1 − (−1/19)2 = .998614. 21 The probability of losing more than $10 or more in 100 plays is P[sum < −10] = P[sum − nEV < −10 − nEV ] −10 − nEV sum − nEV √ √ < ] = P[ nsd nsd −10 − nEV √ ] = ˙ P[Z < nsd = P[Z < [−10 − (100)(−1/19)]/(10 ∗ .998614)] = P[Z < −.47434]. ¿From the standard normal table, the chance of falling between .47434 and -.4743 is about .3473. Thus the chance of being below -.47434 is (1 .3473)/2 = .3264. 22