Download Applied Data Analysis - University of Rochester

Applied Data Analysis Spring 2017 Yemen, 2016 Karen Albert [email protected] Thursdays, 4-5 PM (Hark 302) Lecture outline 1. Percentages 2. Central limit theorem 3. Inference Let’s do some political science McClatchy-Marist poll released yesterday: 17% said that they trust President Trump to "deliver accurate and factual information to the public." • The percentage that is reported is a mean. • More specifically, it is a mean of 1’s and 0’s. • So the percentage is the expected value. • We can then figure out the standard error. The margin of error is twice the standard error. (Why twice?) Let’s do some political science McClatchy-Marist poll released yesterday: 17% said that they trust President Trump to "deliver accurate and factual information to the public." • The percentage that is reported is a mean. • More specifically, it is a mean of 1’s and 0’s. • So the percentage is the expected value. • We can then figure out the standard error. The margin of error is twice the standard error. (Why twice?) The poll number are therefore: expected value ± (2xS.E.) A short-cut to the SD for a special case For boxes with only two kinds of tickets, there is short-cut to finding the standard deviation. S = (big − small) × p big fraction × small fraction A short-cut to the SD for a special case For boxes with only two kinds of tickets, there is short-cut to finding the standard deviation. S = (big − small) × p big fraction × small fraction Consider a box with the values: 1, 1, 1, 5. The standard deviation is r S = (5 − 1) × 1 3 × ≈ 1.73 4 4 The short-cut x <- c(1,1,1,5) sd(x)*(3/4) (5-1)*sqrt(.25*.75) The short-cut x <- c(1,1,1,5) sd(x)*sqrt(3/4) ## [1] 1.732051 (5-1)*sqrt(.25*.75) ## [1] 1.732051 Another example If you bet a dollar on a single number at Nevada roulette, and that number comes up, you get the $1 back together with winnings of $35. If any other number comes up, you lose the dollar. Suppose you play 100 times, betting a dollar on the number 17 each time. Another example If you bet a dollar on a single number at Nevada roulette, and that number comes up, you get the $1 back together with winnings of $35. If any other number comes up, you lose the dollar. Suppose you play 100 times, betting a dollar on the number 17 each time. What is your chance of breaking even? Another example If you bet a dollar on a single number at Nevada roulette, and that number comes up, you get the $1 back together with winnings of $35. If any other number comes up, you lose the dollar. Suppose you play 100 times, betting a dollar on the number 17 each time. What is your chance of breaking even? Hint: there are 38 spaces on a roulette wheel. The box model and the expected value The box has 38 tickets in it: 1 worth 35 and 37 worth -1. The mean of the box is 35*(1/38)+(-1)*(37/38) ## [1] -0.05263158 The standard deviation Since there are only two kinds of tickets in the box, we can use the short-cut. (35-(-1))*sqrt((1/38)*(37/38)) ## [1] 5.762617 So the standard error is 5.76/sqrt(100) ## [1] 0.576 End of the example What is the probability that you break even? The mean (expected value) is -0.0526. The standard deviation (standard error) is 0.576. End of the example What is the probability that you break even? The mean (expected value) is -0.0526. The standard deviation (standard error) is 0.576. pnorm(0,-.0526,.576,lower.tail=FALSE) ## [1] 0.4636194 So the probability of breaking even is roughly 46%. SRS and chance error Let’s consider the percentage of Democrats in a sample. Assume that we took a simple random sample. With an SRS, the expected value for the sample percentage equals the population percentage. It won’t be exact, of course. The amount it is off is governed by chance, and we can use a box model to understand it. How BIG? % of 1’s in the sample = % of 1’s in the box + chance error How BIG? % of 1’s in the sample = % of 1’s in the box + chance error How big is the error likely to be? How BIG? % of 1’s in the sample = % of 1’s in the box + chance error How big is the error likely to be? The standard error tells us. How BIG? % of 1’s in the sample = % of 1’s in the box + chance error How big is the error likely to be? The standard error tells us. Calculation: 1. Find the standard deviation for the number of 1’s in the sample. 2. Find the standard error. 3. Convert to a percentage. Calculating the standard error (part one) Consider a box with 3,091 1’s and 3581 0’s. Assume that we take 100 draws from the box. Calculating the standard error (part one) Consider a box with 3,091 1’s and 3581 0’s. Assume that we take 100 draws from the box. The fraction of 1’s in the box is 0.46. Calculating the standard error (part one) Consider a box with 3,091 1’s and 3581 0’s. Assume that we take 100 draws from the box. The fraction of 1’s in the box is 0.46. The standard deviation is therefore, sqrt(.46*.54) ## [1] 0.4983974 Calculating the standard error (part deux) Now that we have the SD, the standard error is, sqrt(.46*.54/100) ## [1] 0.04983974 The percentage of 1’s in the sample is 46% give or take 5% or so. What if the sample size increases? First, what happens to the standard error of the sum? What if the sample size increases? First, what happens to the standard error of the sum? √ 100 × 0.5 = 5 What if the sample size increases? First, what happens to the standard error of the sum? √ √ 100 × 0.5 = 5 400 × 0.5 = 10 What if the sample size increases? First, what happens to the standard error of the sum? √ √ 100 × 0.5 = 5 400 × 0.5 = 10 Now, what about the standard error for the percentage? What if the sample size increases? First, what happens to the standard error of the sum? √ √ 100 × 0.5 = 5 400 × 0.5 = 10 Now, what about the standard error for the percentage? 0.5 √ 100 = 0.05 What if the sample size increases? First, what happens to the standard error of the sum? √ √ 100 × 0.5 = 5 400 × 0.5 = 10 Now, what about the standard error for the percentage? 0.5 √ 100 0.5 √ 400 = 0.05 = 0.025 The normal curve yet again Remember a box filled with 3,091 1’s and 3,581 0’s. The percentage of 1’s in the sample is 46% give or take 5% or so. Can we estimate the chances of the percentage being greater than 56%? The normal curve yet again Remember a box filled with 3,091 1’s and 3,581 0’s. The percentage of 1’s in the sample is 46% give or take 5% or so. Can we estimate the chances of the percentage being greater than 56%? We did this with the normal curve before, but the data this time are not normal.... The Central Limit Theorem When drawing at random with replacement from a box, the sampling distribution of the sample mean will follow the normal curve, even if the contents of the box do not. The Central Limit Theorem When drawing at random with replacement from a box, the sampling distribution of the sample mean will follow the normal curve, even if the contents of the box do not. There are conditions: 1. the sampling distribution must be put into standard units, 2. the number of draws must be reasonably large. Central Limit Theorem • Consider a variable uniformly distributed between 0 and 10. • Suppose that we take a random sample of size n from this distribution and compute the mean. • Next, we put the mean in standard units by dividing by the square root of n. • Repeat this process 4000 times. • What would the resulting sampling distribution look like? Sample size=1 Sample size=5 Sample size=10 Sample size=30 Actual inference OK, because I’m a nice guy, I keep telling you the contents of the box. Actual inference OK, because I’m a nice guy, I keep telling you the contents of the box. What if I don’t tell you the contents of the box, but only the draws from the box? Actual inference OK, because I’m a nice guy, I keep telling you the contents of the box. What if I don’t tell you the contents of the box, but only the draws from the box? This is the essence of inference. Actual inference OK, because I’m a nice guy, I keep telling you the contents of the box. What if I don’t tell you the contents of the box, but only the draws from the box? This is the essence of inference. Using the draws to learn about the box is just like using a sample to learn about a population. Box to sample If we don’t know the contents of the box, we have to substitute the fractions observed in the sample for the unknown fractions in the box. Example In a certain city, there are 100,000 persons age 18 to 24. A simple random sample of 500 such persons is drawn, of whom 194 turn out to be currently enrolled in college. Estimate the percentage of all persons age 18 to 24 in that city who are currently enrolled in college. Example In a certain city, there are 100,000 persons age 18 to 24. A simple random sample of 500 such persons is drawn, of whom 194 turn out to be currently enrolled in college. Estimate the percentage of all persons age 18 to 24 in that city who are currently enrolled in college. The sample percentage is, 194 = 0.388 500 The standard deviation The SD of the box, using the observed fractions, is, p % of 10 s × % of 00 s p = 0.388(1 − 0.388) s = = 0.487 The standard error The standard error is therefore 0.487 √ 500 = 0.0218 s.e. = The standard error The standard error is therefore 0.487 √ 500 = 0.0218 s.e. = The solution is therefore 38.8% give or take 2.18%. Standard error for a percentage r s.e. = p̂(1 − p̂) n • p̂ is the observed fraction of “successes” in the sample. • n is the number of draws (the size of the sample). Standard error for a percentage r s.e. = p̂(1 − p̂) n • p̂ is the observed fraction of “successes” in the sample. • n is the number of draws (the size of the sample). Interpretation: average distance of the p̂ values from the true, but unknown, percentage. Another example There are 4000 undergraduates at the University of Rochester. A simple random sample of 400 students is drawn, of whom 20 turn out to like Nickelback. Estimate the percentage of all students at UR that like Nickelback. Solution The sample percentage is p̂ = 20/400 = 0.05. Solution The sample percentage is p̂ = 20/400 = 0.05. The standard error is r 0.05(1 − 0.05) 400 = 0.0109 s.e. = Solution The sample percentage is p̂ = 20/400 = 0.05. The standard error is r 0.05(1 − 0.05) 400 = 0.0109 s.e. = The solution is therefore 5% give or take 1.09%. What did we learn? • Percentages • Central limit theorem • Inference

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Applied Data Analysis - University of Rochester