Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sufficient statistic wikipedia , lookup
Mean field particle methods wikipedia , lookup
Association rule learning wikipedia , lookup
Taylor's law wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Student's t-test wikipedia , lookup
Opinion poll wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Lecture #14: Confidence Intervals for the Proportion In the last lecture we covered estimating a population mean, , from a sample, first using a point estimate, x , and then generating an interval, x E x E , which we could state with a certain level of confidence contains the population mean. This time we’ll deal with estimating a different parameter, called the population proportion, p. When we have a quantitative data set at the interval or ratio level of measurement, we have a set of numbers for which we can calculate a mean and a standard deviation. But when we have a binomial variable, like a ‘yes’ or ‘no,’ or a ‘male’ or ‘female,’ the only thing we can calculate is the fraction of the set that are in one of these categories or the other. Whichever of these categories we choose to concentrate on, we can look at a sample of the data set and calculate the sample proportion for this category. Let’s say we’re interested in the fraction of Mendocino College students who are teenagers. From the Class Data Base, we find that 27 out of 97 are teenagers. We call 27, the number in our category (teenagers), x. As usual, the size of the set is n. The fraction x 27 , or in n 95 this case, is the sample proportion, and its symbol is p̂ , pronounced ‘p-hat.’ So the formula for the sample proportion is pˆ x . n Point Estimate for the Population Proportion When we use our sample to make inferences about the population, we begin, as we did in the case of the population mean, with a point estimate. Not surprisingly we use the sample proportion p̂ as the estimate for the population proportion p. So our estimate 1 for the proportion of Mendocino College students who are teenagers is 27 0.284 . (We 95 round these proportions to the nearest thousandth, or tenth of a percent – 28.4% in this case.) The Margin of Error But, just as with the mean, we have to consider what would happen if we chose a different sample of 95 Mendocino College students. Probably it would have a different number of teenagers, not exactly 27. So we want to go down from our point estimate p̂ and up from it a certain amount, in an attempt to capture the population proportion p. We call this amount, just like with the mean, the margin of error, E. We use it to generate a confidence interval, which in this case will look like this: pˆ E p pˆ E . Take careful note of the parameter in the middle: it’s p, not , which we are estimating here. The formula we used for E in estimating the mean simply won’t work here. There’s no x , and no s. Before I reveal it, there’s one more symbol you have to understand: q̂ (q-hat). It’s the fraction of the data set that is the other category than the one we used for p̂ . Here it’s the fraction of the Class Data Base that aren’t teenagers. If 27 out of 95 are teenagers, then 95 27 68 aren’t. The formula for q̂ is qˆ 1 pˆ , or qˆ nx nx n x 1 pˆ . So in this . The two formulas are the same because n n n n case qˆ 1 27 95 27 68 0.716 . 95 95 95 So here’s the formula for the margin of error when estimating population proportions: E z 2 pˆ qˆ . The z is the same as we used in estimating the mean: 2 n 2 Confidence Level z 2 99% 2.576 95% 1.960 90% 1.645 Let’s find the margin of error for estimating the population proportion of students who are teenagers at Mendocino College at the 99% confidence level: E z 2 pˆ qˆ 0.284 0.716 2.576 0.119 . n 95 Constructing Confidence Intervals We’ve already done most of the work for constructing a confidence interval at the 99% confidence level for the population proportion of students who are teenagers at Mendocino College. Here it is: pˆ E p pˆ E 0.284 0.119 p 0.284 0.119 , or 0.165 p 0.403 . We can be 99% confident that the percent of Mendocino College students who are teenagers is between 16.5% and 40.3%. Let’s have the calculator do the work now, using 1-PropZInt. Here are the results for our three levels of confidence: 99%: 0.165 p 0.403 95%: 0.194 p 0.375 90%: 0.208 p 0.360 3 Notice again how the interval narrows as the level of confidence decreases; we estimate more closely but with less confidence that we’ve actually captured the population proportion. Looking at Polls Estimating the population proportion is probably the application of statistics that you are most likely to encounter in everyday life. It shows up all over in the form of opinion polls, in which random samples of people are asked their opinions about candidates and issues. Here’s an article about one such poll from the Rasmussen Reports: Obama Up Big In California Against Romney, Santorum President Obama leads both Mitt Romney and Rick Santorum by more than 20 points in California, as nearly six-out-of-ten voters approve of the way he's handling his job. New Rasmussen Reports data shows that if Romney is the Republican nominee, Obama leads 57% to 35%. If Santorum becomes the GOP standard bearer, the president leads 58% to 30%. President Obama leads Romney by 23 points among unaffiliated voters and Santorum by 34 points among the same group. This California survey of 500 Likely Voters was conducted February 8-16, 2012 by Rasmussen Reports. The margin of sampling error is +/- 4.5 percentage points with a 95% level of confidence. The important facts are that n was 500 and p̂ was 57% for Obama vs. Romney and 58% for Obama vs. Santorum. The confidence level was 95%, and the article says that the margin of sampling error, E, was 4.5%. Check it out: E z 2 pˆ qˆ 0.57 0.43 1.960 0.043 . n 500 Pretty close! 4 Here’s another, from the L. A. Times: Jerry Brown’s approval rating steady, new poll finds. Despite a punishing economic environment that has whittled away at the popularity of many elected officials, Gov. Jerry Brown’s favorability rating in California has largely held steady, according to a new USC Dornsife/Los Angeles Times poll. Forty-six percent of voters have a favorable impression of the governor, comparable to findings in April, when 44% favored him, and in July -- right after he signed the state budget -- when the number ticked up to 48%. Brown drew a higher rating from Latinos -– 54% -- in the new poll, conducted from Oct. 30 to Nov 9. Brown is almost through the first year of his return engagement as governor. He failed to find Republican support for tax increases to balance the state spending plan, instead relying on rosy revenue projections. He also signed the California Dream Act, making illegal immigrant students at public universities eligible for taxpayer-funded scholarships. Last month, he proposed a sweeping overhaul of public pensions that would curtail retirement benefits for all future and some current government workers. The survey is a bipartisan project by the USC Dornsife College of Letters, Arts and Science and the Los Angeles Times. It was conducted among 1,500 registered California voters by the Democratic firm Greenberg Quinlan Rosner Research and the Republican firm American Viewpoint. The overall margin of sampling error is plus or minus 2.5 percentage points. Here n was 1,500, and p̂ was 46%, which gives 54% for q̂ . Assuming that the 95% confidence level was used we get E z 2 pˆ qˆ 0.46 0.54 1.960 0.025 , n 1,500 confirming the last sentence of the article. The confidence interval for this poll is thus 0.46 0.025 p 0.46 0.025 , or 0.435 p 0.485 . 5 Sample Size Our goal here is to produce a formula for the sample size, given the desired margin of error, much as we did in the lecture on confidence intervals for the mean. The z s formula there was n 2 . That won’t work for proportions, because we don’t E 2 have an s, and anyway we developed a different formula for the margin of error for estimating proportions. (Remember, it was rearranging the formula for E in estimating means that produced the formula for n.) z s Perhaps you had a doubt about the formula n 2 , namely, that if you E 2 were setting out to estimate the mean, how on earth would you already know the standard deviation of the sample whose size you are now determining? This is a valid objection; the books just say that you’d know it from other studies of the variable whose mean you want to estimate. This seems kind of glib. The good news is that in estimating the population proportion, we can do better. It appears that if we’re going to rearrange the formula E z 2 pˆ qˆ to solve for n, we’d n have to know p̂ (and thus q̂ ) before we found the appropriate sample size. But lets take a look at p̂ and q̂ and more importantly their product pˆ qˆ : p̂ q̂ pˆ qˆ 0.1 0.9 0.09 0.2 0.8 0.16 6 0.3 0.7 0.21 0.4 0.6 0.24 0.5 0.5 0.25 0.6 0.4 0.24 0.7 0.3 0.21 0.8 0.2 0.16 0.9 0.1 0.19 You can see that the largest value pˆ qˆ ever takes on is 0.25, which happens when p̂ and q̂ are both 0.5 (when the sample is evenly divided between the two categories). So if we just replace pˆ qˆ by 0.25, we’ll be on the safe side – that is, we’ll be using a sample that might be bigger than necessary for a certain margin of error, but it will never be smaller than necessary. Time for a little algebra. First replace pˆ qˆ by 0.25 in the formula E z 2 pˆ qˆ 0.25 : E z . 2 n n Then square both sides: E 2 z 2 2 0.25 . n Then multiply both sides by n: n E 2 n z 2 2 0.25 2 0.25 z . 2 n 2 n E 2 0.25 z 2 Finally, divide both sides by E : . E2 E2 2 So n 0.25 z E2 2 2 z , or n 0.25 2 E 2 . 7 The way we’ve done this formula, so that it doesn’t depend on a particular p̂ , means that the sample size for a given confidence level and a given margin or error will be the same no matter what the binomial variable is – for or against Candidate X, male or female, whatever. How large a sample should we use if we want to estimate a proportion at the 99% confidence level within 4%? E is 0.04, and z is 2.576. So 2 2 2.576 n 0.25 1036.84 . Here we would round up anyway, but even if the digit in 0.04 the tenths’ place were less than five we would still round up. So the sample size we should use is 1,037. Comparing Estimating the Mean and Estimating the Proportion Here is a side-by-side comparison of the confidence intervals covered in this lecture and the last one: Estimating the Mean Estimating the Proportion Parameter p Point Estimate x p̂ Confidence xE xE pˆ E p pˆ E Interval Formula for E E z sx 2 E z n 8 2 pˆ qˆ n Sample Size Formula z s n 2 E 2 z n 0.25 2 E 2 © 2009 by Deborah H. White 9