Download Lecture #14: Confidence Intervals for the Proportion

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sufficient statistic wikipedia , lookup

Mean field particle methods wikipedia , lookup

Association rule learning wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Opinion poll wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Lecture #14: Confidence Intervals for the Proportion
In the last lecture we covered estimating a population mean,  , from a sample,
first using a point estimate, x , and then generating an interval, x  E    x  E , which
we could state with a certain level of confidence contains the population mean. This time
we’ll deal with estimating a different parameter, called the population proportion, p.
When we have a quantitative data set at the interval or ratio level of measurement,
we have a set of numbers for which we can calculate a mean and a standard deviation.
But when we have a binomial variable, like a ‘yes’ or ‘no,’ or a ‘male’ or ‘female,’ the
only thing we can calculate is the fraction of the set that are in one of these categories or
the other.
Whichever of these categories we choose to concentrate on, we can look at a
sample of the data set and calculate the sample proportion for this category. Let’s say
we’re interested in the fraction of Mendocino College students who are teenagers. From
the Class Data Base, we find that 27 out of 97 are teenagers. We call 27, the number in
our category (teenagers), x. As usual, the size of the set is n. The fraction
x
27
, or
in
n
95
this case, is the sample proportion, and its symbol is p̂ , pronounced ‘p-hat.’ So the
formula for the sample proportion is pˆ 
x
.
n
Point Estimate for the Population Proportion
When we use our sample to make inferences about the population, we begin, as
we did in the case of the population mean, with a point estimate. Not surprisingly we use
the sample proportion p̂ as the estimate for the population proportion p. So our estimate
1
for the proportion of Mendocino College students who are teenagers is
27
 0.284 . (We
95
round these proportions to the nearest thousandth, or tenth of a percent – 28.4% in this
case.)
The Margin of Error
But, just as with the mean, we have to consider what would happen if we chose a
different sample of 95 Mendocino College students. Probably it would have a different
number of teenagers, not exactly 27. So we want to go down from our point estimate p̂
and up from it a certain amount, in an attempt to capture the population proportion p. We
call this amount, just like with the mean, the margin of error, E. We use it to generate a
confidence interval, which in this case will look like this: pˆ  E  p  pˆ  E . Take
careful note of the parameter in the middle: it’s p, not  , which we are estimating here.
The formula we used for E in estimating the mean simply won’t work here.
There’s no x , and no s. Before I reveal it, there’s one more symbol you have to
understand: q̂ (q-hat). It’s the fraction of the data set that is the other category than the
one we used for p̂ . Here it’s the fraction of the Class Data Base that aren’t teenagers. If
27 out of 95 are teenagers, then 95  27  68 aren’t. The formula for q̂ is qˆ  1  pˆ , or
qˆ 
nx
nx n x
   1  pˆ . So in this
. The two formulas are the same because
n
n
n n
case qˆ  1 
27 95  27 68


 0.716 .
95
95
95
So here’s the formula for the margin of error when estimating population
proportions: E  z
2
pˆ  qˆ
. The z is the same as we used in estimating the mean:
2
n
2
Confidence Level
z
2
99%
2.576
95%
1.960
90%
1.645
Let’s find the margin of error for estimating the population proportion of students
who are teenagers at Mendocino College at the 99% confidence level:
E  z
2
pˆ  qˆ
0.284  0.716
 2.576
 0.119 .
n
95
Constructing Confidence Intervals
We’ve already done most of the work for constructing a confidence interval at the
99% confidence level for the population proportion of students who are teenagers at
Mendocino College. Here it is:
pˆ  E  p  pˆ  E  0.284  0.119  p  0.284  0.119 , or 0.165  p  0.403 . We can
be 99% confident that the percent of Mendocino College students who are teenagers is
between 16.5% and 40.3%.
Let’s have the calculator do the work now, using 1-PropZInt. Here are the results
for our three levels of confidence:
99%: 0.165  p  0.403
95%: 0.194  p  0.375
90%: 0.208  p  0.360
3
Notice again how the interval narrows as the level of confidence decreases; we estimate
more closely but with less confidence that we’ve actually captured the population
proportion.
Looking at Polls
Estimating the population proportion is probably the application of statistics that
you are most likely to encounter in everyday life. It shows up all over in the form of
opinion polls, in which random samples of people are asked their opinions about
candidates and issues. Here’s an article about one such poll from the Rasmussen
Reports:
Obama Up Big In California Against Romney, Santorum
President Obama leads both Mitt Romney and Rick Santorum by more than 20
points in California, as nearly six-out-of-ten voters approve of the way he's
handling his job.
New Rasmussen Reports data shows that if Romney is the Republican nominee,
Obama leads 57% to 35%.
If Santorum becomes the GOP standard bearer, the president leads 58% to 30%.
President Obama leads Romney by 23 points among unaffiliated voters and
Santorum by 34 points among the same group.
This California survey of 500 Likely Voters was conducted February 8-16,
2012 by Rasmussen Reports. The margin of sampling error is +/- 4.5 percentage
points with a 95% level of confidence.
The important facts are that n was 500 and p̂ was 57% for Obama vs. Romney and 58%
for Obama vs. Santorum. The confidence level was 95%, and the article says that the
margin of sampling error, E, was 4.5%. Check it out:
E  z
2
pˆ  qˆ
0.57  0.43
 1.960
 0.043 .
n
500
Pretty close!
4
Here’s another, from the L. A. Times:
Jerry Brown’s approval rating steady, new poll finds.
Despite a punishing economic environment that has whittled away at the
popularity of many elected officials, Gov. Jerry Brown’s favorability rating in
California has largely held steady, according to a new USC Dornsife/Los Angeles
Times poll.
Forty-six percent of voters have a favorable impression of the governor,
comparable to findings in April, when 44% favored him, and in July -- right after
he signed the state budget -- when the number ticked up to 48%.
Brown drew a higher rating from Latinos -– 54% -- in the new poll, conducted
from Oct. 30 to Nov 9.
Brown is almost through the first year of his return engagement as governor. He
failed to find Republican support for tax increases to balance the state spending
plan, instead relying on rosy revenue projections. He also signed the California
Dream Act, making illegal immigrant students at public universities eligible for
taxpayer-funded scholarships. Last month, he proposed a sweeping overhaul of
public pensions that would curtail retirement benefits for all future and some
current government workers.
The survey is a bipartisan project by the USC Dornsife College of Letters, Arts
and Science and the Los Angeles Times. It was conducted among 1,500 registered
California voters by the Democratic firm Greenberg Quinlan Rosner Research and
the Republican firm American Viewpoint. The overall margin of sampling error is
plus or minus 2.5 percentage points.
Here n was 1,500, and p̂ was 46%, which gives 54% for q̂ . Assuming that the 95%
confidence level was used we get E  z
2
pˆ  qˆ
0.46  0.54
 1.960
 0.025 ,
n
1,500
confirming the last sentence of the article. The confidence interval for this poll is thus
0.46  0.025  p  0.46  0.025 , or 0.435  p  0.485 .
5
Sample Size
Our goal here is to produce a formula for the sample size, given the desired
margin of error, much as we did in the lecture on confidence intervals for the mean. The
 z  s 
formula there was n   2  . That won’t work for proportions, because we don’t
 E 


2
have an s, and anyway we developed a different formula for the margin of error for
estimating proportions. (Remember, it was rearranging the formula for E in estimating
means that produced the formula for n.)
 z  s 
Perhaps you had a doubt about the formula n   2  , namely, that if you
 E 


2
were setting out to estimate the mean, how on earth would you already know the standard
deviation of the sample whose size you are now determining? This is a valid objection;
the books just say that you’d know it from other studies of the variable whose mean you
want to estimate. This seems kind of glib.
The good news is that in estimating the population proportion, we can do better.
It appears that if we’re going to rearrange the formula E  z
2
pˆ  qˆ
to solve for n, we’d
n
have to know p̂ (and thus q̂ ) before we found the appropriate sample size. But lets take
a look at p̂ and q̂ and more importantly their product pˆ  qˆ :
p̂
q̂
pˆ  qˆ
0.1 0.9 0.09
0.2 0.8 0.16
6
0.3 0.7 0.21
0.4 0.6 0.24
0.5 0.5 0.25
0.6 0.4 0.24
0.7 0.3 0.21
0.8 0.2 0.16
0.9 0.1 0.19
You can see that the largest value pˆ  qˆ ever takes on is 0.25, which happens when p̂ and
q̂ are both 0.5 (when the sample is evenly divided between the two categories). So if we
just replace pˆ  qˆ by 0.25, we’ll be on the safe side – that is, we’ll be using a sample that
might be bigger than necessary for a certain margin of error, but it will never be smaller
than necessary.
Time for a little algebra. First replace pˆ  qˆ by 0.25 in the formula
E  z
2
pˆ  qˆ
0.25
: E  z
.
2
n
n
Then square both sides: E 2  z
2
2

0.25
.
n
Then multiply both sides by n: n  E 2  n  z
2
2

0.25
2
 0.25 z .
2
n
2
n  E 2 0.25 z 2
Finally, divide both sides by E :
.

E2
E2
2
So n 
0.25 z
E2
2
2
 z
, or n  0.25  2
 E

2

 .


7
The way we’ve done this formula, so that it doesn’t depend on a particular p̂ ,
means that the sample size for a given confidence level and a given margin or error will
be the same no matter what the binomial variable is – for or against Candidate X, male or
female, whatever.
How large a sample should we use if we want to estimate a proportion at the 99%
confidence level within 4%? E is 0.04, and z is 2.576. So
2
2
 2.576 
n  0.25 
  1036.84 . Here we would round up anyway, but even if the digit in
 0.04 
the tenths’ place were less than five we would still round up. So the sample size we
should use is 1,037.
Comparing Estimating the Mean and Estimating the Proportion
Here is a side-by-side comparison of the confidence intervals covered in this
lecture and the last one:
Estimating the Mean Estimating the Proportion
Parameter

p
Point Estimate
x
p̂
Confidence
xE  xE
pˆ  E  p  pˆ  E
Interval
Formula for E
E  z
sx
2
E  z
n
8
2
pˆ  qˆ
n
Sample Size
Formula
 z  s 
n 2 
 E 


2
 z
n  0.25  2
 E





2
© 2009 by Deborah H. White
9