Download lecture 16.5: estimating population percentage (16.5

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

German tank problem wikipedia , lookup

Opinion poll wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Statistics
16.5_percent.pdf
Michael Hallstone, Ph.D.
[email protected]
Lecture 16.5: Confidence Interval Estimates of Percentages (or
Proportions)
Some Common Sense Assumptions for Interval Estimates of a Percentage
or Proportion
•
We can figure out the percentage of the sample that fits a desired characteristic. (A discrete
variable that is nominal or ordinal in scale is best). (Hint for exam: no student project
should ever violate this nor have to assume it. Your data set will have this sort of
variable. Thus do not put this assumption in your exam answer.)
•
The data comes from a random sample. (Hint for exam: all student projects violate this
assumption. So you write “I violate random sample in your exam answer.)
•
Both things must be true so we can use the Z distribution:
o n x p >=500 (n= Sample size and p= percentage of sample meeting the characteristic)
and
o n(100-p)>=500
(Hint for our take home exams we will pretend this is true, even if it’s not, and you do
not have to mention it.)
You must understand the Central Limits Theorem in Lecture 14 & 15 and
Lecture 16 to understand this lecture
The logic behind this technique is dependent upon the logic of the Central Limits Theorem from
lecture 14 & 15 and the technique of creating confidence intervals of the mean we learned in lecture
16. If you do not understand those, the following will make no sense.
1 of 18
Wondering about the percentage of a population that fits some
characteristic
We may want to know what percentage of a population meets a certain characteristic. The classic
example of this political polls, but we can use the technique for other things as well. Below are just
some ideas.
•
What percentage of the population of registered voters plan on voting for a candidate?
•
What percentage of a population has ever been convicted of a crime?
•
What percentage of a population has a disaster kit at home?
•
What percentage of a population has diabetes?
When we think about it a bit deeper all of these questions really wonder about characteristics of a
population.
Goal for this lecture: estimate the population percentage
The whole point of statistics is to estimate some characteristic of the population based upon data from
a sample. (The population is what we are interested in studying. If a population was made up of
people, we can’t possibly speak to all of them so we speak to a portion of the population called a
sample. We hope the sample is representative of the population.) Remember all the way back to the
first lecture in this class and that funny circle diagram?
2 of 18
In lecture 16 we learned a technique called “confidence interval estimates” or “interval estimates” to
provide an estimate of the population mean using an interval. In plain English, we came up with a
spread of values (an interval) that estimated the population mean. So for example we made an
estimate like this: I am 95% confident that the population mean is between 25 years of age and 29
years of age.
We will do the same thing for percentages of a population. So we will be able to say something like
“there is a 95% chance that the percentage of the population of voters who plan on voting for
candidate X is between 47% and 53%.”
The Logic Comes From the Central Limits Theorem
Using the logic of the central limits theorem we have seen that the sample mean is a pretty good
estimator of the population mean. The chance of drawing a sample mean that is “close” to the
3 of 18
population means is pretty good. Recall the following example modified from the lecture on the
Central Limits Theorem:
We do not know which sample mean we will draw but (if the sampling distribution of means is
normally distributed) the probability of our sample mean falling ±1 standard deviations from the “true
or real” population mean is 68.26%. Or the probability of our sample mean falling ± 2 standard
deviations from the mean is 95.44%. Remember the pictures from z-scores?
source: http://www.sci.sdsu.edu/class/psychology/psy271/Weeks/psy271week06.htm
4 of 18
Here is another picture of the same basic idea:
5 of 18
Well the same logic applies to percentages!
To keep it simple what we learned about the probabilities of sample means being close to the
population mean applies to percentages.
Instead of the sampling distribution of means we have the sampling distribution of percentages and
the same things apply. The population percentage falls in the middle of the normal curve with the
sample percentages distributed in a normal distribution. The same percentages and probabilities
from the z table apply when both of the following are true:
o n x p >=500
o n(100-p)>=500
(n= sample size and p= percentage of sample meeting the characteristic)
In this class we will not cover how to perform this statistical technique when both of the above are not
true.
I won’t repeat it all here but…the probabilities from the Central Limits
Theorem and Lecture 16 remain the same
Recall that 68.26% of all the sample percentages will fall no further away from the middle of the curve
[population percentage] than plus or minus 1 standard deviation, that 95.44% of all the sample
percentages will fall no further away from the middle of the curve [population percentage] than plus or
minus 2 standard deviations, and exactly 95% of all sampling percentages will fall no further away
from the middle of the curve [population percentage] than plus or minus 1.96 standard deviations,
etc., etc., etc.
6 of 18
Formula for Confidence Interval Estimates of the Population Percentage
Generic formula for confidence interval estimates of the population percentage
p - z σˆ
p
< π<
p + z σˆ p
where π = poululation percentage and σˆ p =
!(!""!!)
!
or
p - z σˆ
p
< population percentage<
where σˆ p =
p + z σˆ p
!(!""!!)
!
Notice how this formula is essentially “adding” and “subtracting” something from the sample
percentage and the population percentage is in the middle.
The z is the amount of standard deviations “of rope” you add and subtract and corresponds to your
desired level of confidence.
The σˆ p is a measure of the standard deviation of the sampling distribution of percentages.
Examples of estimating population percentages
Let’s plug some number into the generic formula to get the hang of it.
A political party wants to know what percentage of the population of likely voters will vote for
their candidate
During election season politicians and political parties want to know what percentage of voters are
going to vote a certain way. The funny thing about political polls is representative samples of the
population of “registered voters” are not as good as representative samples of the population of
“registered voters who are likely to vote” because many registered voters do not vote, especially
during non-Presidential election cycles. Thus pollsters are interested in a population often called
“likely voters.”
7 of 18
Pretend we work for political party and want to know what percentage of the population of likely voters
will vote for our candidate. We want to construct a 95% confidence interval of the percentage of the
population.
We have a representative random sample of the population and have the following information:
n=1,000 and 500 of those people say they will vote for our candidate.
500/1,000 = .5
and .5x100=50% so p=50%
Thus 50% of the people in the survey said they would vote for our candidate
In order to use the z table both of the following must be true
o n x p >=500 [1,000 x 50 =50,000]
o n(100-p)>=500 [1,000(100-50)=1,000(50)=50,000]
Both =50,000 which are both are greater than 500 so we can use the z table
p - z σˆ
p
< population percentage (or π) <
where π = poululation percentage and σˆ p =
σˆ p =
!"(!""!!")
p - z σˆ
!,!!!
p
< π<
=
!"(!")
!,!!!
=
!,!""
!,!!!
=
p + z σˆ p
!(!""!!)
!
2.5= 1.58
p + z σˆ p
50 - 1.96(1.58) < π< 50 + 1.96(1.58)
50 – 3.1 < π< 50 + 3.1
= 46.9 <π< 53.1 or rounded to 47% <π< 53%
8 of 18
What does that mean in plain English?
We are 95% confident that the population percentage is within these values. In plain English relating
the numbers to the question, “We are 95% confident that the percentage of the population of likely
voters who will vote for our candidate is between 47 and 53%.”
The margin of error was plus or minus 3%
Note in the above example we subtracted and added 3.1 from our sample percentage. That is the
“margin of error,” which would most likely be rounded from 3.1 to 3% in the real world. So a way to
understand the margin of error in the problem above is to say the percentage of likely voters who plan
on voting for our candidate is 50%, plus or minus 3%. That’s how we came up with a spread of 47
and 53%!
Notice if we increase the level of confidence the interval gets larger or more spread out
Let’s do a 99% confidence interval where everything else remains the same. The only thing that
changes is the z score.
z(99%)=2.575
p - z σˆ
p
< population percentage (or π) <
p + z σˆ p
50 – 2.575(1.58) < π< 50 + 2.575(1.58)
50 – 4.1 < π< 50 + 4.1
45.9 <π< 54.1 or rounded to 46% <π< 54%
We are 99% confident that the population percentage is within these values. In plain English relating
the numbers to the question, “We are 99% confident that the percentage of the population of likely
voters who will vote for our candidate is between 46 and 54%.
Again, the whole point of doing two confidence intervals in this example was to show that
when you increase the confidence level, the z-score gets larger and thus the interval estimate
has a larger spread of numbers.
•
95%: between 47%, and 53%
•
99%: between 46% and 54%
9 of 18
So as you increase confidence that your interval includes the population mean, but your interval is a
greater spread of numbers. You can see how even with more confidence [99% vs. 95%], the spread
of values is wider and thus is less precise.
The relationship between the sample size, accuracy, and cost
Below I have a series of problems where I change the sample size. You will see that as the sample
size gets larger, the estimate gets better; more precisely the margin of error gets smaller. So you will
see how that happens. But I will show you how cost of increasing sample size may lead to the
problem of “diminishing returns” on the extra cost of increasing the size of the sample.
What percent pharmacy clients served by a public health clinic fall below federal poverty line?
A health care administrator receives a new federal grant to help pay for pharmacy co-payments at a
rural health clinic. Only those who fall below the federal poverty line qualify. She needs to figure out
the percentage of the pharmacy’s clients that fall below the federal poverty line.
p=55 n=50.
Construct a 95% confidence interval estimate of the population percentage falling below the
federal poverty line, say what the numbers mean in plain English, and report the margin of
error.
Answer:
95%
pharmacy
poverty
p
55
n
50
p55
z score
1.96
(SE)
7.04
<pop %<
<pop %<
p+
55
(z)
1.96
(z)(SE)
13.79
<pop %<
<pop %<
p+
55
(z)(SE)
13.79
41.21
<pop %<
68.79
n(100p)>=500
2250
n*p>=500?
2750
p55
σp
7.04
(z)
1.96
10 of 18
(SE)
7.04
The administrator can be 95% confident that percentage of the population of the pharmacy’s clients
that fall below the federal poverty line is between 41 and 69%. The margin of error is 55% “plus or
minus 14%.”
What percent pharmacy clients served by a public health clinic fall below federal poverty line?
[Let’s see what happens the margin of error when we increase the sample size from 50 to 100.]
A health care administrator receives a new federal grant to help pay for pharmacy co-payments at a
rural health clinic. Only those who fall below the federal poverty line qualify. She needs to figure out
the percentage of the pharmacy’s clients that fall below the federal poverty line.
p=55 n=500
Construct a 95% confidence interval estimate of the population percentage falling below the
federal poverty line, say what the numbers mean in plain English, and report the margin of
error.
Answer:
95%
pharmacy
poverty
n=500
p
55
n
500
p55
z score
1.96
(SE)
2.22
<pop %<
<pop %<
p+
55
(z)
1.96
(z)(SE)
4.36
<pop %<
<pop %<
p+
55
(z)(SE)
4.36
50.64
<pop %<
59.36
n(100p)>=500
22500
n*p>=500?
27500
p55
σp
2.22
(z)
1.96
(SE)
2.22
The administrator can be 95% confident that percentage of the population of the pharmacy’s clients
that fall below the federal poverty line is between 51 and 60%. The margin of error is 55% “plus or
minus 4%.” See how the margin of error got smaller? It went from plus or minus 14% to 4%.
11 of 18
What percent pharmacy clients served by a public health clinic fall below federal poverty line?
[Let’s see what happens the margin of error when we increase the sample size even more .]
So when we want representative samples in the real world we need samples sizes to be about
n=1000. Let’s see the sampling error go down even more when we do this.
A health care administrator receives a new federal grant to help pay for pharmacy co-payments at a
rural health clinic. Only those who fall below the federal poverty line qualify. She needs to figure out
the percentage of the pharmacy’s clients that fall below the federal poverty line.
p=55 n=1000
Construct a 95% confidence interval estimate of the population percentage falling below the
federal poverty line, say what the numbers mean in plain English, and report the margin of
error.
95%
pharmacy
poverty
n=1000
p
55
n
1000
p55
z score
1.96
(SE)
1.57
<pop %<
<pop %<
p+
55
(z)
1.96
(z)(SE)
3.08
<pop %<
<pop %<
p+
55
(z)(SE)
3.08
51.92
<pop %<
58.08
n(100p)>=500
45000
n*p>=500?
55000
p55
σp
1.57
(z)
1.96
(SE)
1.57
The administrator can be 95% confident that percentage of the population of the pharmacy’s clients
that fall below the federal poverty line is between 52 and 59%. The margin of error is 55% “plus or
minus 3%.” See how the margin of error got smaller? It went from plus or minus 14% to 4% to
3%. You have a much more precise estimate now. Plus or minus 3% is pretty standard for
most “good” polls of a population with a sample size of about 1,000.
12 of 18
2c. What percent pharmacy clients served by a public health clinic fall below federal poverty
line? [Let’s see what happens the margin of error when we increase the sample size even
more . We will double the size of our sample from 1,000 to 2,000 and see an example of
“diminishing returns.” ]
So when we want representative samples in the real world we need to balance costs and
benefits. For example, most public opinion studies are done by calling people on the phone.
And as of 2016 you have to pay for a person to dial a cell phone number. You could have
computers randomly dial home phones or landlines, but you had to pay for a real person to
dial the number of a cell or mobile phone.
By spending more money to increase our sample size we will have more accurate results, but
it will cost more. In this example we double our sample size from 1,000 to 2,000 – which
means at least doubling the cost of collecting the data. How much improvement do we get for
doing a study that costs twice as much money?
A health care administrator receives a new federal grant to help pay for pharmacy co-payments at a
rural health clinic. Only those who fall below the federal poverty line qualify. She needs to figure out
the percentage of the pharmacy’s clients that fall below the federal poverty line.
p=55 n=2000
Construct a 95% confidence interval estimate of the population percentage falling below the
federal poverty line, say what the numbers mean in plain English, and report the margin of
error.
95%
pharmacy
poverty
n=2000
p
55
n
2000
p-
z score
1.96
(SE)
1.11
<pop %<
<pop %<
p+
55
(z)
1.96
(z)(SE)
<pop %<
p+
(z)(SE)
n(100p)>=500
90000
n*p>=500?
110000
p55
σp
1.11
(z)
1.96
13 of 18
(SE)
1.11
55
2.18
<pop %<
55
52.82
<pop %<
57.18
2.18
The administrator can be 95% confident that percentage of the population of the pharmacy’s clients
that fall below the federal poverty line is between 53 and 58%. The margin of error is 55% “plus or
minus 2%.” See how the margin of error got smaller? When n=1000 it was plus or minus 3%.
And when we doubled the size of our population [and our costs] it only went down about 1%.
When n=2,000 it is plus or minus bout 2%. It was really less than a whole percentage
reduction if you look at the decimals. So would doubling the cost of your study justify
reducing your margin of error less than 1%? It’s up to the person or agency funding the
study!
Practice
Practice problems for this lecture can be found in lecture16.5b_practice.pdf
SPSS – slightly inaccurate but close enough if we trick the program
A text lecture to create SPSS output can be found in lecture16.5c_SPSS.pdf along with a link to a
YouTube video.
Astoundingly SPSS does not offer a command to do this very basic statistical technique. However,
we can “trick it” and come pretty close using the SPSS command for interval estimates of the
population mean if our sample size is sufficiently large. [Smaller sample sizes lead SPSS to be less
precise as the formulas for the standard error used by SPSS is not technically correct, although it’s
pretty close when the sample sizes are large.]
95% confidence interval estimate of population percentage of Democrats using SPSS
According to Pew research, in Hawaii approximately 53% of registered voters are registered
Democrats, 31% Republicans and 16% other. So I created a variable with n=100 where n=53 were
14 of 18
Democrats, n=31 were Republicans, and n=16 other. Then I recoded that into a dichotomous
variable where n=53 (or 54%) were Democrats and n=47 (or 47%) were Republicans or other. We
tricked SPSS by recoding this variable100=Democrat and 0=Republican or other. We could use
either variable but the one coded 100=Democrat allows us to avoid converting decimals to
percentages. The key is you want your reference category coded 1 or 100 and the other
categories in which you are not interested in as 0.
See these short YouTube videos I created that are linked in our course schedule to see how to
recode SPSS_recode1.mov and SPSS_recode2.mov.
Below is the SPSS output for a one sample t test:
15 of 18
One-Sample Statistics
Std.
N
Mean
Deviation
Democrats=10
0%
100 53.0000
Std. Error
Mean
50.16136
5.01614
One-Sample Test
Test Value = 0
t
Democrats=10
0%
Sig. (2tailed)
df
10.566
99
Mean
Difference
.000
95% Confidence Interval
of the Difference
Lower
Upper
53.00000
43.0469
62.9531
The answer is in green highlights in the box “95% confidence interval of the difference” box. We can
round 43 and 62.95% to 43 and 63%.
So we can be 95% confident that the percentage of the population of voters planning on voting for the
Democrat is between 43 and 63%. Thus the margin of error is approximately plus or minus 10%.
SPSS uses t value and slightly different formula for standard error
You will note that SPSS’s formula for the standard error of the mean above
σˆ x =
s
n
=
50.16136
=
50.16136
100
=
= 5.01614
10
is different than we calculate below using the standard error of the percentage
σˆ p =
!"(!""!!")
!""
=
!"(!")
!""
=
!,!"#
!""
= 24.91=
4.99
and thus ends up with a slightly different value.
Recall z is really part of t and SPSS uses t and df=n-1 or 100-1=99 above
Recall the normal curve is not a single curve but is a family of curves whose shapes differ based
upon the sample size. As the sample size grows larger [and approaches infinity] we just start using
16 of 18
the z table. But technically speaking there is a different curve for each sample size. Above SPSS
uses t table and df=99 for a t value of approximately 1.9842.
Slightly different t value and standard error formulas account for rounding error in math below
So below I show you the math by hand for our Confidence Interval Estimate of the Population
Percentage and it comes up with a slightly different value than SPSS. They are very close. The
reasons are we use slightly different numbers from z table and the standard error of the percentage.
Calculating it by hand
p - z σˆ p < population percentage (or π) <
where π = poululation percentage and σˆ p =
p + z σˆ p
!(!""!!)
!
For our variable 53 out of 100 people planned to vote Democrat.
σˆ p =
!"(!""!!")
!""
=
!"(!")
!""
!,!"#
=
!""
= 24.91=
p - z σˆ p < population percentage (or π) <
4.99
p + z σˆ p
53 - 1.96(4.99) < 53 + 1.96(4.99)
53 – 9.78 < π< 53 + 9.78
43.22 <π< 62.78
Thus the margin of error is plus or minus 9.78% or approximately 10%
or rounded to 43% <π< 63% …so it matches SPSS with rounding...see below in green highlights.
One-Sample Test
Test Value = 0
t
Democrats=10
0%
Sig. (2tailed)
df
10.566
99
Mean
Difference
.000
17 of 18
53.00000
95% Confidence Interval
of the Difference
Lower
Upper
43.0469
62.9531
For the Type A personalities here is what SPSS did, which accounts for the rounding errors
So the t value for df=99 and .025 = 1.9842 and the standard error of the mean above is
s
50.16136
=
=
σˆ x =
50.16136
100
n
=
= 5.01614
10
see below in green highlights
One-Sample Statistics
Std.
N
Mean
Deviation
Democrats=10
0%
100 53.0000
So SPSS does the following
x - t σˆ x
< µ<
53 - 1.9842 (5.0164) < µ<
53 - 9.954 < µ<
43.04 < µ<
Std. Error
Mean
50.16136
5.01614
x + t σˆ x
53 + 1.9842 (5.0164)
53 + 9.954
62.95
Thus using the SPSS formula for standard error of the mean [which is technically incorrect] the
margin of error is plus or minus 9.95 or approximately 10%.
Again if you round 43% and 63%...see below in green highlights.
One-Sample Test
Test Value = 0
t
Democrats=10
0%
10.566
Sig. (2tailed)
df
99
Mean
Difference
.000
18 of 18
53.00000
95% Confidence Interval
of the Difference
Lower
Upper
43.0469
62.9531