Download Sampling Distribution for a Proportion Start with a population, adult

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Sampling Distribution for a Proportion
Start with a population, adult Americans and a binary variable, whether
they believe in God.
The key parameter is the population proportion p. In this case let us suppose
82% of Americans believe in God.
Take a sample of 400 Americans, ask if they believe in God. The key statistic
is the sample proportion p̂, (“pee-hat”) the number of yes answers divided by
the total (400). The proportion of the sample who believe in God. Each sample
has a different p̂. If we consider all possible samples we can make a histogram
of those values, the sampling distribution of the random variable P̂ .
The sampling distribution of P̂ has
1. A mean of µP̂ = p
2. A standard deviation (standard error) of σP̂ =
q
p(1−p)
n
3. A normal distribution if n is big enough.
• Wait, but parameters are supposed to be Greek letters and statistics are supposed to be
Roman, right? p ought to be some cool Greek letter like π, (in some advanced textbooks its
θ) but somehow people don’t follow that convention here. Notice though that the statistic
has a decoration (the hat) just like the sample mean (the bar over x).
• of course the mean is p. That is just saying the average sample would have 82% answering
“yes”
• The standard deviation of a sampling distribution (i.e. the standard deviation of a statistic)
is always called the standard error
The Fine Print
There were 3 assumptions underlying last slide. None exactly true in real
situations, but Rules of Thumb say when close enough.
(a) SRS The sample is assumed to be a Simple Random Sample.
The formula for standard deviation assumes sampling with replacement
so successive individuals sampled are independent. This is close enough if
population is much larger than the sample:
(b) Independence/Large Population Assumption The population size is
at least 20 times the sample size.
As n gets larger the distribution gets more normal, but it happens faster
if p is close to .5. We can use the normal dist. to model sample proportion
if
(c) Normality Assumption/Rule of 15 : the numbers np and n(1 − p) are
both at least 15.
• We will learn a number of rules of thumb for when we can take assumptions as being met.
1
• Populations are generally big, so the large population assumption is almost always obviously
met. I will expect you to be able to say, if the sample size is say 200, that the population
needs to be at least 4, 000.
• In the rare situations where the large population rule is not met, there is a slightly more
complicated formula for the standard deviation that works fine, so failure of this assumption
is a pretty mild problem.
• The numbers np and n(1 − p) are the average or expected number of yes and no answers in
the sample.
• If p is close to 1/2 this means n just has to be a little more than 30. if p is close to 0 or close
to 1 n needs to be quite large.
An Example
82% of adult Americans believe in God. Take a SRS of 400 adult Americans
and ask if they believe. What are mean and standard deviation of proportion in
your sample who do? What is chance less than 80% in your sample will believe
in God? Between 80 and 90%?
It says simple random sample, so SRS assumption:Met. The mean is
µP̂ = p = .82.
Check the large population assumption: Need there to be more than 400 · 20 =
8000 adult Americans: obviously true so Met.
r
r
p(1 − p)
.82 · .18
σP̂ =
=
= 0.0192.
n
400
Check normality assumption/ rule of 15 : np = 400 · .82 = 328 ≥ 15. n(1 −
p) = 400 · .18 = 72 ≥ 15 so P̂ is normal.Met
• Checking the large population assumption was typical. I want to see that you know how big
the population needs to be. Usually you do not know exactly how big the population is, but
is generally obvious that it is big enough.
• Notice the numbers np and n(1 − p) were the average number of yes and no answers you
would expect in your sample.
More Example
82% of adult Americans believe. Take a SRS of 400 adult Americans and ask
if they support him. What are mean and standard deviation of proportion in
your sample who do? What is chance less than 80% in your sample will believe?
Between 80 and 90%?
To find the probability it is less than 80% since P̂ is normal
p
P (P̂ < .8) = normdist(.8, .82, .82 ∗ .18/400, 1) = 14.9%
2
Between 80% and 90% :
p
.82 ∗ .18/400, 1)
p
− normdist(.8, .82, .82 ∗ .18/400, 1) = 85.1%
P (.8 < P̂ < .9) = normdist(.9, .82,
• Notice I put the formula for the s.d. into normdist, not just the rounded result. Normdist
calculations are extremely sensitive to the standard deviation, and you can be quite far off
if you round it off too much. So enter the formula directly into excel and do not round in
the middle for this calculation.
The Example - The Big Picture
So we saw that if we take many samples of n = 400 from a population with
proportion of successes p = .82 and compute the sample proportion p̂ for each
one, these values of P̂ will have a normal distribution with mean µ = .82 and
standard deviation σ = .019
•
Another Example
You know the answer to 75% of the questions your philosophy professor
might ask. View the 50 questions on the test as a simple random sample of all
questions s/he might ask. Find mean and s. d. of the proportion you will get
right. What is your chance of getting over 90%?Between 80 and 90? Between
70 and 80?
SRS: Met.
Check large population assumption: Need more than 20·50 = 1000 questions
s/he could ask. Lots of questions out there, seems reasonable. Met.
Check rule of 15 : np = 50 · .75 = 37.5 ≥ 15. n(1 − p) = 50 · .25 = 12.5 which
is < 15. Cannot assume P̂ is normal. Not Met. Continue with calc. treat
results with skepticism.
Compute mean and s.d. The mean is
µP̂ = p = .75.
r
σP̂ =
p(1 − p)
=
n
r
•
3
.75 · .25
= 0.0612.
50
More Other Example
You get 75% of questions right, test has 50 questions. Find mean and s. d. of
P̂ . Chance over 90%? Between 80 and 90? Between 70 and 80? The distribution
of P̂ is roughly normal with
p
µP̂ = .75
σP̂ = .75 ∗ .25/50 = .0612.
P (P̂ > .9) = 1 − normdist(.9, .75,
p
.75 · .25/50, 1) = .715%
p
.75 ∗ .25/50, 1)
p
− normdist(.8, .75, .75 ∗ .25/50, 1) = 20.0%.
P (.8 < P̂ < .9) = normdist(.9, .75,
p
.75 ∗ .25/50, 1)
p
− normdist(.7, .75, .75 ∗ .25/50, 1) = 58.6%.
P (.7 < P̂ < .8) = normdist(.8, .75,
•
Sampling Distribution
In general consider a population and a variable.
Take a simple random
sample and compute some statistic like mean or proportion. Each time you do
this you get a different answer, so it is a Random variable!
If you consider the value of this statistic for every possible sample, you get
a distribution, the sampling distribution. We want to know its mean, standard
deviation, and shape of its histogram.
• The population distribution is the values of the variable in the population (the 82% of all adult Americans who believe)
• The data distribution is the values of the variable in one particular
sample (maybe 320 yes answers in a sample of 400)
• The sampling distribution is the different values of the statistic (like
P̂ ) in different samples
• One of the trickiest points in the class is the fact that we are taking the statistic as a random
variable. This means in a sense our population has become the set of all possible samples
out of the original population. If you can keep track of these levels (the original population,
one particular sample, and the population of all samples) straight, you will own this course.
If you can’t, be patient: Your brain takes time to stretch, but it gets there.
4
Lecture 16 Key Points
After watching this lecture you should be able to
• Know we mean by the sampling distribution of P̂ , and what it represents
• Calculate the mean and standard deviation of P̂
• Check the Independence/ Large Population assumption and what it tells
you (that the s.d. formula is correct)
• Check the Normality / Rule of 15 assumption and what it tells you (can
use normdist)
• Calculate probabilities of P̂ using normdist.
5