Download CHAPTER 18: Sampling Distribution Models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Statistical inference wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
CHAPTER 8: Sampling Distribution Models
p is the true proportion of the population (parameter for probability of success)
p̂ is the observed proportion in a sample (varies because of different samples)
Sampling error: variability that results from sampling
Sampling Distribution Model: shows the behavior of the statistic over all the possible samples for
the same size n
 For a proportion (categorical data):
o Provided that the sampled values are independent and the sample size is large enough,
the sampling distribution of p̂ is modeled by a Normal model with mean   p , and
pq
.
n
o Centered at the population proportion
standard deviation SD( pˆ ) 
Assumptions and Conditions (Sampling Distribution Model for proportions):
Assumptions:
*The sampled values must be independent of each other
*The sample size, n, must be large enough
Conditions:
 Randomization Condition: should be a SRS of the population. However, SRS can be difficult,
if not impossible. Need at least to be very confident that the sampling method was not biased
and that the sample should be REPRESENTATITVE of the population
 10% Condition (or 5% condition): (if done w/o replacement) n must be no larger than 10% (or
5%) of the population.
 Success/Failure Condition: sample size must be big enough that both np and nq are at least
10.
*checks that the sample size is large enough
**Just Checking – pg. 415
1. You want to poll a random sample of 100 students on campus to see if they are in favor of
the proposed location for the new student center. Of course, you’ll get just one number,
your sample proportion, p̂ . But if you imagined the histogram of all the sample proportions
from these samples, what shape would it have?
A Normal model (approximately)
2. Where would the center of that histogram be?
At the actual proportion of all students who are in favor.
3. If you think that about half the students are in favor of the plan, what would the standard
deviation of the sample proportions be?
(0.5)(0.5)
SD( pˆ ) 
 0.05
100
**Step-by-Step – pg. 415
Suppose that about 13% of the population is left-handed. A 200-seat school auditorium has
been built with 15 “lefty seats,” seats that have a built-in desk on the left rather than the right
arm of the chair. In a class of 90 students, what’s the probability that there will not be enough
seats for the left-handed students?
Means:
pg. 417 (dice)
*means of repeated samples approach the Normal Model with the center approaching the population
mean (Law of Large Numbers)
Sampling Distribution Model
 For means (quantitative data):
o Needs to be independent
o Needs to be collected with randomization
o Centered at the population mean
o
SD( y ) 

n
o When a random sample is drawn from any population with mean μ and standard
deviation σ, its sample mean, y (may also see it written x ), has a sampling distribution
with the same mean μ but whose standard deviation is σ/√n.
o No matter what population the random sample comes from, the shape of the sampling
distribution is approximately Normal as long as the sample size is large enough.
o The larger the sample used, the more closely the Normal approximates the sampling
distribution for the mean.
**A SAMPLE DISTRIBUTION AND A SAMPLING DISTRIBUTION ARE TWO DIFFERENT
THINGS!!!
Central Limit Theorem: The mean of a random sample has a sampling distribution whose shape
can be approximated by a Normal model. The larger the sample, the better the approximation will be.
(*this is true regardless of the shape of the population distribution, even if the population distribution is
skewed or bimodal!)
CLT Conditions:
 Randomization Condition: The data values must be sampled randomly or representative of the
population
 Independence Assumption: The sampled values must be mutually independent. There’s no
way to check in general whether the observations are independent. However, when the
sample is drawn without replacement (as is usually the case), you should check the…
10% Condition: The sample size, n, is no more than 10% of the population
 Large Enough Sample Condition: Although the CLT tells us that a Normal model is useful in
thinking about the behavior of sample means when the sample size is large enough, it doesn’t
tell us how large a sample we need. The truth is, it depends; there’s no one-size-fits-all rule. If
the population is unimodal and symmetric, even a fairly small sample is okay. If the population
is strongly skewed, like the compensation for CEOs we looked at in Ch. 4, it can take a pretty
large sample to allow use of a Normal model to describe the distribution of sample means.
We’ll discuss this issue in more detail in later sections. For now you’ll just need to think about
you sample size in the context of what you know about the population, and then tell whether
you believe the Large Enough Sample Condition has been met.
*Don’t confuse data taken from a real world sample (which should take the shape of the population
distribution as the sample sizes increases; skewed, bimodal, etc) and the sampling distribution (which
approaches the Normal model as the sample size increases)
*means have smaller standard deviations than individuals
*Just Checking – pg. 422
4. Human gestation times have a mean of about 266 days, with a standard deviation of about
16 days. Does this mean that if we record the gestation times of a sample of 100 women,
that histogram will be well modeled by a Normal model?
No, that histogram will approximate the distribution of human gestation periods; that may
not be Normal.
5. Suppose we look at the average gestation times for a sample of 100 women. If we
imagined all the possible random samples of 100 women we could take and looked at the
histogram of all the sample means, what shape would it have?
A Normal model (approximately)
6. Where would the center of that histogram be?
266 days
7. What would be the standard deviation of that histogram?
16
 1.6days
100
Law of Diminishing Returns: (pg. 423)
*Step-by-Step – pg. 423
Suppose that mean adult weight is 175 pounds with a standard deviation of 25 pounds. An
elevator in our building has a weight limit of 10 persons or 2000 pounds. What’s the probability that
the 10 people who get on the elevator overload its weight limit?
Standard Error (standard deviation):
For a sample proportion, p̂ :
SE ( pˆ ) 
pˆ qˆ
n
For the sample mean, y :
s
SE ( y ) 
n
*Since a statistic comes from a random sample; it is, itself, a random quantity
*Sampling distributions arise because samples vary. Each random sample will contain different
cases and, so, a different value of the statistic
*Although we can always simulate a sampling distribution, the CLT saves us the trouble for means
and proportions
Suggested Practice:
Old Book: Ch. 18 # 1, 3, 5, 9, 12, 14, 15, 17, 21, 22, 25, 27, 29, 30, 36, 41
New Book: 8.3, 8.10, 8.15, 8.16, 8.18, 8.21, 8.25, 8.30, 8.31
Examples:
1.
“Toss Tacks”
 Each group tosses the tack five times and computes the sample proportions of the
number of “ups.” Quickly sketch a histogram (or dotplot) of the results. Describe the
distribution. Note that this is a distribution of sample proportions – the proportions
are now the data
 Each group tosses the tack 20 more times and recomputes the sample proportion
(out of 25). Make a new histogram (on the same scale). Describe the new
histogram. What does the center represent? They should also see that the
variability is smaller. Why? Someone should recall the Law of Large Numbers.
 Now toss an additional 25 times. Find the sample proportions out of 50 tosses. Do
another histogram. Discuss center and spread again. And now the shape should
also be getting clear. You should be able to go back to the Normal as approximation
to the Binomial and derive the shape, center, and spread of the sampling model for
sample proportions. Be sure to remind students of the assumptions and conditions
that allow use of this model.
2.
Of all cars on the interstate, 80% exceed the speed limit. What proportions of speeders
might we see among the next 50 cars?
Solution –
Think: We want to find the distribution of the proportion of the next 50 cars that may be
speeding on the highway. 80% of all cars on the highway are speeding.
10% Condition: The 50 cars can be considered a representative sample of all cars on
the highway, and 50 is less than 10% of all cars on the highway.
Success/Failure Condition: np=50(0.80)=40, nq=50(0.20)=10 are both at least 10.
Therefore, the sampling model for the proportion of speeders in 50 cars has mean 0.80 and
pq
(0.80)(0.20)
standard deviation of SD( pˆ ) 

 0.057 .
n
50
The model for p̂ is N (0.80,0.057) .
Show: Distribution using 68-95-99.7 Rule
Tell: According to the Normal model, we expect 68% of the samples of 50 cars to have
proportions of speeders between 0.743 and 0.857, 95% of the samples to have proportions
between 0.686 and 0.914, and 99.7% of the samples to have proportions between 0.629 and
0.971.
3.
We don’t know it, but 52% of voters plan to vote “Yes” on the upcoming school budget.
We poll a random sample of 300 voters. What might the percentage of yes-voters appear to be in our
poll?
Solution –
Think: We want to find the distribution of the proportion of 300 voters that may vote “Yes” on
the upcoming school budget. 52% of voters plan to vote “Yes”.
10% Condition: The 300 voters can be considered a representative sample of all voters, and
300 is less than 10% of all voters.
Success/Failure Condition: np=300(0.52)=156, nq=300(0.48)=144 are at least 10.
Therefore, the sampling model for the proportion of “Yes”-voters in a sample of 300 has mean 0.52
pq
(0.52)(0.48)
and standard deviation of SD( pˆ ) 

 0.029
n
300
The model for p̂ is N (0.52,0.029)
Show: Draw distribution using 68-95-99.7 Rule
Tell: According to the Normal model, we expect 68% of the samples of 300 voters to have
proportions of “Yes”-voters between 0.491 and 0.549, 95% of the samples to have proportions
between 0.462 and 0.578, and 99.7% of the samples to have proportions between 0.433 and 0.607.
4. “Groovy” M&M’s are supposed to make up 30% of the candies sold. In a large bag of 250
M&M’s, what is the probability that we get at least 25% groovy candies?
Solution –
Think: We want to find the probability of getting a bag of 250 M&M’s that has at least 25%
“groovy” candies.
10% Condition: The 250 M&M’s can be considered a representative sample of all M&M’s, and
250 is less than 10% of all M&M’s.
Success/Failure Condition: np=250(0.30)=75, nq=250(0.70)=175 are at least 10.
Therefore, the sampling model for the proportions of “groovy” M&M’s in a sample of 250 has a mean
pq
(0.30)(0.70)
0.30 and standard deviation of SD( pˆ ) 

 0.029
n
250
The model for p̂ is N (0.30,0.029)
Show: z 
z
pˆ  p
pq
n
*Sketch distribution
0.25  0.30
(0.30)(0.70)
250
z  1.73
Tell: According to the Normal model, the probability that our bag contains at least 25%
“groovy” M&M’s is approximately 95.8%.
5. SAT scores should have a mean 500 and standard deviation 100. What about the mean of
random samples of 20 students? (Note that the small sample is okay because we believe a Normal
model applies to the population.)
Solution:
Think: We are interested in the distribution of possible means from samples of SAT scores
from 20 students. SAT scores have a mean of 500 and a standard deviation of 100, and since the
SAT is standardized, it’s reasonable to assume that the model for all SAT scores is Normal.
Random Sampling Condition: The 20 students were sampled randomly.
Independence Assumption: It’s reasonable to think that the SAT scores of the 20 randomly
sampled students will be mutually independent, as long as the students weren’t all from the same
university.
10% Condition: 20 students represent less than 10% of all students.
Under these conditions, the sampling distribution of y has a Normal model, with mean 500 and

100
standard deviation SD( y ) 

 22.36
n
20
Show: Draw distribution using 68-95-99.7 Rule.
Tell: According to the Normal model, we can expect 68% of samples of 20 students to have
mean SAT scores between 477.6 and 522.4, 95% of samples to have means between 455.3 and
544.7, and 99.7% of samples to have means between 432.9 and 567.1.
6. Speeds of cars on a highway have mean 52 mph and standard deviation 6 mph, and are
likely to be skewed to the right (a few very fast drivers). Describe what we might see in random
samples of 50 cars.
Solution –
Think: We are interested in the distribution of possible means from samples of speeds from 50
cars on the highway. Speeds have a mean of 52 mph and a standard deviation of 6 mph, with a
distribution that is skewed to the right.
Random Sampling Condition: The 50 speeds were sampled randomly.
Independence Assumption: It’s reasonable to think that the speeds of the 50 randomly
sampled cars will be mutually independent.
10% Condition: 50 cars represent less than 10% of all cars.
Even though the distribution is skewed, the Central Limit Theorem applies, since the sample size, 50
cars, is large. Under these conditions, the sampling distribution of y has a Normal model, with mean

6
52 mph and standard deviation SD( y ) 

0.85
n
50
Show: Draw distribution using 68-95-99.7 Rule
Tell: According to the Normal model, we can expect 68% of samples of 50 cars to have mean
speeds between 51.2 and 52.9 mph, 95% of samples to have means between 50.3 and 53.7 mph,
and 99.7% of samples to have means between 49.5 and 54.6 mph.
7. At birth, babies average 7.8 pounds, with a standard deviation of 2.1 pounds. A random
sample of 34 babies born to mothers living near a large factory that may be polluting the air and water
shows a mean birthweight of only 7.2 pounds. Is that unusually low?
Solution:
Think: We are interested in the probability that a sample of babies has mean birthweight less
than 7.2 pounds. At birth, babies average 7.8 pounds, with a standard deviation of 2.1 pounds. The
model for birthweights should be roughly unimodal and symmetric, if not Normal.
Random Sampling Condition: The 34 babies were sampled randomly.
Independence Assumption: It’s reasonable to think that the weights of the 34 randomly
sampled babies will be mutually independent.
10% Condition: As long as more than 340 babies were born to mothers living in the vicinity of
the factory, 34 babies represent less than 10% of all babies.
Since the model for all babies is unimodal and symmetric, the Central Limit Theorem applies,
especially since the sample size, 34 babies, is large. Under these conditions, the sampling
distribution of y has a Normal model, with mean 7.8 pounds and standard deviation

2.1
SD( y ) 

 0.36 pounds
n
34
Show: z 
z
y
SD( y )
*sketch distribution
7.2  7.8
0.36
z  1.67
Tell: According to the Normal model, only about 4.7% of samples of 34 babies are expected to
have mean birthweights below 7.2 pounds. The samples of babies near the factory appears to have
an unusually low mean birthweight.