Download Sampling Distribution Models

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Sampling Distribution Models
Central Limit Theorem
Thought Questions
1.
40% of large population disagree with new law.
In parts a and b, think about role of sample size.

a. If randomly sample 10 people, will exactly four
(40%) disagree with law? Surprised if only two in
sample disagreed? How about if none disagreed?

b. If randomly sample 1000 people, will exactly 400
(40%) disagree with law? Surprised if only 200 in
sample disagreed? How about if none disagreed?
The Diversity of Samples from the Same Population
•
Working Backward from Samples to Populations
Start with question about population.
•
Collect a sample from the population, measure variable.
•
Answer question of interest for sample.
•
With statistics, determine how close such an answer, based on a
sample, would tend to be from the actual answer for the population.

Understanding Dissimilarity among Samples
We need to understand what kind of differences we should expect to
see in various samples from the same population

What to Expect of Sample Proportions
A slice of the population


40% of population carry a certain gene
Do Not Carry Gene = , Do Carry Gene = X
What to Expect of Sample Proportions





Possible Samples
Sample 1: Proportion with gene = 12/25 = 0.48 = 48%
Sample 2: Proportion with gene = 9/25 = 0.36 = 36%
Sample 3: Proportion with gene = 10/25 = 0.40 = 40%
Sample 4: Proportion with gene = 7/25 = 0.28 = 28%
The Central Limit Theorem for Sample
Proportions

Rather than showing real repeated samples, imagine what
would happen if we were to actually draw many samples.

Now imagine what would happen if we looked at the
sample proportions for these samples.

The histogram we’d get if we could see all the proportions
from all possible samples is called the sampling
distribution of the proportions.

What would the histogram of all the sample proportions
look like?
The Central Limit Theorem for Sample
Proportions

We would expect the histogram of the sample
proportions to center at the true proportion, p, in the
population.

As far as the shape of the histogram goes, we can
simulate a bunch of random samples that we didn’t really
draw.

It turns out that the histogram is unimodal, symmetric,
and centered at p.

More specifically, it’s an amazing and fortunate fact that a
Normal model is just the right one for the histogram of
sample proportions.
Imagine, Imagine, Imagine - Predicting Election Results

Imagine a research organization (R.E.) wants to know if
Candidate A would win the election if held next week.
They completed a well-conducted poll of 100 randomly
selected voters.

Imagine several other research organizations (499 of
them) also completed a well-conducted poll of 100
randomly selected voters to try and answer the same
question. They all conduct the polls on the same day.

Imagine that the answer they are looking for is 0.53,
that is Candidate A would get 53% of the vote if the
election were held on the day of the polls.
Predicting Election Results

R.E.
1
2
3
4
5

....and so on





Results
52/100 = .52
49/100 = .49
62/100 = .62
45/100 = .45
59/100 = .59
Results from 500 Polls of 100 voters
0.35
0.3
Relative frequency
0.25
0.2
0.15
0.1
0.05
0
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
Proportion in sample that would vote for Candidate A
0.7
Results from 500 Polls of 400 voters
0.4
0.35
Relative frequency
0.3
0.25
0.2
0.15
0.1
0.05
0
0.35
0.4
0.45
0.5
0.55
0.6
Proportion in sample that would vote for Candidate A
0.65
Results from 500 Polls of 400
voters
Results from 500 Polls of 100
voters
0.4
0.35
0.35
Relative frequency
Relative frequency
0.3
0.25
0.2
0.15
0.1
0.05
0.3
0.25
0.2
0.15
0.1
0.05
0
0.2
0.3
0.4
0.5
0.6
Proportion in sample that would vote for
Candidate A
0.7
0
0.35
0.4
0.45
0.5
0.55
0.6
0.65
Proportion in sample that would vote for
Candidate A
Results from 500 Polls of 1000 voters
0.3
0.25
Relative frequency
0.2
0.15
0.1
0.05
0
0.45
0.47
0.49
0.51
0.53
0.55
Proportion in sample that would vote for Candidate A
0.57
Results from 500 Polls of 1000
voters
Results from 500 Polls of 400
voters
0.3
0.4
0.25
0.3
Relative frequency
Relative frequency
0.35
0.25
0.2
0.15
0.1
0.2
0.15
0.1
0.05
0.05
0
0.35
0.4
0.45
0.5
0.55
0.6
0.65
Proportion in sample that would vote for
Candidate A
0
0.45
0.5
0.55
Proportion in sample that would vote for
Candidate A
The Central Limit Theorem for Sample
Proportions

Modeling how sample proportions vary from sample to
sample is one of the most powerful ideas we’ll see in this
course.

A sampling distribution model for how a sample
proportion varies from sample to sample allows us to
quantify that variation and how likely it is that we’d
observe a sample proportion in any particular interval.

To use a Normal model, we need to specify its mean and
standard deviation. We’ll put µ, the mean of the Normal,
at p.
The Central Limit Theorem for Sample
Proportions

When working with proportions, knowing the mean
automatically gives us the standard deviation as well—the
standard deviation we will use is
pq
n

So, the distribution of the sample proportions is modeled
with a probability model that is

pq 
N  p,

n 

Sampling Distribution Model for Proportions
– Mean and Standard Deviation

Let Y - Binom(n,p) where n is the number of trials and p
is the probability of success.
The Central Limit Theorem for Sample
Proportions

A picture of what we just discussed is as follows:
The Central Limit Theorem for Sample
Proportions

Because we have a Normal model, for example, we know
that 95% of Normally distributed values are within two
standard deviations of the mean.

So we should not be surprised if 95% of various polls gave
results that were near the mean but varied above and
below that by no more than two standard deviations.

This is what we call sampling variability. The sample
proportions varies from sample to sample. All possible
sample proportions arrange themselves neatly under a
normal curve.
What to Expect of Sample Proportions

If numerous samples of the same size are taken from
a population are taken, the frequency curve made
from proportions from various samples will be
approximately bell-shaped.

In other words, the sampling distribution of possible
sample proportions is Normal and centered at the
true population proportion, with standard deviation
(true proportion)(1 – true proportion)
sample size
What to Expect of Sample Proportions

In reality, we don’t know the true proportion

The standard error(SE) for the sampling distribution of
possible sample proportions is
(sample proportion)(1 – sample proportion)
sample size
Assumptions and Conditions
1.
Randomization Condition: The sample should be a
random sample of the population.
2.
10% Condition: the sample size, n, must be no larger
than 10% of the population.
3.
Success/Failure Condition:The sample size has to be big
enough so that both np (number of successes) and nq
(number of failures) are at least 10.
A Sampling Distribution Model for a Proportion


A proportion is no longer just a computation from a set
of data.

It is now a random variable quantity that has a probability
distribution.

This distribution is called the sampling distribution model
for proportions.
Even though we depend on sampling distribution models,
we never actually get to see them.

We never actually take repeated samples from the same
population and make a histogram. We only imagine or
simulate them.
A Sampling Distribution Model for a Proportion

Still, sampling distribution models are important because

they act as a bridge from the real world of data to the
imaginary world of the statistic and

enable us to say something about the population when all we
have is data from the real world.
The Sampling Distribution Model for a Proportion

Provided that the sampled values are
independent and the sample size is large
enough, the sampling distribution of p̂ is
modeled by a Normal model with


Mean: Expected Value( p̂)=p
Standard deviation: SD( p̂) 
pq
n
What to Expect of Sample Proportions


Example : Suppose 40% of all voters in U.S. favor candidate X. Pollsters
take a sample of 2400 people.
What sample proportion would be expected to favor candidate X?
The sample proportion could be anything from a bell-shaped
curve with mean 0.40 and standard deviation:
(0.40)(1 – 0.40)
2400
= 0.01
For our sample of 2400 people:
• 68% of sample proportions will be between 39% and 41%
• 95% of sample proportions will be between 38% and 42%
• 99.7% of sample proportions will be between 37% and 43%
Example: Do Americans Really Vote When
They Say They Do?
 Reported in Time magazine (Nov 28, 1994):
• Telephone poll of 800 adults (2 days after election) –
56% reported they had voted.
• Committee for Study
of American Electorate stated
only 39% of American adults had voted.
 Could it
be the results of poll simply reflected a
sample that, by chance, voted with greater
frequency than general population?
Example: Do Americans Really Vote When
They Say They Do?
 Suppose
only 39% of American adults voted. We can expect
sample proportions to be represented by a bell-shaped curve
with mean 0.39 and standard deviation:
(0.39)(1 – 0.39)
800
•
•
•
= 0.017 or 1.7%
68% of sample proportions will be between 37.3% and 40.7%
95% of sample proportions will be between 35.6% and 42.4%
99.7% of sample proportions will be between 33.9% and 44.1%
Question
Thought Questions

2. Mean weight of all women at large university is 135
pounds with a standard deviation of 10 pounds.

a. Recalling Empirical Rule for bell-shaped curves, in what
range would you expect 95% of women’s weights to fall?

b. If randomly sampled 10 women at university, how
close do you think their average weight would be to 135
pounds?

c. If sampled 1000 women, would you expect average
weight to be closer to 135 pounds than for the sample
of only 10 women?
What to Expect of Sample Means
Example
•
Want to estimate average weight loss for all who attend national
weight-loss clinic for 10 weeks.
•
Unknown to us, population mean weight loss is 8 pounds and standard
deviation is 5 pounds.
•
If weight losses are approximately bell-shaped, 95% of individual weight
losses will fall between –2 (a gain of 2 pounds) and 18 pounds lost.

Possible Samples (random samples of 25 people from this population)
Sample 1: 1,1,2,3,4,4,4,5,6,7,7,7,8,8,9,9,11,11,13,13,14,14,15,16,16
Sample 2: –2, 2,0,0,3,4,4,4,5,5,6,6,8,8,9,9,9,9,9,10,11,12,13,13,16
Sample 3: –4,–4,2,3,4,5,7,8,8,9,9,9,9,9,10,10,11,11,11,12,12,13,14,16,18
Sample 4: –3,–3,–2,0,1,2,2,4,4,5,7,7,9,9,10,10,10,11,11,12,12,14,14,14,19
What to Expect of Sample Means

Results:
Sample 1: Mean = 8.32 pounds

Sample 2: Mean = 6.76 pounds

Sample 3: Mean = 8.48 pounds

Sample 4: Mean = 7.16 pounds

Each sample gave a different sample mean, but close to 8.
What to Expect of Sample Means
Say the true population mean height is 68 inches and the
population standard deviation is 3 inches
What to Expect of Sample Means
Variation in sample means
 Say, the actual mean height of a population is 68 inches.





First sample mean is 67.5 inches
Second sample mean is 66 inches
Third sample mean is 69 inches
Sample means are (hopefully) close to the population mean
But they are not identical


Why? Due to sample variability
Each sample is only a subset of the population
What to Expect of Sample Means : Population of
measurements is bell-shaped, and a random sample of
any size is measured.
What to Expect of Sample Means: Population of
measurements of interest is not bell-shaped, but a large
random sample is measured.

Sample of size 30 is considered “large,” but if there are
extreme outliers, better to have a larger sample.
Population Mean is 21.76
What to Expect of Sample Means: Population of
measurements of interest is not bell-shaped, but a large
random sample is measured.
What to Expect of Sample Means

If numerous samples or repetitions of the same size are taken,
the frequency curve of means from various samples will be
approximately bell-shaped.

The mean for the sampling distribution of the sample mean is
equal to the true population mean

The standard deviation(SD) for the sampling distribution of the
possible sample means is :
population standard deviation
sample size
What to Expect of Sample Means

In reality, we won’t know the population standard deviation

The standard error(SE) for the sampling distribution of
the possible sample means is
sample standard deviation
sample size
The Central Limit Theorem: The
Fundamental Theorem of Statistics

The sampling distribution of any mean becomes closer to
near Normal as the sample size grows.


We don’t even care about the shape of the population
distribution as long as the sample size is large enough!
The Fundamental Theorem of Statistics is called the
Central Limit Theorem (CLT).
The Central Limit Theorem: The
Fundamental Theorem of Statistics

The CLT is surprising:


Not only does the histogram of the sample means get closer
and closer to the Normal distribution as the sample size
grows, but this is true regardless of the shape of the
population distribution.
The CLT works better (and faster) the closer the
population distribution is to a Normal itself. It also works
better for larger samples.
The Central Limit Theorem: The
Fundamental Theorem of Statistics
Slide 18 - 48
The Fundamental Theorem of Statistics
The Central Limit Theorem (CLT)
The mean of a random sample is a random variable
whose sampling distribution can be approximated by a
Normal model. The larger the sample, the better the
approximation will be.
Assumptions and Conditions
The CLT requires essentially the same assumptions we
saw for modeling proportions:


Independence Assumption: The sampled values must be
independent of each other.

Sample Size Assumption: The sample size must be
sufficiently large.
Assumptions and Conditions (cont.)
We can’t check these directly, but we can think about
whether the Independence Assumption is plausible.
We can also check some related conditions:


Randomization Condition: The data values must be
sampled randomly.

10% Condition: When the sample is drawn without
replacement, the sample size, n, should be no more
than 10% of the population.

Large Enough Sample Condition: The CLT
doesn’t tell us how large a sample we need. For now,
you need to think about your sample size in the
context of what you know about the population.
Weight-Loss Example

Weight-loss example, population mean and standard deviation were
8 pounds and 5 pounds, respectively, and we were taking random
samples of size 25.

Potential sample means represented by a bell-shaped curve with
mean of 8 pounds and standard deviation:
5
25

•
•
•
= 1 pound
For our samples of 25 people:
68% of sample means will be between 7 and 9 pounds
95% of sample means will be between 6 and 10 pounds
99.7% of sample means will be between 5 and 11 pounds
Weight-Loss Example


Increasing the Size of the Sample
suppose a sample of 100 people instead of 25 was taken.
Potential sample means still represented by a bell-shaped
curve with mean of 8 pounds but standard deviation:
5
100

•
•
•
= 0.5 pounds
For our sample of 100 people:
68% of sample means will be between 7.5 and 8.5 pounds
95% of sample means will be between 7 and 9 pounds
99.7% of sample means will be between 6.5 and 9.5 pounds
Question

Suppose that test scores on a particular exam have a mean of
77 and standard deviation of 5, and that they have a bellshaped curve.

Suppose you randomly select a 1000 samples of size 100 from
this population and calculate the sample mean test scores.
Between what two values(sample means) would you expect
95% of these sample mean test scores to fall?
A.
B.
C.
72 AND 82
67 AND 87
76 AND 78
Question
But Which Normal?

The CLT says that the sampling distribution of any mean
or proportion is approximately Normal.

But which Normal model?



For proportions, the sampling distribution is centered at the
population proportion.
For means, it’s centered at the population mean.
But what about the standard deviations?
But Which Normal?

The Normal model for the sampling distribution of the
mean has a standard deviation equal to
SD y  

n
where σ is the population standard deviation.
But Which Normal?

The Normal model for the sampling distribution of the
proportion has a standard deviation equal to
SD  p̂  
pq

n
pq
n
About Variation

The standard deviation of the sampling distribution
declines only with the square root of the sample size (the
denominator contains the square root of n).

Therefore, the variability decreases as the sample size
increases.

While we’d always like a larger sample, the square root
limits how much we can make a sample tell about the
population. (This is an example of the Law of Diminishing
Returns.)
The Real World and the Model World
Be careful! Now we have two distributions to deal with.


The first is the real world distribution of the sample,
which we might display with a histogram.
The second is the math world sampling distribution of the
statistic, which we model with a Normal model based
on the Central Limit Theorem.
Don’t confuse the two!
Sampling Distribution Models

There are two basic truths about sampling
distributions:
1.
Sampling distributions arise because samples vary. Each
random sample will have different cases and, so, a different
value of the statistic.
2.
Although we can always simulate a sampling distribution,
the Central Limit Theorem saves us the trouble for means
and proportions.
What Can Go Wrong?


Don’t confuse the sampling distribution with the distribution
of the sample.

When you take a sample, you look at the distribution of the values,
usually with a histogram, and you may calculate summary statistics.

The sampling distribution is an imaginary collection of the values that
a statistic might have taken for all random samples—the one you got
and the ones you didn’t get.
Watch out for small samples from skewed populations.

The more skewed the distribution, the larger the sample size we
need for the CLT to work.
Summary

Sample proportions and means will vary from sample to
sample—that’s sampling error (sampling variability).

Sampling variability may be unavoidable, but it is also
predictable!

In statistics, the concept of population parameter is
theoretical

Only God? knows the Truth?

We try our best to find out what it is.
A Scientific Look at the Dangers of High Heels,
NY Times, Jan., 2012

Not long ago, Neil J. Cronin, a postdoctoral
researcher, and two of his colleagues at the
Musculoskeletal Research Program at Griffith
University in Queensland, Australia, were having
coffee on the university’s campus when they
noticed a young woman tottering past in high heels.
“She looked quite uncomfortable and
unstable,” Dr. Cronin says.
A Scientific Look at the Dangers of High Heels,
NY Times, Jan., 2012

Some observers, particularly women, might have winced
in sympathy or, alternatively, wondered where she’d
bought stilettos.

But the three researchers, men who study the
biomechanics of walking, were struck instead by the
scientific implications of her passage. “We began to
consider what might be happening at the muscle and
tendon level” in women who wear heels, Dr. Cronin says.
Study: Long-term use of high heeled shoes
alters the neuromechanics of human walking
Study: Long-term use of high heeled shoes
alters the neuromechanics of human walking