Download Chapter 9: Sampling Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Statistical inference wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Gibbs sampling wikipedia , lookup

Transcript
Chapter 9: Sampling Distributions
Section 9.1: Sampling Distribution
The introduction to the topic of “statistical inference” – using statistical concepts in interpreting scientific
results from studies and surveys.
Parameter: A number that describes an aspect of a population
Statistics: A number that is computed from sample data; often used to estimate an unknown parameter.
Example: A census of all DHS seniors found that 10% got into college early. An SRS of 30
seniors was also taken and in that sample 12% got into college early. The 10% is a parameter
while the 12% is a statistic.
Sampling distributions and Sampling Variability:
If we take repeated samples from the DHS senior population and measure the proportion of seniors from
those samples that got into college early, we will undoubtedly get different numbers for the different
samples. This is referred to as sample variability.
We can create a distribution of the proportions of all the samples we took and draw a histogram. I created
a simulation that selected repeated samples (100 to be exact) of size 30 from a population that had a
proportion, p = 0.10 of seniors who got in early to college. I then took the proportions that I got from
these samples and created a histogram. The histogram looks like this:
NOTE: We will use p to represent the population proportion. We will use p̂ to represent the sample
proportion which in turn is used to estimate p if p is unknown.
1
If we were to take all possible samples of the same size from the population and compute the sample
proportion, p̂ , of each sample and then create a distribution it would be called a sampling distribution
of p̂ . The following properties generally describe a sampling distribution of p̂ created from samples with
a large size (usually n  30):






The overall shape of the distribution is symmetric and approximately normal. The larger
the sample size the closer the shape is to a normal distribution.
A rule of thumb used to determine if a normal curve can be used to approximate the
sampling distribution of population proportions is if:
a) np > 10 and
b) n(1-p) > 10
There are no outliers or other important deviations from the main pattern
The mean (center) of the distribution is equal to the true population parameter, p.
The variability (spread) of the sampling distribution depends on the sample size. The larger
the sample-size the smaller the variability of the sampling distribution.
p(1  p )
The standard deviation of the sampling distribution is
(as long as the population
n
is at least ten times larger than the sample size)
Not all sampling distributions have these properties (though most do). When a sampling distribution does
not have its center equal to the true population parameter, the statistic used to create that sampling
distribution is said to be biased.
The goal when creating a sampling distribution is to have no bias and low variability. Here is how bias
and variability are related:
2

The variability of a sampling distribution is determined by the sampling design and the sample
size used to create the sampling distribution. As long as the population is much larger than the
sample (at least 10 times as large) the spread of the sampling distribution is the same for any
population size.

Contrary to popular belief and intuition, the behavior of a statistic from random samples is not
influenced by the size of the population. To see why, think of taking a sample scoop of m&ms
from a well-shuffled 1-pound bag. If the m&ms are well shuffled does the scoop of m&ms really
know whether it was surrounded by a one-pound bag of m&ms or a huge bin of m&ms? Clearly it
does not.

The above realization, that variability of a sampling distribution is controlled by the size of the
sample, not the size of a population, has major implication for sampling design. It means that a
survey of, say, 2000 people is just as accurate if the sample was taken from the population of a
small state like Rhode Island as when taken from the population of the entire United States. As
long as the sample was an SRS, it can just as easily predict some aspect of the US population as it
could from the much smaller Rhode Island population. In other words, the ratio of the sample to
the population is NOT important. As a matter of fact, we actually want the ratio of the population
to the sample size to be large – more than 10 to 1, in order to be able to conduct most of the
statistical analyses we’ll be learning about.
3
Section 9.2: Sample Proportions
Example: An SRS of 1500 high school seniors in CT was asked whether they applied to college early.
Let’s assume that there are 100,000 high school seniors in the state of Connecticut, and that in fact 35% of
them apply to college early. What is the probability that your sample of 1500 seniors will give a result
within 2 percentage points of the true value of 35%?
a) Since the population size is greater than 10 times the sample size we are OK to proceed (we can use the
formula for standard deviation).
b) Since np = 525 >10
and n(1-p) = 975 >10
we are OK in assuming that the distribution of sample proportions is approximately normal
c) We know that the sampling distribution of sample proportions has a mean of 0.35 (equal to the p in the
population) and that the standard deviation is:
p (1  p )
0.35  0.65

 .0123
n
1500
d) We are looking for the probability that p̂ falls between 0.33 and 0.37 (within 2 % of 35%). So we are
looking for P(0.33  pˆ  0.37)
e) Draw a normal curve that approximates the sampling distribution of p̂ :
f) Standardizing the p̂ values we get:
4
g) We can now find the area under the normal curve by using the z-score table in the back of the book (or
our calculator).
This result is telling us that almost 90% of all samples of size 1500 we take will be within 2 percentage
points of the truth.
5
Section 9.3: Sample Means
A couple of things to think about:
1.) Averages are less variable than individual observations.
2.) Averages are more normal than individual observations.
Why are these two things important to us? Well because if we look at a histogram of averages, we will
get a histogram that is more normal and less spread out than a histogram of individual observations. Data
is much easier to work with if it is normal and has a small spread, so it is to our advantage to look at a
distribution of averages.
Mean and Standard Deviation of a Sample Mean
Suppose that x is the mean of an SRS of size n drawn from a large population with mean μ and standard
deviation σ. Then the mean of the sampling distribution of x is  x   and its standard deviation is
x 

n
.
Sampling Distribution of a Sample Mean from a Normal Population
Draw an SRS of size n from a population that has a normal distribution with mean μ and standard
deviation σ. Then the sample mean x has a normal distribution N(μ,  n ) with mean μ and standard
deviation

n
.
Example: Suppose the heights of young women are normally distributed with μ = 64.5 inches and σ = 2.5
inches. What is the probability that the mean height of an SRS of 10 young women is greater than 66.5
inches?
6
The Central Limit Theorem
The CLT answers the question, what does the distribution of x look like if the original population is not
normal.
CLT: Draw an SRS of size n from any population whatsoever with mean μ and finite standard deviation
σ. When n is large the sampling distribution of the sample mean x is close to the normal distribution
N(μ,

n
) with mean μ and standard deviation

n
.
NOTE: How large a sample size n is needed for x to be close to normal depends on the population
distribution. More observations are required if the shape of the population distribution is far from normal
Example: The number of flaws per square yard in a type of carpet material varies with mean 1.6 flaws
per square yard and a standard deviation of 1.2 flaws per square yard. The population distribution cannot
be normal because a count takes only whole number values. An inspector studies 200 square yards of the
material, records the number of flaws found in each square yard, and calculates x , the mean number of
flaws per square yard inspected. What is the probability the mean number of flaws exceeds 2 per square
yard?
FINAL THOUGHTS ON CHAPTER 9:
A sampling distribution describes the values a statistic would take in very many repetitions of a sample or
an experiment under the same conditions. Understanding that idea is key to understanding statistical
inference (which is the topic of the next and last 4 chapters of the course). The sampling distributions we
looked at were of two important statistics: a sample proportion p̂ and a sample mean x .
7