Download Introduction • The reasoning of statistical inference rests on asking

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
The Practice of Statistics, 2nd ed.
Yates, Moore, and Starnes
Chapter 9 – Sampling Distributions
Introduction
• The reasoning of statistical inference rests on asking, “How often would this method
give a correct result if I used it very many times?”
• Exploratory data analysis (calculating means, medians, standard deviations, etc.)
makes sense for any data, but formal inference does not.
• Inference is most useful when we produce data by random sampling or randomized
comparative experiments. It is under these conditions that the laws of probability
answer the question posed above, namely, “What would happen if we did this many
times?”
9.1 – Sampling Distributions
• A parameter is a number that describes the population. In practice, we do not know
this value because we cannot examine the entire population. (If we DID know it, we
wouldn’t NEED statistics!)
• A statistic is a number that describes a sample. The value of a statistic changes from
sample to sample, but it is known (unlike a parameter). We often use a statistic to
estimate a parameter.
We must have notation that distinguishes between population parameters and sample
statistics. The sample mean x can be used to approximate the population mean µ and
the sample proportion p̂ can be used to approximate the population proportion p.
Sampling variability
How can a sample statistic, which is based on a small percentage of the members of the
population, be an accurate estimate of the population parameter? Wouldn’t a second
sample produce a different sample statistic? This is the concept of sample variability –
the value of a statistic varies in repeated sampling. We need to know what happens if we
take many samples. To find out, we do the following:
• Take a large number of samples of the same size from the same population
• Calculate the sample mean x or sample proportion p̂ for each sample
• Examine the distribution of the values of x or p̂ (SOCS)
If we were to do the above for all possible samples of a given size n from the same
population, we would be describing the sampling distribution of the statistic. The
sampling distribution of a statistic is the distribution of values taken by the statistic in all
possible samples of the same size from the same population.
In practice, actually finding and calculating the sampling distribution is nearly impossible
because of the huge numbers of samples involved. However, the laws of probability
allow us to obtain sampling distributions without simulation. The same long-run
approach is useful here – haphazard sampling does not give regular and predictable
results. However, when randomization is used in a design for producing data, statistics
computed from the data have a definite pattern of behavior over many repetitions, even
though the result of a single repetition is uncertain.
Page 1 of 5
The Practice of Statistics, 2nd ed.
Yates, Moore, and Starnes
Chapter 9 – Sampling Distributions
Unbiased statistics
A statistic used to estimate a parameter is unbiased if the mean of its sampling
distribution is equal to the true value of the parameter being estimated. This means that
there is no “systematic tendency” to overestimate or underestimate the parameter, i.e.,
there is no “bias.”
It can be shown mathematically that p̂ (the sample proportion) is an unbiased estimator
of p (the population proportion) and x (the sample mean) is an unbiased estimator of µ
(the population mean).
Even though p̂ is an unbiased estimator of p, any given sample may produce an
inaccurate estimate of p. This is because of the variability of the sample statistic. Larger
samples have a clear advantage because larger samples have less variability. If a statistic
is unbiased, it’s more likely that a larger sample, because of its lower variability, will
produce an estimate closer to the true value of the parameter.
One important and perhaps surprising result is that the variability (spread) of the
sampling distribution does not depend very much on the size of the population! Instead,
it depends on the size of the sample. As long as the population is much larger than the
sample (say 10 times as large), the spread of the sampling distribution is approximately
the same for any population size.
Our goal is to have low bias and low variability. In a “target” analogy, bias is missing off
to the same direction every time, while variability is missing in an evenly scattered way
around the bull’s-eye. Properly chosen statistics computed from random samples of
sufficient size will have low bias and low variability.
9.2 – Sample Proportions
The sample proportion p̂ is the statistic we use to gain information about the unknown
population parameter p. Sample proportions arise most often when we are interested in
categorical variables such as “the proportion of adults who watch American Idol” or “the
percent of adults who attended church or synagogue last week.”
Sampling distribution of a sample proportion
Suppose we choose an SRS of size n from a large population (at least 10 times the size of
n) with population proportion p having some characteristic of interest. Let p̂ be the
proportion of the sample having that characteristic. Then:
• The sampling distribution of p̂ is the distribution of the values of p̂ in all
possible samples of the same size from the population.
• The mean µ p̂ of the sampling distribution of p̂ is exactly p.
•
The standard deviation σ p̂ of the sampling distribution is
p (1 − p )
.
n
Page 2 of 5
The Practice of Statistics, 2nd ed.
Yates, Moore, and Starnes
Chapter 9 – Sampling Distributions
Using the normal approximation for p̂
For large enough values of n (the sample size), the sampling distribution of p̂ is
approximately normal. Furthermore,
• The sampling distribution is closer to a normal distribution when the sample size
n is large.
• The accuracy of the normal approximation improves as the sample size n
increases.
• Given a fixed sample size n, the normal approximation is most accurate when p is
close to ½ and least accurate when p is close to 0 or 1.
• Rule of Thumb: We will use the normal approximation to the sampling
distribution of p̂ for values of n and p that satisfy np ≥ 10 and n (1 − p ) ≥ 10
(because pˆ =
X
is based on X being binomial).
n
Notes about sample proportions:
• p̂ is less variable in larger samples
• We know exactly how quickly the standard deviation decreases as n increases.
The sample size n is under the square root sign, so to cut the standard deviation in
half, we need a sample four times as large.
• The formula for the standard deviation of p̂ does not apply when the sample is a
large part of the population.
9.3 – Sample Means
We are often interested in quantitative measures of data such as household income,
lifetime of a car brake pad, or blood pressure of a patient. Sometimes we are interested in
the median or the standard deviation, but most often (at least in our course) we’ll be
looking at the sample mean. Two important ideas come up in the discussion of sample
means:
• Averages are less variable than individual observations.
• Averages are more normally distributed than individual observations.
Sampling distribution of a sample mean
Suppose that x is the mean of an SRS of size n drawn from a large population with mean
µ and standard deviation σ . Then:
• The sampling distribution of x is the distribution of the values of x in all
possible samples of the same size from the population.
• The mean µ x of the sampling distribution of x is µ .
•
The standard deviation σ x of the sampling distribution is σ
n.
Page 3 of 5
The Practice of Statistics, 2nd ed.
Yates, Moore, and Starnes
Chapter 9 – Sampling Distributions
Notes about sample means:
• The values of x are less spread out for larger samples.
• Again, we know exactly how quickly the standard deviation decreases as n
increases. The sample size n is under the square root sign, so to cut the standard
deviation in half, we need a sample four times as large.
The shape of the sampling distribution depends on the shape of the population
distribution. In particular, if the population distribution is normal, then so is the
distribution of the sample mean:
If we draw an SRS from a population that has the normal distribution with
mean µ and standard deviation σ , then the sample mean x has the
(
normal distribution N µ , σ
)
n .
In other cases, particularly for larger sample sizes, we have what is known as the Central
Limit Theorem to help us determine the shape of the sampling distribution (more on this
later).
IMPORTANT NOTE: The formulas for standard deviations of both sample means and
of sample proportions should only be used if the population size is at least 10 times the
sample size. As the sample size gets closer and closer to the population size, the
formulas for the standard deviation become less and less accurate.
Central Limit Theorem
Many populations have roughly normal distributions, but very few are exactly normal.
What happens to x when the population distribution is not normal? The central limit
theorem is an important result of probability theory that answers this question:
Draw an SRS of size n from any population whatsoever with mean µ and
standard deviation σ . When n is large, the sampling distribution of the
(
sample mean x is close to the normal distribution N µ , σ
mean µ and standard deviation σ
)
n with
n.
The size of the sample required in order for x to be close to normal depends on how
“nonnormal” the population distribution is. More observations are required (i.e., a larger
sample size) if the population distribution is far from normal (see EXAMPLE 9.12).
Why is this important? The Central Limit Theorem (or CLT) allows us to use normal
probability calculations to answer questions about sample means from many observations
even when the original population distribution is not normal!
Page 4 of 5
The Practice of Statistics, 2nd ed.
Yates, Moore, and Starnes
Chapter 9 – Sampling Distributions
Summary
Keep the following figure in mind as we proceed through the rest of the course:
We take lots of samples of the same size (size n) from the same population and
calculate the sample mean x for each of those samples. We then plot the
distribution of all the different values of x . We also find the mean and standard
deviation of x , which turn out to be µ and σ n , respectively. Because of the
CLT, we can use this same mean and standard deviation for large enough samples
regardless of the shape of the population distribution from which the samples
were drawn.
Additionally, here is a side-by-side comparison of the sampling distributions of
the sample mean ( x ) and the sample proportion ( p̂ ) :
Sampling Distributions
Mean
Standard
Deviation
( x ) (Sample means)
( p̂ ) (Sample proportions
µx = µ
µ p̂ = p
σx =
σ
n
σ pˆ =
p (1 − p )
n
The above formulas for the standard deviations are exact if the population is
infinite in size; they are approximate if the sample makes up no more than 10% of
the population.
Normality
The sampling distribution of the
sample mean is exactly normal if
the population distribution is
normal, and it is approximately
normal if n ≥ 30
The sampling distribution of the
sample proportion is approximately
normal if np ≥ 10 and
n (1 − p ) ≥ 10 .
Page 5 of 5