Download Chapter 18 Sampling Distribution Models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Chapter 4: Sampling Distribution Models
Statistics that we calculate from data are functions of random variables and so have different
distributions than the population that we drew them from. We will rely on the properties of
sample means and variances that we learned earlier to derive the sampling distributions of some
important statistics that we use to estimate population parameters.
Sampling Distribution for Proportions
If we are interested in estimating a population proportion, what would you guess would be a
good estimator we could calculate from our data? That is, if we want to estimate the parameter p
from a binomial, what would be a plausible sample statistic, p̂ ?
Example: Suppose that 70% of all Florida adults approve of Bush’s handling of the situation in
Iraq. Simulate a random sample of size n = 10 from this population using the table of random
digits with a “random” starting place. Compute p̂ , the proportion of your sample who approve of
Bush’s handling of the situation.
Collecting the results of many repetitions of this simulation approximates the sampling
distribution of p̂ for n = 10.
With the computer, we can simulate thousands of random samples of size 10 or any other size.
Compare the sampling distributions of p̂ for sample sizes 10, 25 and 100 – center, spread, and
shape.
The simulations verify the following results that can be proved theoretically about the sampling
distribution of p̂ for simple random samples of size n from a population with proportion p:
•
The mean of the sampling distribution of p̂ is p.
•
The standard deviation of the sampling distribution of p̂ is
•
pq
where q = 1 – p.
n
If n is large enough, then the sampling distribution of p̂ can be approximated by a normal
model with mean p and standard deviation pq / n . The conditions under which the
normal model can be used are the same as for the normal approximation to the binomial:
when np ≥ 10 and nq ≥ 10.
2
The first two results follow from results in the previous two chapters and hold for any sample
size n. They’re based on the fact that if the population size is large relative to the sample size (at
least 10 times bigger or so), then selecting a random sample of size n can be modeled as n
independent Bernoulli trials with probability of success p on each trial. Therefore, if we let X be
the number of “successes” in the sample, then X has (approximately) a Binom(n,p) distribution.
•
What are E(X) and Var(X)?
•
Note that p̂ = X/n. Therefore, by the results in Chapter 4, what are E( p̂ ) and Var( p̂ )?
By the previous result, how can the sampling distribution of p̂ be modeled in the example above
(Bush’s handling of Iraq) if the true proportion who approve is 70% and the sample size is 100?
(Be sure to check the 10% condition and the success/failure condition are satisfied).
Use this model to approximate the probability that you will get p̂ greater than .5 in a sample
size of 100.
What does the 68-95-99.7 rule say for this sampling distribution?
Notes
• It’s only possible to simulate the sampling distribution of a sample proportion (or other
statistic) for simple random samples or other probability samples. If the sample is not a
probability sample (for example, it’s a sample of convenience), then it’s impossible to
know how differently if it would have come out if the sampling method were repeated.
• In practice, we don’t know p – that’s why we’re taking the sample. The results above
depend on p. So how can these results help us in determining the accuracy of a sample
proportion as an estimate of a population proportion?
3
Sampling Distribution of a Sample Mean
What if we’re dealing with a quantitative variable and we want to estimate the population mean,
µ? We estimate the population mean by the sample mean y . Again, an estimate without an
indication of accuracy is not very useful. So we examine the sampling distribution of y to see
how it varies from sample to sample around the true mean. When we say “sample”, we mean
“simple random sample.”
Example: Rolling a die n times is like taking an SRS from a large population with equal
numbers of 1’s, 2’s, 3’s, 4’s, 5’s and 6’s. The mean of this “population” is 3.5 and the standard
deviation is 2.92. Compare the sampling distributions of y for sample sizes 1-5 dice, for 1 to
10,000 rolls – center, spread, and shape. What will happen will 25 dice, 100 dice?
See a nice applet at the following URL: http://www.stat.sc.edu/~west/javahtml/CLT.html
Central Limit Theorem (the Fundamental Theorem of Statistics): the sampling distribution of
the mean from simple random samples becomes more and more normal as the sample size n
grows. This is true for any population, any quantitative variable.
Which normal?
• The mean of the sampling distribution of y is the population mean µ.
•
The standard deviation of the sampling distribution of y is σ / n
•
Notation: µ ( y ) = µ and σ ( y ) = SD ( y ) = σ / n .
The results about the mean and standard deviation of the sampling distribution of y can be
proved by the properties of expected value and variance of random variables from Chapter 4.
They are true for any sample size.
Putting the results above together:
• As the sample size n grows, the sampling distribution of the sample mean tends toward a
N(µ, σ / n ) distribution.
Note: proportions are a special case of means since the proportion of successes can be thought of
as the mean when successes are given the value “1” and failures the value “0”.
How big does n need to be for the normal model to be a good approximation?
It depends on the population distribution. If the population distribution is exactly normal, then
the sampling distribution of y is exactly normal for any sample size.
If the population distribution is extremely skewed, then the sampling distribution of y might not
be well-approximated by the normal model until the sample size reaches 20 or 30 or more.
If there are outliers, an even bigger sample size may be needed.
4
Example:
Assume that the distribution of durations of human pregnancies follows a normal model with
mean 266 days and standard deviation 16 days.
a) What percentage of pregnancies last between 260 and 270 days?
b) If an obstetrician has 50 patients, what’s the probability that the mean duration of their
pregnancies will be between 260 and 270 days? What assumption must you make to make this
calculation valid?
c) Suppose the distribution of durations is not exactly normal but is skewed to the left. Is your
calculation in a) still valid? Is your calculation in b) still valid?