Download Sampling Distributions and Applications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Lab 5 for Math 17: Sampling Distributions and Applications
Recall: The distribution formed by considering the value of a statistic for every possible sample of a
given size n from the population is called the sampling distribution of the statistic. It is usually too
difficult to enumerate all possible samples and compute all possible values of the statistic by hand,
but we can approximate the distributions by taking a “large” number of samples (via simulation)
to help visualize the distribution. Statistical theory helps us determine the distributions of some
common sampling distributions.
1
Coin Activity
Suppose we want to understand how the sample mean year on pennies behaves. The population
of pennies we have available for investigation is a collection of 1002 pennies which were obtained
from the UMass Five College Credit Union on August 25, 2010 ($10 in pennies was asked for).
What do you think the distribution of year looks like for the population of pennies? Explain.
Obtain a sample of 30 pennies, and compute the sample mean year. What value do you get? We
note that due to time constraints, we are not sampling with replacement.
Compare your mean value with the class (class graph). Are the values very different?
What does the distribution of sample mean year look like based on the graph?
Do you think looking at roughly 30 samples of size 30 is good enough to tell us about the distribution of sample mean year when n is 30?
2
Sampling Distribution of the Sample Proportion
For the purposes of this example, the bin filled with balls represents the population of all possible
birds that could be captured as part of an upcoming study looking for a genetic trait which is
known to be harmful to carriers and sometimes fatal to those which exhibit the trait (think sickle
cell anemia idea but for birds). Let white balls denote birds that do not have the trait and are also
not carriers. Let red balls denote birds that are carriers but do not themselves exhibit the trait,
and let green balls denote birds that do exhibit the trait (also then carriers).
1
Looking at the bin, what are your initial guesses as to the composition of this population?
% white, % red,
% green with
total balls
With the understanding that you could choose a combination of colors (i.e. red + green = all carriers) and estimate the population proportion for that combination, what combination (or single
color) do you want to investigate? (You cannot choose single green vs. white+red).
What color/combination did the class decide on?
Working in groups of 2, taking turns as appropriate, every group come get a sample of size 25, 50,
and 100 from the bin and get your count of the number of balls meeting the criteria above (class
color/combination selected). Both members need to count the number of balls meeting the criteria
chosen and agree on the count before you can record your counts for the class. Be sure you take a
sample then return it to the population without losing members! (Also means don’t take all three
samples at once; do one, then return the balls, then take the second, etc.).
Small (n=25)
Medium (n=50)
Large (n=100)
Explain why this is NOT equivalent to capture-recapture sampling.
What does it look like the counts are close to for each sample size?
What proportion is that (roughly)?
(Class values will be entered into R/Rcmdr for analysis). What values does the class get as the
average of the sample proportions for each sample size?
What values does the class get as the standard deviation of the sample proportions for each sample
size?
The population proportion corresponding to the class color/combination is
proportion appear to be an unbiased statistic?
%. Does the sample
What does the effect of sample size on standard deviation for the sampling distribution of p appear
to be?
2
What shapes do the histograms for each sample size have (will be hard to tell with our small number of repetitions)?
The Sampling Distribution for p can be described as: approximately normal for large sample sizes
where p is not too near 0qor 1, with a mean denoted µp̂ = p, the population proportion, and a
standard deviation σp̂ =
a sample).
p(1−p)
n
(assuming that not more than 10% of the population is used as
For n = 25, 50, 100, compute the standard deviations for p based on the now known population
proportion. Do the observed standard deviations for the sample proportions match up?
3
Sampling Distribution of the Sample Mean
For sample means, we will learn about the sampling distribution via an applet (link online).
Steer your (Java-enabled) browsers to http://onlinestatbook.com/stat sim/sampling dist/index.html
In this applet, when you first hit Begin, a histogram of a normal distribution is displayed at the
top of the screen. This is the parent population from which samples are taken (think of it as the
bin of balls) except it’s showing the distribution. The mean of that distribution is indicated by a
small blue line and the median is indicated by a small purple line. Since the mean and median are
the same for a normal distribution, the two lines overlap. The red line extends from the mean one
standard deviation in each direction.
The second histogram displays the sample data. This histogram is initially blank. The third and
fourth histograms show the distribution of statistics computed from the sample data. The option
N in those histograms is the sample size you are drawing from the population. We will be exploring
the distribution of the sample mean by drawing many samples from the parent distribution and
examining the distribution of the sample means we get.
Step 1. Describe the parent population. What distribution is it and what is its mean and standard
deviation?
Step 2. You can see the third histogram is already set to “Mean”, with a sample size of N = 5.
Click Animated sample once. The animation shows five observations being drawn from the parent
distribution. Their mean is computed and dropped down onto the third histogram. For your sample, what was the sample mean?
Step 3. Click Animated sample again. A new set of five observations are drawn, their mean is
computed and dropped as the second sample mean onto the third histogram. What did the mean
of the sample means (yes, we are interested in the mean of sample means as part of the sampling
distribution) change to?
Step 4. Click Animated sample one more time. What did the mean of the sample means update
to now?
3
Step 5. Click 10,000. This takes 10,000 samples at once (no more animation) and will place those
10,000 sample means on the third histogram and update the mean and standard deviation of the
sample means. Record the mean and standard deviation of the sample means. What shape does
this third histogram have? How do these findings compare to the parent distribution?
Step 6. Hit Clear Lower 3 in the upper right corner. Change N = 5 to N = 25 for the third
histogram. Do animated sample at least once (convince yourself it is actually samples of 25 now).
Then take 10,000 at once. Record the mean and standard deviation of the sample means. What
shape does the third histogram have? How do these findings compare to the parent distribution?
Step 7. Compare the different standard deviations from Steps 5 and 6. What effect does sample
size appear to have on standard deviation of the sample means?
Step 8. Hit Clear Lower 3. Change the parent distribution to Skewed. What are the new mean
and standard deviation of the parent distribution? Which direction is this distribution skewed?
Step 9. Set N = 5 back for the third histogram. Set “Mean” and N = 25 for the fourth histogram.
Hit 10,000 at once. (This will take 10,000 samples of size 5, compute the sample means and put
those means in the third histogram, as well as take 10,000 samples of size 25, compute the sample
means and put those means in the fourth histogram). What do the distributions look like for the
third and fourth histograms? Are they skewed like the parent population? What are the means
and standard deviations for each histogram?
Step 10. Hit Clear Lower 3. Change the parent distribution to Custom. Draw in a custom distribution (left click and drag the mouse over the top histogram). Sketch your custom distribution
below. What are its mean and standard deviation?
Step 11. Hit 10,000 at once (leave the settings on the third and fourth histograms alone). (You
could take animated once to convince yourself it was really drawing from your new distribution).
What do the third and fourth histograms look like? Anything like the parent distribution? What
are their means and standard deviations?
The Sampling Distribution for the sample mean, X̄ can be described as having a mean µX̄ = µ,
4
the population mean, and a standard deviation σX̄ = √σn . The distribution is exactly normal if the
parent population is normal. Finally, the Central Limit Theorem tells us the distribution will be
approximately normal with the mean and standard deviation stated above if n is sufficiently large
even if the population distribution is not normal.
4
Application Example
A rental car company is interested in the number of miles put on their rental cars by their clients
as part of a project where they may trade in some cars in the Cash for Clunkers program. From
past experience, they believe the population distribution of mileage has a mean of 60 miles and
a standard deviation of 60 miles. They obtain a random sample of 50 mileages from their rental
car fleet and obtain a sample mean of 73.31 miles. The company executives are worried: has the
average number of miles put on the cars gone up? Your job is to help them figure out if the data
suggest an increase in average number of miles put on the cars.
a. What is the sampling distribution of the sample mean mileage put on rental cars? (Give
distribution type, mean, and standard deviation.) What result allows you to provide this distribution?
b. What is the probability you would see a sample mean of 73.31 or greater if the population
mean and standard deviation were both really 60?
c. Would you tell the executives that the average number of miles put on the rental cars has
increased? (How unusual is 73.31 if the mean is really 60, assuming the standard deviation is
correct?)
d. In practice, do you think the standard deviation of the parent distribution would be known?
How would you get around it being unknown? What value could you substitute for σ in our calculations relating to the CLT? This swap and its consequences will be a focus of our discussions next
week as we start developing confidence intervals.
5
5
More Applications
1. Suppose 40 percent of the voters in a large city prefer candidate Q for mayor. A random sample
of 2400 city voters is taken.
a. What is the sampling distribution of the sample proportion of city voters who prefer candidate Q for mayor? Check that this distribution is valid.
b. What is the probability that the sample taken results in a sample proportion of .426 or higher?
2. A researcher is investigating deaths among a new invasive species of beetles treated with various
insecticides. Age of death is recorded for fully matured adult beetles at various doses of insecticides. Since only fully matured adult beetles are included, and because ages at death are usually
not bell-shaped, the researcher records age at death for 50 beetles at each insecticide/dosage level
to help study average age at death.
a. What is the significance of studying 50 beetles at each treatment level if you know you want to
examine the sample mean?
b. Suppose the population mean age at death for the beetle population at a specific treatment
level is 20 days with a population standard deviation of 3 days. What is the sampling distribution
of the sample mean for that treatment level for the sample of size 50 taken?
c. What is the probability a sample of size 50 results in a sample mean between 19 and 21?
6
To Turn In
In a recent election, 62 percent of voters voted in favor of a new law. A related law is coming up
to vote in a neighboring state. A random sample of 80 voters in the neighoring state reveals that
43 of the 80 are in favor of the related law. If the percent in favor is really the same in both states,
how unusual is the result of the sample poll or something more extreme (for direction of extreme
use smaller values)?
6