Download SamplingVariability-and-Sampling-Distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Transcript
Chancellor’s Statistics Institute, College of the Canyons
“Sampling with Proportions: let’s count those chips!”
SAMPLING VARIABILITY AND AN (EMPIRICAL) SAMPLING DISTRIBUTION FOR A PROPORTION
Part 1: Sampling Variability
Note: this part of activity involves group work. Each group consists of 3 students.
Introduction: Presidential elections are a complex process, but essential for democracy as they empower the
people to choose their leaders who will be making decisions that are important for the entire nation. This year,
on November 8th, the people of the United States will be voting again, for the 58 th time in the history of the USA.
Currently, the 2016 Republican presidential nominee is Donald J. Trump, while the Democrats have two possible
candidates, Hillary Clinton and Bernie Sanders. There may also be Third Party and Independent Candidates; for
example, Gary Johnson is currently running as a Libertarian and a Third Party candidate.
We will simulate the elections by drawing chips from
a bag. Our population of eligible voters will be
represented by 200 chips, some of which are blue
(representing Democratic votes), some are red
(representing Republican votes), and some are yellow
(representing other votes). We will define success as
“getting a Democratic vote” (drawing a blue chip).
1. Imagine that you have polled two randomly selected groups of 10 eligible voters. Would you expect
getting the same combination of Democratic, Republican, and “Other” votes each time? Why?
2. Now, suppose that you are analyzing the votes of a randomly selected group of 10 eligible voters. For
that purpose, without looking into the bag, take a sample of 10 chips from the bag. Record the
number of Democratic votes in your sample. Return the chips into the bag and then shuffle all the
chips. Repeat the sampling two more times, so that you have three samples in total. For each sample,
record the number of Democratic votes and calculate the corresponding sample proportion ( p̂ ).
Sample size: n = 10
Sample 1
Sample 2
Sample 3
Number of Democratic votes
Sample Proportion ( p̂ )
3. Is the sample proportion ( p̂ ) always the same?
The observed variation in the proportion of Democratic votes in your samples is called Sampling
Variability. Sampling variability is a result of random chance. Therefore, sampling variability means that
just by random chance, each time when we draw a sample, we can get a different sample proportion.
4. How could you use the three sample values ( p̂ ) to estimate the true population proportion ( p )?
5. Calculate the mean sample proportion: Mean of all p̂ = ________________________________
-1-
Part 2: Sampling Distribution
Note: In this part the entire class works together.
1. Let’s now plot all the sample proportions that we obtained in our class. For each sample value make a
dot in the plot. This dotplot represents our empirical Sampling Distribution for the proportion of
Democratic votes in random samples of size n = 10. (Note: a theoretical sampling distribution includes
sample values that correspond to all the possible samples that can be created; since our sampling
yields only a subset of all the possible samples, we use the term empirical.)
2. How many dots are there? What does each dot in the dotplot represent?
3. Has everyone in the class obtained the same proportion of Democratic votes? Why?
4. Let’s now estimate the shape, center, and spread of our empirical sampling distribution.
a) What is the approximate shape of our empirical sampling distribution? Given the shape, which
would be the best measures for the center and the spread of the empirical sampling distribution?
b) Now, recall the Empirical Rule for a Normal Distribution (a.k.a. the 68-95-99.7 rule). This rule says
that the middle 95% of normally distributed data are roughly about _____ standard deviations
away from the mean. How much is 95% out of all the dots?
c) To find the middle 95% of the data, we count 2.5% of the dots from each left and right. Thus, we
will count _____ dots from the left and _____ dots from the right in order to determine the Lower
Limit and the Upper Limit for the middle 95% of data. Use the dotplot to find these values.
Lower Limit:
Upper Limit:
-2-
d) Note: the standard deviation of the (theoretical) sampling distribution is called the Standard Error.
The standard error is determined as SE  p 1  p  / n , where n is the sample size and p is the
population proportion. In reality, we do not know the true population value p , so we usually take
one random sample and estimate the standard error using the sample proportion p̂ . Since in this
case we already have a sampling distribution, we can use a different approach and implement the
Empirical Rule to approximate the Standard Error. Namely, we will use the fact that the distance
between the lower and upper limit amounts to approximately _________ standard deviations of
the sampling distribution. Therefore,
Approximation for the Standard Error =
UpperLimit -LowerLimit
=
4
e) The midpoint between the lower and upper limit represents an approximation for the _________
of the sampling distribution:
Approximation for the Mean =
LowerLimit  UpperLimit
=
2
f) Describe the (empirical) sampling distribution using the above approximations.
Shape:
Center:
Spread:
5. Again, how can you use the (empirical) sampling distribution to estimate the true population value p ?
Conclusion:

When taking random samples from a population, due to random chance, each time that we draw a
sample, we usually obtain a different sample with a different sample value. This variation in the
sample value is called the _____________________________.

By plotting all the sample proportions that we obtained in the class, we created a dotplot which
represents our empirical ____________________________for the proportion of Democratic votes
in random samples of size n = _____. Since the shape of the empirical sampling distribution is
_____________________________, the best measure for the center is the ___________________
and the best measure for the spread is the ___________________. The standard deviation of the
sampling distribution is called the _____________________________; for our empirical sampling
distribution, this value is approximately __________. The approximation we obtained for the
center of our empirical sampling distribution is __________. We will use this value to
___________________ the unknown population proportion.
-3-
Part 3: Sampling Distribution for Increased Sample Size
Note: This is group work. Three students work together to obtain 3 samples, and then the class puts all the results together.
1. Let’s now increase the sample size to simulate the polling of 20 randomly selected voters at a time.
Without looking into the bag, take 20 chips and record the number of Democratic votes. Return the
chips into the bag, and then shuffle all the chips. Repeat the sampling so that you have three samples
in total. For each sample record the number of Democratic votes and find the sample proportion ( p̂ ).
Sample size: n = 20
Sample 1
Sample 2
Sample 3
Number of Democratic votes
Sample Proportion ( p̂ )
2. Let’s now focus on the mean of the three sample proportions.
a) Calculate the mean sample proportion: Mean of all p̂ = ________________________________
b) Compare this value to the previous case. Write down the values of the two means. What do you
think, which mean lies closer to the true population proportion?
Sample size n = 10: Mean of all p̂ = ________________________________
Sample size n = 20: Mean of all p̂ = ________________________________
3. Again, construct a dotplot of all the sample proportions obtained in the class. For each sample value,
put a dot on the plot to construct a new (empirical) sampling distribution.
4. Let’s estimate the two values that contain the middle 95% of our new (empirical) sampling distribution.
The middle 95% of the dots lie between the _________ Limit and the _________ Limit. The total
number of dots is ______. 95% out of this equals ______ dots, so then 5% is ______ dots. We will need
to count ______ dots from the left and ______ dots from the right to find the two limits. Therefore,
Lower Limit:
Upper Limit:
-4-
5. Using the two limits, we will now estimate the center and the spread of our new (empirical) sampling
distribution.
Approximation for the Mean =
Approximation for the Standard Error =
6. Now fill out the table and then compare your approximations for the shape, center, and spread for the
two (empirical) sampling distributions that we created in the class.
Sample size
Shape
Center
Spread
n = 10
n = 20
Which similarities/differences do you observe among the two (empirical) sampling distributions?
-5-
Part 4: Simulations using Technology
Introduction: Again, we will think of our population as a population of eligible voters, some Democratic,
some Republican, and some with “Other” political preference. As earlier, we will define success as
“getting a Democratic vote.” The difference is that this time we will use statistical software to help us
carry out simulations. For that purpose we will use an online statistics tool called “StatKey,” found at
www.lock5stat.com. To enable simulations, we will need to input our population data into StatKey.
Follow these steps:
- Copy the population data.
 Open the excel file “Elections.xlsx” (this file is posted on the class website)
 To copy the data, click on the header of the data column and copy (right click COPY or CTRL C)
- Input the population data into “StatKey”.
 Open the “StatKey” and under “Sampling Distributions” click on “Proportion”
 Edit Data  Select data (CTRL A)  Delete data (DEL)  Paste new data (right click PASTE or CTRL V)
- Carry out simulations, following the instructions below.
1. First, carry out simulations for sample size n = 10.
- Using “Generate 1 Sample” create a random sample. Click onto “Show Data Table” to see the
randomized sample that StatKey generated. Count and record the number of Democratic votes,
and then find the corresponding sample proportion. Repeat this twice, so that you have 3
randomized simulations. Each time record the mean sample proportion (it’s displayed in the top
right corner of the dotplot, as well as under the arrow in the bottom part of the dotplot).
Sample size: n = 10
Sample 1
Sample 2
Sample 3
Number of Democratic votes
Sample Proportion ( p̂ )
Mean Sample Proportion
- Use “Generate 1 Sample” to make 7 more samples. Then use “Generate 10 Samples” to make 90
more samples. Finally, click onto “Generate 100 Samples” a few times. Answer the questions:
a) StatKey has simulated an empirical sampling distribution for us. What does each dot represent?
How many dots are there in total?
b) The original population proportion ( p ) is a fixed value, yet, the computer generated different
sample proportions p̂ . Why does this happen?
c) Describe the shape, center, and spread of the simulated empirical sampling distribution.

Shape:

Center:
-6-

Spread:
2. Now carry out simulations for sample size n = 20.
- Using “Generate 1 Sample” create 3 random samples. Each time, click on “Show Data Table” to see
the randomized sample that StatKey generated. Count and record the number of Democratic Votes
and then find their proportions. Also, each time record the mean value of the sampled proportions.
Sample size: n = 20
Sample 1
Sample 2
Sample 3
Number of Democratic votes
Sample Proportion ( p̂ )
Mean Sample Proportion
- Generate additional random samples, so that the total number of samples is exactly the same as in
the previous case.
- Describe the shape, center, and spread of the simulated empirical sampling distribution.

Shape:


Center:
Spread:
3. If we sampled chips from the bag, it would not be easy to carry out simulations for sample size n = 50.
Let’s see what happens when software is carrying out simulations for us.
- Using “Generate 1 Sample” create 3 random samples. Each time, click on “Show Data Table” to see
the randomized sample that StatKey generated. Count and record the number of Democratic Votes
and find their proportion. Also, each time record the mean of the sample proportions.
Sample size: n = 50
Sample 1
Sample 2
Sample 3
Number of Democratic votes
Sample Proportion ( p̂ )
Mean Sample Proportion
- Generate additional random samples, so that the total number of samples is exactly the same as in
the previous two cases.
- Describe the shape, center, and spread of the simulated empirical sampling distribution.

Shape:

Center:
4.
-7-

Spread:
5. Compare the three cases (n = 10, n = 20, and n = 50).
a) Which (simulated empirical) sampling distribution has a shape closest to the shape of a normal
distribution?
b) State the center for each (simulated empirical) sampling distribution.
Case n = 10
Case n = 20
Case n = 50
Mean of the simulated
empirical sampling
distribution
Compare the three means; which one do you think is closest to the true population proportion?
c) State the spread for each (simulated empirical) sampling distribution.
Case n = 10
Case n = 20
Case n = 50
Standard Deviation of
the simulated empirical
sampling distribution
(our approximation for
the Standard Error)
How do the three measures of spread compare?
d) Make a conclusion how the shape, center, and spread of a (simulated empirical) sampling
distribution change with the sample size.
-8-
Part 5: Test Your Understanding
Let’s now assess your understanding of the sampling variability and the (empirical) sampling distribution.
1. Answer the questions below.
a) What does “Sampling Variability” mean?
b) Briefly (in two sentences or less) describe how the activities that we completed in the class
illustrate the concept of Sampling Variability.
c) Briefly explain the concept of the (empirical) Sampling Distribution of proportions.
d) State the synonym for the standard deviation of the (theoretical) sampling distribution.
e) What was your favorite part of the activities we did?
f) What is the one concept that you are still unclear about?
-9-
2. Let’s carry out some more simulations. Imagine that you are considering the population of 225.8
millions of eligible voters in the USA. Your task is to find an approximation for the number of voters
with Democratic preference. The only information you have is obtained by simulating the sampling of
200 random samples of size 80; for each sample, the proportion of Democratic votes is computed and
displayed as a single dot in the dotplot below. Use the dotplot to carry out the task.
Answer the following questions:
a) What does the dotplot above represent? Circle the correct answer:

theoretical Sampling Distribution

empirical Sampling Distribution

simulated empirical Sampling Distribution
b) How would you describe the shape of the dotplot?
c) What would be the best measure for the center of the dotplot? State this value.
d) What would be the best measure for the spread of the dotplot? State this value.
e) Use your knowledge of the sampling distribution to estimate the proportion of Democratic votes.
f) Using your answer from the previous part, estimate the total number of Democratic votes.
- 10 -
3. Again, we will use StatKey to simulate 100 random drawings from a population of 225.8 millions of
eligible voters in the USA. Each time that a random sample is taken, we record the proportion of
Democratic votes and then we plot this value in a dotplot. Observe how the shape, center, and spread
of the (simulated empirical) sampling distribution changes with sample size.
Sample size n = 10
Shape:
Center:
Spread:
Estimated proportion of
Democratic votes in the
population:
Sample size n = 25
Shape:
Center:
Spread:
Estimated proportion of
Democratic votes in the
population:
Sample size n = 50
Shape:
Center:
Spread:
Estimated proportion of
Democratic votes in the
population:
- 11 -
a) How does the shape of the (simulated empirical) sampling distribution change with sample size?
b) What happens to the spread of the (simulated empirical) sampling distribution as sample size
increases?
c) What do you think, which would be the best estimate for the unknown population proportion?
d) Based on your estimate for the proportion of Democratic votes in the population, approximate
the number of Democratic votes in the entire population.
4.
Your friend wants to illustrate how
we can use a (simulated) empirical
sampling distribution to estimate
the number of students at the
College of the Canyons who are 40
or more years old. Your friend
seeks data from 20 quintuplets of
COC students.
For the purpose of research, the data are obtained from college administrator, who generates the data
by using a computer to simulate a random selection of 20 quintuplets of COC students. For each
quintuplet (n = 5) the proportion of students of age 40 or older is computed. Based on this your friend
makes a dotplot which represents a (simulated) empirical sampling distribution. You look at the dotplot
and conclude that the estimate obtained from this (simulated) empirical sampling distribution is not
going to be very good, i.e. your friend should ask the administrator for a new data set in which one
variable should be changed. What is it that should be changed? Why?
- 12 -
5. How would you determine if a coin is fair of tainted?
Recall, if a coin is fair, the probability of getting the head (or tail) is p =______. This means, if, for
example, we flip a coin n = 60 times, we would expect to obtain about ______ heads on average.
However, due to _____________________________, each sequence of 60 flips can yield different
number of heads. This means the sample proportion will vary from one sample to another, so if we
plot all the sample proportions we will obtain a dotplot of that consists of different values, and this
dotplot will represent the _____________________________ for the given coin. We can then use the
center of the sampling distribution as a point estimate for the probability of getting the head.
Let’s now use StatKey to simulate 60 flips of a coin. We will define success as “getting a head” so we
will focus on the proportion of the heads out of 60 coin flips. To build a (simulated empirical) sampling
distribution, we will carry out 400 simulations. For each simulation the proportion of heads is plotted,
to yield a simulated empirical sampling distribution. Your task is to estimate the proportion of heads
out of 60 flips, and then, based on that value, you should conclude whether the coin is tainted or fair.
Estimated proportion of
heads:
Is the coin fair or tainted?
Why?
Estimated proportion of
heads:
Is the coin fair or tainted?
Why?
- 13 -
6.
An employee of the Pew Research
Center wants to estimate the percent
of adult Americans who do not use
the Internet. To fulfill the task, they
use census data which they input
into a computer to simulate the
drawing of 300 groups of 100
randomly selected American adults.
For each group the software
calculates the proportion of adults
who do not use the Internet and
based on that makes a dotplot.
Use the dotplot of the simulated empirical sampling distribution to estimate the percent of adult
Americans who do not use the internet. How many people would that be? (Assume that there are
about 242.5 millions of adult Americans)
7.
Suppose that you wish to estimate
the number of first generation
students at COC. Your statistics
teacher gives you a dotplot which
represents a simulated empirical
sampling distribution for 100 groups
of COC students, each consisting of
60 randomly selected students. Use
this dotplot to estimate the percent
of first generation students. How
many students would that be?
(Assume that 31,000 students are
enrolled at COC)
- 14 -
References
Note: Many thanks to my colleagues and Professors within the Mathematics Department of the College of the
Canyons. Help and support on project organization has been obtained from Kathy Kubo. The core part of the
activities is based on the lecture notes and activities developed by Matt Teachout and the Statistics Team of the
College of the Canyons. Significant improvements have been carried out based on useful comments and
suggestions provided by Monica Dabos and Joseph Gerda.
Lecture Materials:



Monica Dabos: Lecture on Sampling Distribution, part of a series of statistics workshops held at the College
of the Canyons during the Spring 2016 term. Mathematics Department of the College of the Canyons, Santa
Clarita Community College District.
Joan Garfield and Dani Ben-Zvi: Reese’s Pieces Activity: Sampling from a Population (an adaptation of an
activity from Rossman and Chance (2000), Workshop Statistics: Discovery with Data, 2nd Edition).
Developed though CAUSE as a part of its collaboration with the SERC Pedagogic Service. Web link:
http://serc.carleton.edu/sp/library/datasim/examples/reeses.htm
Matt Teachout: Lecture Notes on Confidence Intervals & Sampling Distributions, Act. 1-4). Mathematics
Department of the College of the Canyons, Santa Clarita Community College District. Web link:
http://www.teachoutcoc.org/Statistics/index.html
Statistical Software:

StatKey, a collection of web-based statistics apps written to accompany Statistics: Unlocking the Power of
Data by Lock, Lock, Lock, Lock, and Lock. Web link: http://lock5stat.com/statkey/
Interesting Facts:




California Community Colleges, 2016 College of the Canyons Student Success Scorecard. Web link:
http://scorecard.cccco.edu/reports/OneYear/661_OneYear.pdf
Pew Research Center, by Monica Anderson and Andrew Perrin: 15% of Americans don’t use the internet.
Who are they? Fact Tank of the Pew Research Center. Web link: http://www.pewresearch.org/facttank/2015/07/28/15-of-americans-dont-use-the-internet-who-are-they/
Pew Research Center, by Jens Mauel Krogstad: 2016 electorate will be the most diverse in the U.S. history.
Fact Tank of the Pew Research Center. Web link: http://www.pewresearch.org/fact-tank/2016/02/03/2016electorate-will-be-the-most-diverse-in-u-s-history/
Reference.com: How many adults live in the USA? Public record information, obtained from
quickfacts.census.gov. Web link: https://www.reference.com/government-politics/many-adults-live-usab830ecdfb6047660#
Other Educational Materials:



e-Nasco, Counting Chips - Blue. Nasco – Modesto, P.O. Box 10, Salida, California 95368. Web link:
https://www.enasco.com/product/TB16613T
e-Nasco, Counting Chips - Red. Nasco – Modesto, P.O. Box 10, Salida, California 95368. Web link:
https://www.enasco.com/product/TB16923T
e-Nasco, Counting Chips - Yellow. Nasco – Modesto, P.O. Box 10, Salida, California 95368. Web link:
https://www.enasco.com/product/TB16924T
- 15 -