Download Introduction to Chance Models (Section 1.1) Introduction A key step

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript
Introduction to Chance Models (Section 1.1)
Introduction
A key step in the Statistical Investigation Method is drawing conclusions beyond the observed
data. Statisticians often call this “statistical inference.” There are four main types of conclusions
(inferences) that statisticians can draw from data: significance, estimation, generalization, and
causation. In the remainder of this chapter we will focus on statistical significance.
If you think back to the Organ Donor study from the Preliminaries, there were three groups:
those in the neutral group were asked to make a yes/no choice about becoming a donor; those
in the “opt in” group were told their default was not to be a donor, but they could choose to
become a donor; and those in the “opt out” group were told their default was they were a donor,
but they could choose not to be one if they wished. Let’s further examine two of those groups.
We saw that 23 of 55 or 41.8% in the “opt in” group elected to become donors, compared with
44/56 or 78.6% in the neutral group. A key question here is whether we believe that the 78.6%
is far enough away from the 41.8% to be considered statistically significant, meaning unlikely
to have occurred by random chance alone. True, 78.6% looks very different from 41.8%, but it is
at least possible that the wording of the solicitation to donate actually makes no difference, and
that the difference we observed happened by random chance.
To answer the question “Is our result unlikely to happen by random chance?,” our general
strategy will be to consider what we expect the results to look like if any differences we are
seeing are solely due to random chance. Exploring the random chance results is critical to our
ability to draw meaningful conclusions from data. In this section we will provide a framework for
assessing random chance explanations.
Example 1.1: Can Dolphins Communicate?
A famous study from the 1960s explored whether two dolphins (Doris and Buzz) could
communicate abstract ideas. Researchers believed dolphins could communicate simple feelings
like “Watch out!” or “I’m happy,” but Dr. Jarvis Bastian wanted to explore whether they could
also communicate in a more abstract way, much like humans do. To investigate this, Dr. Bastian
spent many years training Doris and Buzz and exploring the limits of their communicative ability.
During a training period lasting many months, Dr. Bastian placed buttons underwater on each
end of a large pool—two buttons for Doris and two buttons for Buzz. He then used an old
automobile headlight as his signal. When he turned on the headlight and let it shine steadily, he
intended for this signal to mean “push the button on the right.” When he let the headlight blink
on and off, this was meant as a signal to “push the button on the left.” Every time the dolphins
pushed the correct button, Dr. Bastian gave the dolphins a reward of some fish. Over time Doris
and Buzz caught on and could earn their fish reward every time.
Then Dr. Bastian made things a bit harder. Now, Buzz had to push his button before Doris. If
they didn’t push the buttons in the correct order—no fish. After a bit more training, the dolphins
caught on again and could earn their fish reward every time. The dolphins were now ready to
participate in the real study to examine whether they could communicate with each other.
Dr. Bastian placed a large canvas curtain in the middle of the pool. (See Figure 1.1.) Doris was
on one side of the curtain and could see the headlight, whereas Buzz was on the other side of
the curtain and could not see the headlight. Dr. Bastian turned on the headlight and let it shine
June 27, 2014
MAA PREP workshop
1
steadily. He then watched to see what Doris would do. After looking at the light, Doris swam
near the curtain and began to whistle loudly. Shortly after that, Buzz whistled back and then
pressed the button on the right—he got it correct and so both dolphins got a fish. But this single
attempt was not enough to convince Dr. Bastian that Doris had communicated with Buzz
through her whistling. Dr. Bastian repeated the process several times, sometimes having the
light blink (so Doris needed to let Buzz know to push the left button) and other times having it
glow steadily (so Doris needed to let Buzz know to push the right button). He kept track of how
often Buzz pushed the correct button.
In this scenario, even if Buzz and Doris can communicate, we don’t necessarily expect Buzz to
push the correct button every time. We allow for some “randomness” in the process; maybe on
one trial Doris was a bit more underwater when she whistled and the signal wasn’t as clear for
Buzz. Or maybe Buzz and Doris aren’t communicating at all and Buzz guesses which button to
push every time and just happens to guess correctly once in a while. Our goal is to get an idea
of how likely Buzz is to push the correct button in the long run.
Figure 1.1: Depending whether or not the light was blinking or shown steadily, Doris had to
communicate to Buzz as to which button to push.
Let’s see how Dr. Bastian was applying the Six-Step Statistical Investigation Method.
Step 1: Ask a research question. Can dolphins communicate in an abstract manner?
Step 2: Design a study and collect data. Notice Dr. Bastian took some time to train the
dolphins in order to get them to a point where he could test a specific research conjecture. The
research conjecture is that Buzz pushes the correct button more often than he would if he and
Doris could not communicate. If Buzz and Doris could not communicate, Buzz would just be
guessing which button to push. The observational units are Buzz’s attempts and the variable for
each attempt is whether or not on each attempt, Buzz pushes the correct button (a categorical
variable).
Step 3: Explore the data. In one phase of the study, Dr. Bastian had Buzz attempt to push the
correct button a total of 16 different times. In this sample of 16 attempts, Buzz pushed the
correct button 15 out of 16 times. To summarize these results, we report the statistic, a
numerical summary of the sample. For this example, we could report either 15, the number of
correct pushes, or 15/16 = 0.9375, as the statistic.
Definitions: The set of observational units on which we collect data is called the sample. The
number of observational units in the sample is the sample size. A statistic is a number
summarizing the results in the sample.
June 27, 2014
MAA PREP workshop
2
The sample size in this example is 16. Note that the word “sample” is used as both a noun (the
set of observational units being studied) and as an adjective, for example to mean “computed
from the observed data,” as, for example, “sample statistic.”
Step 4: Draw inferences beyond the data. These 16 observations are a mere snapshot of
Buzz’s overall selection process. We will consider this a random process. We are interested in
Buzz’s actual long-run probability of pushing the correct button based on Doris’ whistles. This
unknown long-run probability is called a parameter.
Definition: For a random process, a parameter is a long-run numerical property of the
process.
Note that we are assuming this parameter is not changing over time, at least for the process
used by Buzz in this phase of the study. Because we can’t observe Buzz pushing the button
forever, we need to draw conclusions (possibly incorrect, but hopefully not) about the value of
the parameter based only on these 16 attempts. Buzz certainly pushed the correct button most
of the time, so we might consider either:


Buzz is doing something other than just guessing (his probability of a correct button push
is larger than 0.50).
Buzz is just guessing (his probability of a correct button push is 0.50) and he got lucky in
these 16 attempts.
These are the two possible explanations to be evaluated. Because we can’t collect more data,
we have to base our conclusions only on the data we have. It’s certainly possible that Buzz was
just guessing and got lucky! But does this seem like a reasonable explanation to you? How
would you argue against someone who thought this was the case?
Think about it: Based on these data, do you think Buzz somehow knew which button to push?
Is 15 out of 16 correct pushes convincing to you? Or do you think that Buzz could have just
been guessing? How might you justify your answer?
So how are we going to decide between these two possible explanations? One approach is to
choose a model for the random process (repeated attempts to push the correct button) and
then see whether our model is consistent with the observed data. If it is, then we will conclude
that we have a reasonable model and we will use that model to answer our questions.
The Chance Model
Scientists use models to help understand complicated real world phenomena. Statisticians often
employ chance models to generate data from random processes to help them investigate such
processes. You did this with the Monty Hall Exploration (P.3) to investigate properties of the two
strategies, switching and staying with your original choice of door. In that exploration it was clear
how the underlying chance process worked, even though the probabilities themselves were not
obvious. But here we don’t know for sure what the underlying real world process is. We are
trying to decide whether the process could be Buzz simply guessing or whether the process is
something else, such as Buzz and Doris being able to communicate.
Let us first investigate the “Buzz was simply guessing” process. Because Buzz is choosing
between two options, the simplest chance model to consider is a coin flip. We can flip a coin to
represent or simulate Buzz’s choice assuming he is just guessing which button to push. To
June 27, 2014
MAA PREP workshop
3
generate this artificial data, we can let “heads” represent the outcome that Buzz pushes the
correct button and let “tails” be the outcome that Buzz pushes the incorrect button. This gives
Buzz a 50% chance of pushing the correct button. This can be used to represent the “Buzz was
just guessing” or the “random chance alone” explanation. The correspondence between the real
study and the physical simulation is shown in Table 1.1.
Table 1.1: Parallels between real study and physical simulation
Coin flip
Heads
Tails
=
=
=
guess by Buzz
correct guess
wrong guess
probability of correct button when Buzz is just
Chance of Heads = 1/2 = guessing
One repetition
=
one set of 16 simulated attempts by Buzz
Now that we see how flipping a coin can simulate Buzz guessing, let’s flip some coins to
simulate Buzz’s performance. Imagine that we get heads on the first flip. What does this mean?
This would correspond to Buzz pushing the correct button! But, why did he push the correct
button? In this chance model, the only reason he pushed the correct button is because he
happened to guess correctly—remember the coin is simulating what happens when Buzz is just
guessing which button to push.
What if we keep flipping the coin? Each time we flip the coin we are simulating another attempt
where Buzz guesses which button to push. Remember that heads represents Buzz guessing
correctly and tails represents Buzz guessing incorrectly. How many times do we flip the coin?
Sixteen, to match Buzz’s 16 attempts in the actual study. After 16 tosses, Figure 1.2 shows our
simulated data.
Figure 1.2: Results from tossing a coin 16 times
Here we got 11 heads and 5 tails (11 out of 16 or 0.6875 is the simulated statistic). This gives
us an idea of what could have happened in the study if Buzz had been randomly guessing
which button to push each time.
Will we get this same result every time we flip a coin 16 times? Let’s flip our coin another 16
times and see what happens. When we did this we got 7 heads and 9 tails as shown in Figure
1.3 (7 out of 16 or 0.4375 is the simulated statistic).
Figure 1.3: Results from a second repetition of tossing a coin 16 times
So can we learn anything from these coin tosses when the results vary between the sets of 16
tosses?
Using and evaluating the coin flip chance model
Because coin flipping is a random process, we know that we won’t obtain the same number of
heads with every set of 16 flips. But are some numbers of heads more likely than others? If we
June 27, 2014
MAA PREP workshop
4
continue our repetitions of 16 tosses, we can start to see how the outcomes for the number of
heads are distributed. Does the distribution of the number of heads that result in 16 flips have a
predictable long-run pattern? In particular, how much variability is there in our simulated
statistics between repetitions (sets of 16 flips) just by random chance?
In order to investigate these questions, we need to continue to flip our coin to get many, many
sets of 16 flips (or many repetitions of the 16 choices where we are modeling Buzz simply
guessing each time). We did this, and Figure 1.4 shows what we found when we graphed the
number of heads from each set of 16 coin flips. Here, the process of flipping a coin 16 times
was repeated 1,000 times—this number was chosen for convenience, but also appears to be
large enough to give us a fairly accurate sense of the long-run behavior for the number of heads
in 16 tosses.
Figure 1.4: A dotplot showing 1000 repetitions of flipping a coin 16 times and counting the
number of heads
Each dot represents one set of
16 attempts by Buzz
Let’s think carefully about what the graph in Figure 1.4 shows. For this graph, each dot
represents the number of heads in one set of 16 coin tosses. We see that the resulting number
of heads follows a clear pattern: 7, 8, and 9 heads happened quite a lot, 10 was pretty common
also (though less so than 8), 6 happened some of the time, 1 happened once. But we never got
15 heads in any set of 16 tosses! We might consider any outcome between about 5 and 11
heads to be typical, but getting fewer than 5 heads or more than 11 heads happened rarely
enough we can consider it a bit unusual. We refer to these unusual results as being out in the
‘tails’ of the distribution.
Think about it: How does the analysis above help us address the strength of evidence for our
research conjecture that Buzz was doing something other than just guessing?
What does this have to do with the dolphin communication study? We said that we would flip a
coin to simulate what could happen if Buzz was really just guessing each time he pushed the
button in 16 attempts. We saw that getting results like 15 heads out of 16 never happened in our
1000 repetitions. This shows us that 15 is a very unusual outcome— far out in the tail of the
distribution of the simulated statistics—if Buzz is guessing. In short, even though we expect
some variability in the results for different sets of 16 tosses, the pattern shown in this distribution
indicates that an outcome of 15 heads is outside the typical chance variability we would expect
to see when Buzz is simply guessing.
June 27, 2014
MAA PREP workshop
5
In the actual study, Buzz really did push the correct button 15 times out of 16, an outcome that
we just determined would rarely occur if Buzz was just guessing. So, our coin flip chance model
tells us that we have very strong evidence that Buzz was not just tossing a coin to make his
choices. This means we have strong evidence that Buzz wasn’t just guessing. Therefore, we
don’t believe the “by chance alone” explanation is a good one for Buzz. The results mean we
have strong enough evidence to be statistically significant, not a result that happened by
chance alone, and to convince us that something other than “random chance” was at play.
Definition: A result is statistically significant if it is unlikely to occur just by random chance.
If our observed result appears to be consistent with the chance model, we say that the
chance model is plausible or believable.
What next? A glimpse into Steps 5 and 6
The steps we went through above have helped us evaluate how strong the evidence is that
Buzz is not guessing (Step 4 of the Statistical Investigation Method). In this case, the evidence
provided by this sample is fairly strong that Buzz isn’t guessing. Still, there are some important
questions you should be asking right now, such as: If Buzz isn’t guessing what is he doing?
Step 5: Formulate conclusions. We should also ask ourselves: if Buzz wasn’t guessing, does
this prove that Buzz and Doris can communicate? And if so, what does this say about other
dolphins? As we’ll find in later chapters, the answers to these questions hinge mainly on how
the study was designed and how we view the 16 attempts that we observed (e.g., we assume
Buzz couldn’t see the light himself, the light signal displayed each time was chosen randomly so
Buzz couldn’t figure out a pattern to help him decide which button to push; Buzz’s 16 attempts
are a good representation of what Buzz would do given many more attempts under identical
conditions; but we might still wonder whether Buzz’s behavior representative of dolphin behavior
in general or are there key differences among individual dolphins).
Step 6: Look back and ahead. After completing Steps 1-5 of the Statistical Investigation
Method, we need to revisit the big picture of the initial research question. First, we reflect on the
limitations of the analysis, and think about future studies. In short, we are now stepping back
and thinking about the initial research question more than the specific research conjecture being
tested in the study. In some ways, this is the most important step of the whole study because it
is where we think about the true implications of the scientific study we’ve conducted. For this
study, we would reflect on Dr. Bastian’s methods, summarize the results for Buzz, and reflect on
ways to improve the study to enhance the conclusions we can draw.
The 3S Strategy
Let us summarize the overall approach to assessing statistical significance that we have been
taking in this section. We observed a sample statistic (e.g., the number of “successes” or the
proportion of “successes” in the sample). Then we simulated ‘could-have-been’ outcomes for
that statistic under a specific chance model (just guessing). Then we used the information we
gained about the random variation in the ‘by-chance’ values of the statistics to help us judge
whether the observed value of the statistic is an unusual or a typical outcome. If it is unusual–
we say the observed statistic is statistically significant—it provides strong evidence that the
chance-alone explanation is wrong. If it is typical, we consider the chance model plausible.
You may have noticed that we only simulated results for one specific model. When we saw that
the sample statistic observed in the study was not consistent with these simulated results, we
June 27, 2014
MAA PREP workshop
6
rejected the chance-alone explanation. Often, research analyses stop here. Instead of trying to
simulate results from other models (in particular we may not really don’t have an initial idea what
a more appropriate model might be), we are content to say there is something other than
random chance at play here. This might lead the researchers to reformulate their conjectures
and collect more data in order to investigate different models.
We will call the process of simulating could-have-been statistics under a specific chance model
the 3S Strategy. After forming our research conjecture and collecting the sample data, we will
use the 3S strategy to weigh the evidence against the chance model. This 3S Strategy will
serve as the foundation for addressing the question of statistical significance in Step 4 of the
Statistical Investigation Method.
3S Strategy for Measuring Strength of Evidence
1. Statistic: Compute the statistic from the observed sample data.
2. Simulate: Identify a “by chance alone” explanation for the data. Repeatedly simulate values
of the statistic that could have happened when the chance model is true.
3. Strength of evidence: Consider whether the value of the observed statistic from the
research study is unlikely to occur if the chance model is true. If we decide the observed statistic
is unlikely to occur by chance alone, then we can conclude that the observed data provide
strong evidence against the plausibility of the chance model. If not, then we consider the chance
model to be a plausible (believable) explanation for the observed data; in other words what we
observed could plausibly have happened just by random chance.
Let’s illustrate how we implemented the 3S Strategy for the Doris and Buzz example.
1. Statistic: Our observed statistic was 15, the number of times Buzz pushed the correct button
in 16 attempts.
2. Simulate: If Buzz was actually guessing, the parameter (the probability he would push the
correct button) would equal 0.5. In other words, he would push the correct button 50% of the
time in the long run. We used a coin flip to model what could have happened in 16 attempts
when Buzz is just guessing. We flip the coin 16 times and count how many of the 16 flips are
heads, meaning how many times Buzz pressed the correct button (“success”). We then repeat
this process many more times, each time keeping track of the number of the 16 attempts that
Buzz pushed the correct button. We end up with a distribution of could-have-been statistics
representing typical values for the number of correct pushes when Buzz is just guessing.
3. Strength of evidence: Because 15 successes of 16 rarely happens by chance alone, we
conclude that we have strong evidence that, in the long-run, Buzz is not just guessing.
Notice that we have used the result of 15 out of 16 correct attempts to infer that Buzz’s actual
long-run probability of pushing the correct button was not simply 0.5.
June 27, 2014
MAA PREP workshop
7
Another Doris and Buzz study
One goal of statistical significance is to rule out random chance as a plausible (believable)
explanation for what we have observed. We still need to worry about how well the study was
conducted. For example, are we absolutely sure Buzz couldn’t see the headlight around the
curtain? Are we sure there was no pattern to which headlight setting was displayed that he
might have detected? And of course we haven’t completely ruled out random chance, he may
have had an incredibly lucky day. But the chance of him being that lucky is so small, that we
conclude that other explanations are more plausible or credible.
One option that Dr. Bastian pursued was to re-do the study except now he replaced the curtain
with a wooden barrier between the two sides of the tank in order to ensure a more complete
separation between the dolphins to see whether that would diminish the effectiveness of their
communication.
Step1: Ask a research question. The research question remains the same: Can dolphins
communicate in a deep abstract manner?
Step 2: Design a study and collect data. The study design is similar with some adjustments to
the barrier between Doris and Buzz. The canvas curtain is replaced by a plywood board. The
research conjecture, observational units, and variable remain the same.
In this case, Buzz pushed the correct button only 16 out of 28 times. The variable is the same
(whether or not Buzz pushed the correct button), but the number of observational units (sample
size) has changed to 28 (the number of attempts).
Think about it: Based on the results for this phase of the study, do you think that Doris could
tell Buzz which button to push, even under these conditions? Or is it believable that Buzz could
have just been guessing?
Step 3: Explore the data. So our observed statistic is 16 out of 28 correct attempts, which is
16/28 × 100% = 57.1% of Buzz’s attempts. This is again more than half the time, but not much
larger than 50%. A simple bar graph of these results is shown in Figure 1.5.
Figure 1.5: Bar graph for Buzz’s 28 attempts
June 27, 2014
MAA PREP workshop
8
Step 4: Draw inferences. Is it plausible (believable) that Buzz was simply guessing in this set
of attempts? How do we measure how much evidence these results provide against the chance
model? Let’s use the same chance model as we used earlier to see what could have happened
if Buzz was just guessing. We will apply the 3S Strategy to this new study.
1. Statistic: The new observed sample statistic is 16 out of 28, or about 0.571.
Think about it: Consider again our simulation of the chance model assuming Buzz is guessing.
What do we need to change for this new phase of the study?
2. Simulation: This time we need to do repetitions of 28 coin flips, not just 16. A distribution of
the number of heads in 1000 repetitions of 28 coin flips is shown in Figure 1.6. This models
1000 repetitions of 28 attempts with Buzz randomly pushing one of the buttons (guessing) each
time.
Figure 1.6: A graph showing 1000 repetitions of flipping a coin 28 times and counting the
number of heads. This models the number of correct pushes in 28 attempts when Buzz is
guessing each time.
3. Strength of evidence: Now we need to consider the new observed statistic (16 out of 28 or
0.571). We see from the graph that 16 out of 28 is a fairly typical outcome if Buzz is just
randomly guessing. What does this tell us? It tells us that the results of this study are something
that could easily have happened if Buzz was just randomly guessing. So what can we
conclude? We can say his 16 successes are not convincing evidence against the “by chance
alone” model.
The graph in Figure 1.6 shows what happens for the hypothetical Buzz who just guesses. An
actual outcome far out in the tail of that distribution would be strong evidence against the “just
guessing” hypothesis. But be careful: The opposite result – an actual outcome near the center –
is not strong evidence in support of the guessing hypothesis. Yes, the result is consistent with
that hypothesis, but it is also consistent with many other hypotheses as well.
Bottom line: In this second study we conclude that there is not enough evidence that the “by
chance alone” model is wrong. That model is still a plausible explanation for the statistic we
observed in the study (16 out of 28). Based on this set of attempts, we do not have convincing
evidence against the possibility that Buzz is just guessing, but other explanations also remain
plausible. For example, the results are consistent with very weak communication between the
dolphins. All we know from this analysis is that one plausible explanation for the observed data
is that Buzz was guessing.
June 27, 2014
MAA PREP workshop
9
In fact, Dr. Bastian soon discovered that in this set of attempts, the equipment malfunctioned
and the food dispenser for Doris did not operate and so Doris was not receiving her fish rewards
during the study. Because of this malfunction, it’s not so surprising that removing the incentive
hindered the communication between the dolphins and we cannot refute that Buzz was just
guessing for these attempts.
Dr. Bastian fixed the equipment and ran the study again. This time he found convincing
evidence that Buzz was not guessing.
For a bit more discussion on processes and parameters, see FAQ 1.1
Exploration 1.1: Can Dogs Understand Human Cues?
Dogs have been domesticated for about 14,000 years. In that time, have they been able to
develop an understanding of human gestures such as pointing or glancing? How about similar
non-human cues? Researchers Udell, Giglio, and Wynne tested a small number of dogs in
order to answer these questions.
In this exploration, we will first see whether dogs can understand human gestures as well as
non-human gestures. To test this, the researchers positioned the dogs about 2.5 meters from
the experimenter. On each side of the experimenter were two cups. The experimenter would
perform some sort of gesture (pointing, bowing, looking) toward one of the cups or there would
be some other non-human gesture (a mechanical arm pointing, a doll pointing, or a stuffed
animal looking) toward one of the cups. The researchers would then see whether the dog would
go to the cup that was indicated. There were six dogs tested. We will look at one of the dogs in
two of his sets of trials. This dog, a 4 year-old mixed breed, was named Harley. Each trial
involved one gesture and one pair of cups, with a total of ten trials in a set.
We will start out by looking at one set of trials where the experimenter bowed toward one of the
cups to see whether Harley would go to that cup.
Step 1. State the research question.
1. Based on the description of the study, state the research question.
Step 2. Design a study and collect data. Harley was tested 10 times and 9 of those times he
chose the correct cup.
2. What are the observational units?
3. Identify the variable in the study. What are the possible outcomes of this variable? Is this
variable quantitative or categorical?
June 27, 2014
MAA PREP workshop
10
Step 3. Explore the data.
Definitions: The set of observational units on which we collect data is called the sample.
The number of observational units in the sample is the sample size. A statistic is a number
summarizing the results in the sample.
With categorical data, we typically report the number of “successes” or the proportion of
successes as the statistic.
4. What is the number of observational units (sample size) in this study?
5. Determine the observed statistic and produce a simple bar graph of the data (have one
bar for the proportion of times Harley picked the correct cup, and another for the
proportion of times he picked the wrong cup)
6. If the research conjecture is that Harley can understand what the experimenter means
when they bow toward an object, is the statistic in the direction suggested by the
research conjecture?
7. Could Harley have gotten 9 out of 10 correct even if he really didn’t understand the
human gesture and so was randomly guessing between the two cups?
8. Do you think it is likely Harley would have gotten 9 out of 10 correct if he was just
guessing randomly each time?
Step 4. Draw inferences beyond the data.
There are two possibilities for why Harley chose the correct cup nine out of ten times:


He is merely picking a cup at random, and in these 10 trials happened to guess correctly
in 9 of them. That is, he got more than half correct just by random chance alone.
He is doing something other than merely guessing and perhaps understands what the
experimenter means when they bow towards the cup.
The unknown long-run probability that Harley will chose the correct cup is called a parameter.
Definition: For a random process, a parameter is a long-run numerical property of the
process.
June 27, 2014
MAA PREP workshop
11
We don’t know the value of the parameter, but the two possibilities listed above suggest two
different possibilities.
9. What is the value of the parameter if Harley is picking a cup at random? Give a specific
value.
10. What is the possible range of values (greater than or less than some value) for the
parameter Harley is not just guessing and perhaps understands the experimenter?
We will know show you how statisticians use simulation to make a statement about the strength
of evidence for these two possible statements about the parameter’s value.
The Chance Model
Statisticians often use chance models to generate data from random processes to help them
investigate the process. In particular, they can see whether the observed statistic is consistent
with the values of the statistic simulated by the chance model. If we determine that Harley’s
results are not consistent with the results from the chance model, we will consider this to be
evidence against the chance model and in favor of the research conjecture, that he understands
the bowing gesture. In this case, we would say Harley’s results are statistically significant,
meaning unlikely to have occurred by chance alone.
We can’t perform the actual study more times in order to assess the second possibility, but we
can simulate the behavior of Harley’s choices if we were to assume the first possibility (that he
is simply guessing every time).
11. Explain how you could use a coin toss to represent Harley’s choices if he is guessing
between the two cups each time. How many times do you have to flip the coin to
represent one set of Harley’s attempts? What does heads represent?
12. If Harley was guessing randomly each time, on average, how many out of the 10 times
would you expect him to choose the correct cup?
13. Simulate one repetition of Harley guessing randomly by flipping a coin 10 times (why
10?) and letting heads represent selecting the correct cup (“success”) and tails represent
selecting the incorrect cup (“failure”). Count the number of heads in your 10 flips.
Combine your results with the rest of the class to create a dotplot of the distribution for
the number of heads out of 10 flips of a coin.
a. Where does 9 heads fall in the distribution? Would you consider it an unusual
outcome or a fairly typical outcome for the number of heads in 10 flips?
June 27, 2014
MAA PREP workshop
12
b. Based on your answer to the previous question, do you think it is plausible
(believable) that Harley was just guessing which cup to choose?
Using an applet to simulate flipping a coin many times
To really assess the typical values for the number of heads in 10 coin tosses (number of correct
picks by Harley assuming he is guessing at random), we need to simulate many more outcomes
of the chance model. Open the One Proportion applet from the textbook webpage.
Notice that the probability of heads has been set to be 0.50, representing the guessing between
the two cups. Set the number of tosses to 10 and press the Toss Coins button. What was the
resulting number of heads?
Notice that the number of heads in this set of 10 tosses is then displayed by a dot on the graph.
Uncheck the Animate box and press the Toss Coins button 9 more times. This will
demonstrate the how the number of heads varies randomly across each set of 10 tosses. Nine
more dots have been added to your dotplot. Is a pattern starting to emerge?
Now change the Number of repetitions from 1 to 990 and press Toss Coins. The applet will
now show the results for the number of heads in 1000 different sets of 10 coin tosses. So each
dot represents the number of times Harley chooses the correct cup out of ten attempts
assuming he is just guessing.
Remember why we conducted this simulation: to assess whether Harley’s result (9 correct in 10
attempts) would be unlikely to occur by chance alone if he were just guessing between the pair
of cups for each attempt.
14. Locate the result of getting 9 heads in the dotplot created by the applet. Would you
consider this an unlikely result, in the tail of the distribution of the number of heads?
15. Based on the results of 1000 simulated sets of 10 coin flips each, would you conclude
that Harley would be very unlikely to have picked the correct cup in 9 times in 10
attempts if he was randomly guessing between the two cups each time? Explain how
your answer relates to the applet’s dotplot.
Definition: A result is statistically significant if it is unlikely to occur just by random chance.
If our observed result appears to be consistent with the chance model, we say that the
chance model is plausible or believable.
16. Do the results of this study appear to be statistically significant?
June 27, 2014
MAA PREP workshop
13
17. Do the results of this study suggest that Harley just guessing is a plausible explanation
for Harley picking the correct cup 9 out of 10 times?
Summarizing your understanding
18. To make sure that you understand the coin flipping chance model fill in the following
table indicating what parts of the real study correspond to the physical (coin-flipping)
simulation.
Table 1.2: Parallels between real study and physical simulation
Coin flip
=
Heads
=
Tails
=
Chance of Heads = 1/2 =
One repetition
=
one set of ___ simulated attempts by Harley
The 3S Strategy
We will call the process of simulating could-have-been statistics under a specific chance model
the 3S Strategy. After forming our research conjecture and collecting the sample data, we will
use the 3S strategy to weigh the evidence against the chance model. This 3S Strategy will
serve as the foundation for addressing the question of statistical significance in Step 4 of the
Statistical Investigation Method.
3S Strategy for Measuring Strength of Evidence
1. Statistic: Compute the statistic from the observed sample data.
2. Simulate: Identify a “by chance alone” explanation for the data. Repeatedly simulate values
of the statistic that could have happened when the chance model is true.
3. Strength of evidence: Consider whether the value of the observed statistic from the
research study is unlikely to occur if the chance model is true. If we decide the observed statistic
is unlikely to occur by chance alone, then we can conclude that the observed data provide
strong evidence against the plausibility of the chance model. If not, then we consider the chance
model to be a plausible (believable) explanation for the observed data; in other words what we
observed could plausibly have happened just by random chance.
Let’s review how we have already applied the 3S strategy to this study.
19. Statistic. What is the statistic in this study?
20. Simulate. Fill in the blanks to describe the simulation.
June 27, 2014
MAA PREP workshop
14
We flipped a coin ____ times and kept track of how many times it came up heads. We
then repeated this process _____ more times, each time keeping track of how many
heads were obtained in each of the ______ flips.
21. Strength of evidence. Fill in the blanks to summarize how we are assessing the
strength of evidence for this study.
Because we rarely obtained a value of ________ heads when flipping the coin _______
times, this means that it is ____________ (believable/unlikely) that Harley is just
guessing, since if Harley was just guessing he ____________ (rarely/often) would get a
value like _________ correct out of ______ attempts.
Step 5: Formulate conclusions.
22. Based on this analysis, are you convinced that Harley can understand human cues?
Why or why not?
Another Study
One important step in a statistical investigation is to consider other models and whether the
results can be confirmed in other settings.
23. In a different study, the researchers used a mechanical arm (roughly the size of a human
arm) to point at one of the two cups. The researchers tested this to see whether dogs
understood non-human gestures. In 10 trials, Harley chose the correct cup 6 times.
a. Using the dotplot you obtained when you simulated 1000 sets of 10 coin flips
assuming Harley was just guessing, locate the result of getting 6 heads. Would
you consider this an unlikely result, in the tail of the distribution?
b. Based on the results of 1000 simulated sets of 10 coin flips each, would you
conclude that Harley would be very unlikely to have picked the correct cup in 6
times in 10 attempts, if he was randomly guessing between the two cups each
time? Explain how your answer relates to the applet’s dotplot.
June 27, 2014
MAA PREP workshop
15
c. Is this study’s result statistically significant?
d. Do the results of this study suggest that Harley just guessing is a plausible
explanation for Harley picking the correct cup 6 out of 10 times?
e. Does this study prove that Harley cannot understand the mechanical arm?
Step 6: Look back and ahead.
24. Compare the analyses between the two studies. How does the unusualness of the
observed statistic compare between the two studies? Does this make sense based on
the value of the observed statistic in the two studies? Does this make sense based on
how the two studies were designed? Explain. (Hint: Why might the results differ for
human and mechanical arms? Why would this matter?)
25. A single study will not provide all of the information needed to fully understand a broad,
complex research question. Thinking back to the original research question, what
additional studies would you suggest conducting next?
Section 1.1 Summary
The set of observational units on which we collect data is called a sample. The number of
observational units is the sample size. A number computed to summarize the variable
measured on a sample is called a statistic.
For a chance process, a parameter is a long-run numerical property of that process, such as a
probability (long-run proportion).
A simulation analysis based on a chance model can assess the strength of evidence provided
by sample data against a particular claim about the chance model. The logic of assessing
statistical significance employs what we call the 3S Strategy:
 Statistic: Compute an observed statistic from the data.
 Simulate: Identify a model for the “by chance alone” explanation. Repeatedly simulate
values of the statistic that could have occurred from that chance model.
 Strength of evidence: Examine how unusual the observed value of the statistic would
be under repeated application of the chance model.
o If the observed value of the sample statistic is unlikely to have occurred from the
chance model, then the data provide strong evidence against the chance model
as the explanation.
June 27, 2014
MAA PREP workshop
16
o
If the observed value of the sample statistic is not unlikely to have occurred from
the chance model, then the chance model is a plausible explanation for the
observed data.
The chance model considered in this section involved tossing a fair coin. This chance model
allowed for assessing whether an observed number of “successes” in a study provided strong
evidence that the two outcomes of a categorical variable were not equally likely. In the next
section you will consider other chance models, but the reasoning process will remain the same.
June 27, 2014
MAA PREP workshop
17
FAQ 1.1: What is a random process?
Q: So a parameter is a numerical property of a process… but what is the process? All I have is Buzz’s 16
attempts.
A: We can think of Buzz’s 16 attempts as a sample from his long-run selection process. He has some
underlying probability of pushing the correct button, but he’s unlikely to be correct every time, even if the
dolphins are communicating. There is randomness in his choices. His choices might be affected by Doris’
squeaks, but also by how hungry he is, how tired he is, how warm the water is. We are trying to see how
much Doris’ communication is influencing him. Everything else gets lumped together into “random
chance.” Any one outcome from a random process is unknown to us in advance (like tossing a coin), but
if we observe outcomes from the process for long enough, we should start to see some patterns amidst
the randomness. So you can think of a random process as an unlimited source of potential observations
for your sample.
Q: Can I use any old set of observations from a process as a sample?
A: Some samples are good and others are terrible.
Q: How do I tell the good apples from the rotten ones?
A: For a Yes/No process like Buzz’s selections, we need to be willing to make certain assumptions along
the way. For example, we assumed Buzz’s probability of a correct guess was the same every time. In
fact, under the null hypothesis, we assumed it was 50/50 every time. But even if the probability is not 0.5,
we are assuming it is not changing. In other words, we assume that Buzz doesn’t get better through
learning and that he doesn’t get tired or bored or quit trying. We are also assuming that his previous trials
don’t influence his future guesses – like a coin, he would have no memory of his past guesses.
Q: So how do I know if I have a good sample?
A: You need to believe that nothing about the process is changing and that each outcome does not
depend on past outcomes. If you’re a coin, your chance of heads doesn’t change, and you have no
memory of how you’ve landed in the past. If this is true about the process, then you will have a good
sample! On the other hand, if Buzz’s probability of success is different in the morning than in the
afternoon, then only observing him in the morning will not give you a good representation of his overall
probability of success.
Q: So then how does the parameter come into play?
A: Parameter is a hard concept, precisely because we never see it! For the dolphin example, we want to
know whether Buzz is just guessing. We think of his guesses as a potentially never-ending process (like
coin tossing), and the parameter is the probability that Buzz will be correct in his next attempt. As you saw
in the Preliminaries, this means if we were to observe the random process forever, this probability is the
long-run proportion of times that Buzz pushes the correct button.
Q: So I see why the parameter isn’t just 15/16, that is his sample proportion and we may not have hit that
long-run probability yet. So the parameter is just 0.5?
A: Not necessarily. The parameter equals 0.5 if he is just guessing. That’s the chance model we
simulated. But his probability may actually be different from 0.5.
Q: So the parameter is his actual probability of pushing the correct button. It’s a number, but we don’t
know its value.
A: Correct! But keep in mind that we will define lots of different kinds of parameters in future chapters.
The “probability of success” parameter only applies in this type of Yes/No random process.
June 27, 2014
MAA PREP workshop
18
Formalizing Terms
(adapted from Exploration 1.2: Tasting Water and Exploration 1.3: Do People Use
Facial Prototyping?)
A study in Psychonomic Bulletin and Review (Lea, Thomas, Lamkin, & Bell, 2007) presented
evidence that “people use facial prototypes when they encounter different names.” Participants
were given two faces and asked to identify which one was Tim and which one was Bob. The
researchers wrote that their participants “overwhelmingly agreed” on which face belonged to
Tim and which face belonged to Bob, but did not provide the exact results of their study.
Step 1: Ask a research question. We will gather data from your class to investigate the
research question of whether students have a tendency to associate certain facial features with
a name.
Step 2: Design a study and collect data.
1. Identify the observational units in this study.
2. Identify the variable. Is the variable quantitative or categorical?
3. How many outcomes can the variable have?
Definition: A binary variable is a categorical variable with only two outcomes. We also
(arbitrarily) define one outcome to be a success and one to be a failure.
4. Describe the parameter of interest (in words). (Hint: The parameter is the long-run
probability of …?)
5. One possibility here is that subjects are equally likely to associate “Tim” with the face on
the left as they are to associate “Bob” with the face on the left and so are essentially
guessing which name is associated with the face on the left. If this is the case, what is
the long-run probability that a subject would associate “Tim” with the face on the left?
6. Another possibility is that subjects are more likely to associate “Tim” with the face on the
left than they are to associate “Bob” with the face on the left. If this is the case, what can
you say about the long-run probability that a subject would associate “Tim” with the face
on the left? (Hint: You are not to specify a particular value this time; instead indicate a
direction from a particular value.)
June 27, 2014
MAA PREP workshop
19
Definition:
 The null hypothesis typically represents the “by chance alone” explanation. The chance
model (or “null model”) is chosen to reflect this hypothesis.
 The alternative hypothesis typically represents the “there is an effect” explanation that
contradicts the null hypothesis. It represents what the researchers are hoping to gather
evidence to support.
7. Your answers to #5 and #6 should be the null and alternative hypotheses for this study.
Which is which?
Step 3: Explore the data.
8. How many students put Tim as the name on the left? How many students participated in
this study (sample size)? What proportion put Tim’s name on the left?
Number (Tim on left):
Sample size:
Proportion (Tim on left):
Use of symbols
We can use mathematical symbols to represent quantities and simplify our writing. Throughout
the book we will emphasize written explanations, but will also show you mathematical symbols
which you are free to use as a short-hand once you are comfortable with the material. The
distinction between parameter and statistic is so important that we always use different symbols
to refer to them.
When dealing with a parameter that is a long-run probability, such as the probability that a
(future) subject in this study would associate “Tim” with the face on the left, we use the Greek
letter  (pronounced “pie”). But when working with a statistic that is the proportion of
“successes” in a sample, such as the proportion of subjects who associated “Tim” with the face
on the left, we use the symbol ̂ (pronounced “p-hat”). Finally, we use the symbol n to represent
the sample size.
9. What is the value of ̂ in this study?
10. What is the value of n in this study?
11. Hypotheses are always conjectures about the unknown parameter, . You can also use
H0 and Ha as short-hand notation for the null and alternative hypotheses, respectively. A
colon, ‘:’, is used to represent the word ‘is.’ Restate the null and alternative hypotheses
using .
H0:
Ha:
June 27, 2014
MAA PREP workshop
20
Step 4: Draw Inferences.
12. Is the sample proportion who associated “Tim” with the face on the left greater than the
probability specified in the null hypothesis?
13. Is it possible that the sample proportion could turn out to be this large even if the null
hypothesis were true?
As we did with Buzz and Doris in Section 1.1, we will use simulation to investigate how
surprising the observed sample result would be, if in fact subjects were just as likely to
associate “Tim” with the face on the left as they were to associate “Bob” with the face on the left.
(Note also that our null model assumes the same probability for all subjects.)
14. We will now use the One Proportion applet to conduct this simulation analysis.
a. First enter the probability of success value specified in the null hypothesis.
b. Enter the appropriate sample size (number of subjects in this study).
c. Enter 1 for the number of samples, and press Draw Samples. Report the number of
“successes” in this simulated sample.
d. Now, select the radio button for “Proportion of successes.” Report the proportion of
successes in this simulated sample. Use your answer to “c” to verify how this value is
calculated.
e. Leaving the “Proportion of successes” radio button selected, click on Draw Samples
four more times. Do you get the same results each time?
f.
Now enter 995 for the number of samples and click on Draw Samples, bringing the
number of simulated samples to 1000. Comment on the center, variability, and shape
of the resulting distribution of sample proportions.
This distribution of simulated sample proportions is called the null distribution, because it is
based on assuming the null hypothesis to be true.
15. Restate the observed value of the sample proportion, p̂ , who associated “Tim” with the
face on the left.
June 27, 2014
MAA PREP workshop
21
16. Looking at the null distribution you have simulated, is this a very unlikely result, when the
null hypothesis is true? In other words, is this value far in the tail of the null distribution?
You might find that #16 is a bit of a tough call if the observed sample statistic is not far in the tail
of the distribution. To help make a judgment about strength of evidence in this case, we can
count how many (and what proportion) of the simulated sample proportions are as extreme, or
more extreme, than the observed value.
17. Use the applet to count how many (and what proportion) of the simulated sample
proportions are more extreme than the observed value, p̂ .
To do this, enter the observed p̂ (the observed sample proportion who associated “Tim”
with the face on the left) in the box to the left of the Count button. Then press the Count
button. Record the number and proportion of simulated sample proportions that are as
extreme, or more than, than the observed p̂ .
Definition: The p-value is estimated as the proportion of simulated statistics in the null
distribution that are at least as extreme (in the direction of the alternative hypothesis) as the
value of the statistic actually observed in the research study.
How do we evaluate this p-value as a judgment about strength of evidence provided by the
sample data against the null hypothesis? One answer is: The smaller the p-value, the stronger
the evidence against the null hypothesis and in favor of the alternative hypothesis. But how
small is small enough to regard as convincing? There is no definitive answer, but here are some
guidelines:
Guidelines for evaluating strength of evidence from p-values
0.10 < p-value
not much evidence against null hypothesis; null is plausible
0.05 < p-value < 0.10
moderate evidence against the null hypothesis
0.01 < p-value < 0.05
strong evidence against the null hypothesis
p-value < 0.01
very strong evidence against the null hypothesis
The smaller the p-value, the stronger the evidence against the null hypothesis.
18. Is the approximate p-value from your simulation analysis (your answer to #17) small
enough to provide much evidence against the null hypothesis that subjects are equally
likely to associate “Tim” with the face on the left as they are to associate “Bob” with the
face on the left? If so, how strong is this evidence? Explain.
June 27, 2014
MAA PREP workshop
22
Step 5: Formulate conclusions.
19. Do you consider the observed sample result to be statistically significant? Recall that this
means that the observed result is unlikely to have occurred by chance alone.
20. How broadly are you willing to generalize your conclusions? Would you be willing to
generalize your conclusions beyond the subjects in this study? How broadly? Explain
your reasoning.
Step 6: Look back and ahead.
21. Suggest a new research question that you might investigate next, building on what you
learned in this study.
June 27, 2014
MAA PREP workshop
23
June 27, 2014
MAA PREP workshop
24
Alternate Measures for Strength of Evidence
(adapted from Exploration 1.3: Do People Use Facial Prototyping?)
A study in Psychonomic Bulletin and Review (Lea, Thomas, Lamkin, & Bell, 2007) presented
evidence that “people use facial prototypes when they encounter different names.” Participants
were given two faces and asked to identify which one was Tim and which one was Bob. The
researchers wrote that their participants “overwhelmingly agreed” on which face belonged to
Tim and which face belonged to Bob, but did not provide the exact results of their study.
Step 1: Ask a research question. We will gather data from your class to investigate the
research question of whether students have a tendency to associate certain facial features with
a name.
Step 2: Design a study and collect data. Each student in your class will be shown the same
two pictures of men’s faces used in the research study. You will be asked to assign the name
Bob to one photo and the name Tim to the other. Each student will then submit the name that
he or she assigned to the picture on the left. Then the name that the researchers identify with
the face on the left will be revealed.
The parameter of interest here is the long-run probability that a student in your class would
assign the same name to the face on the left.
1. State the null and alternative hypotheses to be tested when the data are collected.
Express these both in words and symbols. (Hint: Think about the parameter and the
research question of interest here.)
Step 3: Explore the data.
2. How many students put Tim as the name on the left? How many students participated in
this study (sample size)? What proportion put Tim’s name on the left?
Number (Tim on left):
Sample size:
Proportion (Tim on left):
When we conduct analyses with binary variables, we often call one of the outcomes a success
and the other a failure, and then focus the analysis on the success outcome. It is arbitrary which
outcome is defined to be a success, but you need to make sure you do so consistently
throughout the analysis. In this case we’ll call “Tim on left” a success because that’s what
previous studies have found to be a popular choice.
June 27, 2014
MAA PREP workshop
25
Step 4: Draw inferences. You will use the One Proportion applet to investigate how surprising
the observed class statistic would be, if students were just randomly selecting which name to
put with which face.
3. Before you use the applet, indicate what you will enter for the following values:
a. Probability of success:
b. Sample size:
c. Number of repetitions:
4. Conduct this simulation analysis. Make sure the “Proportion of heads” button is selected
in the applet and not the “Number of heads.”
a. Indicate how to calculate the approximate p-value (count the number of simulated
statistics that equal __________ or are ___________).
b. Report the approximate p-value.
c. Use the p-value to evaluate the strength of evidence provided by the sample data
against the null hypothesis, in favor of the alternative that students really do tend to
assign the name Tim (as the researchers predicted) to the face on the left.
The p-value is the most common way to evaluate strength of evidence, but now we will explore
a common alternative way to evaluate strength of evidence. The goal of any measure of
strength of evidence is to use a number to assess whether the observed statistic falls in the tail
of the null distribution (and is therefore surprising if the null hypothesis is true) or among the
typical values we see when the null hypothesis is true.
5. Check the Summary Stats box in the applet.
a. Report the mean (average) value of the simulated statistics.
b. Explain why it makes sense that this mean is close to 0.5.
c. Report the standard deviation (SD) of the simulated statistics.
June 27, 2014
MAA PREP workshop
26
d. Report (again) the observed class value of the statistic. (What proportion of students
in your class put Tim’s name on the left?)
p̂ =
e. Calculate how many standard deviations the observed class value of the statistic is
from the hypothesized mean of the null distribution, 0.5. In other words, subtract the
0.5 from the observed value, and then divide by the standard deviation. This is the
standardized statistic
z = (observed statistic p̂ – 0.5) / SD of null distribution.
f.
Once you calculate this value, you interpret it as “how many standard deviations the
observed statistic falls from the hypothesized parameter value.” What strength of
evidence against the null does your standardized statistic provide?
g. How closely does your evaluation of strength of evidence based on the standardized
statistic compare to the strength of evidence based on the p-value in #4c?
Guidelines for evaluating strength of evidence from standardized values of statistics
Standardizing gives us a quick, informal way to evaluate the strength of evidence against the
null hypothesis:
between -1.5 and 1.5:
below -1.5 or above 1.5:
below -2 or above 2:
below -3 or above 3:
little or no evidence against the null hypothesis;
moderate evidence against the null hypothesis;
strong evidence against the null hypothesis;
very strong evidence against the null hypothesis.
Step 5: Formulate conclusions.
6. Now, let’s step back a bit further and think about the scope of inference. We have found
that in most classes, the observed data provide strong evidence that students do better
than random guessing which face is Tim’s and which is Bob’s. In that case, do you think
that most students at your school would agree on which face is Tim’s? Do you think this
means that most people can agree on which face belongs to Tim? Furthermore, does
this mean that all people do ascribe to the same facial prototyping? We will look more at
the scope of inference in chapters 2 and 4.
June 27, 2014
MAA PREP workshop
27
Step 6: Look back and ahead.
7. Discuss limitations to this study?
8. Based on the limitations of this study, suggest a new research question that you would
investigate next.
Extensions
9. In #2 you recorded the proportion of students in your class who put Tim’s name with the
photo on the left. Imagine that the proportion was actually larger than that (e.g., if your
class was 60%, imagine it was 70%).
a. How would this have affected the p-value:
Larger
Same
Smaller
b. How would this have affected the standardized value:
Larger
Same
Smaller
c. How would this have affected the strength of evidence against the null hypothesis:
Stronger
Same
Weaker
10. What if the proportion who put Tim’s name on the left was the same, but 100 students
were in your class.
a. How would this have affected the variability in the null distribution?
Larger
Same
Smaller
b. How would this have affected the p-value?
Larger
Same
Smaller
c. How would this have affected the standardized value:
Larger
Same
Smaller
d. How would this have affected the strength of evidence against the null hypothesis:
Stronger
June 27, 2014
Same
Weaker
MAA PREP workshop
28
11. Suppose that less than half of the students in your class had put Tim’s name on the left,
so your class result was in the opposite direction of the alternative hypothesis. What can
you say about the standardized value of the statistic in this case? Explain. (Hint: You
cannot give a value for the standardized statistic, but you can say something specific
about its value.)
June 27, 2014
MAA PREP workshop
29
Alternative Values for Probability of Success
(adapted from Chapter 1 Investigation)
Background: A legendary story on college campuses concerns two students who miss a
chemistry exam because of excessive partying but blame their absence on a flat tire. The
professor allowed them to take a make-up exam, and he sent them to separate rooms to take it.
The first question, worth five points, was quite easy. The second question, worth ninety-five
points, asked: Which tire was it?
Wednesday we collected data from you, asking which of the four tires you would pick if you
were caught in this situation (and had to make up a tire on the spot). For some reason, I
conjecture that college students appear to have a tendency to pick the
tire. You will
use these data to conduct a test of whether the sample data provide convincing evidence that
this tire is picked more often than we would expect if the four tires were picked equally often
(purely at random).
Step 1: Ask a research question
(a) Identify the research conjecture.
Step 2: Design a study and collect data
(b) What are the observational units?
(c) What is the variable that is measured/recorded on each observational unit? (Hint: State as a
binary variable.)
(d) Describe the parameter of interest in words.
(e) State the null and alternative hypotheses to be tested.
Null hypothesis (H0) >>
Alternative hypothesis (Ha) >>
Step 3: Explore the data
Fill in the tallies of your responses.
Front left
Front right
Back left
Back right
Total
(f) Calculate the observed statistic: What proportion of students selected the left front tire? How
many students participated? (Include the symbols you would use to represent these value
(copy and paste one of these symbols below: p̂ , , ̅ , , , s, n)
June 27, 2014
MAA PREP workshop
30
Step 4: Draw inferences beyond the data
Because I want to focus on the right front tire, I will make the variable binary (right front or not)
and the parameter of interest in this study is the long-run probability that a Cal Poly student
picks the left front tire. What symbol should I use to refer to this unknown parameter value:
p̂ , , ̅ , , , s, n – copy and paste below.
(g) Restate the hypotheses in terms of this parameter.
(h) Is the sample proportion who selected the left front tire greater than one-fourth? (If not,
there’s no need to conduct a simulation analysis. You will have no evidence that this tire is
selected more than one-fourth of the time in the long run.)
(i) Simulate: Use the One Proportion applet to simulate 1000 repetitions of this study,
assuming that every student in class has a 0.25 probability of selecting the right front tire.
For your simulation
“Success” represents:
“Chance of success” =
One repetition =
(j) Using the proportion of successes for the values on the horizontal axis, what is the center of
your null distribution? Does it make sense that this is the center? Explain.
(k) Strength of evidence: Determine the p-value from your simulation results. Also interpret
what this p-value probability represents (i.e., the long-run proportion of what that do what
assuming what?!).
(l) Evaluate this p-value: Does this p-value provide strong evidence against the null hypothesis?
(m) Determine and interpret the standardized statistic from your simulation analysis.
Standardized statistic =
observed statistic  hypothesized parameter
=
standard deviation of null distributi on
(n) According to this standardized statistic, is the observed value of the sample proportion
surprising when the null hypothesis is true? Is this consistent with what the p-value told you?
June 27, 2014
MAA PREP workshop
31
Step 5: Formulate Conclusions
(o) Write a paragraph, as if to a member of your family who has not taken statistics,
summarizing the conclusion that you would draw from this study and your simulation analysis.
Clearly explain to this intelligent but not-statistical family member how you are coming to this
conclusion/the reasoning behind your conclusion.
Step 6: Look back and ahead
(p) Now, let’s step back a bit and think about the scope of our inference. For example: Have we
shown that are equally likely to pick among the four tires? Does it follow that people in general
don’t have a tendency to pick the left front? Are there any improvements you would suggest for
how the data were collected? Suggest a future research study that you would be interested in
investigating based on these results.
Limitations in conclusions? >>
Improvements would make to study? >>
Future research questions? >>
June 27, 2014
MAA PREP workshop
32
Two-Sided Tests
(adapted from Example 1.5: Halloween Treats)
Stemming from concern over the nation’s obesity epidemic, researchers investigated whether
children might be as tempted by toys as by candy for Halloween treats. Test households in five
Connecticut neighborhoods offered children two plates: one with lollipops or fruit candy and one
containing small, inexpensive Halloween toys, like plastic bugs that glow in the dark. The
researchers observed the selections of 283 trick-or-treaters between the ages of 3 and 14
(Schwartz, Chen, and Brownell, 2003).
1. Identify the parameter of interest in this study.
2. State the null and alternative hypotheses to be tested. (Note that no direction is
anticipated in advance for this study, so the alternative should be two-sided.)
The researchers found that 148 trick-or-treaters selected the candy, and 135 chose a toy.
3. Calculate the value of the statistic for this study.
4. Use the applet to conduct a simulation analysis with 10,000 repetitions, assuming the
null hypothesis. Comment on the shape, center, and variability of the simulated
distribution of sample proportions.
5. Use the simulation results to approximate the p-value for this study. Be sure to click on
“Two-sided.” Also describe which values the applet counts to determine this p-value,
and explain why these are the correct values to count.
June 27, 2014
MAA PREP workshop
33
6. Calculate the standardized value of the statistic for this study.
7. Interpret what this standardized values reveals.
8. Summarize your conclusion from the simulation analysis and standardized statistic from
this study.
9. Look back and ahead: What limitations do you see with this study? How broadly would
you feel comfortable generalizing your conclusions? What would you like to see in a
follow-up study?
June 27, 2014
MAA PREP workshop
34
Theory-Based Inference
(adapted from Examples 1.3 and 1.4)
Example 1.4: Halloween Treats
Recall that stemming from concern over the nation’s obesity epidemic, researchers investigated
whether children might be as tempted by toys as by candy for Halloween treats. Test
households in five Connecticut neighborhoods offered children two plates: one with lollipops or
fruit candy and one containing small, inexpensive Halloween toys, like plastic bugs that glow in
the dark. The researchers observed the selections of 283 trick-or-treaters between the ages of 3
and 14 (Schwartz, Chen, and Brownell, 2003). 
To investigate whether children show a preference for either the candy or the toys, we test the
following hypotheses.
Null hypothesis: The probability a trick-or-treater would choose candy is 0.5.
Alternative hypothesis: The probability a trick-or-treater would choose candy is not 0.5.
Note that our null model assumes that the probability of choosing candy (π) is the same for all
children. In symbols, these hypotheses translate to
H0:  = 0.5
Ha:  ≠ 0.5.
The researchers collect the reactions of 283 children for the study. With a sample size of 283,
under our null hypothesis, we simulated the null distribution (using 10,000 simulated samples)
shown in Figure 1.1.
Figure 1.1: A null distribution representing 10,000 simulated samples of 283 children where the
probability that an individual child would choose candy is 0.5
1. With regard to the above null distribution:
a. What is the numeric value of the center of this null distribution? Does that make
sense?
b. What is the SD of this null distribution?
June 27, 2014
MAA PREP workshop
35
c. Would it be appropriate to describe the overall shape of this null distribution as
bell-shaped?
2. As you did in the previous session, use the One Proportion applet to run a simulation
(with 10000 repetitions) to find the approximate p-value for this study. Be sure to click
on “Two-sided.” Record the p-value: ___________
Theory-based approach (One proportion z test)
In the early 1900s, and even earlier, computers weren’t available to do simulations, and as
people didn’t want to sit around and flip coins all day long, they focused their attention on
mathematical and probabilistic rules and theories that could predict what would happen if
someone did simulate.
They proved the following key result (often called the Central Limit Theorem):
Central Limit Theorem (CLT): The distribution of sample proportions will be centered at the
long-run probability (), with a standard deviation of √(1 − )/. If the sample size (n) is
large enough, then the shape of the distribution is bell-shaped.
One bit of ambiguity in the statement is how large is large enough for the sample size? As it
turns out, the larger the sample size is, the better the prediction of bell-shaped behavior in the
null distribution is, but there is not a sample size where all of the sudden the prediction is good.
However, some people have used the convention that you should have at least 10 successes
and at least 10 failures in the sample.
Validity conditions: The normal approximation can be thought of as a prediction of what
would occur if simulation was done. Many times this prediction is valid, but not always,
only when the validity condition (at least 10 successes and at least 10 failures) is met.
3. How well does the CLT predict the center, SD, and shape of the null distribution of
sample proportion of children who choose candy, as displayed in Figure 1.1?
a. Center:
b. SD:
c. Shape:
4. Now go back to the One Proportion applet, and check the box next to “Normal
Approximation." Be sure that the “Two-sided” option is also selected. Record the normal
approximation-based p-value: ___________
5. Compare the simulation-based p-value (from #2) to the normal approximation-based pvalue (from #4). Are the very different or about the same?
June 27, 2014
MAA PREP workshop
36
6. In case of discrepancy between the simulation-based p-value and the normal
approximation-based p-value, which do you think would be the more reliable?
7. (Optional) Notice that the applet provides you with an option to report the Exact Binomial
p-value. Check on the option, and report the resulting p-value: ___________________.
8. (Optional) Which p-value is closer to the Exact Binomial p-value: the simulation-based pvalue (from #2) or the normal approximation-based p-value (from #4)?
Figure 1.2 shows a few examples of null distributions.
Figure 1.2: Null distributions for various studies. A bell-shaped distribution, called a normal
distribution, can be used to nicely approximate some of them, but will not work well for others.
St. George’s
(n = 10;  = 0.15)
St. George’s
(n = 361;  = 0.15)
Halloween treats
(n = 283;  = 0.5)
The St. George’s (n = 361) and Halloween treats (n = 283) simulations show bell-shaped
curves. However, the first sample of St. George’s heart patients (n = 10) is a not bell-shaped
curve, in part because the distribution is not symmetric and in part because there are so few
values that the sample proportion can be.
June 27, 2014
MAA PREP workshop
37
A situation where a theory-based approach doesn’t work
Example 1.3: Heart Transplant Operations
In an article published in the British Medical Journal (2004), researchers Poloniecki, Sismanidis,
Bland, and Jones reported that heart transplantations at St. George’s Hospital in London had
been suspended in September 2000, after a sudden spike in mortality rate. Of the last 10 heart
transplants, 80% had resulted in deaths within 30 days of the transplant. Newspapers reported
that this mortality rate was over five times the national average. Based on historical national
data, the researchers used 15% as a reasonable value for comparison.
Think about it: Why do you think the Theory-based (One proportion z-test; Normal
approximation) approach will not work well for this data?
Applying the 3S Strategy
Using the 3S Strategy from the previous section, we observed 0.80 as our statistic and now we
will simulate one thousand repetitions from a process where  = 0.15 (under the null
hypothesis). We did this using the One Proportion applet as shown in Figure 1.2.
Figure 1.2: Null distribution (could-have-been simulated sample proportions) for 1000
repetitions of drawing samples of 10 “patients” from a process where the probability of death is
equal to 0.15. “Success” has been defined to be patient death.
9. How well does the CLT predict the center, SD, and shape of the null distribution of
sample proportion of deaths among a sample of 10 heart transplant cases, as displayed
in Figure 1.2?
a. Center:
Does it make sense?
b. SD:
c. Shape:
June 27, 2014
MAA PREP workshop
38
Another situation where a theory-based approach doesn’t work
Recall the study about the legendary story on college campuses concerns two students who
miss a chemistry exam because of excessive partying but blame their absence on a flat tire.
The professor allowed them to take a make-up exam, and he sent them to separate rooms to
take it. The first question, worth five points, was quite easy. The second question, worth ninetyfive points, asked: Which tire was it?
When I collected data in my class of 36 students, 18 picked the ____________ tire. To explore
whether this is unlikely to have happened by random chance, I generated the following null
distribution.
10. As you can see, I forgot to include the mean of the null distribution in the applet output.
Use the CLT to predict what the mean of this null distribution should be.
11. Now use the CLT to predict the SD of this null distribution.
12. Do you think the One proportion z-test is a valid approach to find the p-value in the
context of this study? Why or why not? Explain. (Hint: Is it appropriate to describe the
overall shape of the null distribution as bell-shaped? Is the validity condition of at least
10 successes and at least 10 failures helpful in deciding whether the theory-based
approach is valid in the context of this study?)
13. Use the One Proportion applet to find the following:
a. Simulation-based p-value: __________
b. Normal approximation-based (theory-based) p-value:
c. Which should you trust more: (a) or (b)?
June 27, 2014
MAA PREP workshop
39
Assessment examples from Chapter 1
Sample quiz #1
1. Which definition of a p-value is the most accurate?
a. If the alternative hypothesis is true, the p-value is the probability of observing an
outcome as extreme or more extreme than the one actually observed in the study
b. The p-value is the probability that the alternative hypothesis is true
c. If the null hypothesis is true, the p-value is the probability of observing an
outcome as extreme or more extreme than the one actually observed in the study
d. The p-value is the probability that the null hypothesis is true
Circle TRUE or FALSE each of the following (#2 through #5)
2. A p-value can be negative
TRUE
FALSE
3. A standardized statistic can be negative
TRUE
FALSE
4. We run tests of significance to determine whether ̂ is larger or smaller than some value.
TRUE
FALSE
5. We run tests of significance to determine whether  is larger or smaller than some value.
TRUE
FALSE
6. Dwayne Wade of the Miami Heat hit 569 of his 1093 field goal attempts in the 2012/2013
season for a shooting percentage of 52.1%. Over the lifetime of Dwayne’s career, can
we say that Dwayne is more likely than not to make a field goal?
a. Is the long-run probability of Dwayne making a field goal a parameter or a
statistic?
b. Is 52.1% a parameter or a statistic?
c. When simulating possible outcomes assuming the chance model, how many
times would you flip a coin for one repetition of the 2012/2013 season?
d. With each repetition, what would you keep track of?
e. What would be a typical value from a single repetition?
June 27, 2014
MAA PREP workshop
40
7. Following up on question #6. Suppose at the beginning of the 2013/2014 season
Dwayne Wade made 12 of his first 20 field goal attempts Is there evidence that in the
long run Dwayne will make his field goal a majority of the time in the 2013/2014 season?
a. Let π be the long-run probability that Dwayne makes his field goal. State the null
and alternative hypotheses using symbols.
b. Based on the simulated null distribution, what is the p-value for your test?
c. Based on the simulated null distribution, what is the standardized statistic for
your test?
d. Based on the p-value or standardized statistic, do you have strong evidence that
in the long-run Dwayne will make a majority of his field goals? Explain.
June 27, 2014
MAA PREP workshop
41
Sample quiz #2
Use the following situation to answer all questions.
“Rock-Paper-Scissors” is a game in which two players simultaneously show one of three hand
symbols (denoted rock, paper, or scissors). Rock beats scissors, scissors beats paper, and
paper beats rock. An article published in College Mathematics Journal (Eyler, Shalla,
Doumaux, and McDevitt, 2009) found that players tend to not prefer scissors, choosing it less
often than expected if the 3 choices were made equally. They collected data from 119 people
who played games against a computer. In the study, 55.5% started with rock, 32.8% started
with paper, and 11.8% with scissors in their first round.
Researchers wanted to know whether these data provided convincing evidence that scissors
are chosen less than one-third of the time in the long run.
1. What are the observational units?
2. What is the variable?
3. State the null hypothesis
4. State the alternative hypothesis.
5. Identify the parameter of interest in this study
A. People playing the game against a computer
B. Whether or not someone chooses scissors
C. 11.8%
D. The long-run probability someone plays scissors
E. 1/3
June 27, 2014
MAA PREP workshop
42
When using the One-proportion inference
applet, indicate which values you would
enter into each of the boxes
6. Box A_______
7. Box B_______
8. Box C_______
9. Box D_______ (Hint: Answer is ≤ or
≥)
10. Box E_______
11. Write a one-sentence interpretation of the p-value for this study (Note: you do not need
to provide the actual VALUE of the p-value).
The p-value is the proportion of simulations which are _______________or ____________
(more/less) assuming ___________________________________ (name what is assumed to
be true)
Here is output from the simulation for use in the following questions:
June 27, 2014
MAA PREP workshop
43
12. What is the approximate value of the standardized statistic in this case?
A. Z = 0
B. Z = -1
C. Z = 3
D. Z = -5
13. Would you consider the results of this study to be convincing evidence that scissors are
chosen less often in the long run than expected?
A.
B.
C.
D.
E.
No, the p-value is going to be large
No, the p-value is going to be small
Yes, the p-value is going to be small
Yes, the p-value is going to be large
No, the distribution is centered at 1/3.
June 27, 2014
MAA PREP workshop
44
Sample quiz #3 (for after Chapters 1 and 2)
1. Which of the following is true about sampling?
a) Larger samples are always better than smaller samples, regardless of how the sample
were collected.
b) Larger samples will have less sampling variability than smaller samples.
c) Larger samples give less biased estimates than smaller samples.
d) Larger samples are needed with larger populations.
2. Recall the Gettysburg Address exploration. Which of the following sampling methods would
be unbiased?
a) Select your sample of words by closing your eyes and dropping a pencil down on the
page, record the word, repeat until you have 10 words.
b) Select your sample of words by cutting each word out of the address, mixing them up in
a hat and drawing out 10 slips of paper.
c) Select your sample of words by typing each word onto the same sized slips of paper,
mixing them up in a hat and drawing out 10 slips of paper.
d) Select your sample of words by randomly choosing words you think represent all the
words in the Gettysburg Address.
3. The following dotplot is the time estimates for a 10 second song snippet we looked at in
class. The highest time estimates were 30 seconds.
Suppose that one of the observations recorded as 30 seconds was a typo, and was meant
to be 20. If we corrected this entry in the dataset, how would the following numerical
statistics change, if at all?
Mean:
SMALLER
SAME
LARGER
Median:
SMALLER
SAME
LARGER
Standard Deviation:
SMALLER
SAME
LARGER
June 27, 2014
MAA PREP workshop
45
4. You have heard that in sports like boxing there might be some competitive advantage to
those wearing red uniforms. You want to test this with your new favorite sport of chessboxing. You randomly assign blue and red uniforms to contestants in 20 matches and find
that those wearing red won 14 times (or 70%). You conduct a test of significance using
simulation and get the following null distribution. (Note this null distribution uses only 100
simulated samples and not the usual 1000 or more.)
a) Suppose you want to see if competitors wearing red win more than 50% of the matches
in the long run, so you test H0:  = 0.5 versus Ha:  > 0.5. What is your p-value based
on the above null distribution?
b) Suppose you now want to see if competitors wearing either red or blue have an
advantage, so you test H0:  = 0.5 versus Ha:  ≠ 0.5. What is your p-value now based
on the above null distribution?
5. Suppose you are testing the hypotheses H0:  = 0.5 versus Ha:  > 0.5. You get a sample
proportion of 0.54 and find that your p-value is 0.08. Now suppose you redid your study with
each of the following changes, will your new p-value be larger or smaller than the 0.08 your
first obtained?
a) You increase the sample size and still find a sample proportion of 0.54.
LARGER
SMALLER
b) Keeping the sample size the same, you take a new sample and find a sample proportion
of 0.55.
LARGER
SMALLER
c) With your original sample, you decided to test a two-sided alternative instead of Ha:  >
0.5.
LARGER
June 27, 2014
SMALLER
MAA PREP workshop
46
A sample exam question (for after Chapters 1 and 2)
Studying at Dordt
A sample of twenty Dordt students is taken, four of whom (4/20=20%) say they study at least 35
hours per week during the academic year. At most state universities, the proportion of students
who report studying at least 35 hours a week is 10%. We wish to see if the Dordt sample
provides strong evidence that the true proportion of Dordt students who study more than 35
hours a week is more than 10%.
Two different approaches were taken in order to yield a p-value.
Option #1. 1000 sets of 20 spins on a spinner were generated where the probability of
the spinner pointing to the blue region was 10%.Out of the 1000 sets of 20 spins, 148
sets had at least 4 spins yield blue, and so a p-value of 0.148 is obtained, showing no
evidence that more than 10% of Dordt students study more than 35 hours a week.
Option #2. The theory-based inference applet was used, generating a standardized zstatistic of 1.49 with a p-value of 0.068, yielding moderate evidence that more than 10%
of Dordt students study more than 35 hours a week.
Option #1
Option #2
Briefly explain which p-value (Option #1 or Option #2) is more valid and why.
Note: While the results obtained from the coin tossing applet are subject to some variation
because it is based on simulation that is NOT the main reason for the discrepancies between
the two p-values.
June 27, 2014
MAA PREP workshop
47
Generalization
(adapted from Exploration 2.1A: Sampling Words)
1. Select a representative set of 10 words from the following passage by circling them.
Four score and seven years ago, our fathers brought forth upon
this continent a new nation: conceived in liberty, and dedicated
to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that
nation, or any nation so conceived and so dedicated, can long
endure. We are met on a great battlefield of that war.
We have come to dedicate a portion of that field as a final resting
place for those who here gave their lives that that nation might
live. It is altogether fitting and proper that we should do this.
But, in a larger sense, we cannot dedicate, we cannot consecrate, we cannot hallow this ground. The brave men, living and
dead, who struggled here have consecrated it, far above our poor
power to add or detract. The world will little note, nor long
remember, what we say here, but it can never forget what they
did here.
It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly
advanced. It is rather for us to be here dedicated to the great
task remaining before us, that from these honored dead we take
increased devotion to that cause for which they gave the last full
measure of devotion, that we here highly resolve that these dead
shall not have died in vain, that this nation, under God, shall
have a new birth of freedom, and that government of the people,
by the people, for the people, shall not perish from the earth.
The authorship of literary works is often a topic for debate. For example, researchers have tried
to determine whether some of the works attributed to Shakespeare were actually written by
Bacon or Marlow. The field of “literary computing” examines ways of numerically analyzing
authors’ works, looking at variables such as sentence length and rates of occurrence of specific
words.
Definitions: The population is the entire collection of observational units we are interested
in, and the sample is a subgroup of the population on which we record data. Numerical
summaries about a population are called parameters, and numerical summaries calculated
from a sample are called statistics.
June 27, 2014
MAA PREP workshop
48
The above passage is, of course, Lincoln’s Gettysburg Address, given November 19, 1863 on
the battlefield near Gettysburg, PA. We are considering this passage a population of words, and
the 10 words you selected are considered a sample from this population. In most studies, we do
not have access to the entire population and can only consider results for a sample from that
population, but to learn more about the process of sampling and its implications we will now
deal with a somewhat artificial scenario where we sample from this known population.
2. Record each word from your sample, and then indicate the length of the word (number of
letters) and whether or not the word contains at least one letter e.
Word
Length
(no. of letters)
Contains e?
(Y or N)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
3. Identify the observational units and the variables you have recorded on those
observational units. (Keep in mind that observational units do not have to be people!)
4. Is the variable “length of word” quantitative or categorical?
5. Calculate the average length of the 10 words in your sample. Is this number a parameter
or a statistic? Explain how you know. What symbol would you use to refer to this value?
6. The average length of the 268 words in entire speech equals 4.29 letters. Is this number a
parameter or a statistic? Explain how you know. What symbol would you use to refer to
this value?
June 27, 2014
MAA PREP workshop
49
7. Do you think the words you selected are representative of the 268 words in this
passage? Suggest a method for deciding whether you have a representative sample.
(Hint: Whereas any one sample may not produce statistics that exactly equal the
population parameters, what would we like to be true in general?)
8. Combine your results with your classmates’ by producing a dotplot of the distribution of
average word lengths in your samples. Be sure to label the axis of this dotplot
appropriately. Describe the shape, center, and variability of the distribution of average
word lengths as revealed in the dotplot.
9. Let’s compare your sample statistics to the population parameter. How many and what
proportion of students in your class obtained a sample average word lengths larger than
4.29 letters, the average word length in the population?
Definition: A sampling method is biased if, when using that sampling method, statistics from
different samples consistently overestimate or consistently underestimate the population
parameter of interest.
Note that bias is a property of a sampling method, not a property of an individual sample. Also
note that the sampling method must consistently produce non-representative results in order to
be considered biased.
10. What does your answer to #9 tell you about whether the sampling method of asking
students to quickly pick 10 representative words is biased or unbiased? If biased, what is
the direction of the bias (tendency to overestimate or to underestimate)?
11. Do you think asking each of you to take 20 words instead of 10 words have helped with
this issue? Explain.
June 27, 2014
MAA PREP workshop
50
Now consider a different sampling method: What if you were to close your eyes and point blindly
at the page with a pencil 10 times, taking for your sample the 10 words that your pencil lands
on.
12. Do you think this sampling method is likely to be biased? If so, in which direction?
Explain.
A sample is only useful to us if the data we collect on our sample is similar to the results we
would find in the entire population. In this sense, we say the sample is representative of the
population.
13. Suggest another technique for selecting 10 words from this population in order for the
sample to be representative of the population with regard to word length and e-words.
Taking a simple random sample
Key idea: A simple random sample ensures that every sample of size n is equally likely to be
the sample selected from the population. In particular, each observational unit has the same
change of being selected as every other observational unit.
The key to obtaining a representative sample is using some type of random mechanism to
select the observational units from the population, rather than relying on convenience samples
or any type of human judgment. Instead of having you choose “random” words using your own
judgment, we will now ask you to take a simple random sample of words and evaluate your
results. The first step is to obtain a sampling frame - a complete list of every member of the
population where each member of the population can be assigned a number. Below is a copy of
the Gettysburg address that includes numbers in front of every word. For example, the 43rd word
is nation.
14. Go to the Random Numbers generator at http://www.random.org
Generate 10 Numbers in the range from 1 to 268. What should you do if you get a
duplicate number generated?
Enter the random numbers in the table below:
Randomly generated ten ID values from 1-268
Using your randomly generated values, look up the corresponding word from the
sampling frame. Fill in the data table below.
June 27, 2014
MAA PREP workshop
51
Word
Length
(no. of letters)
e-word?
(Y or N)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Notice that each simple random sample gave us different values for the statistics. That is, there
is variability from sample to sample (sampling variability). This makes sense because the
samples are chosen randomly and consist of different words each time, so we expect some
differences from sample to sample.
Gettysburg Address Sampling Frame
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Four
score
and
seven
years
ago
Our
fathers
brought
forth
upon
this
continent
A
new
nation
conceived
In
liberty
and
dedicated
To
The
proposition
that
All
men
Are
created
June 27, 2014
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
are
met
on
a
great
battlefield
of
that
war
We
have
come
To
dedicate
A
portion
Of
that
field
as
a
final
resting
place
For
those
Who
here
gave
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
we
cannot
consecrate
we
cannot
hallow
this
ground
The
brave
men
living
and
dead
who
struggled
here
have
consecrated
It
far
above
our
poor
power
to
add
or
detract
MAA PREP workshop
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
living
rather
to
be
dedicated
here
to
the
unfinished
work
which
they
who
fought
here
have
thus
far
so
nobly
advanced
It
is
rather
for
us
to
be
here
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
full
measure
of
devotion
that
we
here
highly
resolve
that
these
dead
shall
not
have
died
in
vain
that
this
nation
under
God
shall
have
a
new
birth
of
52
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
equal
Now
We
Are
engaged
In
A
great
Civil
War
testing
whether
That
nation
Or
Any
nation
So
conceived
And
So
dedicated
Can
Long
Endure
We
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
their
lives
that
that
nation
might
live
It
Is
altogether
fitting
and
proper
that
we
should
do
this
But
in
a
larger
sense
we
cannot
dedicate
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
The
world
will
little
note
nor
long
remember
what
we
say
here
but
it
can
never
forget
what
they
did
here
It
is
for
us
the
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
dedicated
to
the
great
task
remaining
before
us
that
from
these
honored
dead
we
take
increased
devotion
to
that
cause
for
which
they
gave
the
last
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
freedom
and
that
government
of
the
people
by
the
people
for
the
people
shall
not
perish
from
the
earth
15. Let’s examine your sample and those of your classmates.
a. Calculate the average word length in your random sample.
b. Again produce a dotplot of the distribution of sample average word lengths for
yourself and your classmates.
c. Comment on how this distribution compares to the one from #8 based on nonrandom sampling.
16. Let’s compare your sample statistics to the population parameter.
a. How many and what proportion of students in your class obtained a sample average
word length larger than the population average (4.29 letters)?
June 27, 2014
MAA PREP workshop
53
b. What does this answer reveal about whether simple random sampling is an unbiased
sampling method? Explain.
Key idea: When we use simple random sampling, the average of the sample means from
different samples equals the population mean and the average of the sample proportions
from different samples equals the population proportion. (This means that these statistics are
unbiased estimators for the corresponding parameter when we use simple random
sampling.)
To really see the long-run pattern in these results, let’s look at many more random samples from
this population. Open the Sampling Words java applet. You will see that the population
distribution contains three variables: the lengths of the words, whether or not a word contains at
least one letter e, and whether or not the word is a noun. For now, select only the length
variable.
17. Check the Show Sampling Options box. Specify 10 in the Sample Size box and press
the Draw Samples button. You will see the 10 selected words appear in the box below.
You will also see 10 darkened dots within the population distribution representing the 10
lengths that you have sampled. The value of sample mean is displayed in the population
box. The sample mean is also plotted in Sample Mean Length plot. Press Draw
Samples again. Did you get the same sample this time? Did you get the same sample
mean?
You should now see two dots in the Sample Mean Length plot, one for each sample
mean. The mean of these two sample means is displayed in the upper right corner of
this plot. Confirm this calculation.
First sample mean:
Second sample mean:
Average:
Change the Number of Samples to 998 (for a total of 1000 samples). Uncheck Animate. Press
the Draw Samples button. (This should generate a dotplot of 1000 values of the sample mean
for 1000 different simple random samples, each consisting of 10 words, selected randomly from
the population of all 268 words in the Gettysburg address.)
18. Describe the resulting distribution of sample means: What are roughly the largest and
smallest values? What is the average of the distribution? What is the standard
deviation? How does the average of the 1000 sample means compare to the population
mean  = 4.29?
June 27, 2014
MAA PREP workshop
54
19. Now change the Sample Size from 10 to 30 and the Number of Samples to 1000.
Press the Draw Samples button. The applet allows you to compare the two distributions
(n = 10 to n = 30). How do these two distributions compare? Be sure to identify a feature
that is similar and a feature that is different.
20. Did changing the sample size change the center of the distribution? If we used a biased
sampling method, would increasing the sample size remove the bias? Explain.
You should see that whereas the convenience samples tended to consistently produce samples
that overestimate the length of words, the simple random sampling method does not have that
tendency. In fact, the average of all the statistics from all possible simple random samples will
exactly equal the population parameter value. For this reason, in the real world, where we don’t
have access to the entire population, when we select a simple random sample, we will be willing
to believe that sample is representative of the population. In fact, we would prefer a small
random sample, than a large convenience sample. That doesn’t mean our sample result will
match the population result exactly, but we will be able to predict how far off it might be.
Simple random sampling (and probability sampling methods in general) allow us to believe that
our sample is representative of the larger population and that the observed sample statistic is “in
the ball park” of the population parameter. For this reason, when we use simple random
sampling we will be willing to generalize the characteristics of our sample to the entire
population.
June 27, 2014
MAA PREP workshop
55
Interval Estimation (one proportion/probability)
(adapted from Exploration 3.1: Kissing Right?)
Most people are right-handed, and even the right eye is dominant
for most people. Developmental biologists have suggested that
late-stage human embryos tend to turn their heads to the right. In
a study reported in Nature (2003), German bio-psychologist Onur
Güntürkün conjectured that this tendency to turn to the right
manifests itself in other ways as well, so he studied kissing
couples to see which side they tended to lean their heads while
kissing. He and his researchers observed kissing couples in
public places such as airports, train stations, beaches, and parks.
They were careful not to include couples who were holding
objects such as luggage that might have affected which direction
they turned. For each kissing couple observed, the researchers
noted whether the couple leaned their heads to the right or to the
left. They observed 124 couples, age 13-70 years.
You will first use Güntürkün’s data to test his conjecture that kissing couples tend to lean their
heads to the right. Use the symbol  to denote the proportion of all kissing couples in these
countries that lean their heads to the right.
1. Is  a parameter or a statistic? Explain how you are deciding.
2. Do we know the exact value of  based on the observed data? Explain.
3. State the appropriate null and alternative hypotheses, both in words and in terms of the
parameter , for testing the conjecture that kissing couples tend to lean their heads to
the right more often than by random chance.
Of the 124 couples observed, 80 leaned their heads to the right while kissing.
4. Calculate the sample proportion of the observed couples who leaned their heads to the
right while kissing. Also indicate the symbol used to denote this value.
June 27, 2014
MAA PREP workshop
56
5. Conduct a simulation analysis (using the One Proportion applet) to assess the strength
of evidence that the sample data provide for Güntürkün’s conjecture that kissing couples
tend to lean their heads to the right more often than they would by random chance.
Report the approximate p-value and summarize your conclusion about this strength of
evidence.
Your simulation analysis should convince you that the sample data provide very strong evidence
to believe that kissing couples lean their heads to the right more than half the time in the long
run. That leads to a natural follow-up question: How much more than half the time? In other
words, we have strong evidence that the long-run probability of leaning to the right is greater
than one-half, but can we now estimate the value for that probability? We will do this by testing
many different (null) values for the probability that a couple leans to the right when kissing.
6. Now test whether the data provide evidence that the long-run probability that a couple
leans their heads to the right while kissing () is different from 0.60. Use the One
Proportion applet to determine the p-value for testing the null value of 0.60. Report what
you changed in the applet and report your p-value.
Think about it: What kind of test are we conducting if we want to determine whether the
population parameter is different from a hypothesized value - a one-sided test or a two-sided
test?
Remember from Section 1.4 that if we don’t have a specific direction we are testing, greater
than or less than, we need to use a two-sided test of significance. If you used a one-sided
alternative in #8, determine the two-sided p-value now by checking the Two-sided box.
Recall that a p-value of 0.05 or less indicates that the sample data provide strong evidence
against the null hypothesis and in favor of the alternative hypothesis. Thus, we can reject the
null hypothesis when the p-value is less than or equal to 0.05. Otherwise, when the p-value is
greater than 0.05, we do not have strong enough evidence against the null hypothesis and so
we consider the null value to be plausible for the parameter.
Key Idea: We will consider a value of the parameter to be plausible if the two-sided p-value
for testing that parameter value is larger than the level of significance.
7. Is the p-value for testing the null value of 0.60 less than 0.05? Can the value 0.60 be
rejected, or is the value 0.60 plausible for the long-run probability that a couple leans
their heads to the right while kissing?
June 27, 2014
MAA PREP workshop
57
Recall from Section 2.3, the 0.05 criterion we are using is called the significance level. The pvalue you found in the previous question should not have been smaller than 0.05. Hence, you
do not reject the null hypothesis at the 0.05 level of significance and therefore you do not reject
0.60 as a plausible value for . Thus, it is plausible (i.e., believable) that the long-run probability
that a kissing couple leans their heads to the right is 0.60.
8. Does this mean that you’ve proven that exactly 60% of kissing couples lean right? Why
or why not?
Because there are still other plausible values, now we want to “zoom in” on which values for the
long-run probability are plausible and which can be rejected at the 0.05 significance level.
9. Use the applet to test the probability values given in the following table.
 Each time, change the Probability of success to match the value that you are
testing (keeping the observed sample proportion that you count beyond the
same).
 Everything else should stay the same; press Draw Samples and then Count to
see the new two-sided p-value (with the Two-sided box checked).
Probability
under H0
(Two-sided)
p-value
Reject or
plausible?
Probability
under H0
(Two-sided)
p-value
Reject or
plausible?
0.54
0.55
0.56
0.57
0.58
0.59
0.60
0.70
0.71
0.72
0.73
0.74
0.75
0.76
10. Using a 0.05 significance level and your results from #9, provide a list of plausible values
for , the long-run probability that a kissing couple leans their heads to the right.
This list of values represents an interval containing all values between two endpoints.
Key idea: This interval of plausible values that you have written down in #10 is also called a
95% confidence interval for , the long-run probability that a kissing couple leans their heads
to the right. Why is this called 95% confidence? This corresponds to the 5% (0.05) significance
level that was used to decide whether there was enough evidence against a hypothesized
value. Notice that 95% and 5% add up to give 100%. The 95% is called the confidence level,
and is a measure of how confident we are about our interval estimate of the parameter, in this
case, we are 95% confident that the population proportion of all couples who lean their heads to
the right while kissing is in the interval that you found.
June 27, 2014
MAA PREP workshop
58
11. Does the 95% confidence interval include the value 0.5? Does it include the value 0.6?
Explain how your answers relate to the significance test and p-value that you calculated
in #5 and #6.
12. Now suppose we were to use a significance level of 0.01 instead of 0.05 to decide
whether or not to reject the corresponding null hypothesis for each listed value of . How
would you expect the interval of plausible values to change: wider, narrower, or no
change? Explain your reasoning.
13. Implement the 0.01 significance level to determine plausible values for the population
proportion of all kissing couples that lean their heads to the right. (Hint: Start with the
table from #9, although you might have to test more null values as well.) Report the
interval of plausible values.
14. How confident are you about the interval of plausible values that you listed in #13?
15. What is the primary difference between the 95% confidence interval reported in #10 and
the 99% confidence interval reported in #13? Which interval is wider? Is this consistent
with what you predicted in #12?
16. How would you expect a 90% confidence interval to compare to the 95% and 99%
confidence intervals? Explain.
Key idea: Using a higher confidence level produces a wider confidence interval.
You should have found very strong evidence, based on a small p-value, that a majority of
kissing couples in the population lean their heads to the right. Moreover, your confidence
intervals, which provide plausible values for of the population proportion that lean to the right,
contain only values above 0.5.
June 27, 2014
MAA PREP workshop
59
Confounding, Causation, and Experiments
(adapted from Exploration 4.1: Home Court Disadvantage?)
Sports teams prefer to play in front of their own fans rather than at the opposing team’s site.
Having a sell-out crowd should provide even more excitement and lead to an even better
performance, right? Well, consider the Oklahoma City Thunder, a National Basketball
Association team, in its second season (2008-09) after moving from Seattle. This team had a
win-loss record that was actually worse for home games with a sell-out crowd (3 wins and 15
losses) than for home games without a sell-out crowd (12 wins and 11 losses). (These data
were noted in the April 20, 2009 issue of Sports Illustrated in the Go Figure column.)
1. Identify the observational units and variables in this study. Also classify each variable as
categorical (also binary?) or quantitative.
2. When did the Thunder have a higher winning percentage: in front of a sell-out crowd or a
smaller crowd? Support your answer by calculating the proportion of sell-out games that
they won and also the proportion of non-sell-out games that they won. (Write both these
proportions as decimals.)
Sell-out crowd:
Smaller crowd:
Definition: Two variables are associated or related if the value of one variable gives you
information about the value of the other variable. When comparing two groups this means that
the proportions or means take different values in the two groups.
3. Do the two variables appear to be associated? Explain.
Often when a study involves two associated variables, it is natural to consider one the
explanatory variable and the other the response variable.
Definitions: The explanatory variable is the variable we think is “explaining” the change in the
response variable and the response variable is the variable we think is being impacted or
changed by the explanatory variable. The explanatory variable is sometimes called the
independent variable and the response variable is sometimes called the dependent variable.
June 27, 2014
MAA PREP workshop
60
4. Which would you consider the explanatory variable in this study? Which is the
response? (That is, what are the roles of these variables in this study?)
There are two possible explanations for this odd finding that the team had a better winning
percentage with a smaller crowd:
 The sell-out crowd caused the Thunder to play worse, perhaps because of pressure or
nervousness.
 The sell-out crowd did not cause a worse performance, and some other issue (variable)
explains why they had a worse winning percentage with a sell-out crowd. In other words,
a third variable is at play, which is related to both the crowd size and the game outcome.
(Of course, another explanation is random chance. Using methods you will learn later, we’ve
determined that you can essentially rule out random chance in this case.)
5. Consider the second explanation. Suggest a plausible alternative variable that would
explain why the team would be less likely to win in front of a sell-out crowd than in front
of a smaller crowd. (Make sure it’s clear not just that your explanation would affect the
team’s likelihood of winning, but that the team would be less likely to win in front of a
sell-out crowd compared to a smaller crowd.)
Definition: A confounding variable is a variable that is related both to the explanatory and
to the response variable in such a way that its effects on the response variable cannot be
separated from the effects of the explanatory variable.
6. Identify the confounding variable based on your suggested explanation in #5. Explain
how it is confounding– what is the link between this third variable and the response
variable, and what is the link between this third variable and the explanatory variable?
(Hint: Remember that this variable has to be recorded on the observational units: home
games for the Thunder.)
Another variable recorded for these data was whether or not the opponent had a winning record
the previous season. Of the Thunder’s 41 home games, 22 were against teams that won more
than half of their games. Let’s refer to those 22 teams as strong opponents. Of these 22 games,
13 were sell-outs. Of the 19 games against opponents that won less than half of their games
that season (weak opponents), only 5 of those games were sell-outs.
June 27, 2014
MAA PREP workshop
61
7. Was the Thunder more likely to have a sell-out crowd against a strong opponent or a
weak opponent? Calculate the relevant (conditional) proportions to support your
answer.
When the Thunder played a strong opponent, they won only 4 of 22 games. When they played a
weak opponent, the Thunder won 11 of 19 games.
8. Was the Thunder less likely to win against a strong opponent than a weak one? Again
calculate the relevant (conditional) proportions to support your answer.
9. Explain how your answers to #7 and #8 establish that strength of opponent is a
confounding variable that prevents drawing a cause-and-effect conclusion between
crowd size and game outcome.
10. Summarize your conclusion about whether these data provide evidence that a sell-out
crowd caused the Thunder to play worse. Write as if to a friend who has never studied
statistics. Be sure to address the fact that the Thunder had a much smaller winning
percentage in front of a sell-out crowd.
June 27, 2014
MAA PREP workshop
62
Follow-ups:
1. A recent study classified children according to the type of lighting used in their room (full light,
night light, no light) and their eye condition (near-sighted, normal vision, and far-sighted). The
two-way table of counts is:
Full light Night light No light Total
Near-sighted
41
78
18
137
Normal vision
22
115
114
251
Far-sighted
12
39
40
91
Total
75
232
172
479
a) Is this an observational study or an experiment? Explain briefly.
b) What proportion of the children who slept in a room with full light were near-sighted?
The following segmented bar graph displays these data:
100%
80%
60%
Far-sighted
40%
Normal vision
Near-sighted
20%
0%
Full light
Night light
No light
c) Based on this graph, describe the association (if any) between lighting condition and
eyesight.
d) Provide an argument that parents’ eyesight could be a confounding variable in this study.
e) Would you conclude that this study provides strong evidence of a cause-and-effect
relationship between lighting condition and eyesight? Explain briefly.
2. In a study published in a 2007 issue of the journal Preventive Medicine, researchers found
that smokers were more likely to have used candy cigarettes as children than non-smokers
were. When hearing about this study, John said: “But isn’t the smoking status of the person’s
parents a confounding variable here?” When Karen asked what he meant, John said: “Children
whose parents smoke are more likely to become smokers themselves when they become
adults.” What else does John need to say in order to explain how the parents’ smoking status
can be a confounding variable in this study?
June 27, 2014
MAA PREP workshop
63
June 27, 2014
MAA PREP workshop
64
Randomization Test – Comparing Two Proportions
(adapted from Example 5.1: Dolphin Therapy)
Swimming with dolphins can certainly be fun, but is it also therapeutic for patients suffering from
clinical depression? To investigate this possibility, researchers recruited 30 subjects aged 1865 with a clinical diagnosis of mild to moderate depression. Subjects were required to
discontinue use of any antidepressant drugs or psychotherapy four weeks prior to the
experiment, and throughout the experiment. These 30 subjects went to an island off the coast
of Honduras, where they were randomly assigned to one of two treatment groups. Both groups
engaged in the same amount of swimming and snorkeling each day, but one group (the animal
care program) did so in the presence of bottlenose dolphins and the other group (outdoor nature
program) did not. At the end of two weeks, each subjects’ level of depression was evaluated,
as it had been at the beginning of the study (Antonioli and Reveley, 2005). The response
variable is defined as whether or not the subject achieved substantial reduction (improvement)
of their depression.
1. Why did the researchers include a comparison group in this study? Why didn’t they just see
how many patients showed substantial improvement when given the dolphin therapy?
The null hypothesis for this study should basically say that there is no association between
swimming with dolphins and treatment of depression while the alternative should state that there
is an association. We are going to write our hypotheses in terms of probabilities.
Null: The probability a subject will show improvement in their depression symptoms is the
same between those that swim with dolphins and those that don’t.
Alternative: The probability a subject will show improvement in their depression symptoms
is greater for those that swim with dolphins compared to those that don’t.
The results that the researchers found are shown in the following table:
Dolphin therapy Control group Total
Showed substantial improvement
10
3
13
Did not show substantial improvement
5
12
17
Total
15
15
30
2. Identify which of the variables in this study is explanatory and which is response.
3. A useful first step begin to see whether the data appear to support the claim that dolphin
therapy is effective is to calculate the conditional proportion that improved in each group.
Calculate these conditional proportions. Did the dolphin therapy group have a higher
conditional proportion that showed substantial improvement than the control group?
June 27, 2014
MAA PREP workshop
65
4. Is it possible that this difference could happen even if dolphin therapy was not effective,
simply due simply to the random nature of putting subjects into groups (i.e., the luck of the
draw)?
Sure, this is possible. Consider the following scenario: Assume that the 13 people in the study
whose depression symptoms improved would have improved whether they swam with dolphins
or not, let’s call them the “improvers”. Now, what if, by chance, of the 13 improvers (people who
improve no matter what), 10 randomly ended up in the “swim with dolphins” group. Recall the
important fact that subjects were randomly assigned to swim with dolphins or not. So,
randomly, it is possible that 10 improvers end up in the swim with dolphins group and only 3 in
the not swim with dolphins group. Thus, it is possible we would see 67% (10/15) of dolphin
swimmers improving and only 20% (3/15) non-dolphin swimmers improving even if swimming
with dolphins doesn’t actually make a difference.
So, it is possible….but how unlikely is it? In order to answer this question we will once again
turn to simulation. Remember that simulation is a method we have used to estimate
probabilities. So far you used coin flipping and an applet to simulate the p-value (the probability
we would obtain the observed statistic or something more extreme if the null hypothesis was
true) for different tests of significance. Because the dolphin experiment is more complex than
the previous studies we looked at, we can’t simply flip a coin any longer. Instead, to do our
simulation, we’re going to use playing cards and a computer applet.
To estimate the p-value for this study you will need 30 index cards, 13 of which are blue to
represent the “improvers” and 17 of which are green to represent the “non-improvers.” The null
hypothesis assumes that 13 people (the blue cards) will get better no matter whether they swim
with dolphins or not. So those that improved were going to do so (perhaps just from getting to fly
to Honduras and swim in the water) regardless of which treatment group they were assigned to.
5. Now shuffle your 30 cards and deal them into two stacks of fifteen. One of the two stacks
represents the people who got to swim with dolphins and the other stack represents people
who didn’t. Decide which stack is which and then fill in the table below.
Showed substantial improvement
Did not show substantial improvement
Total
Dolphin therapy Control group Total
13
17
15
15
30
Difference in proportions of improvement (dolphin group minus control group):
Repeat this shuffling and dealing process a second time:
Dolphin therapy Control group Total
Showed substantial improvement
13
Did not show substantial improvement
17
Total
15
15
30
Difference in proportions of improvement (dolphin group minus control group):
June 27, 2014
MAA PREP workshop
66
6. Combine your results with your classmates by constructing a dotplot of the differences in
improvement proportions (dolphin group minus control group).
7. Does it appear that the observed value of this statistic (namely, 0.467) would be unlikely to
occur if there really were no effect of the dolphin therapy? Explain.
Remember the p-value is the probability we would observe a statistic at least as large as we did
in our study, assuming the null hypothesis was true. By shuffling and dealing the cards, we are
simulating a true null hypothesis. We are simulating a random assignment of the 13 subjects
that actually improved to each group.
8. What proportion of the results from your class had a difference in the proportion of improvers
of 0.467 or larger? This is your p-value. Provide a detailed interpretation of this p-value. Be
sure to explain what is meant by “by chance alone,” keeping in mind what was randomized
in this study.
9. Based on the class results, what is your preliminary conclusion about the effectiveness of
swimming with dolphins in improving peoples’ depression symptoms? Explain the reasoning
process behind this conclusion.
We really need to carry out this simulated random assignment process hundreds, preferably
thousands of times. This would be very tedious and time-consuming with cards, so let’s turn to
technology. To do this, use the “Dolphin study” applet found at:
http://www.rossmanchance.com/ISIapplets.html. Check the “Show Shuffle Options” x boand
then press the “Shuffle” button. Notice that the applet does what you have done: shuffle the 30
cards and deal out 15 for the “dolphin therapy” group, separating cards by color. The applet
also determines the 2×2 table for the simulated results (check the Show table box on the left)
and creates a dotplot of the shuffled difference in conditional proportions. Now press the
“Shuffle” button four more times. Then increase the “Number of Shuffles” to 995 and press the
“Shuffle” button. This produces a total of 1000 repetitions of the simulated random assignment
process.
June 27, 2014
MAA PREP workshop
67
10. Use the applet to determine what proportion of your 1000 simulated random assignments
were the results as (or more) extreme as the actual study (which, you’ll recall, saw a 0.467
difference in improvers). This is your p-value. Interpret this p-value, including what is meant
by the phrase “by chance alone” in this context.
11. Is your p-value small enough so that you would consider an difference in the observed
proportion of improvers of 0.467 or more surprising under the null model that dolphin
therapy is not effective? Does this give strong evidence that the dolphin therapy is effective
in treating depression in patients similar to those in the study?
12. Is it reasonable to draw a cause-and-effect conclusion between dolphin therapy and
improvement or not in depression symptoms? Explain why or why not.
13. Is it reasonable to generalize your conclusion to all people who suffer from depression?
Explain why or why not.
Further Exploration
What if the study had found the same overall proportion of improvers, but with the group
proportions closer together? More specifically, suppose that the actual data had turned out with
2 fewer improvers in the dolphin therapy group and 2 more improvers in the control group.
14. How will this affect our analysis, and the p-value, and the conclusion? First make a
conjecture, and explain your reasoning.
15. Then investigate this question with a simulation. [Hints: Think about how the two-way table
will change. Then use your earlier simulation results but explain what you are doing
differently now to find the approximate p-value.] Explain why your answers (about how the
strength of evidence against the null model changes when the group difference is not as
large) make intuitive sense.
June 27, 2014
MAA PREP workshop
68
Randomization Tests Comparing Two Groups on a Quantitative Response
(adapted from Exploration 6.2: Lingering Effects of Sleep Deprivation)
Step 1: Ask a research question. Many students pull “all-nighters” when they have an
important exam or a pressing assignment. Concerns that may arise include: Can you really
function well the next day after a sleepless night? What about several days later: Can you
recover from a sleepless night by getting a full night’s sleep on the following nights?
Step 2: Design a study and collect data. Researchers Stickgold, James, and Hobson
investigated delayed effects of sleep deprivation on learning in a study published in Nature
Neuroscience (2000). Twenty-one volunteers, aged 18 to 25 years, were first trained on a visual
discrimination task that involved watching stimuli appear on a computer screen and reporting
what was seen.
After the training period, subjects were tested. Performance was recorded as the minimum time
(in milliseconds) between the appearance of stimuli and an accurate response. Following these
“baseline” measurements, one group was randomly assigned to be deprived of sleep for 30
hours, followed by two full nights of unrestricted sleep, whereas the other group was allowed to
get unrestricted sleep on all three nights.
Following this, both groups were retested on the task to see how well they remembered the
training from the first day. Researchers recorded the improvement in performance as the
decrease in time required at retest compared to training. (Note: For example, if someone took 5
milliseconds (ms) to respond at the beginning of the study and then 2 ms, to respond at the end,
the improvement score is 3 ms. But if someone took 2 ms at the beginning and then 5 ms at the
end, the improvement score is -3 ms.)
The goal of the study was to see whether the improvement scores tend to be higher for the
unrestricted sleep treatment than for the sleep deprivation treatment.
1. Identify the explanatory and response variables in this study. Also classify them as either
categorical or quantitative.
Explanatory:
Type:
Response:
Type:
2. Was this an experiment or an observational study? Explain how you are deciding.
June 27, 2014
MAA PREP workshop
69
3. Let µunrestricted be the long-run mean improvement on this task three days later when
someone has had unrestricted sleep and let µdeprived denote the long-run mean
improvement when someone is sleep deprived on the first night.
In words and symbols, state the null and the alternative hypotheses to investigate
whether sleep deprivation has a negative effect on improvement in performance on
visual discrimination tasks. (Hint for the alternative hypothesis: Do you expect the people
to do better or worse when sleep deprived? Based on your answer, what sign/direction
should you choose for the alternative hypothesis?)
Here are the data, with positive values indicating better performance at retest than at training,
and negative values indicating worse performance at retest than at training:
Unrestricted-sleep group’s improvement scores (milliseconds):
25.2, 14.5, -7.0, 12.6, 34.5, 45.6, 11.6, 18.6, 12.1, 30.5
Sleep-deprived group’s improvement scores (milliseconds):
-10.7, 4.5, 2.2, 21.3, -14.7, -10.7, 9.6, 2.4, 21.8, 7.2, 10.0
Step 3: Explore the Data.
4. To look at graphical and numerical summaries of the data from the study, go to the
Multiple Means applet. The sleep deprivation data have already been entered into the
applet.
a. Notice that the applet creates parallel dotplots, one for each study group. Based on
these dotplots alone, which group (unrestricted or deprived) appears to have had the
higher mean improvement? How are you deciding?
b. Based on the dotplots alone, which group (unrestricted or deprived) appears to have
had more variability in improvement? How are you deciding?
c. Notice also that the applet also computes numerical summaries of the data, such as,
the mean and standard deviation (SD) for the improvements in each group.
i.
For the unrestricted group, record the sample size (n), mean, and SD.
nunrestricted =
ii.
SDunrestricted =
For the deprived group, record the sample size (n), mean, and SD.
ndeprived=
June 27, 2014
̅ unrestricted =
̅
deprived=
MAA PREP workshop
SDdeprived =
70
Recall from earlier in the course, that standard deviation is a measure of variability. Relatively
speaking, smaller standard deviation values indicate less variability and a distribution whose
data values tend to cluster more closely together, compared to a distribution with a larger
standard deviation.
d. Based on the numerical summaries reported in #4c, which group (unrestricted or
deprived) had the higher mean improvement?
e. Based on the numerical summaries reported in #4c, which group (unrestricted or
deprived) had the higher variability in improvement?
f.
Notice that the applet also reports the observed difference in means for the
improvements of the two groups. Record this value (and its measurement units).
̅ unrestricted - ̅ deprived =
g. Before you conduct an inferential analysis, does this difference in sample means (as
reported in #4f) strike you as a meaningful difference? Explain your answer.
Step 4: Draw Inferences.
5. What are two possible explanations for why we observed the two groups to have
different sample means for improvement in performance?
Next you will be asked to describe how you might go about deciding whether the observed
difference between the two sample means is statistically significant. (Hint: Think about how you
assessed whether an observed difference between two sample proportions was statistically
significant in Chapter 5. Use the same strategy, with an appropriate modification for working
with means instead of proportions.)
Once again the key question is how often random assignment alone would produce a difference
in the groups at least as extreme as the difference observed in this study, if there really were no
effect of sleep condition on improvement score. You addressed similar questions in Chapter 5
when you analyzed the dolphin therapy and yawing studies. The only change is that now the
response variable is quantitative rather than categorical, so the relevant statistic is the
difference in group means rather than the difference in group proportions. Also once again, we
use simulation to investigate how often such an extreme difference would occur by chance
(random assignment) alone (if the null hypothesis of no difference / no effect / no association
were true).
In other words, we will again employ the 3S Strategy.
June 27, 2014
MAA PREP workshop
71
Statistic:
6. A natural statistic for measuring how different the observed group means are from each
other is the difference in the mean improvement scores between the two groups. Report
the value of this statistic, as you did in #4f.
Simulate: You will start by using index cards to perform a tactile simulation of randomly
assigning the 21 subjects between the two groups, assuming that sleep condition has no impact
on improvement. Because the null hypothesis asserts that improvement score is not associated
with sleep condition, we will assume that the 21 subjects would have had exactly the same
improvement scores as they did, regardless of which sleep condition group (unrestricted or
deprived) the subject had been assigned.
7.
a. How many index cards do you need to conduct this simulation?
b. What will you write on each index card?
To conduct one repetition of this simulation:
 Shuffle the stack of 21 cards well, and then randomly distribute cards into two stacks:
one stack with 10 cards (the unrestricted group) and one with 11 (the sleep deprived
group).

Calculate and report the sample means for each re-randomized group:
Re-randomized unrestricted group’s mean:
Re-randomized deprived group’s mean:

Calculate the difference in group means, subtracting unrestricted mean minus sleep
deprived mean. Report this value.

Combine this result with your classmates’ to create a dotplot that shows the
distribution of several possible values of difference in sample means that could have
happened due to pure chance, if sleep condition has no impact on improvement.
Sketch the dotplot below, being sure to label the horizontal axis.
Label:
June 27, 2014
MAA PREP workshop
72
c. At about what value is the dotplot centered? Explain why this makes sense. (Hint:
What are we assuming to be true when we conduct the simulation?)
d. Where is the observed difference in means from the original study (as reported in #6)
on the dotplot? Did this value happen often, somewhat rarely, or very rarely? How
are you deciding?
8. As before with simulation-based analyses, you would now like to conduct many, many
more repetitions to determine what is typical and what is not for the difference in group
means, assuming that sleep condition has no impact on improvement score. We think
you would prefer to use a computer applet to do this rather than continue to shuffle cards
for a very long time, calculating difference of group means by hand. Go back to the
Multiple Means applet, check the Show Shuffle Options box, select the Plot display,
and press Shuffle Responses.
a. Describe what the applet is doing and how this relates to your null hypothesis from
#3. (Note: You can check the Show Original Colors box to see which groups the
observations originally belonged to.)
b. Record the simulated difference in sample means for the re-randomized groups, as
given in the applet output. Is this difference more extreme than the observed
difference from the study (as reported in #6)? How are you deciding?
c. Click on Shuffle Responses again, and record the simulated difference in sample
means for the re-randomized groups. Did it change from #8b?
d. Click on Re-Randomize again, and record the simulated difference in sample means
for the re-randomized groups. Did it change from #8b and #8c?
e. Now to see many more possible values of the difference in sample means, assuming
sleep condition has no impact on improvement, do the following in the Multiple
Means applet:
 Change Number of Shuffles from 1 to 997.
 Press Shuffle Responses, to produce a total of 1000 shuffles and rerandomized statistics.
June 27, 2014
MAA PREP workshop
73
f.
Consider the histogram of the 1000 could-have-been values of difference in sample
means, assuming that sleep condition has no effect on improvement.
i.
What does one dot on the dotplot represent? (Hint: Think about what you
would have to do to put another dot on the graph.)
ii.
Describe the overall shape of the null distribution displayed in this dotplot.
iii.
Where does the observed difference in sample means (as reported in #6) fall
in this dotplot: near the middle or out in a tail? Are there a lot of dots that are
even more extreme than the observed difference, assuming sleep condition
has no impact on improvement score? How are you deciding?
g. To estimate a p-value, continue with the Multiple Means applet.


Type in the observed difference in group means (as reported in #6) in the Count
Samples box (for the one-sided alternative hypothesis), and press Count.
Record the approximate p-value.
h. Fill in the blanks of the following sentence to complete the interpretation of the pvalue.
The p-value of _______ is the probability of observing _________________________
_______________ assuming __________________________________________.
Strength of Evidence:
9. Based on the p-value, evaluate the strength of evidence provided by the experimental
data against the null hypothesis that sleep condition has no effect on improvement
score: not much evidence, moderate evidence, strong evidence, or very strong
evidence? How are you deciding?
June 27, 2014
MAA PREP workshop
74
10. Significance: Summarize your conclusion with regard to strength of evidence in the
context of this study.
11. Estimation:
a. Use the 2SD Method to approximate a 95% confidence interval for the difference in
long-run mean improvement score for subjects who get unrestricted sleep minus the
long-run mean improvement score for subjects who are sleep deprived. (Hints:
Remember the observed value of the difference in group means, and obtain the SD
of the difference in group means from the applet’s simulation results. The interval
should be observed difference in means ± 2 SD, where SD represents the standard
deviation of the null distribution of the difference in group means.)
b. Interpret what this confidence interval reveals, paying particular attention to whether
the interval is entirely positive, entirely negative, or contains zero. (Hint: Be sure to
mention “direction” in your interpretation by saying how much larger improvement
scores are on average for the treatment you find to have the larger long-run mean:
I’m 95% confident that the long-run mean improvement score is
higher with the __
to
________ treatment as opposed to the
____________ treatment.)
Step 5: Formulate Conclusions.
12. Generalization: Were the participants in this study randomly selected from a larger
population? Describe the population to which you would feel comfortable generalizing
the results of this study.
June 27, 2014
MAA PREP workshop
75
13. Causation: Were the participants in the study randomly assigned to a sleep condition?
How does this affect the scope of conclusion that you can draw?
Another statistic:
Could we have chosen a statistic other than the difference in group means to summarize how
different the two groups’ improvement scores were? Yes, for example we could have used the
difference in group medians. Why might we do this? For one reason, the median is less affected
by outliers than the mean (see Section 3.2). And, we can use the Multiple Means applet to do
this.
14. Step 6: Look back and ahead.
Looking back: Did anything about the design and conclusions of this study concern you?
Issues you may want to critique include:
 Any mismatch between the research question and the study design
 How the experimental units were selected
 How the treatments were assigned to the experimental units
 How the measurements were recorded
 The number of experimental units in the study
 Whether what we observed is of practical value
Looking ahead: What should the researchers’ next steps be to fix the limitations or build on this
knowledge?
June 27, 2014
MAA PREP workshop
76
Common exam questions
One proportion (Unit 1)
Questions 1 – 5, refer to the following story. With movie-viewing-at-home made so convenient
by services such as Netflix, Pay-per-view, and Video-on-demand, do a majority of city residents
now prefer watching movies at home rather than going to the theater? To investigate, a local
high school student, Lori, decides to conduct a poll of adult residents in her city. She selects a
random sample of 100 adult residents from the city and gives each participant the choice
between watching a movie at home or the same movie at the theater. She records how many
choose to watch the movie at home.
1. Pick the correct null hypothesis. (Circle only one.)
a) Adult residents of the city are equally likely to choose to watch the movie at home as to
watch the movie at the theater.
b) Adult residents of the city are more likely to choose to watch the movie at home than to
watch the movie at the theater.
c) Adult residents of the city are less likely to choose to watch the movie at home than to
watch at the theater.
d) Other:
2. Pick the correct alternative hypothesis. (Circle only one.)
a) Adult residents of the city are equally likely to choose to watch the movie at home as to
watch the movie at the theater.
b) Adult residents of the city are more likely to choose to watch the movie at home than to
watch the movie at the theater.
c) Adult residents of the city are less likely to choose to watch the movie at home than to
watch at the theater.
d) Other:
3. After analyzing her data, Lori finds that significantly more than half of the sample (p-value
0.012) preferred to watch the movie at home. Below are two possible explanations for the
observed majority. Based on her analysis, which explanation do you find the most plausible
(i.e., reasonable, believable)? (Circle only one.)
a) More than half of the adult residents in her city prefer to watch the movie at home.
b) There is no overall preference for movie-watching-at-home in her city, but by pure
chance her sample just happened to have an unusually high number of people choose
to watch the movie at home.
c) (a) and (b) are equally plausible explanations.
June 27, 2014
MAA PREP workshop
77
4. Which of the following is the most valid interpretation of Lori’s p-value of 0.012? (Circle only
one.)
a) A sample proportion as large as or larger than hers would rarely occur.
b) A sample proportion as large as or larger than hers would rarely occur if the study had
been conducted properly.
c) A sample proportion as large as or larger than hers would rarely occur if 50% of adults
in the population prefer to watch the movie at home.
d) A sample proportion as large as or larger than hers would rarely occur if more than 50%
of adults in the population prefer to watch the movie at home.
5. Reconsider Lori’s results. If we were to construct a 95% confidence interval for the
proportion of all adult residents of this city that would choose to watch the movie at home,
will this interval contain the value of 0.5? (Circle only one.)
a) Yes
b) No
c) Don’t have enough information to decide. Explain:
June 27, 2014
MAA PREP workshop
78
Two proportions (Unit 2)
You want to investigate a claim that women are more likely than men to dream in color. You
take a random sample of men and a random sample of women (in your community) and ask
whether they dream in color, and compare the proportions of each gender that dream in color.
(Questions 1-6 are based on this study.)
1. If the difference in the proportions (who dream in color) between the two samples turns out
not to be statistically significant (p-value was not small), which of the following would be the
best conclusion to draw?
a) You have found strong evidence that there is no difference between the proportions
of men and women in your community that dream in color.
b) You have not found enough evidence to conclude that there is a difference between
the proportions of men and women in your community that dream in color.
c) You have found strong evidence against the claim that there is a difference between
the proportions of men and women that dream in color.
d) Because the result is not significant, we can’t conclude anything from this study.

Instructors may want to give partial credit for (d)
2. If the difference in the proportions (who dream in color) between the two samples does turn
out to have a small p-value, which one of the following would be the best interpretation?
a) It would not be very surprising to obtain the observed sample results if there is really
no difference between the proportions of men and women in your community that
dream in color.
b) It would be very surprising to obtain the observed sample results if there is really no
difference between the proportions of men and women in your community that dream
in color.
c) It would be very surprising to obtain the observed sample results if there is really a
difference between the proportion of men and women in your community that dream
in color.
d) The probability is very small that there is no difference between the proportions of
men and women in your community that dream in color.
e) The probability is very small that there is a difference between the proportions of
men and women in your community that dream in color.
3. Your review of the research suggests that there is a genuine difference between men and
women on this variable. Suppose that the difference between the sample groups turns out
not to be statistically significant (not a small p-value). Which conclusion would be most
reasonable?
a) Something went wrong with the analysis, and the results of this study cannot be
trusted.
b) There must not be a difference after all and the other research studies were flawed.
c) The sample size might have been too small to detect a difference even if there is
one.
June 27, 2014
MAA PREP workshop
79
4. Suppose that two different studies are conducted on this issue.
 Study A finds that 40 of 100 women sampled dream in color, compared to 20 of 100
men.
 Study B finds that 35 of 100 women dream in color, compared to 25 of 100 men.
Which study (A or B) would provide stronger evidence that there is a genuine difference
between men and women on this issue?
a) Study A
b) Study B
c) The strength of evidence would be similar for these two studies
5. Suppose two more studies are conducted on this issue. Both studies find 30% of women
sampled dream in color, compared to 20% of men. But Study C consists of 100 people of
each sex, whereas Study D consists of 40 people of each gender. Which study would
provide stronger evidence that there is a genuine difference between men and women on
this issue?
a) Study C
b) Study D
c) The strength of evidence would be similar for these two studies
6. Suppose the difference in the sample proportions who dream in color does turn out to be
statistically significant (small p-value). Below are two possible explanations for this observed
difference. Which do you consider to be the more plausible (i.e., believable or reasonable)
explanation for the observed result?
a) Men and women in your community do not differ on this issue but by chance alone
the random sampling led to the difference we observed between the two groups.
b) Men and women in your community differ on this issue.
c) (a) and (b) are equally plausible explanations.
Possible alternative to Q5 and Q6: Suppose I were to use a simulation-based method to
randomly assign yes/no responses about dreaming in color to the males and females in the
study under the assumption that the likelihood of dreaming in color is the same for both groups.
What would be the main purpose of this use of randomness in this simulation?
a. To allow me to draw a cause-and-effect conclusion from the study.
b. To allow me to generalize my results to a larger population.
c. To simulate values of the statistic under the null hypothesis.
d. To replicate the study and increase the accuracy of the results.
.
e. Other:
June 27, 2014
MAA PREP workshop
80
Two means (Unit 2)
A psychology student conducted a research study in which the research question is whether
financial incentives can improve performance on video games. The student prepares an
experiment in which 40 subjects are randomly assigned to one of two groups. The first group
was offered $5 for a score above 100 and the other group was simply told to “do your best.”
Each of the 40 students played the video game and achieved a certain score, from which the
student calculated the following summary statistics:
$5 incentive “do your best”
Mean score
98
80
Sample size
20
20
These average scores differ by 98 – 80 = 18 points, but the student is concerned that this
difference might be just due to random chance. To examine the question of random chance the
student conducted the following simulation:
 She took 40 index cards. She wrote each subject’s score on one card.
o She then shuffled the cards and randomly placed the cards into two stacks of 20
cards each. One stack represents $5 incentive and the other “do your best.”
o She computed the mean score for each group.
o She found the difference in means for the two groups by taking average for the $5
incentive group minus the average for the “do your best” group.
 She repeated the previous three steps 1000 times.
 She graphed the 1000 statistics (differences in group means) she calculated from these
trials to produce the following histogram:
June 27, 2014
MAA PREP workshop
81
Use this information to circle the best answer among the options to the following questions:
1. What is the motivation for the process the student followed?
a) This process allows her to compare her actual result to what could have happened
by chance if gamers’ performances were not affected by whether they were asked to
do their best or offered an incentive.
b) This process allows her to determine the percentage of time the $5 incentive strategy
would outperform the “do your best" strategy for all possible scenarios.
c) This process allows her to determine how many times she needs to replicate the
experiment for valid results.
d) This process allows her to determine whether the normal distribution fits the data.
2. Which of the following was assumed to be true in carrying out the simulation?
a) The $5 incentive is more effective than the “do your best” incentive for improving
performance.
b) The $5 incentive and the “do your best” incentive are equally effective at improving
performance.
c) The “do your best” incentive is more effective than a $5 incentive for improving
performance.
d) Both (a) and (b) but not (c).
3. Which of the following could be the approximate p-value in this situation? Recall that the
research question is “Does the $5 incentive improve performance?”
a) 0.501
c) 0.022
b) 0.047
d) 0.001
4. What does the histogram tell you about whether $5 incentives are effective in improving
performance on the video game?
a) The $5 incentive is not effective because the distribution of differences generated is
centered at zero.
b) The $5 incentive is effective because distribution of differences generated is
centered at zero.
c) The $5 incentive is not effective because the p-value is greater than 0.05.
d) The $5 incentive is effective because the p-value is less than 0.05.
5. Which of the following is the appropriate interpretation of the p-value?
a) The p-value is the probability that the $5 incentive is not really helpful.
b) The p-value is the probability that the $5 incentive is really helpful.
c) The p-value is the probability that she would get a result as least as extreme as the
one she actually found, if the $5 incentive is really not helpful.
d) The p-value is the probability that a student wins on the video game.
June 27, 2014
MAA PREP workshop
82
Transferability (Final exam)
A consumer organization was concerned that a
computer manufacturer was misleading customers by
overstating the battery life of its laptop batteries
(measured in hours). The manufacturer states the
batteries will last 3 hours. To investigate this claim,
researchers selected a random sample of 10 laptops
from the current production process. The same exact
operations were performed on each laptop (watching
a movie, writing a term paper, surfing the internet,
etc.), recording the number of hours each battery
lasted.
Consider the data displayed on the lifetimes of the sample of 10 laptop batteries (in hours).
Suppose we want to investigate whether there is convincing evidence that the distribution of all
battery lifetimes is skewed to the right. One possible statistic for measuring skewness is the
ratio:
(sample mean  sample median)
(a) Calculate the value of this statistic (sample mean divided by the sample median) for this
sample.
(b) If the distribution of battery lifetimes in the population is symmetric, then we would expect
this skewness ratio statistic to be
LESS THAN / CLOSE TO / GREATER THAN (circle one)
the value of
(fill in a number).
(c) If the distribution of battery lifetimes in the population is skewed to the right, then we would
expect this skewness ratio statistic to be
LESS THAN / CLOSE TO / GREATER THAN
(circle one) the value of
(fill in a number).
Even when the battery lifetimes in a sample of 10 batteries is skewed to the right, it is still
possible that the actual population distribution of battery lifetimes is symmetric, and that the
skewness observed in the sample was due to random chance alone. To simulate the selection
of random samples from a hypothetical population for which the null hypothesis is true, 100
random samples, each of size 10, were taken from a perfectly symmetrically distributed
population with the same mean and standard deviation as the original sample. For each of
those 100 samples, the statistic sample mean divided by the sample median was calculated. A
dotplot of the 100 simulated statistics is shown below.
June 27, 2014
MAA PREP workshop
83
(d) (2 pts) Select the best response below to finish this statement: “Based on this simulation
output, our observed sample statistic…
 does not provide strong evidence that the actual population distribution is skewed to the
right because this null distribution is symmetric.
 provides strong evidence that the actual population distribution is skewed to the right
because this null distribution is symmetric.
 does not provide strong evidence that the actual population distribution is skewed to the
right because this null distribution is centered around one.
 provides strong evidence that the actual population distribution is skewed to the right
because this null distribution is centered around one.
 does not provide strong evidence that the actual population distribution is skewed to the
right because this null distribution ranges from .95 to 1.05.
 provides strong evidence that the actual population distribution is skewed to the right
because this null distribution ranges from .95 to 1.05.
June 27, 2014
MAA PREP workshop
84
Paired Tests
(adapted from Exploration 7.1: Rounding First Base)
Imagine you are at the plate in baseball and have hit a hard line drive. You want to try to stretch
your hit from a single to a double. Does the path that you take to “round” first base make much
of a difference? For example, is it better to take a “narrow angle” or a “wide angle” around first
base?
1. Suppose 22 baseball players volunteered to participate in an experiment. Suppose also
that you plan to assign a single angle, either wide or narrow, to each player. How would
you decide which player ran with which base running angle?
A reasonable experimental design would be to randomly assign 11 of the 22 players to run with
the wide angle and the other 11 to run with the narrow angle.
2. Some runners are faster than others. Explain how random assignment controls for this,
so that speed is not likely to be a confounding variable in this study.
Even though random assignment tends to balance out other variables (such as speed) between
the two groups, there’s still a chance that most of the fast runners could be in one group and
most of the slow runners in the other group. More importantly, there’s likely to be a good bit of
variability in the runners’ speeds, and that variability would make it harder to spot a difference
between the base running angles even if one angle really is superior to the other.
3. Suggest a different way of conducting the experiment, to make sure that speed is
completely balanced between the two groups.
The data below are the times (seconds) for the first 10 off 22 runners in a matched-pairs
experiment to study this question, and the graph shows dotplots of the times for all 22 runners.
Subject
1
2
3
4
5
6
7
8
9
10
narrow angle 5.50 5.70 5.60 5.50 5.85 5.55 5.40 5.50 5.15 5.80 …
wide angle
5.55 5.75 5.50 5.40 5.70 5.60 5.35 5.35 5.00 5.70 …
June 27, 2014
MAA PREP workshop
85
One thing to notice in these distributions is that there is a lot of overlap in the two distributions
(narrow mean: 5.534 sec, wide mean: 5.459 sec) and a fair bit of variability (narrow SD: 0.260
sec, wide SD: 0.273 sec). The slowest times took over six seconds, whereas the fastest times
were just under five seconds. As you saw in Chapter 6, it will be difficult to detect a difference
between the two methods when there is so much natural variation in the running times.
However, these data are clearly paired. As discussed earlier, such data should not be treated
as coming from two independent samples. Instead, we will focus our analysis on the difference
in times for each runner between the two methods. Thus, we can define our parameter of
interest to be
µd = long-run mean difference in running speed when rounding first base with a wide
angle versus a narrow angle in the population of interest.
4. State the null and alternative hypotheses (using µd), to test whether the mean difference
in running speeds between the two running angles is not 0.
5. Go to the Matched Pairs applet. The default dataset is the baseball running data.
Comment on the distribution of differences as revealed in the dotplot.
6. Report the mean and SD of the differences, with appropriate symbols.
7. Describe how you could flip a coin repeatedly to simulate the randomness in the way this
experiment was conducted.
8. The Matched Pairs applet will perform the simulation similar to what you could do with
flipping a coin.
 Set the number of times to Randomize to 1 and click on Randomize.
 One the coin tosses have determined which time will be in which column, the
applet displays the re-randomized data (the colors show you the original column
for each observation, so you should see a mix in each group now).
 The could-have-been value for the mean difference is added to the Average
Difference graph.
June 27, 2014
MAA PREP workshop
86
a. How does the simulated mean difference compare to observed mean difference.
Circle one.
More extreme
Less extreme
Similar
b. How are you deciding?
9. Update the number of Randomizations to 99 (for a total of 100 repetitions), uncheck
Animate, and press Randomize. Consider the graph “Average Difference” that the
applet has created...
a. How many dots are in this graph?
b. What does each dot represent?
10. To see many more possible values of mean difference in sample means that could have
been, IF the two running angles were swappable, update the Number of
Randomizations to 900, and press Randomize (for a total of 1000 repetitions).
Describe the updated “Average Difference” graph with the 1000 samples or repetitions,
with regard to the following features.
a. Shape:
b. About what number is this graph centered? Explain why you were expecting this.
c. This graph also reports a value for standard deviation, SD. Report this value,
and give a simple interpretation of this value. As in, what is this value
measuring?
11. You now should have generated 1000 possible values of the mean difference in running
times between the two angles, assuming the null hypothesis was true and these times
were the same, on average. How does the observed mean difference from your data
compare to these simulated values? Is an average difference in running times like that
observed in the actual study unlikely to happen by chance alone, if the two angles are
the same, on average? How are you deciding?
June 27, 2014
MAA PREP workshop
87
12. To quantify the strength of evidence against the null hypothesis, you can find the pvalue.
 Go back to the Matched Pairs applet.
 In the Count Samples box, make an appropriate selection from the drop down
menu (Hint: what direction does your alternative hypothesis look), and enter the
appropriate number in the box (Hint: at least as extreme as what number).
 Report the approximate p-value.
13. Use the p-value to state a conclusion in the context of the study. Be sure to comment on
statistical significance. Can you conclude that there is strong evidence the wide angle
and the narrow angle differ? Why or why not? Can you draw a cause-and-effect
conclusion? To what population are you willing to generalize the results?
14. Alternatively, you can summarize the strength of evidence using a standardized statistic.
Find the standardized statistic and confirm that the strength of evidence you receive
from the p-value is approximately the same as with the standardized statistic.
15. We can again use the 2SD Method to approximate a 95% confidence interval for the
mean difference in running angles. The overall structure of the formula is the same:
estimate ± 2(SD)
where the estimate is the sample mean difference in narrow angle minus wide angle and
SD is the standard deviation of your null distribution when you did 1000 repetitions in the
applet. Use these numbers to find an approximate 95% confidence interval for the
population mean difference in running times among all baseball players who could use
these two running angles.
16. Provide an interpretation of this confidence interval, being sure to explain the parameter
in this context.
17. Describe what would change in your analysis if you had calculated differences in running
times by subtracting in the other order.
June 27, 2014
MAA PREP workshop
88
Randomization Tests (comparing many groups – categorical response)
(adapted from Exploration 8.1: Recruiting Organ Donors)
In Example P.1 we considered a study that investigated how to encourage people to be more
likely to become organ donors. At the time, we did not formally analyze the data from the study.
Now we are studying methods for comparing three groups, so we can further analyze the data
and draw inferential conclusions from this study.
Recall that researchers asked volunteer subjects to imagine that they had moved to a new
state, applied for a driver’s license, and needed to decide whether or not to become an organ
donor. Researchers created three different versions of the survey, each with a different donor
recruiting strategy. Subjects were randomly assigned to one of the three types of recruiting
strategies:



The default option is to be an organ donor, and individuals have to “opt-out” of organ
donation (Opt-out group)
The default option is to not be an organ donor, and individuals have to “opt-in” to
organ donation (Opt-in group)
There is no default option, and individuals have to choose whether or not they will be
an organ donor (Neutral group)
1. Write the appropriate null and alternative hypotheses in words or in symbols.
2. Researchers found that 23 of 55 subjects in the opt-in group chose to be organ donors,
41 of 50 in the opt-out group chose to be organ donors, and 44 of 56 in the neutral group
chose to be organ donors. To see these data in the Multiple Proportions applet, open
the OrganDonor data file and paste the two columns (with column names) into the
applet and press Use Data. (Or enter/paste in the two-way table of counts in the Sample
data window and press Use Table.)
a. Check the Show table box to have the applet calculate the conditional proportions
who agreed to become an organ donor across the three groups and produce a
segmented bar graph. What are the three proportions?
b. Comment on what your calculations and graph reveal about whether the default
option used in the question appears to affect how likely the person is to become an
organ donor.
June 27, 2014
MAA PREP workshop
89
We see some differences in the sample proportions of individuals who chose to be organ
donors across the three treatments, but are these differences large enough to be statistically
significant? In other words, do these data provide strong evidence of an association between
donor recruiting strategy (opt-in, opt-out, neutral) and choosing to become an organ donor in the
population?
3. Suggest two possible explanations for how these observed proportions could have
turned out to be as different from each other as they are. (Hint: Think about what our null
and alternative hypotheses are.)
Applying the 3S Strategy
To investigate we will see how we can apply the 3S Strategy to these data. Because
researchers are attempting to minimize (or at least control) the probability of Type I Errors, in
cases with multiple groups they first do an overall test (see the Unit 3 introduction for more
discussion on this). However, this requires computing a single statistic and considering the null
distribution of that statistic. Let’s look at this approach now.
1. Statistic
4.
a. Propose a formula for a statistic that could be used to measure how different the
three sample proportions are. (Hint: The key here is that the statistic needs to
provide a single number that somehow focuses on differences between/among the
observed sample proportions.)
b. For the statistic you propose, would a large value or small value give evidence for
the alternative hypothesis that at least one proportion is different in the population?
When we compared two proportions in Chapter 5, we used the difference in two proportions as
the statistic. In this chapter, though, there are multiple proportions. There are actually several
reasonable ways of summarizing the differences among the groups in one number. Once you
have settled on a statistic, you apply the 3S process procedures as before – simulate the
distribution of the statistic under the null hypothesis and then see where the observed value of
the statistic falls in that distribution.
June 27, 2014
MAA PREP workshop
90
5. One possible statistic is max p̂ – min p̂ . Review the conditional proportions you
determined in #2b and calculate the observed value of this statistic for these data. Why
might this not be a less than satisfactory choice?
However, this statistic ignores some of the groups. Another reasonable statistic to calculate is
the mean of the absolute values of differences for each pair. (We could call this statistic “MAD”
for mean of absolute differences.)
6. Let’s construct the MAD statistic for these data by going through the following steps.
a. Review the conditional proportions of organ donors you determined in #5b and
calculate the differences in these proportions for each pair:
Opt-out minus opt-in:
Opt-out minus neutral:
Opt-in minus neutral:
b. Calculate the mean of the absolute values of these differences.
So which statistic should you use? And how can you use the statistic to estimate a p-value?
We’ll consider those questions in the following sections.
2. Simulation
In Chapter 5 we first saw how we can simulate a null hypothesis of no association by shuffling
the response variable outcomes across the explanatory variable groups. This models the
random assignment process used in the data collection process, and, assuming the null
hypothesis is true, “breaks any potential association” between the response and explanatory
variables. We modeled this with playing cards in Chapter 5.
7. Describe how you would model a simulation of the null hypothesis with red and black
playing cards now that there are three explanatory variable groups instead of two. Make
sure to tell how many black cards you would need and how many red ones. Also tell
what each color of cards would represent. How many cards would you deal out to each
group? What would you calculate after dealing them out?
8. Now let’s use the Multiple Proportions applet to simulate a null distribution of these
MAD statistics. This null distribution is again used to measure the strength of evidence
provided by the observed statistic against the null hypothesis.
June 27, 2014
MAA PREP workshop
91
a. Now check Show Shuffle Options, and leave Number of Shuffles at 1. Press
Shuffle Response to perform one shuffle of the response variable values. You
should see the shuffled response variable, the new two-way table for the simulated
data, and the value of the simulated statistic (Most Recent Shuffled MAD) which is
also placed in blue on the graph on the right. Select the Plot radio button to see the
shuffled segmented bar graph. How does the distribution across the groups for the
shuffled data compare to the original data? Is the simulated MAD statistic value
closer to zero than the observed value of the MAD statistic?
b. Now enter 999 for the Number of Shuffles, and press Shuffle Responses. This
repeats the shuffling of the response variable 999 more times for a total of 1000
repetitions. You should see a graph of a null distribution of the MAD statistics. What
is the shape of this distribution? Why is it not a bell-shaped curve centered at zero?
3. Strength of Evidence
9. To estimate a p-value, determine how often the observed value of the statistic, or
something even larger, occurred in the null distribution. (Hint: Enter the observed value
for the MAD statistic from the research study in the Count Samples box and press
Count.)
10. Based on the p-value, summarize the strength of evidence that the sample data provide
against the null hypothesis.
June 27, 2014
MAA PREP workshop
92
Randomization Tests Comparing Many Groups on a Quantitative Response
Research Question: An investigation of strategies that may affect memory scores.
Data collection:
You will be asked to study a sequence of letters for 20 seconds and then to write down as many
as you can remember, in order. Your score will be the number that you remember correctly
before your first error of any kind.
1) Note your score here: ____________
2) Note your treatment group here: ______________ (the instructor will tell you the answer
to this)
3) On the instructor’s computer, add your score to the appropriate column. Sketch the
dotplots of the class data here. Make sure to draw a rough sketch of the three graphs in
the space below, with labeled axes.
4) Did all the students in the study get the same memory score? YES
NO
What are a few reasons why the memory scores vary?
5) Record the averages (means) and SDs for each group. Record these below.
Group
Sample size
Sample average
Sample SD
6) Do the graphical and numerical summaries suggest that the three “treatment” groups
tended to produce different responses? Explain.
June 27, 2014
MAA PREP workshop
93
7) Write out the null and alternative hypotheses of what we are trying to test about memory
in this experiment.
Null hypothesis, H0:
Alternative hypothesis, Ha:
Drawing Inference
Once again, we arrive at the question: how surprising are the study results to have happened by
chance alone? And, we can answer that question by finding a p-value.
Let us now use the 3S Strategy to find a p-value in the context of our study. That is, let us find
the probability of seeing the kind of differences we are seeing between the three groups or
something even more extreme, if the grouping of letters had no effect on memory scores.
8) First though, how can we measure or quantify the observed difference between the three
groups? Suggest one or two ways to do this.
In answering #8, you have identified a few possible statistics for the “statistic” step of the 3S
Strategy.
One option is to use the following statistic to quantify the observed difference in mean scores
between the three groups, and call it the MAD statistic:
MAD = Mean Absolute Difference =
| avg1  avg2 |  | avg1  avg3 |  | avg2  avg3 |
3
Using the 3S Strategy:
Statistic
9) For our data, find the numeric value for the MAD statistic.
June 27, 2014
MAA PREP workshop
94
10) What kind of numeric values would you expect the MAD statistic to take if the type of
grouping of letters had no effect on memory scores? Why?
Recall that the issue of Statistical Significance asks whether the observed results would be
surprising to occur by random chance alone, if there were really no effect/difference between
the groups.
In this case, we are exploring whether the observed MAD statistic value is statistically significant
by investigating how often a MAD value as large as ours would occur by random chance alone,
if there really were no effect/difference between the groups.
11) How was randomness (or random chance) used in this study: random sampling or
random assignment?
12) Is it possible that even with the random assignment, we ended up with three groups
where the good memorizers ended up in one group, and the not-so-good ones in the
others?
13) Is it possible that the grouping of letters has absolutely no effect on memory score, and
the difference observed here in our experiment is merely the result of an unlucky random
assignment?
Note that the more interesting question is, if the grouping of letters has absolutely no effect on
memory score, what is the probability that an unlucky random assignment alone leads to the
difference observed between groups in our study (as measured by the MAD statistic) or an even
greater difference?
We investigate this issue by assuming that all the subjects would have obtained the same
memory score regardless of which group they had been in, and we repeat the random
assignment process over and over, and calculate a could-have-been MAD value for each
repetition. Then, we note how surprising the observed MAD value from our study would be in
the long run, if the type of grouping of letters had no effect on memory scores.
Of course, like we did in Chapter 6, we could shuffle index cards with the scores on them, to
produce several possible values of the MAD statistic that could have happened by random
chance (that is, random assignment) if the type of grouping of letters had no effect on memory
scores. But we will directly go to an applet from here.
NOTE: In the interest of time, we have presented the results from data collected previously.
June 27, 2014
MAA PREP workshop
95
14) The data collected were copied and pasted the data into the Multiple Means applet.
Pressing Use My Data on the applet produces the following output:
Thus, the observed MAD statistic is: ____________________
Simulate
15) Similar to how the Multiple Proportions applet was used, we used the Multiple Means
applet to produce the following randomization-based null distribution of the MAD
statistic.
With regard to the randomization-based null distribution of the MAD statistic:
a. Is the null distribution centered at zero? Explain why your answer makes sense.
June 27, 2014
MAA PREP workshop
96
b. Describe the shape of the null distribution. Is it symmetric or skewed? If it is
skewed, in which direction is it skewed? Explain why your answer makes sense.
Strength of evidence
16) Indicate the region of the null distribution that denotes the p-value. How did you decide
where to look?

Did the center and shape of the null distribution have a bearing on how you
decided about what region of the null distribution denoted the p-value?
17) The (approximate) p-value was calculated to be: 0.570. Write a conclusion for the result
of your hypothesis test in the context of this problem. In your conclusion, be sure to
address the following:
 Can we conclude that type of letter grouping affects average memory scores?
Why or why not?
 To whom can we say that these results are applicable? How do you decide?
June 27, 2014
MAA PREP workshop
97
June 27, 2014
MAA PREP workshop
98
Correlation Test
(adapted from Exploration 10.2: Draft Lottery)
In 1970, the United States Selective Service conducted a lottery to decide which young men
would be drafted into the armed forces (Fienberg, 1971). Each of the 366 birthdays in a year
(including February 29) was assigned a draft number. Young men born on days assigned low
draft numbers were drafted.
We will regard the 366 dates of the year as observational units. We will consider two variables
recorded on each date: draft number assigned to the date, and sequential date in the year (so
January 31 is sequential date 31, February 1 is sequential date 32, and so on).
1. In a perfectly fair, random lottery, what should be the value of the correlation coefficient
between draft number and sequential date of birthday?
The graph below displays a scatterplot of the assigned draft numbers and the sequential dates.
There are 366 dots, one for each day of the (leap) year.
2.
a. Does the scatterplot reveal much of an association between draft number and
sequential date?
b. Based on the scatterplot, guess the value of the correlation coefficient.
c.
Does it appear that this was a fair, random lottery?
It’s difficult to see much of a pattern or association in the scatterplot, so it seems reasonable to
conclude that this was a fair, random lottery with a correlation coefficient near zero.
But let’s dig a little deeper …
June 27, 2014
MAA PREP workshop
99
3. The 1970 Draft Lottery data sheet (at the end of the exploration) shows the draft number
assigned to each of the 366 birthdays.
a. Find and report the draft number assigned to your birthday.
b. Is your draft number in the bottom third (1-122), middle third (123-244), or top third
(245-366)?
4. The second table at the end of the exploration has ordered the draft numbers within
each month.
a. Use this table to determine the median draft number for your birth month.
b. Collaborate with your classmates to determine and report the median draft number
for all twelve months.
Month
Median draft
number
Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec
c. Do you notice any pattern or trend in the median draft numbers over the course of
the year? (Hint: If you do not see a trend, compare the six medians from Jan – June
with the 6 medians from July – Dec.)
5. The correlation coefficient for these data is r = -0.226. What does this number reveal? Is
it consistent with the scatterplot?
6.
a. Suggest two possible explanations (hypotheses) which could have generated the
value of the observed correlation coefficient.
June 27, 2014
MAA PREP workshop
100
b. In your own words, how could we go about determining whether random chance is a
plausible explanation for the observed correlation value between sequential date and
draft number? Explain how the 3S strategy could be applied here, in particular
identifying a simulation strategy you could conduct “by hand.” Note: You do not need
to actually carry out a simulation analysis.
i.
What is the statistic?
ii. How would you simulate?
iii. How would you evaluate the strength of evidence?
The null hypothesis to be tested is that the lottery was conducted with a fair, random process.
The null hypothesis would therefore mean that there is no association between sequential date
and draft number for this process. The alternative hypothesis is that this lottery was not
conducted with a fair, random process, so there is an association between sequential date and
draft number.
How can we assess whether the observed correlation coefficient of r = -0.226 is far enough from
zero to provide convincing evidence that the lottery process was not random? Like always, we
ask how unlikely it would be for a fair, random lottery to produce a correlation value as far from
zero as -0.226. Also like always, we answer that question by simulating a large number of fair
random lotteries, calculating the correlation coefficient for each one, and seeing how often we
obtain a correlation coefficient as or more extreme (as far from zero) as -0.226.
7. Open the Corr/Regression applet. Copy the data from DraftLottery into the applet
(remember to include the column titles).
a. Check the Correlation Coefficient box and confirm that the correlation coefficient is
-0.226.
b. Check the Show Shuffle Options box and select the Correlation radio button to the
right to keep track of that statistic. Then press Shuffle Y-values to simulate one fair,
random lottery. (Select the Plot radio button to see the corresponding scatterplot.)
Record the value of the correlation coefficient between the shuffled draft numbers
and sequential date (in blue).
c. Press Shuffle Y-values four more times to generate results of four more fair,
random lotteries. Record the values of the correlation coefficients in the table below.
June 27, 2014
MAA PREP workshop
101
Repetition
Correlation coefficient
1
2
3
4
5
d. Did any of these simulated statistics from fair, random lotteries produce a
correlation coefficient as extreme (far from zero) as the observed -0.226?
e. Change the Number of Shuffles from 1 to 995 and press Shuffle Y-values to
simulate 995 more fair, random lotteries. Look at the null distribution of these
1000 correlation coefficients. Where is this distribution centered? Why does this
make sense?
f.
Use Count Samples pull-down menu to select Beyond. Specify the observed
correlation coefficient (-.226) and press Count. What proportion of the 1000
simulated random lotteries produced a correlation coefficient at least as extreme
(as far from zero in either direction) as -0.226? Report the approximate p-value.
g. Interpret this p-value: This is the probability of what, assuming what?
h. What conclusion would you draw from this p-value? Do you have strong
evidence that the 1970 draft lottery was not conducted with a fair, random
process? Explain the reasoning behind your conclusion.
Once they saw these results, statisticians were quick to point out that something fishy happened
with the 1970 draft lottery. The irregularity can be attributed to improper mixing of the balls used
in the lottery drawing process. (Balls with birthdays early in the year were placed in the bin first,
and balls with birthdays late in the year were placed in the bin last. Without thorough mixing,
balls with birthdays late in the year settled near the top of the bin and so tended to be selected
earlier.) The mixing process was changed for the 1971 draft lottery (e.g., two bins, one for the
draft numbers and one for the birthdays), for which the correlation coefficient turned out to be r
= 0.014.
8. Use your simulation results to approximate the p-value for the 1971 draft lottery. Is there
any reason to suspect that this 1971 draft lottery was not conducted with a fair, random
process? Explain the reasoning behind your conclusion. Also explain why you don’t
need to paste in the data from the 1971 lottery first.
June 27, 2014
MAA PREP workshop
102
1970 Draft Lottery Data
date
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Jan
305
159
251
215
101
224
306
199
194
325
329
221
318
238
17
121
235
140
58
280
186
337
118
59
52
92
355
77
349
164
211
June 27, 2014
Feb
86
144
297
210
214
347
91
181
338
216
150
68
152
4
89
212
189
292
25
302
363
290
57
236
179
365
205
299
285
Mar
108
29
267
275
293
139
122
213
317
323
136
300
259
354
169
166
33
332
200
239
334
265
256
258
343
170
268
223
362
217
30
Apr
32
271
83
81
269
253
147
312
219
218
14
346
124
231
273
148
260
90
336
345
62
316
252
2
351
340
74
262
191
208
May
330
298
40
276
364
155
35
321
197
65
37
133
295
178
130
55
112
278
75
183
250
326
319
31
361
357
296
308
226
103
313
Jun
249
228
301
20
28
110
85
366
335
206
134
272
69
356
180
274
73
341
104
360
60
247
109
358
137
22
64
222
353
209
Jul
93
350
115
279
188
327
50
13
277
284
248
15
42
331
322
120
98
190
227
187
27
153
172
23
67
303
289
88
270
287
193
MAA PREP workshop
Aug
111
45
261
145
54
114
168
48
106
21
324
142
307
198
102
44
154
141
311
344
291
339
116
36
286
245
352
167
61
333
11
Sep
225
161
49
232
82
6
8
184
263
71
158
242
175
1
113
207
255
246
177
63
204
160
119
195
149
18
233
257
151
315
Oct
359
125
244
202
24
87
234
283
342
220
237
72
138
294
171
254
288
5
241
192
243
117
201
196
176
7
264
94
229
38
79
Nov
19
34
348
266
310
76
51
97
80
282
46
66
126
127
131
107
143
146
203
185
156
9
182
230
132
309
47
281
99
174
Dec
129
328
157
165
56
10
12
105
43
41
39
314
163
26
320
96
304
128
240
135
70
53
162
95
84
173
78
123
16
3
100
103
The following table arranges the draft numbers in order for each month:
rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Jan
17
52
58
59
77
92
101
118
121
140
159
164
186
194
199
211
215
221
224
235
238
251
280
305
306
318
325
329
337
349
355
June 27, 2014
Feb
4
25
57
68
86
89
91
144
150
152
179
181
189
205
210
212
214
216
236
285
290
292
297
299
302
338
347
363
365
Mar
29
30
33
108
122
136
139
166
169
170
200
213
217
223
239
256
258
259
265
267
268
275
293
300
317
323
332
334
343
354
362
Apr
2
14
32
62
74
81
83
90
124
147
148
191
208
218
219
231
252
253
260
262
269
271
273
312
316
336
340
345
346
351
May
31
35
37
40
55
65
75
103
112
130
133
155
178
183
197
226
250
276
278
295
296
298
308
313
319
321
326
330
357
361
364
Jun
20
22
28
60
64
69
73
85
104
109
110
134
137
180
206
209
222
228
247
249
272
274
301
335
341
353
356
358
360
366
Jul
13
15
23
27
42
50
67
88
93
98
115
120
153
172
187
188
190
193
227
248
270
277
279
284
287
289
303
322
327
331
350
MAA PREP workshop
Aug
11
21
36
44
45
48
54
61
102
106
111
114
116
141
142
145
154
167
168
198
245
261
286
291
307
311
324
333
339
344
352
Sep
1
6
8
18
49
63
71
82
113
119
149
151
158
160
161
175
177
184
195
204
207
225
232
233
242
246
255
257
263
315
Oct
5
7
24
38
72
79
87
94
117
125
138
171
176
192
196
201
202
220
229
234
237
241
243
244
254
264
283
288
294
342
359
Nov
9
19
34
46
47
51
66
76
80
97
99
107
126
127
131
132
143
146
156
174
182
185
203
230
266
281
282
309
310
348
Dec
3
10
12
16
26
39
41
43
53
56
70
78
84
95
96
100
105
123
128
129
135
157
162
163
165
173
240
304
314
320
328
104
FAQ 10.3.1 Why least squares?
Buzz: Why least squares? Why do we square the deviations? Why not absolute values?
Doris: Not only is your question a thoughtful one, it’s also a question that invites us back in
history. In the 1700s, scientists tried to do what you suggest – to minimize the sum of
absolute deviations: they wanted to use least MAD (deviation) instead of least sum of
squared deviations. The scientific logic was good, but the math turned out to be too messy.
Sometimes it worked, and there was a single “best” line that minimized the MAD statistic. But
sometimes there would be infinitely many “best” lines, with no way to choose from among
them. “Best” and “infinitely many” don’t go together. Even when there was a single best line,
it was devilishly hard to find. Overall, the math was just too hard.
Buzz: So least squares is pretty much just the fallback Plan B after least absolute deviations
failed?
Doris: Yes. The justification for least absolute deviations makes good sense, but the theory
is much simpler for least squares. As Fred Mosteller, one of the greatest statisticians of the
last century, used to tell his students at Harvard, “Mathematicians love calculus, and calculus
loves squares.”
Buzz: Does the choice – absolute values or squares -- make much difference in the line you
get?
Doris: Another good question. Sometimes not, but sometimes it does.
Buzz: Say more.
Doris: Remember SD and IQR and outliers?
Buzz: You mean SD is sensitive to outliers and IQR is resistant?
Doris: Exactly. Remember why?
Buzz: Because the SD squares the deviations, and the IQR does not. Squaring gives extra
weight to large deviations.
Doris: Apply the same logic to the least squares line.
Buzz: Outliers have a big effect because their large deviations get extra weight from the
squaring.
Doris: Bravo! This is why some observations can become influential points – the least
squares line won’t let their residuals get too large. If this were second grade in the 1950s
you’d get to stay after class and dust the erasers. If this were the 1970s and you were a
statistician, you’d be able to get a big research grant from the National Science Foundation.
If this was Dr. Bastian’s lab and you were a dolphin, you’d have just earned a fish.
June 27, 2014
MAA PREP workshop
105
Regression
(adapted from Exploration 10.3: Predicting Height from Footprints)
Can a footprint taken at the scene of a crime help to predict the height of the criminal? In other
words, is there an association between height and foot length? To investigate this question, a
sample of 20 statistics students measured their height (in inches) and their foot length (in
centimeters). The sample data for the 20 students appear in the Correlation/Regression
applet.
1. Open this applet, and look at the scatterplot for the data already there. Check the Show
Movable Line box to add a blue line to the scatterplot. If you place your mouse over one
of the green squares at the ends of the line and drag, you can change the slope of the
line and move it. You can also use the mouse to move the green dot up and down
vertically to change the intercept of the line.
a. Move the line until you believe your line “best” summarizes the relationship
between height and foot length for these data. Write down the resulting equation
for your line (using traditional statistical notation).
b. Why do you believe that your line is “best?”
c. Did all students in your class obtain the same line/equation?
d. How can we decide whether your line provides a better fit to the data than other
students’ lines? Suggest a criterion for deciding which line “best” summarizes the
relationship.
One way to draw the best fit line is to minimize the distance of the points to the line (these
distances are called residuals).
Key idea: A residual is the difference between an observed response and the corresponding
prediction made by the least squares regression line (residual = observed – predicted). Thus,
negative residuals occur when points are below the best fit line and positive residuals occur
when points are above the best fit line.
June 27, 2014
MAA PREP workshop
106
2. Check the Show Residuals box to visually represent these residuals for your line on the
scatterplot. The applet also reports the sum of the values of the residuals (SAE). SAE
stands for “Sum of the Absolute Errors.” The acronym indicates that we need to make
residuals positive before we add them up and that sometimes people call residuals
“errors.”
Record the SAE value for your line _____________
What is the best (lowest) SAE in the class?_________________
3. It turns out that a more common criterion for determining the “best” line is to instead look
at the sum of the squared residuals (SSE). This approach is similar to simply adding up
the residuals, but is even more strict in not letting individual residuals get too large.
Check the Show Squared Residuals box to visually represent the squared residual for
each observation. Note that we can visually represent the squared residual as the area
of a square where each side of the square has length equal to the residual.
a. What is the SSE (sum of squared residuals) for your line?____________
What is the best (lowest) SSE in the class?________
b. Now continue to adjust your line until you think you have minimized the sum
of the squared residuals.
Report your new equation_________________________________
Report your new SSE value_____________________________
What is the best SSE in the class?_________________________
Key Idea: The least squares regression line minimizes the sum of squared residuals.
4. Now check the Show Regression Line box to determine and display the equation for
the line that actually does minimize (as can be shown using calculus) the sum of the
squared residuals.
a. Record the equation of the least squares regression line by indicating the
appropriate slope and intercept of the line. Note that we’ve used variable names
in the equation, not generic x and y. And put a carat (“hat”) over the y variable
name to emphasize that the line gives predicted values of the y (response)
variable.
̂ = _______ + _______(ℎ)
ℎ
Notation: The equation of the best fit line is written as ̂ =  + () where
  is the y-intercept
  is the slope
  is a value of the explanatory variable
 ̂ is the predicted value for the response variable
June 27, 2014
MAA PREP workshop
107
b. Did everyone in your class obtain the same equation?
5. Uncheck the Show Movable Line box to remove it from the display and check the Move
observations box.
a. Now click on one student’s point in the scatterplot and drag the point up and
down (changing the height, without changing the foot length). Does the
regression line change much as you change this student’s height?
b. Repeat the previous question, using a student with a very small x (foot size)
value and then a point with an x value near the middle and then a point with a
very large x value. Which of these seem(s) to have the most influence on the
least squares regression line? Explain.
Key idea: An observation or set of observations is considered influential if removing the
observation from the dataset substantially changes the values of the correlation coefficient
and/or the least squares regression equation. Typically, observations that have extreme
explanatory variable values (far below or far above x ) are potentially influential. They may not
have large residuals, having pulled the line close to them.
Residuals also help us measure how accurate our predictions are from using the regression
line. In particular, we can compare the “prediction errors” from the regression line to the
prediction errors if we made no use of the explanatory variable.
6. Press the Revert button to reload the original data set. Recheck the Show Movable
Line box to redisplay the blue line. Notice that this line is flat at the mean of the y
(height) values.
a. Check the Show Squared Residuals box (under the Movable Line information) to
determine the SSE if we were to use the average height ( y ) as the predicted value
for every x (foot size). Record this value.
b. What is the slope of this line?
c. If the slope of the best fit line is zero, our data shows __________________
(positive/negative/no) association between the explanatory and response variables.
June 27, 2014
MAA PREP workshop
108
7. Now compare this to the SSE value from the regression line. Which is smaller? Why
does that make sense?
Coefficient of Determination (r2)
8. A quantity related to the correlation coefficient is called the coefficient of determination,
or r-squared (r2).
a. To measure how much the regression line decreases the “unexplained variability” in
the response variable, we can calculate the percentage reduction in SSE.
SSE( y ) – SSE(regression line) x 100% =
SSE( y )
Key Idea: The coefficient of determination indicates the percentage of variation in the
response variable (height) that is explained by changes in the explanatory variable (foot
length).
b. Complete the following statement: The coefficient of determination is
______________percent, this means that _____________ percent of the
variation in people’s ______________________ is attributable to
changes in their ______________________________.
If we think of (̅) as the “worst” line (no association), then we are seeing how
much better the actual line is than this worst line. The best possible line would get rid
of all of the residuals, and would have an r-squared of 100%.
c. Find the value of the correlation coefficient using the applet, and confirm that when
you square the correlation coefficient you get the same number as the applet reports
for the coefficient of determination.
June 27, 2014
MAA PREP workshop
109