Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Chance Models (Section 1.1) Introduction A key step in the Statistical Investigation Method is drawing conclusions beyond the observed data. Statisticians often call this “statistical inference.” There are four main types of conclusions (inferences) that statisticians can draw from data: significance, estimation, generalization, and causation. In the remainder of this chapter we will focus on statistical significance. If you think back to the Organ Donor study from the Preliminaries, there were three groups: those in the neutral group were asked to make a yes/no choice about becoming a donor; those in the “opt in” group were told their default was not to be a donor, but they could choose to become a donor; and those in the “opt out” group were told their default was they were a donor, but they could choose not to be one if they wished. Let’s further examine two of those groups. We saw that 23 of 55 or 41.8% in the “opt in” group elected to become donors, compared with 44/56 or 78.6% in the neutral group. A key question here is whether we believe that the 78.6% is far enough away from the 41.8% to be considered statistically significant, meaning unlikely to have occurred by random chance alone. True, 78.6% looks very different from 41.8%, but it is at least possible that the wording of the solicitation to donate actually makes no difference, and that the difference we observed happened by random chance. To answer the question “Is our result unlikely to happen by random chance?,” our general strategy will be to consider what we expect the results to look like if any differences we are seeing are solely due to random chance. Exploring the random chance results is critical to our ability to draw meaningful conclusions from data. In this section we will provide a framework for assessing random chance explanations. Example 1.1: Can Dolphins Communicate? A famous study from the 1960s explored whether two dolphins (Doris and Buzz) could communicate abstract ideas. Researchers believed dolphins could communicate simple feelings like “Watch out!” or “I’m happy,” but Dr. Jarvis Bastian wanted to explore whether they could also communicate in a more abstract way, much like humans do. To investigate this, Dr. Bastian spent many years training Doris and Buzz and exploring the limits of their communicative ability. During a training period lasting many months, Dr. Bastian placed buttons underwater on each end of a large pool—two buttons for Doris and two buttons for Buzz. He then used an old automobile headlight as his signal. When he turned on the headlight and let it shine steadily, he intended for this signal to mean “push the button on the right.” When he let the headlight blink on and off, this was meant as a signal to “push the button on the left.” Every time the dolphins pushed the correct button, Dr. Bastian gave the dolphins a reward of some fish. Over time Doris and Buzz caught on and could earn their fish reward every time. Then Dr. Bastian made things a bit harder. Now, Buzz had to push his button before Doris. If they didn’t push the buttons in the correct order—no fish. After a bit more training, the dolphins caught on again and could earn their fish reward every time. The dolphins were now ready to participate in the real study to examine whether they could communicate with each other. Dr. Bastian placed a large canvas curtain in the middle of the pool. (See Figure 1.1.) Doris was on one side of the curtain and could see the headlight, whereas Buzz was on the other side of the curtain and could not see the headlight. Dr. Bastian turned on the headlight and let it shine June 27, 2014 MAA PREP workshop 1 steadily. He then watched to see what Doris would do. After looking at the light, Doris swam near the curtain and began to whistle loudly. Shortly after that, Buzz whistled back and then pressed the button on the right—he got it correct and so both dolphins got a fish. But this single attempt was not enough to convince Dr. Bastian that Doris had communicated with Buzz through her whistling. Dr. Bastian repeated the process several times, sometimes having the light blink (so Doris needed to let Buzz know to push the left button) and other times having it glow steadily (so Doris needed to let Buzz know to push the right button). He kept track of how often Buzz pushed the correct button. In this scenario, even if Buzz and Doris can communicate, we don’t necessarily expect Buzz to push the correct button every time. We allow for some “randomness” in the process; maybe on one trial Doris was a bit more underwater when she whistled and the signal wasn’t as clear for Buzz. Or maybe Buzz and Doris aren’t communicating at all and Buzz guesses which button to push every time and just happens to guess correctly once in a while. Our goal is to get an idea of how likely Buzz is to push the correct button in the long run. Figure 1.1: Depending whether or not the light was blinking or shown steadily, Doris had to communicate to Buzz as to which button to push. Let’s see how Dr. Bastian was applying the Six-Step Statistical Investigation Method. Step 1: Ask a research question. Can dolphins communicate in an abstract manner? Step 2: Design a study and collect data. Notice Dr. Bastian took some time to train the dolphins in order to get them to a point where he could test a specific research conjecture. The research conjecture is that Buzz pushes the correct button more often than he would if he and Doris could not communicate. If Buzz and Doris could not communicate, Buzz would just be guessing which button to push. The observational units are Buzz’s attempts and the variable for each attempt is whether or not on each attempt, Buzz pushes the correct button (a categorical variable). Step 3: Explore the data. In one phase of the study, Dr. Bastian had Buzz attempt to push the correct button a total of 16 different times. In this sample of 16 attempts, Buzz pushed the correct button 15 out of 16 times. To summarize these results, we report the statistic, a numerical summary of the sample. For this example, we could report either 15, the number of correct pushes, or 15/16 = 0.9375, as the statistic. Definitions: The set of observational units on which we collect data is called the sample. The number of observational units in the sample is the sample size. A statistic is a number summarizing the results in the sample. June 27, 2014 MAA PREP workshop 2 The sample size in this example is 16. Note that the word “sample” is used as both a noun (the set of observational units being studied) and as an adjective, for example to mean “computed from the observed data,” as, for example, “sample statistic.” Step 4: Draw inferences beyond the data. These 16 observations are a mere snapshot of Buzz’s overall selection process. We will consider this a random process. We are interested in Buzz’s actual long-run probability of pushing the correct button based on Doris’ whistles. This unknown long-run probability is called a parameter. Definition: For a random process, a parameter is a long-run numerical property of the process. Note that we are assuming this parameter is not changing over time, at least for the process used by Buzz in this phase of the study. Because we can’t observe Buzz pushing the button forever, we need to draw conclusions (possibly incorrect, but hopefully not) about the value of the parameter based only on these 16 attempts. Buzz certainly pushed the correct button most of the time, so we might consider either: Buzz is doing something other than just guessing (his probability of a correct button push is larger than 0.50). Buzz is just guessing (his probability of a correct button push is 0.50) and he got lucky in these 16 attempts. These are the two possible explanations to be evaluated. Because we can’t collect more data, we have to base our conclusions only on the data we have. It’s certainly possible that Buzz was just guessing and got lucky! But does this seem like a reasonable explanation to you? How would you argue against someone who thought this was the case? Think about it: Based on these data, do you think Buzz somehow knew which button to push? Is 15 out of 16 correct pushes convincing to you? Or do you think that Buzz could have just been guessing? How might you justify your answer? So how are we going to decide between these two possible explanations? One approach is to choose a model for the random process (repeated attempts to push the correct button) and then see whether our model is consistent with the observed data. If it is, then we will conclude that we have a reasonable model and we will use that model to answer our questions. The Chance Model Scientists use models to help understand complicated real world phenomena. Statisticians often employ chance models to generate data from random processes to help them investigate such processes. You did this with the Monty Hall Exploration (P.3) to investigate properties of the two strategies, switching and staying with your original choice of door. In that exploration it was clear how the underlying chance process worked, even though the probabilities themselves were not obvious. But here we don’t know for sure what the underlying real world process is. We are trying to decide whether the process could be Buzz simply guessing or whether the process is something else, such as Buzz and Doris being able to communicate. Let us first investigate the “Buzz was simply guessing” process. Because Buzz is choosing between two options, the simplest chance model to consider is a coin flip. We can flip a coin to represent or simulate Buzz’s choice assuming he is just guessing which button to push. To June 27, 2014 MAA PREP workshop 3 generate this artificial data, we can let “heads” represent the outcome that Buzz pushes the correct button and let “tails” be the outcome that Buzz pushes the incorrect button. This gives Buzz a 50% chance of pushing the correct button. This can be used to represent the “Buzz was just guessing” or the “random chance alone” explanation. The correspondence between the real study and the physical simulation is shown in Table 1.1. Table 1.1: Parallels between real study and physical simulation Coin flip Heads Tails = = = guess by Buzz correct guess wrong guess probability of correct button when Buzz is just Chance of Heads = 1/2 = guessing One repetition = one set of 16 simulated attempts by Buzz Now that we see how flipping a coin can simulate Buzz guessing, let’s flip some coins to simulate Buzz’s performance. Imagine that we get heads on the first flip. What does this mean? This would correspond to Buzz pushing the correct button! But, why did he push the correct button? In this chance model, the only reason he pushed the correct button is because he happened to guess correctly—remember the coin is simulating what happens when Buzz is just guessing which button to push. What if we keep flipping the coin? Each time we flip the coin we are simulating another attempt where Buzz guesses which button to push. Remember that heads represents Buzz guessing correctly and tails represents Buzz guessing incorrectly. How many times do we flip the coin? Sixteen, to match Buzz’s 16 attempts in the actual study. After 16 tosses, Figure 1.2 shows our simulated data. Figure 1.2: Results from tossing a coin 16 times Here we got 11 heads and 5 tails (11 out of 16 or 0.6875 is the simulated statistic). This gives us an idea of what could have happened in the study if Buzz had been randomly guessing which button to push each time. Will we get this same result every time we flip a coin 16 times? Let’s flip our coin another 16 times and see what happens. When we did this we got 7 heads and 9 tails as shown in Figure 1.3 (7 out of 16 or 0.4375 is the simulated statistic). Figure 1.3: Results from a second repetition of tossing a coin 16 times So can we learn anything from these coin tosses when the results vary between the sets of 16 tosses? Using and evaluating the coin flip chance model Because coin flipping is a random process, we know that we won’t obtain the same number of heads with every set of 16 flips. But are some numbers of heads more likely than others? If we June 27, 2014 MAA PREP workshop 4 continue our repetitions of 16 tosses, we can start to see how the outcomes for the number of heads are distributed. Does the distribution of the number of heads that result in 16 flips have a predictable long-run pattern? In particular, how much variability is there in our simulated statistics between repetitions (sets of 16 flips) just by random chance? In order to investigate these questions, we need to continue to flip our coin to get many, many sets of 16 flips (or many repetitions of the 16 choices where we are modeling Buzz simply guessing each time). We did this, and Figure 1.4 shows what we found when we graphed the number of heads from each set of 16 coin flips. Here, the process of flipping a coin 16 times was repeated 1,000 times—this number was chosen for convenience, but also appears to be large enough to give us a fairly accurate sense of the long-run behavior for the number of heads in 16 tosses. Figure 1.4: A dotplot showing 1000 repetitions of flipping a coin 16 times and counting the number of heads Each dot represents one set of 16 attempts by Buzz Let’s think carefully about what the graph in Figure 1.4 shows. For this graph, each dot represents the number of heads in one set of 16 coin tosses. We see that the resulting number of heads follows a clear pattern: 7, 8, and 9 heads happened quite a lot, 10 was pretty common also (though less so than 8), 6 happened some of the time, 1 happened once. But we never got 15 heads in any set of 16 tosses! We might consider any outcome between about 5 and 11 heads to be typical, but getting fewer than 5 heads or more than 11 heads happened rarely enough we can consider it a bit unusual. We refer to these unusual results as being out in the ‘tails’ of the distribution. Think about it: How does the analysis above help us address the strength of evidence for our research conjecture that Buzz was doing something other than just guessing? What does this have to do with the dolphin communication study? We said that we would flip a coin to simulate what could happen if Buzz was really just guessing each time he pushed the button in 16 attempts. We saw that getting results like 15 heads out of 16 never happened in our 1000 repetitions. This shows us that 15 is a very unusual outcome— far out in the tail of the distribution of the simulated statistics—if Buzz is guessing. In short, even though we expect some variability in the results for different sets of 16 tosses, the pattern shown in this distribution indicates that an outcome of 15 heads is outside the typical chance variability we would expect to see when Buzz is simply guessing. June 27, 2014 MAA PREP workshop 5 In the actual study, Buzz really did push the correct button 15 times out of 16, an outcome that we just determined would rarely occur if Buzz was just guessing. So, our coin flip chance model tells us that we have very strong evidence that Buzz was not just tossing a coin to make his choices. This means we have strong evidence that Buzz wasn’t just guessing. Therefore, we don’t believe the “by chance alone” explanation is a good one for Buzz. The results mean we have strong enough evidence to be statistically significant, not a result that happened by chance alone, and to convince us that something other than “random chance” was at play. Definition: A result is statistically significant if it is unlikely to occur just by random chance. If our observed result appears to be consistent with the chance model, we say that the chance model is plausible or believable. What next? A glimpse into Steps 5 and 6 The steps we went through above have helped us evaluate how strong the evidence is that Buzz is not guessing (Step 4 of the Statistical Investigation Method). In this case, the evidence provided by this sample is fairly strong that Buzz isn’t guessing. Still, there are some important questions you should be asking right now, such as: If Buzz isn’t guessing what is he doing? Step 5: Formulate conclusions. We should also ask ourselves: if Buzz wasn’t guessing, does this prove that Buzz and Doris can communicate? And if so, what does this say about other dolphins? As we’ll find in later chapters, the answers to these questions hinge mainly on how the study was designed and how we view the 16 attempts that we observed (e.g., we assume Buzz couldn’t see the light himself, the light signal displayed each time was chosen randomly so Buzz couldn’t figure out a pattern to help him decide which button to push; Buzz’s 16 attempts are a good representation of what Buzz would do given many more attempts under identical conditions; but we might still wonder whether Buzz’s behavior representative of dolphin behavior in general or are there key differences among individual dolphins). Step 6: Look back and ahead. After completing Steps 1-5 of the Statistical Investigation Method, we need to revisit the big picture of the initial research question. First, we reflect on the limitations of the analysis, and think about future studies. In short, we are now stepping back and thinking about the initial research question more than the specific research conjecture being tested in the study. In some ways, this is the most important step of the whole study because it is where we think about the true implications of the scientific study we’ve conducted. For this study, we would reflect on Dr. Bastian’s methods, summarize the results for Buzz, and reflect on ways to improve the study to enhance the conclusions we can draw. The 3S Strategy Let us summarize the overall approach to assessing statistical significance that we have been taking in this section. We observed a sample statistic (e.g., the number of “successes” or the proportion of “successes” in the sample). Then we simulated ‘could-have-been’ outcomes for that statistic under a specific chance model (just guessing). Then we used the information we gained about the random variation in the ‘by-chance’ values of the statistics to help us judge whether the observed value of the statistic is an unusual or a typical outcome. If it is unusual– we say the observed statistic is statistically significant—it provides strong evidence that the chance-alone explanation is wrong. If it is typical, we consider the chance model plausible. You may have noticed that we only simulated results for one specific model. When we saw that the sample statistic observed in the study was not consistent with these simulated results, we June 27, 2014 MAA PREP workshop 6 rejected the chance-alone explanation. Often, research analyses stop here. Instead of trying to simulate results from other models (in particular we may not really don’t have an initial idea what a more appropriate model might be), we are content to say there is something other than random chance at play here. This might lead the researchers to reformulate their conjectures and collect more data in order to investigate different models. We will call the process of simulating could-have-been statistics under a specific chance model the 3S Strategy. After forming our research conjecture and collecting the sample data, we will use the 3S strategy to weigh the evidence against the chance model. This 3S Strategy will serve as the foundation for addressing the question of statistical significance in Step 4 of the Statistical Investigation Method. 3S Strategy for Measuring Strength of Evidence 1. Statistic: Compute the statistic from the observed sample data. 2. Simulate: Identify a “by chance alone” explanation for the data. Repeatedly simulate values of the statistic that could have happened when the chance model is true. 3. Strength of evidence: Consider whether the value of the observed statistic from the research study is unlikely to occur if the chance model is true. If we decide the observed statistic is unlikely to occur by chance alone, then we can conclude that the observed data provide strong evidence against the plausibility of the chance model. If not, then we consider the chance model to be a plausible (believable) explanation for the observed data; in other words what we observed could plausibly have happened just by random chance. Let’s illustrate how we implemented the 3S Strategy for the Doris and Buzz example. 1. Statistic: Our observed statistic was 15, the number of times Buzz pushed the correct button in 16 attempts. 2. Simulate: If Buzz was actually guessing, the parameter (the probability he would push the correct button) would equal 0.5. In other words, he would push the correct button 50% of the time in the long run. We used a coin flip to model what could have happened in 16 attempts when Buzz is just guessing. We flip the coin 16 times and count how many of the 16 flips are heads, meaning how many times Buzz pressed the correct button (“success”). We then repeat this process many more times, each time keeping track of the number of the 16 attempts that Buzz pushed the correct button. We end up with a distribution of could-have-been statistics representing typical values for the number of correct pushes when Buzz is just guessing. 3. Strength of evidence: Because 15 successes of 16 rarely happens by chance alone, we conclude that we have strong evidence that, in the long-run, Buzz is not just guessing. Notice that we have used the result of 15 out of 16 correct attempts to infer that Buzz’s actual long-run probability of pushing the correct button was not simply 0.5. June 27, 2014 MAA PREP workshop 7 Another Doris and Buzz study One goal of statistical significance is to rule out random chance as a plausible (believable) explanation for what we have observed. We still need to worry about how well the study was conducted. For example, are we absolutely sure Buzz couldn’t see the headlight around the curtain? Are we sure there was no pattern to which headlight setting was displayed that he might have detected? And of course we haven’t completely ruled out random chance, he may have had an incredibly lucky day. But the chance of him being that lucky is so small, that we conclude that other explanations are more plausible or credible. One option that Dr. Bastian pursued was to re-do the study except now he replaced the curtain with a wooden barrier between the two sides of the tank in order to ensure a more complete separation between the dolphins to see whether that would diminish the effectiveness of their communication. Step1: Ask a research question. The research question remains the same: Can dolphins communicate in a deep abstract manner? Step 2: Design a study and collect data. The study design is similar with some adjustments to the barrier between Doris and Buzz. The canvas curtain is replaced by a plywood board. The research conjecture, observational units, and variable remain the same. In this case, Buzz pushed the correct button only 16 out of 28 times. The variable is the same (whether or not Buzz pushed the correct button), but the number of observational units (sample size) has changed to 28 (the number of attempts). Think about it: Based on the results for this phase of the study, do you think that Doris could tell Buzz which button to push, even under these conditions? Or is it believable that Buzz could have just been guessing? Step 3: Explore the data. So our observed statistic is 16 out of 28 correct attempts, which is 16/28 × 100% = 57.1% of Buzz’s attempts. This is again more than half the time, but not much larger than 50%. A simple bar graph of these results is shown in Figure 1.5. Figure 1.5: Bar graph for Buzz’s 28 attempts June 27, 2014 MAA PREP workshop 8 Step 4: Draw inferences. Is it plausible (believable) that Buzz was simply guessing in this set of attempts? How do we measure how much evidence these results provide against the chance model? Let’s use the same chance model as we used earlier to see what could have happened if Buzz was just guessing. We will apply the 3S Strategy to this new study. 1. Statistic: The new observed sample statistic is 16 out of 28, or about 0.571. Think about it: Consider again our simulation of the chance model assuming Buzz is guessing. What do we need to change for this new phase of the study? 2. Simulation: This time we need to do repetitions of 28 coin flips, not just 16. A distribution of the number of heads in 1000 repetitions of 28 coin flips is shown in Figure 1.6. This models 1000 repetitions of 28 attempts with Buzz randomly pushing one of the buttons (guessing) each time. Figure 1.6: A graph showing 1000 repetitions of flipping a coin 28 times and counting the number of heads. This models the number of correct pushes in 28 attempts when Buzz is guessing each time. 3. Strength of evidence: Now we need to consider the new observed statistic (16 out of 28 or 0.571). We see from the graph that 16 out of 28 is a fairly typical outcome if Buzz is just randomly guessing. What does this tell us? It tells us that the results of this study are something that could easily have happened if Buzz was just randomly guessing. So what can we conclude? We can say his 16 successes are not convincing evidence against the “by chance alone” model. The graph in Figure 1.6 shows what happens for the hypothetical Buzz who just guesses. An actual outcome far out in the tail of that distribution would be strong evidence against the “just guessing” hypothesis. But be careful: The opposite result – an actual outcome near the center – is not strong evidence in support of the guessing hypothesis. Yes, the result is consistent with that hypothesis, but it is also consistent with many other hypotheses as well. Bottom line: In this second study we conclude that there is not enough evidence that the “by chance alone” model is wrong. That model is still a plausible explanation for the statistic we observed in the study (16 out of 28). Based on this set of attempts, we do not have convincing evidence against the possibility that Buzz is just guessing, but other explanations also remain plausible. For example, the results are consistent with very weak communication between the dolphins. All we know from this analysis is that one plausible explanation for the observed data is that Buzz was guessing. June 27, 2014 MAA PREP workshop 9 In fact, Dr. Bastian soon discovered that in this set of attempts, the equipment malfunctioned and the food dispenser for Doris did not operate and so Doris was not receiving her fish rewards during the study. Because of this malfunction, it’s not so surprising that removing the incentive hindered the communication between the dolphins and we cannot refute that Buzz was just guessing for these attempts. Dr. Bastian fixed the equipment and ran the study again. This time he found convincing evidence that Buzz was not guessing. For a bit more discussion on processes and parameters, see FAQ 1.1 Exploration 1.1: Can Dogs Understand Human Cues? Dogs have been domesticated for about 14,000 years. In that time, have they been able to develop an understanding of human gestures such as pointing or glancing? How about similar non-human cues? Researchers Udell, Giglio, and Wynne tested a small number of dogs in order to answer these questions. In this exploration, we will first see whether dogs can understand human gestures as well as non-human gestures. To test this, the researchers positioned the dogs about 2.5 meters from the experimenter. On each side of the experimenter were two cups. The experimenter would perform some sort of gesture (pointing, bowing, looking) toward one of the cups or there would be some other non-human gesture (a mechanical arm pointing, a doll pointing, or a stuffed animal looking) toward one of the cups. The researchers would then see whether the dog would go to the cup that was indicated. There were six dogs tested. We will look at one of the dogs in two of his sets of trials. This dog, a 4 year-old mixed breed, was named Harley. Each trial involved one gesture and one pair of cups, with a total of ten trials in a set. We will start out by looking at one set of trials where the experimenter bowed toward one of the cups to see whether Harley would go to that cup. Step 1. State the research question. 1. Based on the description of the study, state the research question. Step 2. Design a study and collect data. Harley was tested 10 times and 9 of those times he chose the correct cup. 2. What are the observational units? 3. Identify the variable in the study. What are the possible outcomes of this variable? Is this variable quantitative or categorical? June 27, 2014 MAA PREP workshop 10 Step 3. Explore the data. Definitions: The set of observational units on which we collect data is called the sample. The number of observational units in the sample is the sample size. A statistic is a number summarizing the results in the sample. With categorical data, we typically report the number of “successes” or the proportion of successes as the statistic. 4. What is the number of observational units (sample size) in this study? 5. Determine the observed statistic and produce a simple bar graph of the data (have one bar for the proportion of times Harley picked the correct cup, and another for the proportion of times he picked the wrong cup) 6. If the research conjecture is that Harley can understand what the experimenter means when they bow toward an object, is the statistic in the direction suggested by the research conjecture? 7. Could Harley have gotten 9 out of 10 correct even if he really didn’t understand the human gesture and so was randomly guessing between the two cups? 8. Do you think it is likely Harley would have gotten 9 out of 10 correct if he was just guessing randomly each time? Step 4. Draw inferences beyond the data. There are two possibilities for why Harley chose the correct cup nine out of ten times: He is merely picking a cup at random, and in these 10 trials happened to guess correctly in 9 of them. That is, he got more than half correct just by random chance alone. He is doing something other than merely guessing and perhaps understands what the experimenter means when they bow towards the cup. The unknown long-run probability that Harley will chose the correct cup is called a parameter. Definition: For a random process, a parameter is a long-run numerical property of the process. June 27, 2014 MAA PREP workshop 11 We don’t know the value of the parameter, but the two possibilities listed above suggest two different possibilities. 9. What is the value of the parameter if Harley is picking a cup at random? Give a specific value. 10. What is the possible range of values (greater than or less than some value) for the parameter Harley is not just guessing and perhaps understands the experimenter? We will know show you how statisticians use simulation to make a statement about the strength of evidence for these two possible statements about the parameter’s value. The Chance Model Statisticians often use chance models to generate data from random processes to help them investigate the process. In particular, they can see whether the observed statistic is consistent with the values of the statistic simulated by the chance model. If we determine that Harley’s results are not consistent with the results from the chance model, we will consider this to be evidence against the chance model and in favor of the research conjecture, that he understands the bowing gesture. In this case, we would say Harley’s results are statistically significant, meaning unlikely to have occurred by chance alone. We can’t perform the actual study more times in order to assess the second possibility, but we can simulate the behavior of Harley’s choices if we were to assume the first possibility (that he is simply guessing every time). 11. Explain how you could use a coin toss to represent Harley’s choices if he is guessing between the two cups each time. How many times do you have to flip the coin to represent one set of Harley’s attempts? What does heads represent? 12. If Harley was guessing randomly each time, on average, how many out of the 10 times would you expect him to choose the correct cup? 13. Simulate one repetition of Harley guessing randomly by flipping a coin 10 times (why 10?) and letting heads represent selecting the correct cup (“success”) and tails represent selecting the incorrect cup (“failure”). Count the number of heads in your 10 flips. Combine your results with the rest of the class to create a dotplot of the distribution for the number of heads out of 10 flips of a coin. a. Where does 9 heads fall in the distribution? Would you consider it an unusual outcome or a fairly typical outcome for the number of heads in 10 flips? June 27, 2014 MAA PREP workshop 12 b. Based on your answer to the previous question, do you think it is plausible (believable) that Harley was just guessing which cup to choose? Using an applet to simulate flipping a coin many times To really assess the typical values for the number of heads in 10 coin tosses (number of correct picks by Harley assuming he is guessing at random), we need to simulate many more outcomes of the chance model. Open the One Proportion applet from the textbook webpage. Notice that the probability of heads has been set to be 0.50, representing the guessing between the two cups. Set the number of tosses to 10 and press the Toss Coins button. What was the resulting number of heads? Notice that the number of heads in this set of 10 tosses is then displayed by a dot on the graph. Uncheck the Animate box and press the Toss Coins button 9 more times. This will demonstrate the how the number of heads varies randomly across each set of 10 tosses. Nine more dots have been added to your dotplot. Is a pattern starting to emerge? Now change the Number of repetitions from 1 to 990 and press Toss Coins. The applet will now show the results for the number of heads in 1000 different sets of 10 coin tosses. So each dot represents the number of times Harley chooses the correct cup out of ten attempts assuming he is just guessing. Remember why we conducted this simulation: to assess whether Harley’s result (9 correct in 10 attempts) would be unlikely to occur by chance alone if he were just guessing between the pair of cups for each attempt. 14. Locate the result of getting 9 heads in the dotplot created by the applet. Would you consider this an unlikely result, in the tail of the distribution of the number of heads? 15. Based on the results of 1000 simulated sets of 10 coin flips each, would you conclude that Harley would be very unlikely to have picked the correct cup in 9 times in 10 attempts if he was randomly guessing between the two cups each time? Explain how your answer relates to the applet’s dotplot. Definition: A result is statistically significant if it is unlikely to occur just by random chance. If our observed result appears to be consistent with the chance model, we say that the chance model is plausible or believable. 16. Do the results of this study appear to be statistically significant? June 27, 2014 MAA PREP workshop 13 17. Do the results of this study suggest that Harley just guessing is a plausible explanation for Harley picking the correct cup 9 out of 10 times? Summarizing your understanding 18. To make sure that you understand the coin flipping chance model fill in the following table indicating what parts of the real study correspond to the physical (coin-flipping) simulation. Table 1.2: Parallels between real study and physical simulation Coin flip = Heads = Tails = Chance of Heads = 1/2 = One repetition = one set of ___ simulated attempts by Harley The 3S Strategy We will call the process of simulating could-have-been statistics under a specific chance model the 3S Strategy. After forming our research conjecture and collecting the sample data, we will use the 3S strategy to weigh the evidence against the chance model. This 3S Strategy will serve as the foundation for addressing the question of statistical significance in Step 4 of the Statistical Investigation Method. 3S Strategy for Measuring Strength of Evidence 1. Statistic: Compute the statistic from the observed sample data. 2. Simulate: Identify a “by chance alone” explanation for the data. Repeatedly simulate values of the statistic that could have happened when the chance model is true. 3. Strength of evidence: Consider whether the value of the observed statistic from the research study is unlikely to occur if the chance model is true. If we decide the observed statistic is unlikely to occur by chance alone, then we can conclude that the observed data provide strong evidence against the plausibility of the chance model. If not, then we consider the chance model to be a plausible (believable) explanation for the observed data; in other words what we observed could plausibly have happened just by random chance. Let’s review how we have already applied the 3S strategy to this study. 19. Statistic. What is the statistic in this study? 20. Simulate. Fill in the blanks to describe the simulation. June 27, 2014 MAA PREP workshop 14 We flipped a coin ____ times and kept track of how many times it came up heads. We then repeated this process _____ more times, each time keeping track of how many heads were obtained in each of the ______ flips. 21. Strength of evidence. Fill in the blanks to summarize how we are assessing the strength of evidence for this study. Because we rarely obtained a value of ________ heads when flipping the coin _______ times, this means that it is ____________ (believable/unlikely) that Harley is just guessing, since if Harley was just guessing he ____________ (rarely/often) would get a value like _________ correct out of ______ attempts. Step 5: Formulate conclusions. 22. Based on this analysis, are you convinced that Harley can understand human cues? Why or why not? Another Study One important step in a statistical investigation is to consider other models and whether the results can be confirmed in other settings. 23. In a different study, the researchers used a mechanical arm (roughly the size of a human arm) to point at one of the two cups. The researchers tested this to see whether dogs understood non-human gestures. In 10 trials, Harley chose the correct cup 6 times. a. Using the dotplot you obtained when you simulated 1000 sets of 10 coin flips assuming Harley was just guessing, locate the result of getting 6 heads. Would you consider this an unlikely result, in the tail of the distribution? b. Based on the results of 1000 simulated sets of 10 coin flips each, would you conclude that Harley would be very unlikely to have picked the correct cup in 6 times in 10 attempts, if he was randomly guessing between the two cups each time? Explain how your answer relates to the applet’s dotplot. June 27, 2014 MAA PREP workshop 15 c. Is this study’s result statistically significant? d. Do the results of this study suggest that Harley just guessing is a plausible explanation for Harley picking the correct cup 6 out of 10 times? e. Does this study prove that Harley cannot understand the mechanical arm? Step 6: Look back and ahead. 24. Compare the analyses between the two studies. How does the unusualness of the observed statistic compare between the two studies? Does this make sense based on the value of the observed statistic in the two studies? Does this make sense based on how the two studies were designed? Explain. (Hint: Why might the results differ for human and mechanical arms? Why would this matter?) 25. A single study will not provide all of the information needed to fully understand a broad, complex research question. Thinking back to the original research question, what additional studies would you suggest conducting next? Section 1.1 Summary The set of observational units on which we collect data is called a sample. The number of observational units is the sample size. A number computed to summarize the variable measured on a sample is called a statistic. For a chance process, a parameter is a long-run numerical property of that process, such as a probability (long-run proportion). A simulation analysis based on a chance model can assess the strength of evidence provided by sample data against a particular claim about the chance model. The logic of assessing statistical significance employs what we call the 3S Strategy: Statistic: Compute an observed statistic from the data. Simulate: Identify a model for the “by chance alone” explanation. Repeatedly simulate values of the statistic that could have occurred from that chance model. Strength of evidence: Examine how unusual the observed value of the statistic would be under repeated application of the chance model. o If the observed value of the sample statistic is unlikely to have occurred from the chance model, then the data provide strong evidence against the chance model as the explanation. June 27, 2014 MAA PREP workshop 16 o If the observed value of the sample statistic is not unlikely to have occurred from the chance model, then the chance model is a plausible explanation for the observed data. The chance model considered in this section involved tossing a fair coin. This chance model allowed for assessing whether an observed number of “successes” in a study provided strong evidence that the two outcomes of a categorical variable were not equally likely. In the next section you will consider other chance models, but the reasoning process will remain the same. June 27, 2014 MAA PREP workshop 17 FAQ 1.1: What is a random process? Q: So a parameter is a numerical property of a process… but what is the process? All I have is Buzz’s 16 attempts. A: We can think of Buzz’s 16 attempts as a sample from his long-run selection process. He has some underlying probability of pushing the correct button, but he’s unlikely to be correct every time, even if the dolphins are communicating. There is randomness in his choices. His choices might be affected by Doris’ squeaks, but also by how hungry he is, how tired he is, how warm the water is. We are trying to see how much Doris’ communication is influencing him. Everything else gets lumped together into “random chance.” Any one outcome from a random process is unknown to us in advance (like tossing a coin), but if we observe outcomes from the process for long enough, we should start to see some patterns amidst the randomness. So you can think of a random process as an unlimited source of potential observations for your sample. Q: Can I use any old set of observations from a process as a sample? A: Some samples are good and others are terrible. Q: How do I tell the good apples from the rotten ones? A: For a Yes/No process like Buzz’s selections, we need to be willing to make certain assumptions along the way. For example, we assumed Buzz’s probability of a correct guess was the same every time. In fact, under the null hypothesis, we assumed it was 50/50 every time. But even if the probability is not 0.5, we are assuming it is not changing. In other words, we assume that Buzz doesn’t get better through learning and that he doesn’t get tired or bored or quit trying. We are also assuming that his previous trials don’t influence his future guesses – like a coin, he would have no memory of his past guesses. Q: So how do I know if I have a good sample? A: You need to believe that nothing about the process is changing and that each outcome does not depend on past outcomes. If you’re a coin, your chance of heads doesn’t change, and you have no memory of how you’ve landed in the past. If this is true about the process, then you will have a good sample! On the other hand, if Buzz’s probability of success is different in the morning than in the afternoon, then only observing him in the morning will not give you a good representation of his overall probability of success. Q: So then how does the parameter come into play? A: Parameter is a hard concept, precisely because we never see it! For the dolphin example, we want to know whether Buzz is just guessing. We think of his guesses as a potentially never-ending process (like coin tossing), and the parameter is the probability that Buzz will be correct in his next attempt. As you saw in the Preliminaries, this means if we were to observe the random process forever, this probability is the long-run proportion of times that Buzz pushes the correct button. Q: So I see why the parameter isn’t just 15/16, that is his sample proportion and we may not have hit that long-run probability yet. So the parameter is just 0.5? A: Not necessarily. The parameter equals 0.5 if he is just guessing. That’s the chance model we simulated. But his probability may actually be different from 0.5. Q: So the parameter is his actual probability of pushing the correct button. It’s a number, but we don’t know its value. A: Correct! But keep in mind that we will define lots of different kinds of parameters in future chapters. The “probability of success” parameter only applies in this type of Yes/No random process. June 27, 2014 MAA PREP workshop 18 Formalizing Terms (adapted from Exploration 1.2: Tasting Water and Exploration 1.3: Do People Use Facial Prototyping?) A study in Psychonomic Bulletin and Review (Lea, Thomas, Lamkin, & Bell, 2007) presented evidence that “people use facial prototypes when they encounter different names.” Participants were given two faces and asked to identify which one was Tim and which one was Bob. The researchers wrote that their participants “overwhelmingly agreed” on which face belonged to Tim and which face belonged to Bob, but did not provide the exact results of their study. Step 1: Ask a research question. We will gather data from your class to investigate the research question of whether students have a tendency to associate certain facial features with a name. Step 2: Design a study and collect data. 1. Identify the observational units in this study. 2. Identify the variable. Is the variable quantitative or categorical? 3. How many outcomes can the variable have? Definition: A binary variable is a categorical variable with only two outcomes. We also (arbitrarily) define one outcome to be a success and one to be a failure. 4. Describe the parameter of interest (in words). (Hint: The parameter is the long-run probability of …?) 5. One possibility here is that subjects are equally likely to associate “Tim” with the face on the left as they are to associate “Bob” with the face on the left and so are essentially guessing which name is associated with the face on the left. If this is the case, what is the long-run probability that a subject would associate “Tim” with the face on the left? 6. Another possibility is that subjects are more likely to associate “Tim” with the face on the left than they are to associate “Bob” with the face on the left. If this is the case, what can you say about the long-run probability that a subject would associate “Tim” with the face on the left? (Hint: You are not to specify a particular value this time; instead indicate a direction from a particular value.) June 27, 2014 MAA PREP workshop 19 Definition: The null hypothesis typically represents the “by chance alone” explanation. The chance model (or “null model”) is chosen to reflect this hypothesis. The alternative hypothesis typically represents the “there is an effect” explanation that contradicts the null hypothesis. It represents what the researchers are hoping to gather evidence to support. 7. Your answers to #5 and #6 should be the null and alternative hypotheses for this study. Which is which? Step 3: Explore the data. 8. How many students put Tim as the name on the left? How many students participated in this study (sample size)? What proportion put Tim’s name on the left? Number (Tim on left): Sample size: Proportion (Tim on left): Use of symbols We can use mathematical symbols to represent quantities and simplify our writing. Throughout the book we will emphasize written explanations, but will also show you mathematical symbols which you are free to use as a short-hand once you are comfortable with the material. The distinction between parameter and statistic is so important that we always use different symbols to refer to them. When dealing with a parameter that is a long-run probability, such as the probability that a (future) subject in this study would associate “Tim” with the face on the left, we use the Greek letter (pronounced “pie”). But when working with a statistic that is the proportion of “successes” in a sample, such as the proportion of subjects who associated “Tim” with the face on the left, we use the symbol 𝑝̂ (pronounced “p-hat”). Finally, we use the symbol n to represent the sample size. 9. What is the value of 𝑝̂ in this study? 10. What is the value of n in this study? 11. Hypotheses are always conjectures about the unknown parameter, . You can also use H0 and Ha as short-hand notation for the null and alternative hypotheses, respectively. A colon, ‘:’, is used to represent the word ‘is.’ Restate the null and alternative hypotheses using . H0: Ha: June 27, 2014 MAA PREP workshop 20 Step 4: Draw Inferences. 12. Is the sample proportion who associated “Tim” with the face on the left greater than the probability specified in the null hypothesis? 13. Is it possible that the sample proportion could turn out to be this large even if the null hypothesis were true? As we did with Buzz and Doris in Section 1.1, we will use simulation to investigate how surprising the observed sample result would be, if in fact subjects were just as likely to associate “Tim” with the face on the left as they were to associate “Bob” with the face on the left. (Note also that our null model assumes the same probability for all subjects.) 14. We will now use the One Proportion applet to conduct this simulation analysis. a. First enter the probability of success value specified in the null hypothesis. b. Enter the appropriate sample size (number of subjects in this study). c. Enter 1 for the number of samples, and press Draw Samples. Report the number of “successes” in this simulated sample. d. Now, select the radio button for “Proportion of successes.” Report the proportion of successes in this simulated sample. Use your answer to “c” to verify how this value is calculated. e. Leaving the “Proportion of successes” radio button selected, click on Draw Samples four more times. Do you get the same results each time? f. Now enter 995 for the number of samples and click on Draw Samples, bringing the number of simulated samples to 1000. Comment on the center, variability, and shape of the resulting distribution of sample proportions. This distribution of simulated sample proportions is called the null distribution, because it is based on assuming the null hypothesis to be true. 15. Restate the observed value of the sample proportion, p̂ , who associated “Tim” with the face on the left. June 27, 2014 MAA PREP workshop 21 16. Looking at the null distribution you have simulated, is this a very unlikely result, when the null hypothesis is true? In other words, is this value far in the tail of the null distribution? You might find that #16 is a bit of a tough call if the observed sample statistic is not far in the tail of the distribution. To help make a judgment about strength of evidence in this case, we can count how many (and what proportion) of the simulated sample proportions are as extreme, or more extreme, than the observed value. 17. Use the applet to count how many (and what proportion) of the simulated sample proportions are more extreme than the observed value, p̂ . To do this, enter the observed p̂ (the observed sample proportion who associated “Tim” with the face on the left) in the box to the left of the Count button. Then press the Count button. Record the number and proportion of simulated sample proportions that are as extreme, or more than, than the observed p̂ . Definition: The p-value is estimated as the proportion of simulated statistics in the null distribution that are at least as extreme (in the direction of the alternative hypothesis) as the value of the statistic actually observed in the research study. How do we evaluate this p-value as a judgment about strength of evidence provided by the sample data against the null hypothesis? One answer is: The smaller the p-value, the stronger the evidence against the null hypothesis and in favor of the alternative hypothesis. But how small is small enough to regard as convincing? There is no definitive answer, but here are some guidelines: Guidelines for evaluating strength of evidence from p-values 0.10 < p-value not much evidence against null hypothesis; null is plausible 0.05 < p-value < 0.10 moderate evidence against the null hypothesis 0.01 < p-value < 0.05 strong evidence against the null hypothesis p-value < 0.01 very strong evidence against the null hypothesis The smaller the p-value, the stronger the evidence against the null hypothesis. 18. Is the approximate p-value from your simulation analysis (your answer to #17) small enough to provide much evidence against the null hypothesis that subjects are equally likely to associate “Tim” with the face on the left as they are to associate “Bob” with the face on the left? If so, how strong is this evidence? Explain. June 27, 2014 MAA PREP workshop 22 Step 5: Formulate conclusions. 19. Do you consider the observed sample result to be statistically significant? Recall that this means that the observed result is unlikely to have occurred by chance alone. 20. How broadly are you willing to generalize your conclusions? Would you be willing to generalize your conclusions beyond the subjects in this study? How broadly? Explain your reasoning. Step 6: Look back and ahead. 21. Suggest a new research question that you might investigate next, building on what you learned in this study. June 27, 2014 MAA PREP workshop 23 June 27, 2014 MAA PREP workshop 24 Alternate Measures for Strength of Evidence (adapted from Exploration 1.3: Do People Use Facial Prototyping?) A study in Psychonomic Bulletin and Review (Lea, Thomas, Lamkin, & Bell, 2007) presented evidence that “people use facial prototypes when they encounter different names.” Participants were given two faces and asked to identify which one was Tim and which one was Bob. The researchers wrote that their participants “overwhelmingly agreed” on which face belonged to Tim and which face belonged to Bob, but did not provide the exact results of their study. Step 1: Ask a research question. We will gather data from your class to investigate the research question of whether students have a tendency to associate certain facial features with a name. Step 2: Design a study and collect data. Each student in your class will be shown the same two pictures of men’s faces used in the research study. You will be asked to assign the name Bob to one photo and the name Tim to the other. Each student will then submit the name that he or she assigned to the picture on the left. Then the name that the researchers identify with the face on the left will be revealed. The parameter of interest here is the long-run probability that a student in your class would assign the same name to the face on the left. 1. State the null and alternative hypotheses to be tested when the data are collected. Express these both in words and symbols. (Hint: Think about the parameter and the research question of interest here.) Step 3: Explore the data. 2. How many students put Tim as the name on the left? How many students participated in this study (sample size)? What proportion put Tim’s name on the left? Number (Tim on left): Sample size: Proportion (Tim on left): When we conduct analyses with binary variables, we often call one of the outcomes a success and the other a failure, and then focus the analysis on the success outcome. It is arbitrary which outcome is defined to be a success, but you need to make sure you do so consistently throughout the analysis. In this case we’ll call “Tim on left” a success because that’s what previous studies have found to be a popular choice. June 27, 2014 MAA PREP workshop 25 Step 4: Draw inferences. You will use the One Proportion applet to investigate how surprising the observed class statistic would be, if students were just randomly selecting which name to put with which face. 3. Before you use the applet, indicate what you will enter for the following values: a. Probability of success: b. Sample size: c. Number of repetitions: 4. Conduct this simulation analysis. Make sure the “Proportion of heads” button is selected in the applet and not the “Number of heads.” a. Indicate how to calculate the approximate p-value (count the number of simulated statistics that equal __________ or are ___________). b. Report the approximate p-value. c. Use the p-value to evaluate the strength of evidence provided by the sample data against the null hypothesis, in favor of the alternative that students really do tend to assign the name Tim (as the researchers predicted) to the face on the left. The p-value is the most common way to evaluate strength of evidence, but now we will explore a common alternative way to evaluate strength of evidence. The goal of any measure of strength of evidence is to use a number to assess whether the observed statistic falls in the tail of the null distribution (and is therefore surprising if the null hypothesis is true) or among the typical values we see when the null hypothesis is true. 5. Check the Summary Stats box in the applet. a. Report the mean (average) value of the simulated statistics. b. Explain why it makes sense that this mean is close to 0.5. c. Report the standard deviation (SD) of the simulated statistics. June 27, 2014 MAA PREP workshop 26 d. Report (again) the observed class value of the statistic. (What proportion of students in your class put Tim’s name on the left?) p̂ = e. Calculate how many standard deviations the observed class value of the statistic is from the hypothesized mean of the null distribution, 0.5. In other words, subtract the 0.5 from the observed value, and then divide by the standard deviation. This is the standardized statistic z = (observed statistic p̂ – 0.5) / SD of null distribution. f. Once you calculate this value, you interpret it as “how many standard deviations the observed statistic falls from the hypothesized parameter value.” What strength of evidence against the null does your standardized statistic provide? g. How closely does your evaluation of strength of evidence based on the standardized statistic compare to the strength of evidence based on the p-value in #4c? Guidelines for evaluating strength of evidence from standardized values of statistics Standardizing gives us a quick, informal way to evaluate the strength of evidence against the null hypothesis: between -1.5 and 1.5: below -1.5 or above 1.5: below -2 or above 2: below -3 or above 3: little or no evidence against the null hypothesis; moderate evidence against the null hypothesis; strong evidence against the null hypothesis; very strong evidence against the null hypothesis. Step 5: Formulate conclusions. 6. Now, let’s step back a bit further and think about the scope of inference. We have found that in most classes, the observed data provide strong evidence that students do better than random guessing which face is Tim’s and which is Bob’s. In that case, do you think that most students at your school would agree on which face is Tim’s? Do you think this means that most people can agree on which face belongs to Tim? Furthermore, does this mean that all people do ascribe to the same facial prototyping? We will look more at the scope of inference in chapters 2 and 4. June 27, 2014 MAA PREP workshop 27 Step 6: Look back and ahead. 7. Discuss limitations to this study? 8. Based on the limitations of this study, suggest a new research question that you would investigate next. Extensions 9. In #2 you recorded the proportion of students in your class who put Tim’s name with the photo on the left. Imagine that the proportion was actually larger than that (e.g., if your class was 60%, imagine it was 70%). a. How would this have affected the p-value: Larger Same Smaller b. How would this have affected the standardized value: Larger Same Smaller c. How would this have affected the strength of evidence against the null hypothesis: Stronger Same Weaker 10. What if the proportion who put Tim’s name on the left was the same, but 100 students were in your class. a. How would this have affected the variability in the null distribution? Larger Same Smaller b. How would this have affected the p-value? Larger Same Smaller c. How would this have affected the standardized value: Larger Same Smaller d. How would this have affected the strength of evidence against the null hypothesis: Stronger June 27, 2014 Same Weaker MAA PREP workshop 28 11. Suppose that less than half of the students in your class had put Tim’s name on the left, so your class result was in the opposite direction of the alternative hypothesis. What can you say about the standardized value of the statistic in this case? Explain. (Hint: You cannot give a value for the standardized statistic, but you can say something specific about its value.) June 27, 2014 MAA PREP workshop 29 Alternative Values for Probability of Success (adapted from Chapter 1 Investigation) Background: A legendary story on college campuses concerns two students who miss a chemistry exam because of excessive partying but blame their absence on a flat tire. The professor allowed them to take a make-up exam, and he sent them to separate rooms to take it. The first question, worth five points, was quite easy. The second question, worth ninety-five points, asked: Which tire was it? Wednesday we collected data from you, asking which of the four tires you would pick if you were caught in this situation (and had to make up a tire on the spot). For some reason, I conjecture that college students appear to have a tendency to pick the tire. You will use these data to conduct a test of whether the sample data provide convincing evidence that this tire is picked more often than we would expect if the four tires were picked equally often (purely at random). Step 1: Ask a research question (a) Identify the research conjecture. Step 2: Design a study and collect data (b) What are the observational units? (c) What is the variable that is measured/recorded on each observational unit? (Hint: State as a binary variable.) (d) Describe the parameter of interest in words. (e) State the null and alternative hypotheses to be tested. Null hypothesis (H0) >> Alternative hypothesis (Ha) >> Step 3: Explore the data Fill in the tallies of your responses. Front left Front right Back left Back right Total (f) Calculate the observed statistic: What proportion of students selected the left front tire? How many students participated? (Include the symbols you would use to represent these value (copy and paste one of these symbols below: p̂ , 𝜋, 𝑥̅ , 𝜇, 𝜎, s, n) June 27, 2014 MAA PREP workshop 30 Step 4: Draw inferences beyond the data Because I want to focus on the right front tire, I will make the variable binary (right front or not) and the parameter of interest in this study is the long-run probability that a Cal Poly student picks the left front tire. What symbol should I use to refer to this unknown parameter value: p̂ , 𝜋, 𝑥̅ , 𝜇, 𝜎, s, n – copy and paste below. (g) Restate the hypotheses in terms of this parameter. (h) Is the sample proportion who selected the left front tire greater than one-fourth? (If not, there’s no need to conduct a simulation analysis. You will have no evidence that this tire is selected more than one-fourth of the time in the long run.) (i) Simulate: Use the One Proportion applet to simulate 1000 repetitions of this study, assuming that every student in class has a 0.25 probability of selecting the right front tire. For your simulation “Success” represents: “Chance of success” = One repetition = (j) Using the proportion of successes for the values on the horizontal axis, what is the center of your null distribution? Does it make sense that this is the center? Explain. (k) Strength of evidence: Determine the p-value from your simulation results. Also interpret what this p-value probability represents (i.e., the long-run proportion of what that do what assuming what?!). (l) Evaluate this p-value: Does this p-value provide strong evidence against the null hypothesis? (m) Determine and interpret the standardized statistic from your simulation analysis. Standardized statistic = observed statistic hypothesized parameter = standard deviation of null distributi on (n) According to this standardized statistic, is the observed value of the sample proportion surprising when the null hypothesis is true? Is this consistent with what the p-value told you? June 27, 2014 MAA PREP workshop 31 Step 5: Formulate Conclusions (o) Write a paragraph, as if to a member of your family who has not taken statistics, summarizing the conclusion that you would draw from this study and your simulation analysis. Clearly explain to this intelligent but not-statistical family member how you are coming to this conclusion/the reasoning behind your conclusion. Step 6: Look back and ahead (p) Now, let’s step back a bit and think about the scope of our inference. For example: Have we shown that are equally likely to pick among the four tires? Does it follow that people in general don’t have a tendency to pick the left front? Are there any improvements you would suggest for how the data were collected? Suggest a future research study that you would be interested in investigating based on these results. Limitations in conclusions? >> Improvements would make to study? >> Future research questions? >> June 27, 2014 MAA PREP workshop 32 Two-Sided Tests (adapted from Example 1.5: Halloween Treats) Stemming from concern over the nation’s obesity epidemic, researchers investigated whether children might be as tempted by toys as by candy for Halloween treats. Test households in five Connecticut neighborhoods offered children two plates: one with lollipops or fruit candy and one containing small, inexpensive Halloween toys, like plastic bugs that glow in the dark. The researchers observed the selections of 283 trick-or-treaters between the ages of 3 and 14 (Schwartz, Chen, and Brownell, 2003). 1. Identify the parameter of interest in this study. 2. State the null and alternative hypotheses to be tested. (Note that no direction is anticipated in advance for this study, so the alternative should be two-sided.) The researchers found that 148 trick-or-treaters selected the candy, and 135 chose a toy. 3. Calculate the value of the statistic for this study. 4. Use the applet to conduct a simulation analysis with 10,000 repetitions, assuming the null hypothesis. Comment on the shape, center, and variability of the simulated distribution of sample proportions. 5. Use the simulation results to approximate the p-value for this study. Be sure to click on “Two-sided.” Also describe which values the applet counts to determine this p-value, and explain why these are the correct values to count. June 27, 2014 MAA PREP workshop 33 6. Calculate the standardized value of the statistic for this study. 7. Interpret what this standardized values reveals. 8. Summarize your conclusion from the simulation analysis and standardized statistic from this study. 9. Look back and ahead: What limitations do you see with this study? How broadly would you feel comfortable generalizing your conclusions? What would you like to see in a follow-up study? June 27, 2014 MAA PREP workshop 34 Theory-Based Inference (adapted from Examples 1.3 and 1.4) Example 1.4: Halloween Treats Recall that stemming from concern over the nation’s obesity epidemic, researchers investigated whether children might be as tempted by toys as by candy for Halloween treats. Test households in five Connecticut neighborhoods offered children two plates: one with lollipops or fruit candy and one containing small, inexpensive Halloween toys, like plastic bugs that glow in the dark. The researchers observed the selections of 283 trick-or-treaters between the ages of 3 and 14 (Schwartz, Chen, and Brownell, 2003). To investigate whether children show a preference for either the candy or the toys, we test the following hypotheses. Null hypothesis: The probability a trick-or-treater would choose candy is 0.5. Alternative hypothesis: The probability a trick-or-treater would choose candy is not 0.5. Note that our null model assumes that the probability of choosing candy (π) is the same for all children. In symbols, these hypotheses translate to H0: 𝜋 = 0.5 Ha: 𝜋 ≠ 0.5. The researchers collect the reactions of 283 children for the study. With a sample size of 283, under our null hypothesis, we simulated the null distribution (using 10,000 simulated samples) shown in Figure 1.1. Figure 1.1: A null distribution representing 10,000 simulated samples of 283 children where the probability that an individual child would choose candy is 0.5 1. With regard to the above null distribution: a. What is the numeric value of the center of this null distribution? Does that make sense? b. What is the SD of this null distribution? June 27, 2014 MAA PREP workshop 35 c. Would it be appropriate to describe the overall shape of this null distribution as bell-shaped? 2. As you did in the previous session, use the One Proportion applet to run a simulation (with 10000 repetitions) to find the approximate p-value for this study. Be sure to click on “Two-sided.” Record the p-value: ___________ Theory-based approach (One proportion z test) In the early 1900s, and even earlier, computers weren’t available to do simulations, and as people didn’t want to sit around and flip coins all day long, they focused their attention on mathematical and probabilistic rules and theories that could predict what would happen if someone did simulate. They proved the following key result (often called the Central Limit Theorem): Central Limit Theorem (CLT): The distribution of sample proportions will be centered at the long-run probability (𝜋), with a standard deviation of √𝜋(1 − 𝜋)/𝑛. If the sample size (n) is large enough, then the shape of the distribution is bell-shaped. One bit of ambiguity in the statement is how large is large enough for the sample size? As it turns out, the larger the sample size is, the better the prediction of bell-shaped behavior in the null distribution is, but there is not a sample size where all of the sudden the prediction is good. However, some people have used the convention that you should have at least 10 successes and at least 10 failures in the sample. Validity conditions: The normal approximation can be thought of as a prediction of what would occur if simulation was done. Many times this prediction is valid, but not always, only when the validity condition (at least 10 successes and at least 10 failures) is met. 3. How well does the CLT predict the center, SD, and shape of the null distribution of sample proportion of children who choose candy, as displayed in Figure 1.1? a. Center: b. SD: c. Shape: 4. Now go back to the One Proportion applet, and check the box next to “Normal Approximation." Be sure that the “Two-sided” option is also selected. Record the normal approximation-based p-value: ___________ 5. Compare the simulation-based p-value (from #2) to the normal approximation-based pvalue (from #4). Are the very different or about the same? June 27, 2014 MAA PREP workshop 36 6. In case of discrepancy between the simulation-based p-value and the normal approximation-based p-value, which do you think would be the more reliable? 7. (Optional) Notice that the applet provides you with an option to report the Exact Binomial p-value. Check on the option, and report the resulting p-value: ___________________. 8. (Optional) Which p-value is closer to the Exact Binomial p-value: the simulation-based pvalue (from #2) or the normal approximation-based p-value (from #4)? Figure 1.2 shows a few examples of null distributions. Figure 1.2: Null distributions for various studies. A bell-shaped distribution, called a normal distribution, can be used to nicely approximate some of them, but will not work well for others. St. George’s (n = 10; 𝜋 = 0.15) St. George’s (n = 361; 𝜋 = 0.15) Halloween treats (n = 283; 𝜋 = 0.5) The St. George’s (n = 361) and Halloween treats (n = 283) simulations show bell-shaped curves. However, the first sample of St. George’s heart patients (n = 10) is a not bell-shaped curve, in part because the distribution is not symmetric and in part because there are so few values that the sample proportion can be. June 27, 2014 MAA PREP workshop 37 A situation where a theory-based approach doesn’t work Example 1.3: Heart Transplant Operations In an article published in the British Medical Journal (2004), researchers Poloniecki, Sismanidis, Bland, and Jones reported that heart transplantations at St. George’s Hospital in London had been suspended in September 2000, after a sudden spike in mortality rate. Of the last 10 heart transplants, 80% had resulted in deaths within 30 days of the transplant. Newspapers reported that this mortality rate was over five times the national average. Based on historical national data, the researchers used 15% as a reasonable value for comparison. Think about it: Why do you think the Theory-based (One proportion z-test; Normal approximation) approach will not work well for this data? Applying the 3S Strategy Using the 3S Strategy from the previous section, we observed 0.80 as our statistic and now we will simulate one thousand repetitions from a process where 𝜋 = 0.15 (under the null hypothesis). We did this using the One Proportion applet as shown in Figure 1.2. Figure 1.2: Null distribution (could-have-been simulated sample proportions) for 1000 repetitions of drawing samples of 10 “patients” from a process where the probability of death is equal to 0.15. “Success” has been defined to be patient death. 9. How well does the CLT predict the center, SD, and shape of the null distribution of sample proportion of deaths among a sample of 10 heart transplant cases, as displayed in Figure 1.2? a. Center: Does it make sense? b. SD: c. Shape: June 27, 2014 MAA PREP workshop 38 Another situation where a theory-based approach doesn’t work Recall the study about the legendary story on college campuses concerns two students who miss a chemistry exam because of excessive partying but blame their absence on a flat tire. The professor allowed them to take a make-up exam, and he sent them to separate rooms to take it. The first question, worth five points, was quite easy. The second question, worth ninetyfive points, asked: Which tire was it? When I collected data in my class of 36 students, 18 picked the ____________ tire. To explore whether this is unlikely to have happened by random chance, I generated the following null distribution. 10. As you can see, I forgot to include the mean of the null distribution in the applet output. Use the CLT to predict what the mean of this null distribution should be. 11. Now use the CLT to predict the SD of this null distribution. 12. Do you think the One proportion z-test is a valid approach to find the p-value in the context of this study? Why or why not? Explain. (Hint: Is it appropriate to describe the overall shape of the null distribution as bell-shaped? Is the validity condition of at least 10 successes and at least 10 failures helpful in deciding whether the theory-based approach is valid in the context of this study?) 13. Use the One Proportion applet to find the following: a. Simulation-based p-value: __________ b. Normal approximation-based (theory-based) p-value: c. Which should you trust more: (a) or (b)? June 27, 2014 MAA PREP workshop 39 Assessment examples from Chapter 1 Sample quiz #1 1. Which definition of a p-value is the most accurate? a. If the alternative hypothesis is true, the p-value is the probability of observing an outcome as extreme or more extreme than the one actually observed in the study b. The p-value is the probability that the alternative hypothesis is true c. If the null hypothesis is true, the p-value is the probability of observing an outcome as extreme or more extreme than the one actually observed in the study d. The p-value is the probability that the null hypothesis is true Circle TRUE or FALSE each of the following (#2 through #5) 2. A p-value can be negative TRUE FALSE 3. A standardized statistic can be negative TRUE FALSE 4. We run tests of significance to determine whether 𝑝̂ is larger or smaller than some value. TRUE FALSE 5. We run tests of significance to determine whether is larger or smaller than some value. TRUE FALSE 6. Dwayne Wade of the Miami Heat hit 569 of his 1093 field goal attempts in the 2012/2013 season for a shooting percentage of 52.1%. Over the lifetime of Dwayne’s career, can we say that Dwayne is more likely than not to make a field goal? a. Is the long-run probability of Dwayne making a field goal a parameter or a statistic? b. Is 52.1% a parameter or a statistic? c. When simulating possible outcomes assuming the chance model, how many times would you flip a coin for one repetition of the 2012/2013 season? d. With each repetition, what would you keep track of? e. What would be a typical value from a single repetition? June 27, 2014 MAA PREP workshop 40 7. Following up on question #6. Suppose at the beginning of the 2013/2014 season Dwayne Wade made 12 of his first 20 field goal attempts Is there evidence that in the long run Dwayne will make his field goal a majority of the time in the 2013/2014 season? a. Let π be the long-run probability that Dwayne makes his field goal. State the null and alternative hypotheses using symbols. b. Based on the simulated null distribution, what is the p-value for your test? c. Based on the simulated null distribution, what is the standardized statistic for your test? d. Based on the p-value or standardized statistic, do you have strong evidence that in the long-run Dwayne will make a majority of his field goals? Explain. June 27, 2014 MAA PREP workshop 41 Sample quiz #2 Use the following situation to answer all questions. “Rock-Paper-Scissors” is a game in which two players simultaneously show one of three hand symbols (denoted rock, paper, or scissors). Rock beats scissors, scissors beats paper, and paper beats rock. An article published in College Mathematics Journal (Eyler, Shalla, Doumaux, and McDevitt, 2009) found that players tend to not prefer scissors, choosing it less often than expected if the 3 choices were made equally. They collected data from 119 people who played games against a computer. In the study, 55.5% started with rock, 32.8% started with paper, and 11.8% with scissors in their first round. Researchers wanted to know whether these data provided convincing evidence that scissors are chosen less than one-third of the time in the long run. 1. What are the observational units? 2. What is the variable? 3. State the null hypothesis 4. State the alternative hypothesis. 5. Identify the parameter of interest in this study A. People playing the game against a computer B. Whether or not someone chooses scissors C. 11.8% D. The long-run probability someone plays scissors E. 1/3 June 27, 2014 MAA PREP workshop 42 When using the One-proportion inference applet, indicate which values you would enter into each of the boxes 6. Box A_______ 7. Box B_______ 8. Box C_______ 9. Box D_______ (Hint: Answer is ≤ or ≥) 10. Box E_______ 11. Write a one-sentence interpretation of the p-value for this study (Note: you do not need to provide the actual VALUE of the p-value). The p-value is the proportion of simulations which are _______________or ____________ (more/less) assuming ___________________________________ (name what is assumed to be true) Here is output from the simulation for use in the following questions: June 27, 2014 MAA PREP workshop 43 12. What is the approximate value of the standardized statistic in this case? A. Z = 0 B. Z = -1 C. Z = 3 D. Z = -5 13. Would you consider the results of this study to be convincing evidence that scissors are chosen less often in the long run than expected? A. B. C. D. E. No, the p-value is going to be large No, the p-value is going to be small Yes, the p-value is going to be small Yes, the p-value is going to be large No, the distribution is centered at 1/3. June 27, 2014 MAA PREP workshop 44 Sample quiz #3 (for after Chapters 1 and 2) 1. Which of the following is true about sampling? a) Larger samples are always better than smaller samples, regardless of how the sample were collected. b) Larger samples will have less sampling variability than smaller samples. c) Larger samples give less biased estimates than smaller samples. d) Larger samples are needed with larger populations. 2. Recall the Gettysburg Address exploration. Which of the following sampling methods would be unbiased? a) Select your sample of words by closing your eyes and dropping a pencil down on the page, record the word, repeat until you have 10 words. b) Select your sample of words by cutting each word out of the address, mixing them up in a hat and drawing out 10 slips of paper. c) Select your sample of words by typing each word onto the same sized slips of paper, mixing them up in a hat and drawing out 10 slips of paper. d) Select your sample of words by randomly choosing words you think represent all the words in the Gettysburg Address. 3. The following dotplot is the time estimates for a 10 second song snippet we looked at in class. The highest time estimates were 30 seconds. Suppose that one of the observations recorded as 30 seconds was a typo, and was meant to be 20. If we corrected this entry in the dataset, how would the following numerical statistics change, if at all? Mean: SMALLER SAME LARGER Median: SMALLER SAME LARGER Standard Deviation: SMALLER SAME LARGER June 27, 2014 MAA PREP workshop 45 4. You have heard that in sports like boxing there might be some competitive advantage to those wearing red uniforms. You want to test this with your new favorite sport of chessboxing. You randomly assign blue and red uniforms to contestants in 20 matches and find that those wearing red won 14 times (or 70%). You conduct a test of significance using simulation and get the following null distribution. (Note this null distribution uses only 100 simulated samples and not the usual 1000 or more.) a) Suppose you want to see if competitors wearing red win more than 50% of the matches in the long run, so you test H0: 𝜋 = 0.5 versus Ha: 𝜋 > 0.5. What is your p-value based on the above null distribution? b) Suppose you now want to see if competitors wearing either red or blue have an advantage, so you test H0: 𝜋 = 0.5 versus Ha: 𝜋 ≠ 0.5. What is your p-value now based on the above null distribution? 5. Suppose you are testing the hypotheses H0: 𝜋 = 0.5 versus Ha: 𝜋 > 0.5. You get a sample proportion of 0.54 and find that your p-value is 0.08. Now suppose you redid your study with each of the following changes, will your new p-value be larger or smaller than the 0.08 your first obtained? a) You increase the sample size and still find a sample proportion of 0.54. LARGER SMALLER b) Keeping the sample size the same, you take a new sample and find a sample proportion of 0.55. LARGER SMALLER c) With your original sample, you decided to test a two-sided alternative instead of Ha: 𝜋 > 0.5. LARGER June 27, 2014 SMALLER MAA PREP workshop 46 A sample exam question (for after Chapters 1 and 2) Studying at Dordt A sample of twenty Dordt students is taken, four of whom (4/20=20%) say they study at least 35 hours per week during the academic year. At most state universities, the proportion of students who report studying at least 35 hours a week is 10%. We wish to see if the Dordt sample provides strong evidence that the true proportion of Dordt students who study more than 35 hours a week is more than 10%. Two different approaches were taken in order to yield a p-value. Option #1. 1000 sets of 20 spins on a spinner were generated where the probability of the spinner pointing to the blue region was 10%.Out of the 1000 sets of 20 spins, 148 sets had at least 4 spins yield blue, and so a p-value of 0.148 is obtained, showing no evidence that more than 10% of Dordt students study more than 35 hours a week. Option #2. The theory-based inference applet was used, generating a standardized zstatistic of 1.49 with a p-value of 0.068, yielding moderate evidence that more than 10% of Dordt students study more than 35 hours a week. Option #1 Option #2 Briefly explain which p-value (Option #1 or Option #2) is more valid and why. Note: While the results obtained from the coin tossing applet are subject to some variation because it is based on simulation that is NOT the main reason for the discrepancies between the two p-values. June 27, 2014 MAA PREP workshop 47 Generalization (adapted from Exploration 2.1A: Sampling Words) 1. Select a representative set of 10 words from the following passage by circling them. Four score and seven years ago, our fathers brought forth upon this continent a new nation: conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we cannot dedicate, we cannot consecrate, we cannot hallow this ground. The brave men, living and dead, who struggled here have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember, what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us, that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion, that we here highly resolve that these dead shall not have died in vain, that this nation, under God, shall have a new birth of freedom, and that government of the people, by the people, for the people, shall not perish from the earth. The authorship of literary works is often a topic for debate. For example, researchers have tried to determine whether some of the works attributed to Shakespeare were actually written by Bacon or Marlow. The field of “literary computing” examines ways of numerically analyzing authors’ works, looking at variables such as sentence length and rates of occurrence of specific words. Definitions: The population is the entire collection of observational units we are interested in, and the sample is a subgroup of the population on which we record data. Numerical summaries about a population are called parameters, and numerical summaries calculated from a sample are called statistics. June 27, 2014 MAA PREP workshop 48 The above passage is, of course, Lincoln’s Gettysburg Address, given November 19, 1863 on the battlefield near Gettysburg, PA. We are considering this passage a population of words, and the 10 words you selected are considered a sample from this population. In most studies, we do not have access to the entire population and can only consider results for a sample from that population, but to learn more about the process of sampling and its implications we will now deal with a somewhat artificial scenario where we sample from this known population. 2. Record each word from your sample, and then indicate the length of the word (number of letters) and whether or not the word contains at least one letter e. Word Length (no. of letters) Contains e? (Y or N) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 3. Identify the observational units and the variables you have recorded on those observational units. (Keep in mind that observational units do not have to be people!) 4. Is the variable “length of word” quantitative or categorical? 5. Calculate the average length of the 10 words in your sample. Is this number a parameter or a statistic? Explain how you know. What symbol would you use to refer to this value? 6. The average length of the 268 words in entire speech equals 4.29 letters. Is this number a parameter or a statistic? Explain how you know. What symbol would you use to refer to this value? June 27, 2014 MAA PREP workshop 49 7. Do you think the words you selected are representative of the 268 words in this passage? Suggest a method for deciding whether you have a representative sample. (Hint: Whereas any one sample may not produce statistics that exactly equal the population parameters, what would we like to be true in general?) 8. Combine your results with your classmates’ by producing a dotplot of the distribution of average word lengths in your samples. Be sure to label the axis of this dotplot appropriately. Describe the shape, center, and variability of the distribution of average word lengths as revealed in the dotplot. 9. Let’s compare your sample statistics to the population parameter. How many and what proportion of students in your class obtained a sample average word lengths larger than 4.29 letters, the average word length in the population? Definition: A sampling method is biased if, when using that sampling method, statistics from different samples consistently overestimate or consistently underestimate the population parameter of interest. Note that bias is a property of a sampling method, not a property of an individual sample. Also note that the sampling method must consistently produce non-representative results in order to be considered biased. 10. What does your answer to #9 tell you about whether the sampling method of asking students to quickly pick 10 representative words is biased or unbiased? If biased, what is the direction of the bias (tendency to overestimate or to underestimate)? 11. Do you think asking each of you to take 20 words instead of 10 words have helped with this issue? Explain. June 27, 2014 MAA PREP workshop 50 Now consider a different sampling method: What if you were to close your eyes and point blindly at the page with a pencil 10 times, taking for your sample the 10 words that your pencil lands on. 12. Do you think this sampling method is likely to be biased? If so, in which direction? Explain. A sample is only useful to us if the data we collect on our sample is similar to the results we would find in the entire population. In this sense, we say the sample is representative of the population. 13. Suggest another technique for selecting 10 words from this population in order for the sample to be representative of the population with regard to word length and e-words. Taking a simple random sample Key idea: A simple random sample ensures that every sample of size n is equally likely to be the sample selected from the population. In particular, each observational unit has the same change of being selected as every other observational unit. The key to obtaining a representative sample is using some type of random mechanism to select the observational units from the population, rather than relying on convenience samples or any type of human judgment. Instead of having you choose “random” words using your own judgment, we will now ask you to take a simple random sample of words and evaluate your results. The first step is to obtain a sampling frame - a complete list of every member of the population where each member of the population can be assigned a number. Below is a copy of the Gettysburg address that includes numbers in front of every word. For example, the 43rd word is nation. 14. Go to the Random Numbers generator at http://www.random.org Generate 10 Numbers in the range from 1 to 268. What should you do if you get a duplicate number generated? Enter the random numbers in the table below: Randomly generated ten ID values from 1-268 Using your randomly generated values, look up the corresponding word from the sampling frame. Fill in the data table below. June 27, 2014 MAA PREP workshop 51 Word Length (no. of letters) e-word? (Y or N) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Notice that each simple random sample gave us different values for the statistics. That is, there is variability from sample to sample (sampling variability). This makes sense because the samples are chosen randomly and consist of different words each time, so we expect some differences from sample to sample. Gettysburg Address Sampling Frame 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Four score and seven years ago Our fathers brought forth upon this continent A new nation conceived In liberty and dedicated To The proposition that All men Are created June 27, 2014 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 are met on a great battlefield of that war We have come To dedicate A portion Of that field as a final resting place For those Who here gave 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 we cannot consecrate we cannot hallow this ground The brave men living and dead who struggled here have consecrated It far above our poor power to add or detract MAA PREP workshop 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 living rather to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced It is rather for us to be here 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 full measure of devotion that we here highly resolve that these dead shall not have died in vain that this nation under God shall have a new birth of 52 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 equal Now We Are engaged In A great Civil War testing whether That nation Or Any nation So conceived And So dedicated Can Long Endure We 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 their lives that that nation might live It Is altogether fitting and proper that we should do this But in a larger sense we cannot dedicate 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 The world will little note nor long remember what we say here but it can never forget what they did here It is for us the 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 dedicated to the great task remaining before us that from these honored dead we take increased devotion to that cause for which they gave the last 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 freedom and that government of the people by the people for the people shall not perish from the earth 15. Let’s examine your sample and those of your classmates. a. Calculate the average word length in your random sample. b. Again produce a dotplot of the distribution of sample average word lengths for yourself and your classmates. c. Comment on how this distribution compares to the one from #8 based on nonrandom sampling. 16. Let’s compare your sample statistics to the population parameter. a. How many and what proportion of students in your class obtained a sample average word length larger than the population average (4.29 letters)? June 27, 2014 MAA PREP workshop 53 b. What does this answer reveal about whether simple random sampling is an unbiased sampling method? Explain. Key idea: When we use simple random sampling, the average of the sample means from different samples equals the population mean and the average of the sample proportions from different samples equals the population proportion. (This means that these statistics are unbiased estimators for the corresponding parameter when we use simple random sampling.) To really see the long-run pattern in these results, let’s look at many more random samples from this population. Open the Sampling Words java applet. You will see that the population distribution contains three variables: the lengths of the words, whether or not a word contains at least one letter e, and whether or not the word is a noun. For now, select only the length variable. 17. Check the Show Sampling Options box. Specify 10 in the Sample Size box and press the Draw Samples button. You will see the 10 selected words appear in the box below. You will also see 10 darkened dots within the population distribution representing the 10 lengths that you have sampled. The value of sample mean is displayed in the population box. The sample mean is also plotted in Sample Mean Length plot. Press Draw Samples again. Did you get the same sample this time? Did you get the same sample mean? You should now see two dots in the Sample Mean Length plot, one for each sample mean. The mean of these two sample means is displayed in the upper right corner of this plot. Confirm this calculation. First sample mean: Second sample mean: Average: Change the Number of Samples to 998 (for a total of 1000 samples). Uncheck Animate. Press the Draw Samples button. (This should generate a dotplot of 1000 values of the sample mean for 1000 different simple random samples, each consisting of 10 words, selected randomly from the population of all 268 words in the Gettysburg address.) 18. Describe the resulting distribution of sample means: What are roughly the largest and smallest values? What is the average of the distribution? What is the standard deviation? How does the average of the 1000 sample means compare to the population mean 𝜇 = 4.29? June 27, 2014 MAA PREP workshop 54 19. Now change the Sample Size from 10 to 30 and the Number of Samples to 1000. Press the Draw Samples button. The applet allows you to compare the two distributions (n = 10 to n = 30). How do these two distributions compare? Be sure to identify a feature that is similar and a feature that is different. 20. Did changing the sample size change the center of the distribution? If we used a biased sampling method, would increasing the sample size remove the bias? Explain. You should see that whereas the convenience samples tended to consistently produce samples that overestimate the length of words, the simple random sampling method does not have that tendency. In fact, the average of all the statistics from all possible simple random samples will exactly equal the population parameter value. For this reason, in the real world, where we don’t have access to the entire population, when we select a simple random sample, we will be willing to believe that sample is representative of the population. In fact, we would prefer a small random sample, than a large convenience sample. That doesn’t mean our sample result will match the population result exactly, but we will be able to predict how far off it might be. Simple random sampling (and probability sampling methods in general) allow us to believe that our sample is representative of the larger population and that the observed sample statistic is “in the ball park” of the population parameter. For this reason, when we use simple random sampling we will be willing to generalize the characteristics of our sample to the entire population. June 27, 2014 MAA PREP workshop 55 Interval Estimation (one proportion/probability) (adapted from Exploration 3.1: Kissing Right?) Most people are right-handed, and even the right eye is dominant for most people. Developmental biologists have suggested that late-stage human embryos tend to turn their heads to the right. In a study reported in Nature (2003), German bio-psychologist Onur Güntürkün conjectured that this tendency to turn to the right manifests itself in other ways as well, so he studied kissing couples to see which side they tended to lean their heads while kissing. He and his researchers observed kissing couples in public places such as airports, train stations, beaches, and parks. They were careful not to include couples who were holding objects such as luggage that might have affected which direction they turned. For each kissing couple observed, the researchers noted whether the couple leaned their heads to the right or to the left. They observed 124 couples, age 13-70 years. You will first use Güntürkün’s data to test his conjecture that kissing couples tend to lean their heads to the right. Use the symbol 𝜋 to denote the proportion of all kissing couples in these countries that lean their heads to the right. 1. Is 𝜋 a parameter or a statistic? Explain how you are deciding. 2. Do we know the exact value of 𝜋 based on the observed data? Explain. 3. State the appropriate null and alternative hypotheses, both in words and in terms of the parameter 𝜋, for testing the conjecture that kissing couples tend to lean their heads to the right more often than by random chance. Of the 124 couples observed, 80 leaned their heads to the right while kissing. 4. Calculate the sample proportion of the observed couples who leaned their heads to the right while kissing. Also indicate the symbol used to denote this value. June 27, 2014 MAA PREP workshop 56 5. Conduct a simulation analysis (using the One Proportion applet) to assess the strength of evidence that the sample data provide for Güntürkün’s conjecture that kissing couples tend to lean their heads to the right more often than they would by random chance. Report the approximate p-value and summarize your conclusion about this strength of evidence. Your simulation analysis should convince you that the sample data provide very strong evidence to believe that kissing couples lean their heads to the right more than half the time in the long run. That leads to a natural follow-up question: How much more than half the time? In other words, we have strong evidence that the long-run probability of leaning to the right is greater than one-half, but can we now estimate the value for that probability? We will do this by testing many different (null) values for the probability that a couple leans to the right when kissing. 6. Now test whether the data provide evidence that the long-run probability that a couple leans their heads to the right while kissing (𝜋) is different from 0.60. Use the One Proportion applet to determine the p-value for testing the null value of 0.60. Report what you changed in the applet and report your p-value. Think about it: What kind of test are we conducting if we want to determine whether the population parameter is different from a hypothesized value - a one-sided test or a two-sided test? Remember from Section 1.4 that if we don’t have a specific direction we are testing, greater than or less than, we need to use a two-sided test of significance. If you used a one-sided alternative in #8, determine the two-sided p-value now by checking the Two-sided box. Recall that a p-value of 0.05 or less indicates that the sample data provide strong evidence against the null hypothesis and in favor of the alternative hypothesis. Thus, we can reject the null hypothesis when the p-value is less than or equal to 0.05. Otherwise, when the p-value is greater than 0.05, we do not have strong enough evidence against the null hypothesis and so we consider the null value to be plausible for the parameter. Key Idea: We will consider a value of the parameter to be plausible if the two-sided p-value for testing that parameter value is larger than the level of significance. 7. Is the p-value for testing the null value of 0.60 less than 0.05? Can the value 0.60 be rejected, or is the value 0.60 plausible for the long-run probability that a couple leans their heads to the right while kissing? June 27, 2014 MAA PREP workshop 57 Recall from Section 2.3, the 0.05 criterion we are using is called the significance level. The pvalue you found in the previous question should not have been smaller than 0.05. Hence, you do not reject the null hypothesis at the 0.05 level of significance and therefore you do not reject 0.60 as a plausible value for 𝜋. Thus, it is plausible (i.e., believable) that the long-run probability that a kissing couple leans their heads to the right is 0.60. 8. Does this mean that you’ve proven that exactly 60% of kissing couples lean right? Why or why not? Because there are still other plausible values, now we want to “zoom in” on which values for the long-run probability are plausible and which can be rejected at the 0.05 significance level. 9. Use the applet to test the probability values given in the following table. Each time, change the Probability of success to match the value that you are testing (keeping the observed sample proportion that you count beyond the same). Everything else should stay the same; press Draw Samples and then Count to see the new two-sided p-value (with the Two-sided box checked). Probability under H0 (Two-sided) p-value Reject or plausible? Probability under H0 (Two-sided) p-value Reject or plausible? 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.70 0.71 0.72 0.73 0.74 0.75 0.76 10. Using a 0.05 significance level and your results from #9, provide a list of plausible values for 𝜋, the long-run probability that a kissing couple leans their heads to the right. This list of values represents an interval containing all values between two endpoints. Key idea: This interval of plausible values that you have written down in #10 is also called a 95% confidence interval for 𝝅, the long-run probability that a kissing couple leans their heads to the right. Why is this called 95% confidence? This corresponds to the 5% (0.05) significance level that was used to decide whether there was enough evidence against a hypothesized value. Notice that 95% and 5% add up to give 100%. The 95% is called the confidence level, and is a measure of how confident we are about our interval estimate of the parameter, in this case, we are 95% confident that the population proportion of all couples who lean their heads to the right while kissing is in the interval that you found. June 27, 2014 MAA PREP workshop 58 11. Does the 95% confidence interval include the value 0.5? Does it include the value 0.6? Explain how your answers relate to the significance test and p-value that you calculated in #5 and #6. 12. Now suppose we were to use a significance level of 0.01 instead of 0.05 to decide whether or not to reject the corresponding null hypothesis for each listed value of . How would you expect the interval of plausible values to change: wider, narrower, or no change? Explain your reasoning. 13. Implement the 0.01 significance level to determine plausible values for the population proportion of all kissing couples that lean their heads to the right. (Hint: Start with the table from #9, although you might have to test more null values as well.) Report the interval of plausible values. 14. How confident are you about the interval of plausible values that you listed in #13? 15. What is the primary difference between the 95% confidence interval reported in #10 and the 99% confidence interval reported in #13? Which interval is wider? Is this consistent with what you predicted in #12? 16. How would you expect a 90% confidence interval to compare to the 95% and 99% confidence intervals? Explain. Key idea: Using a higher confidence level produces a wider confidence interval. You should have found very strong evidence, based on a small p-value, that a majority of kissing couples in the population lean their heads to the right. Moreover, your confidence intervals, which provide plausible values for of the population proportion that lean to the right, contain only values above 0.5. June 27, 2014 MAA PREP workshop 59 Confounding, Causation, and Experiments (adapted from Exploration 4.1: Home Court Disadvantage?) Sports teams prefer to play in front of their own fans rather than at the opposing team’s site. Having a sell-out crowd should provide even more excitement and lead to an even better performance, right? Well, consider the Oklahoma City Thunder, a National Basketball Association team, in its second season (2008-09) after moving from Seattle. This team had a win-loss record that was actually worse for home games with a sell-out crowd (3 wins and 15 losses) than for home games without a sell-out crowd (12 wins and 11 losses). (These data were noted in the April 20, 2009 issue of Sports Illustrated in the Go Figure column.) 1. Identify the observational units and variables in this study. Also classify each variable as categorical (also binary?) or quantitative. 2. When did the Thunder have a higher winning percentage: in front of a sell-out crowd or a smaller crowd? Support your answer by calculating the proportion of sell-out games that they won and also the proportion of non-sell-out games that they won. (Write both these proportions as decimals.) Sell-out crowd: Smaller crowd: Definition: Two variables are associated or related if the value of one variable gives you information about the value of the other variable. When comparing two groups this means that the proportions or means take different values in the two groups. 3. Do the two variables appear to be associated? Explain. Often when a study involves two associated variables, it is natural to consider one the explanatory variable and the other the response variable. Definitions: The explanatory variable is the variable we think is “explaining” the change in the response variable and the response variable is the variable we think is being impacted or changed by the explanatory variable. The explanatory variable is sometimes called the independent variable and the response variable is sometimes called the dependent variable. June 27, 2014 MAA PREP workshop 60 4. Which would you consider the explanatory variable in this study? Which is the response? (That is, what are the roles of these variables in this study?) There are two possible explanations for this odd finding that the team had a better winning percentage with a smaller crowd: The sell-out crowd caused the Thunder to play worse, perhaps because of pressure or nervousness. The sell-out crowd did not cause a worse performance, and some other issue (variable) explains why they had a worse winning percentage with a sell-out crowd. In other words, a third variable is at play, which is related to both the crowd size and the game outcome. (Of course, another explanation is random chance. Using methods you will learn later, we’ve determined that you can essentially rule out random chance in this case.) 5. Consider the second explanation. Suggest a plausible alternative variable that would explain why the team would be less likely to win in front of a sell-out crowd than in front of a smaller crowd. (Make sure it’s clear not just that your explanation would affect the team’s likelihood of winning, but that the team would be less likely to win in front of a sell-out crowd compared to a smaller crowd.) Definition: A confounding variable is a variable that is related both to the explanatory and to the response variable in such a way that its effects on the response variable cannot be separated from the effects of the explanatory variable. 6. Identify the confounding variable based on your suggested explanation in #5. Explain how it is confounding– what is the link between this third variable and the response variable, and what is the link between this third variable and the explanatory variable? (Hint: Remember that this variable has to be recorded on the observational units: home games for the Thunder.) Another variable recorded for these data was whether or not the opponent had a winning record the previous season. Of the Thunder’s 41 home games, 22 were against teams that won more than half of their games. Let’s refer to those 22 teams as strong opponents. Of these 22 games, 13 were sell-outs. Of the 19 games against opponents that won less than half of their games that season (weak opponents), only 5 of those games were sell-outs. June 27, 2014 MAA PREP workshop 61 7. Was the Thunder more likely to have a sell-out crowd against a strong opponent or a weak opponent? Calculate the relevant (conditional) proportions to support your answer. When the Thunder played a strong opponent, they won only 4 of 22 games. When they played a weak opponent, the Thunder won 11 of 19 games. 8. Was the Thunder less likely to win against a strong opponent than a weak one? Again calculate the relevant (conditional) proportions to support your answer. 9. Explain how your answers to #7 and #8 establish that strength of opponent is a confounding variable that prevents drawing a cause-and-effect conclusion between crowd size and game outcome. 10. Summarize your conclusion about whether these data provide evidence that a sell-out crowd caused the Thunder to play worse. Write as if to a friend who has never studied statistics. Be sure to address the fact that the Thunder had a much smaller winning percentage in front of a sell-out crowd. June 27, 2014 MAA PREP workshop 62 Follow-ups: 1. A recent study classified children according to the type of lighting used in their room (full light, night light, no light) and their eye condition (near-sighted, normal vision, and far-sighted). The two-way table of counts is: Full light Night light No light Total Near-sighted 41 78 18 137 Normal vision 22 115 114 251 Far-sighted 12 39 40 91 Total 75 232 172 479 a) Is this an observational study or an experiment? Explain briefly. b) What proportion of the children who slept in a room with full light were near-sighted? The following segmented bar graph displays these data: 100% 80% 60% Far-sighted 40% Normal vision Near-sighted 20% 0% Full light Night light No light c) Based on this graph, describe the association (if any) between lighting condition and eyesight. d) Provide an argument that parents’ eyesight could be a confounding variable in this study. e) Would you conclude that this study provides strong evidence of a cause-and-effect relationship between lighting condition and eyesight? Explain briefly. 2. In a study published in a 2007 issue of the journal Preventive Medicine, researchers found that smokers were more likely to have used candy cigarettes as children than non-smokers were. When hearing about this study, John said: “But isn’t the smoking status of the person’s parents a confounding variable here?” When Karen asked what he meant, John said: “Children whose parents smoke are more likely to become smokers themselves when they become adults.” What else does John need to say in order to explain how the parents’ smoking status can be a confounding variable in this study? June 27, 2014 MAA PREP workshop 63 June 27, 2014 MAA PREP workshop 64 Randomization Test – Comparing Two Proportions (adapted from Example 5.1: Dolphin Therapy) Swimming with dolphins can certainly be fun, but is it also therapeutic for patients suffering from clinical depression? To investigate this possibility, researchers recruited 30 subjects aged 1865 with a clinical diagnosis of mild to moderate depression. Subjects were required to discontinue use of any antidepressant drugs or psychotherapy four weeks prior to the experiment, and throughout the experiment. These 30 subjects went to an island off the coast of Honduras, where they were randomly assigned to one of two treatment groups. Both groups engaged in the same amount of swimming and snorkeling each day, but one group (the animal care program) did so in the presence of bottlenose dolphins and the other group (outdoor nature program) did not. At the end of two weeks, each subjects’ level of depression was evaluated, as it had been at the beginning of the study (Antonioli and Reveley, 2005). The response variable is defined as whether or not the subject achieved substantial reduction (improvement) of their depression. 1. Why did the researchers include a comparison group in this study? Why didn’t they just see how many patients showed substantial improvement when given the dolphin therapy? The null hypothesis for this study should basically say that there is no association between swimming with dolphins and treatment of depression while the alternative should state that there is an association. We are going to write our hypotheses in terms of probabilities. Null: The probability a subject will show improvement in their depression symptoms is the same between those that swim with dolphins and those that don’t. Alternative: The probability a subject will show improvement in their depression symptoms is greater for those that swim with dolphins compared to those that don’t. The results that the researchers found are shown in the following table: Dolphin therapy Control group Total Showed substantial improvement 10 3 13 Did not show substantial improvement 5 12 17 Total 15 15 30 2. Identify which of the variables in this study is explanatory and which is response. 3. A useful first step begin to see whether the data appear to support the claim that dolphin therapy is effective is to calculate the conditional proportion that improved in each group. Calculate these conditional proportions. Did the dolphin therapy group have a higher conditional proportion that showed substantial improvement than the control group? June 27, 2014 MAA PREP workshop 65 4. Is it possible that this difference could happen even if dolphin therapy was not effective, simply due simply to the random nature of putting subjects into groups (i.e., the luck of the draw)? Sure, this is possible. Consider the following scenario: Assume that the 13 people in the study whose depression symptoms improved would have improved whether they swam with dolphins or not, let’s call them the “improvers”. Now, what if, by chance, of the 13 improvers (people who improve no matter what), 10 randomly ended up in the “swim with dolphins” group. Recall the important fact that subjects were randomly assigned to swim with dolphins or not. So, randomly, it is possible that 10 improvers end up in the swim with dolphins group and only 3 in the not swim with dolphins group. Thus, it is possible we would see 67% (10/15) of dolphin swimmers improving and only 20% (3/15) non-dolphin swimmers improving even if swimming with dolphins doesn’t actually make a difference. So, it is possible….but how unlikely is it? In order to answer this question we will once again turn to simulation. Remember that simulation is a method we have used to estimate probabilities. So far you used coin flipping and an applet to simulate the p-value (the probability we would obtain the observed statistic or something more extreme if the null hypothesis was true) for different tests of significance. Because the dolphin experiment is more complex than the previous studies we looked at, we can’t simply flip a coin any longer. Instead, to do our simulation, we’re going to use playing cards and a computer applet. To estimate the p-value for this study you will need 30 index cards, 13 of which are blue to represent the “improvers” and 17 of which are green to represent the “non-improvers.” The null hypothesis assumes that 13 people (the blue cards) will get better no matter whether they swim with dolphins or not. So those that improved were going to do so (perhaps just from getting to fly to Honduras and swim in the water) regardless of which treatment group they were assigned to. 5. Now shuffle your 30 cards and deal them into two stacks of fifteen. One of the two stacks represents the people who got to swim with dolphins and the other stack represents people who didn’t. Decide which stack is which and then fill in the table below. Showed substantial improvement Did not show substantial improvement Total Dolphin therapy Control group Total 13 17 15 15 30 Difference in proportions of improvement (dolphin group minus control group): Repeat this shuffling and dealing process a second time: Dolphin therapy Control group Total Showed substantial improvement 13 Did not show substantial improvement 17 Total 15 15 30 Difference in proportions of improvement (dolphin group minus control group): June 27, 2014 MAA PREP workshop 66 6. Combine your results with your classmates by constructing a dotplot of the differences in improvement proportions (dolphin group minus control group). 7. Does it appear that the observed value of this statistic (namely, 0.467) would be unlikely to occur if there really were no effect of the dolphin therapy? Explain. Remember the p-value is the probability we would observe a statistic at least as large as we did in our study, assuming the null hypothesis was true. By shuffling and dealing the cards, we are simulating a true null hypothesis. We are simulating a random assignment of the 13 subjects that actually improved to each group. 8. What proportion of the results from your class had a difference in the proportion of improvers of 0.467 or larger? This is your p-value. Provide a detailed interpretation of this p-value. Be sure to explain what is meant by “by chance alone,” keeping in mind what was randomized in this study. 9. Based on the class results, what is your preliminary conclusion about the effectiveness of swimming with dolphins in improving peoples’ depression symptoms? Explain the reasoning process behind this conclusion. We really need to carry out this simulated random assignment process hundreds, preferably thousands of times. This would be very tedious and time-consuming with cards, so let’s turn to technology. To do this, use the “Dolphin study” applet found at: http://www.rossmanchance.com/ISIapplets.html. Check the “Show Shuffle Options” x boand then press the “Shuffle” button. Notice that the applet does what you have done: shuffle the 30 cards and deal out 15 for the “dolphin therapy” group, separating cards by color. The applet also determines the 2×2 table for the simulated results (check the Show table box on the left) and creates a dotplot of the shuffled difference in conditional proportions. Now press the “Shuffle” button four more times. Then increase the “Number of Shuffles” to 995 and press the “Shuffle” button. This produces a total of 1000 repetitions of the simulated random assignment process. June 27, 2014 MAA PREP workshop 67 10. Use the applet to determine what proportion of your 1000 simulated random assignments were the results as (or more) extreme as the actual study (which, you’ll recall, saw a 0.467 difference in improvers). This is your p-value. Interpret this p-value, including what is meant by the phrase “by chance alone” in this context. 11. Is your p-value small enough so that you would consider an difference in the observed proportion of improvers of 0.467 or more surprising under the null model that dolphin therapy is not effective? Does this give strong evidence that the dolphin therapy is effective in treating depression in patients similar to those in the study? 12. Is it reasonable to draw a cause-and-effect conclusion between dolphin therapy and improvement or not in depression symptoms? Explain why or why not. 13. Is it reasonable to generalize your conclusion to all people who suffer from depression? Explain why or why not. Further Exploration What if the study had found the same overall proportion of improvers, but with the group proportions closer together? More specifically, suppose that the actual data had turned out with 2 fewer improvers in the dolphin therapy group and 2 more improvers in the control group. 14. How will this affect our analysis, and the p-value, and the conclusion? First make a conjecture, and explain your reasoning. 15. Then investigate this question with a simulation. [Hints: Think about how the two-way table will change. Then use your earlier simulation results but explain what you are doing differently now to find the approximate p-value.] Explain why your answers (about how the strength of evidence against the null model changes when the group difference is not as large) make intuitive sense. June 27, 2014 MAA PREP workshop 68 Randomization Tests Comparing Two Groups on a Quantitative Response (adapted from Exploration 6.2: Lingering Effects of Sleep Deprivation) Step 1: Ask a research question. Many students pull “all-nighters” when they have an important exam or a pressing assignment. Concerns that may arise include: Can you really function well the next day after a sleepless night? What about several days later: Can you recover from a sleepless night by getting a full night’s sleep on the following nights? Step 2: Design a study and collect data. Researchers Stickgold, James, and Hobson investigated delayed effects of sleep deprivation on learning in a study published in Nature Neuroscience (2000). Twenty-one volunteers, aged 18 to 25 years, were first trained on a visual discrimination task that involved watching stimuli appear on a computer screen and reporting what was seen. After the training period, subjects were tested. Performance was recorded as the minimum time (in milliseconds) between the appearance of stimuli and an accurate response. Following these “baseline” measurements, one group was randomly assigned to be deprived of sleep for 30 hours, followed by two full nights of unrestricted sleep, whereas the other group was allowed to get unrestricted sleep on all three nights. Following this, both groups were retested on the task to see how well they remembered the training from the first day. Researchers recorded the improvement in performance as the decrease in time required at retest compared to training. (Note: For example, if someone took 5 milliseconds (ms) to respond at the beginning of the study and then 2 ms, to respond at the end, the improvement score is 3 ms. But if someone took 2 ms at the beginning and then 5 ms at the end, the improvement score is -3 ms.) The goal of the study was to see whether the improvement scores tend to be higher for the unrestricted sleep treatment than for the sleep deprivation treatment. 1. Identify the explanatory and response variables in this study. Also classify them as either categorical or quantitative. Explanatory: Type: Response: Type: 2. Was this an experiment or an observational study? Explain how you are deciding. June 27, 2014 MAA PREP workshop 69 3. Let µunrestricted be the long-run mean improvement on this task three days later when someone has had unrestricted sleep and let µdeprived denote the long-run mean improvement when someone is sleep deprived on the first night. In words and symbols, state the null and the alternative hypotheses to investigate whether sleep deprivation has a negative effect on improvement in performance on visual discrimination tasks. (Hint for the alternative hypothesis: Do you expect the people to do better or worse when sleep deprived? Based on your answer, what sign/direction should you choose for the alternative hypothesis?) Here are the data, with positive values indicating better performance at retest than at training, and negative values indicating worse performance at retest than at training: Unrestricted-sleep group’s improvement scores (milliseconds): 25.2, 14.5, -7.0, 12.6, 34.5, 45.6, 11.6, 18.6, 12.1, 30.5 Sleep-deprived group’s improvement scores (milliseconds): -10.7, 4.5, 2.2, 21.3, -14.7, -10.7, 9.6, 2.4, 21.8, 7.2, 10.0 Step 3: Explore the Data. 4. To look at graphical and numerical summaries of the data from the study, go to the Multiple Means applet. The sleep deprivation data have already been entered into the applet. a. Notice that the applet creates parallel dotplots, one for each study group. Based on these dotplots alone, which group (unrestricted or deprived) appears to have had the higher mean improvement? How are you deciding? b. Based on the dotplots alone, which group (unrestricted or deprived) appears to have had more variability in improvement? How are you deciding? c. Notice also that the applet also computes numerical summaries of the data, such as, the mean and standard deviation (SD) for the improvements in each group. i. For the unrestricted group, record the sample size (n), mean, and SD. nunrestricted = ii. SDunrestricted = For the deprived group, record the sample size (n), mean, and SD. ndeprived= June 27, 2014 𝑥̅ unrestricted = 𝑥̅ deprived= MAA PREP workshop SDdeprived = 70 Recall from earlier in the course, that standard deviation is a measure of variability. Relatively speaking, smaller standard deviation values indicate less variability and a distribution whose data values tend to cluster more closely together, compared to a distribution with a larger standard deviation. d. Based on the numerical summaries reported in #4c, which group (unrestricted or deprived) had the higher mean improvement? e. Based on the numerical summaries reported in #4c, which group (unrestricted or deprived) had the higher variability in improvement? f. Notice that the applet also reports the observed difference in means for the improvements of the two groups. Record this value (and its measurement units). 𝑥̅ unrestricted - 𝑥̅ deprived = g. Before you conduct an inferential analysis, does this difference in sample means (as reported in #4f) strike you as a meaningful difference? Explain your answer. Step 4: Draw Inferences. 5. What are two possible explanations for why we observed the two groups to have different sample means for improvement in performance? Next you will be asked to describe how you might go about deciding whether the observed difference between the two sample means is statistically significant. (Hint: Think about how you assessed whether an observed difference between two sample proportions was statistically significant in Chapter 5. Use the same strategy, with an appropriate modification for working with means instead of proportions.) Once again the key question is how often random assignment alone would produce a difference in the groups at least as extreme as the difference observed in this study, if there really were no effect of sleep condition on improvement score. You addressed similar questions in Chapter 5 when you analyzed the dolphin therapy and yawing studies. The only change is that now the response variable is quantitative rather than categorical, so the relevant statistic is the difference in group means rather than the difference in group proportions. Also once again, we use simulation to investigate how often such an extreme difference would occur by chance (random assignment) alone (if the null hypothesis of no difference / no effect / no association were true). In other words, we will again employ the 3S Strategy. June 27, 2014 MAA PREP workshop 71 Statistic: 6. A natural statistic for measuring how different the observed group means are from each other is the difference in the mean improvement scores between the two groups. Report the value of this statistic, as you did in #4f. Simulate: You will start by using index cards to perform a tactile simulation of randomly assigning the 21 subjects between the two groups, assuming that sleep condition has no impact on improvement. Because the null hypothesis asserts that improvement score is not associated with sleep condition, we will assume that the 21 subjects would have had exactly the same improvement scores as they did, regardless of which sleep condition group (unrestricted or deprived) the subject had been assigned. 7. a. How many index cards do you need to conduct this simulation? b. What will you write on each index card? To conduct one repetition of this simulation: Shuffle the stack of 21 cards well, and then randomly distribute cards into two stacks: one stack with 10 cards (the unrestricted group) and one with 11 (the sleep deprived group). Calculate and report the sample means for each re-randomized group: Re-randomized unrestricted group’s mean: Re-randomized deprived group’s mean: Calculate the difference in group means, subtracting unrestricted mean minus sleep deprived mean. Report this value. Combine this result with your classmates’ to create a dotplot that shows the distribution of several possible values of difference in sample means that could have happened due to pure chance, if sleep condition has no impact on improvement. Sketch the dotplot below, being sure to label the horizontal axis. Label: June 27, 2014 MAA PREP workshop 72 c. At about what value is the dotplot centered? Explain why this makes sense. (Hint: What are we assuming to be true when we conduct the simulation?) d. Where is the observed difference in means from the original study (as reported in #6) on the dotplot? Did this value happen often, somewhat rarely, or very rarely? How are you deciding? 8. As before with simulation-based analyses, you would now like to conduct many, many more repetitions to determine what is typical and what is not for the difference in group means, assuming that sleep condition has no impact on improvement score. We think you would prefer to use a computer applet to do this rather than continue to shuffle cards for a very long time, calculating difference of group means by hand. Go back to the Multiple Means applet, check the Show Shuffle Options box, select the Plot display, and press Shuffle Responses. a. Describe what the applet is doing and how this relates to your null hypothesis from #3. (Note: You can check the Show Original Colors box to see which groups the observations originally belonged to.) b. Record the simulated difference in sample means for the re-randomized groups, as given in the applet output. Is this difference more extreme than the observed difference from the study (as reported in #6)? How are you deciding? c. Click on Shuffle Responses again, and record the simulated difference in sample means for the re-randomized groups. Did it change from #8b? d. Click on Re-Randomize again, and record the simulated difference in sample means for the re-randomized groups. Did it change from #8b and #8c? e. Now to see many more possible values of the difference in sample means, assuming sleep condition has no impact on improvement, do the following in the Multiple Means applet: Change Number of Shuffles from 1 to 997. Press Shuffle Responses, to produce a total of 1000 shuffles and rerandomized statistics. June 27, 2014 MAA PREP workshop 73 f. Consider the histogram of the 1000 could-have-been values of difference in sample means, assuming that sleep condition has no effect on improvement. i. What does one dot on the dotplot represent? (Hint: Think about what you would have to do to put another dot on the graph.) ii. Describe the overall shape of the null distribution displayed in this dotplot. iii. Where does the observed difference in sample means (as reported in #6) fall in this dotplot: near the middle or out in a tail? Are there a lot of dots that are even more extreme than the observed difference, assuming sleep condition has no impact on improvement score? How are you deciding? g. To estimate a p-value, continue with the Multiple Means applet. Type in the observed difference in group means (as reported in #6) in the Count Samples box (for the one-sided alternative hypothesis), and press Count. Record the approximate p-value. h. Fill in the blanks of the following sentence to complete the interpretation of the pvalue. The p-value of _______ is the probability of observing _________________________ _______________ assuming __________________________________________. Strength of Evidence: 9. Based on the p-value, evaluate the strength of evidence provided by the experimental data against the null hypothesis that sleep condition has no effect on improvement score: not much evidence, moderate evidence, strong evidence, or very strong evidence? How are you deciding? June 27, 2014 MAA PREP workshop 74 10. Significance: Summarize your conclusion with regard to strength of evidence in the context of this study. 11. Estimation: a. Use the 2SD Method to approximate a 95% confidence interval for the difference in long-run mean improvement score for subjects who get unrestricted sleep minus the long-run mean improvement score for subjects who are sleep deprived. (Hints: Remember the observed value of the difference in group means, and obtain the SD of the difference in group means from the applet’s simulation results. The interval should be observed difference in means ± 2 SD, where SD represents the standard deviation of the null distribution of the difference in group means.) b. Interpret what this confidence interval reveals, paying particular attention to whether the interval is entirely positive, entirely negative, or contains zero. (Hint: Be sure to mention “direction” in your interpretation by saying how much larger improvement scores are on average for the treatment you find to have the larger long-run mean: I’m 95% confident that the long-run mean improvement score is higher with the __ to ________ treatment as opposed to the ____________ treatment.) Step 5: Formulate Conclusions. 12. Generalization: Were the participants in this study randomly selected from a larger population? Describe the population to which you would feel comfortable generalizing the results of this study. June 27, 2014 MAA PREP workshop 75 13. Causation: Were the participants in the study randomly assigned to a sleep condition? How does this affect the scope of conclusion that you can draw? Another statistic: Could we have chosen a statistic other than the difference in group means to summarize how different the two groups’ improvement scores were? Yes, for example we could have used the difference in group medians. Why might we do this? For one reason, the median is less affected by outliers than the mean (see Section 3.2). And, we can use the Multiple Means applet to do this. 14. Step 6: Look back and ahead. Looking back: Did anything about the design and conclusions of this study concern you? Issues you may want to critique include: Any mismatch between the research question and the study design How the experimental units were selected How the treatments were assigned to the experimental units How the measurements were recorded The number of experimental units in the study Whether what we observed is of practical value Looking ahead: What should the researchers’ next steps be to fix the limitations or build on this knowledge? June 27, 2014 MAA PREP workshop 76 Common exam questions One proportion (Unit 1) Questions 1 – 5, refer to the following story. With movie-viewing-at-home made so convenient by services such as Netflix, Pay-per-view, and Video-on-demand, do a majority of city residents now prefer watching movies at home rather than going to the theater? To investigate, a local high school student, Lori, decides to conduct a poll of adult residents in her city. She selects a random sample of 100 adult residents from the city and gives each participant the choice between watching a movie at home or the same movie at the theater. She records how many choose to watch the movie at home. 1. Pick the correct null hypothesis. (Circle only one.) a) Adult residents of the city are equally likely to choose to watch the movie at home as to watch the movie at the theater. b) Adult residents of the city are more likely to choose to watch the movie at home than to watch the movie at the theater. c) Adult residents of the city are less likely to choose to watch the movie at home than to watch at the theater. d) Other: 2. Pick the correct alternative hypothesis. (Circle only one.) a) Adult residents of the city are equally likely to choose to watch the movie at home as to watch the movie at the theater. b) Adult residents of the city are more likely to choose to watch the movie at home than to watch the movie at the theater. c) Adult residents of the city are less likely to choose to watch the movie at home than to watch at the theater. d) Other: 3. After analyzing her data, Lori finds that significantly more than half of the sample (p-value 0.012) preferred to watch the movie at home. Below are two possible explanations for the observed majority. Based on her analysis, which explanation do you find the most plausible (i.e., reasonable, believable)? (Circle only one.) a) More than half of the adult residents in her city prefer to watch the movie at home. b) There is no overall preference for movie-watching-at-home in her city, but by pure chance her sample just happened to have an unusually high number of people choose to watch the movie at home. c) (a) and (b) are equally plausible explanations. June 27, 2014 MAA PREP workshop 77 4. Which of the following is the most valid interpretation of Lori’s p-value of 0.012? (Circle only one.) a) A sample proportion as large as or larger than hers would rarely occur. b) A sample proportion as large as or larger than hers would rarely occur if the study had been conducted properly. c) A sample proportion as large as or larger than hers would rarely occur if 50% of adults in the population prefer to watch the movie at home. d) A sample proportion as large as or larger than hers would rarely occur if more than 50% of adults in the population prefer to watch the movie at home. 5. Reconsider Lori’s results. If we were to construct a 95% confidence interval for the proportion of all adult residents of this city that would choose to watch the movie at home, will this interval contain the value of 0.5? (Circle only one.) a) Yes b) No c) Don’t have enough information to decide. Explain: June 27, 2014 MAA PREP workshop 78 Two proportions (Unit 2) You want to investigate a claim that women are more likely than men to dream in color. You take a random sample of men and a random sample of women (in your community) and ask whether they dream in color, and compare the proportions of each gender that dream in color. (Questions 1-6 are based on this study.) 1. If the difference in the proportions (who dream in color) between the two samples turns out not to be statistically significant (p-value was not small), which of the following would be the best conclusion to draw? a) You have found strong evidence that there is no difference between the proportions of men and women in your community that dream in color. b) You have not found enough evidence to conclude that there is a difference between the proportions of men and women in your community that dream in color. c) You have found strong evidence against the claim that there is a difference between the proportions of men and women that dream in color. d) Because the result is not significant, we can’t conclude anything from this study. Instructors may want to give partial credit for (d) 2. If the difference in the proportions (who dream in color) between the two samples does turn out to have a small p-value, which one of the following would be the best interpretation? a) It would not be very surprising to obtain the observed sample results if there is really no difference between the proportions of men and women in your community that dream in color. b) It would be very surprising to obtain the observed sample results if there is really no difference between the proportions of men and women in your community that dream in color. c) It would be very surprising to obtain the observed sample results if there is really a difference between the proportion of men and women in your community that dream in color. d) The probability is very small that there is no difference between the proportions of men and women in your community that dream in color. e) The probability is very small that there is a difference between the proportions of men and women in your community that dream in color. 3. Your review of the research suggests that there is a genuine difference between men and women on this variable. Suppose that the difference between the sample groups turns out not to be statistically significant (not a small p-value). Which conclusion would be most reasonable? a) Something went wrong with the analysis, and the results of this study cannot be trusted. b) There must not be a difference after all and the other research studies were flawed. c) The sample size might have been too small to detect a difference even if there is one. June 27, 2014 MAA PREP workshop 79 4. Suppose that two different studies are conducted on this issue. Study A finds that 40 of 100 women sampled dream in color, compared to 20 of 100 men. Study B finds that 35 of 100 women dream in color, compared to 25 of 100 men. Which study (A or B) would provide stronger evidence that there is a genuine difference between men and women on this issue? a) Study A b) Study B c) The strength of evidence would be similar for these two studies 5. Suppose two more studies are conducted on this issue. Both studies find 30% of women sampled dream in color, compared to 20% of men. But Study C consists of 100 people of each sex, whereas Study D consists of 40 people of each gender. Which study would provide stronger evidence that there is a genuine difference between men and women on this issue? a) Study C b) Study D c) The strength of evidence would be similar for these two studies 6. Suppose the difference in the sample proportions who dream in color does turn out to be statistically significant (small p-value). Below are two possible explanations for this observed difference. Which do you consider to be the more plausible (i.e., believable or reasonable) explanation for the observed result? a) Men and women in your community do not differ on this issue but by chance alone the random sampling led to the difference we observed between the two groups. b) Men and women in your community differ on this issue. c) (a) and (b) are equally plausible explanations. Possible alternative to Q5 and Q6: Suppose I were to use a simulation-based method to randomly assign yes/no responses about dreaming in color to the males and females in the study under the assumption that the likelihood of dreaming in color is the same for both groups. What would be the main purpose of this use of randomness in this simulation? a. To allow me to draw a cause-and-effect conclusion from the study. b. To allow me to generalize my results to a larger population. c. To simulate values of the statistic under the null hypothesis. d. To replicate the study and increase the accuracy of the results. . e. Other: June 27, 2014 MAA PREP workshop 80 Two means (Unit 2) A psychology student conducted a research study in which the research question is whether financial incentives can improve performance on video games. The student prepares an experiment in which 40 subjects are randomly assigned to one of two groups. The first group was offered $5 for a score above 100 and the other group was simply told to “do your best.” Each of the 40 students played the video game and achieved a certain score, from which the student calculated the following summary statistics: $5 incentive “do your best” Mean score 98 80 Sample size 20 20 These average scores differ by 98 – 80 = 18 points, but the student is concerned that this difference might be just due to random chance. To examine the question of random chance the student conducted the following simulation: She took 40 index cards. She wrote each subject’s score on one card. o She then shuffled the cards and randomly placed the cards into two stacks of 20 cards each. One stack represents $5 incentive and the other “do your best.” o She computed the mean score for each group. o She found the difference in means for the two groups by taking average for the $5 incentive group minus the average for the “do your best” group. She repeated the previous three steps 1000 times. She graphed the 1000 statistics (differences in group means) she calculated from these trials to produce the following histogram: June 27, 2014 MAA PREP workshop 81 Use this information to circle the best answer among the options to the following questions: 1. What is the motivation for the process the student followed? a) This process allows her to compare her actual result to what could have happened by chance if gamers’ performances were not affected by whether they were asked to do their best or offered an incentive. b) This process allows her to determine the percentage of time the $5 incentive strategy would outperform the “do your best" strategy for all possible scenarios. c) This process allows her to determine how many times she needs to replicate the experiment for valid results. d) This process allows her to determine whether the normal distribution fits the data. 2. Which of the following was assumed to be true in carrying out the simulation? a) The $5 incentive is more effective than the “do your best” incentive for improving performance. b) The $5 incentive and the “do your best” incentive are equally effective at improving performance. c) The “do your best” incentive is more effective than a $5 incentive for improving performance. d) Both (a) and (b) but not (c). 3. Which of the following could be the approximate p-value in this situation? Recall that the research question is “Does the $5 incentive improve performance?” a) 0.501 c) 0.022 b) 0.047 d) 0.001 4. What does the histogram tell you about whether $5 incentives are effective in improving performance on the video game? a) The $5 incentive is not effective because the distribution of differences generated is centered at zero. b) The $5 incentive is effective because distribution of differences generated is centered at zero. c) The $5 incentive is not effective because the p-value is greater than 0.05. d) The $5 incentive is effective because the p-value is less than 0.05. 5. Which of the following is the appropriate interpretation of the p-value? a) The p-value is the probability that the $5 incentive is not really helpful. b) The p-value is the probability that the $5 incentive is really helpful. c) The p-value is the probability that she would get a result as least as extreme as the one she actually found, if the $5 incentive is really not helpful. d) The p-value is the probability that a student wins on the video game. June 27, 2014 MAA PREP workshop 82 Transferability (Final exam) A consumer organization was concerned that a computer manufacturer was misleading customers by overstating the battery life of its laptop batteries (measured in hours). The manufacturer states the batteries will last 3 hours. To investigate this claim, researchers selected a random sample of 10 laptops from the current production process. The same exact operations were performed on each laptop (watching a movie, writing a term paper, surfing the internet, etc.), recording the number of hours each battery lasted. Consider the data displayed on the lifetimes of the sample of 10 laptop batteries (in hours). Suppose we want to investigate whether there is convincing evidence that the distribution of all battery lifetimes is skewed to the right. One possible statistic for measuring skewness is the ratio: (sample mean sample median) (a) Calculate the value of this statistic (sample mean divided by the sample median) for this sample. (b) If the distribution of battery lifetimes in the population is symmetric, then we would expect this skewness ratio statistic to be LESS THAN / CLOSE TO / GREATER THAN (circle one) the value of (fill in a number). (c) If the distribution of battery lifetimes in the population is skewed to the right, then we would expect this skewness ratio statistic to be LESS THAN / CLOSE TO / GREATER THAN (circle one) the value of (fill in a number). Even when the battery lifetimes in a sample of 10 batteries is skewed to the right, it is still possible that the actual population distribution of battery lifetimes is symmetric, and that the skewness observed in the sample was due to random chance alone. To simulate the selection of random samples from a hypothetical population for which the null hypothesis is true, 100 random samples, each of size 10, were taken from a perfectly symmetrically distributed population with the same mean and standard deviation as the original sample. For each of those 100 samples, the statistic sample mean divided by the sample median was calculated. A dotplot of the 100 simulated statistics is shown below. June 27, 2014 MAA PREP workshop 83 (d) (2 pts) Select the best response below to finish this statement: “Based on this simulation output, our observed sample statistic… does not provide strong evidence that the actual population distribution is skewed to the right because this null distribution is symmetric. provides strong evidence that the actual population distribution is skewed to the right because this null distribution is symmetric. does not provide strong evidence that the actual population distribution is skewed to the right because this null distribution is centered around one. provides strong evidence that the actual population distribution is skewed to the right because this null distribution is centered around one. does not provide strong evidence that the actual population distribution is skewed to the right because this null distribution ranges from .95 to 1.05. provides strong evidence that the actual population distribution is skewed to the right because this null distribution ranges from .95 to 1.05. June 27, 2014 MAA PREP workshop 84 Paired Tests (adapted from Exploration 7.1: Rounding First Base) Imagine you are at the plate in baseball and have hit a hard line drive. You want to try to stretch your hit from a single to a double. Does the path that you take to “round” first base make much of a difference? For example, is it better to take a “narrow angle” or a “wide angle” around first base? 1. Suppose 22 baseball players volunteered to participate in an experiment. Suppose also that you plan to assign a single angle, either wide or narrow, to each player. How would you decide which player ran with which base running angle? A reasonable experimental design would be to randomly assign 11 of the 22 players to run with the wide angle and the other 11 to run with the narrow angle. 2. Some runners are faster than others. Explain how random assignment controls for this, so that speed is not likely to be a confounding variable in this study. Even though random assignment tends to balance out other variables (such as speed) between the two groups, there’s still a chance that most of the fast runners could be in one group and most of the slow runners in the other group. More importantly, there’s likely to be a good bit of variability in the runners’ speeds, and that variability would make it harder to spot a difference between the base running angles even if one angle really is superior to the other. 3. Suggest a different way of conducting the experiment, to make sure that speed is completely balanced between the two groups. The data below are the times (seconds) for the first 10 off 22 runners in a matched-pairs experiment to study this question, and the graph shows dotplots of the times for all 22 runners. Subject 1 2 3 4 5 6 7 8 9 10 narrow angle 5.50 5.70 5.60 5.50 5.85 5.55 5.40 5.50 5.15 5.80 … wide angle 5.55 5.75 5.50 5.40 5.70 5.60 5.35 5.35 5.00 5.70 … June 27, 2014 MAA PREP workshop 85 One thing to notice in these distributions is that there is a lot of overlap in the two distributions (narrow mean: 5.534 sec, wide mean: 5.459 sec) and a fair bit of variability (narrow SD: 0.260 sec, wide SD: 0.273 sec). The slowest times took over six seconds, whereas the fastest times were just under five seconds. As you saw in Chapter 6, it will be difficult to detect a difference between the two methods when there is so much natural variation in the running times. However, these data are clearly paired. As discussed earlier, such data should not be treated as coming from two independent samples. Instead, we will focus our analysis on the difference in times for each runner between the two methods. Thus, we can define our parameter of interest to be µd = long-run mean difference in running speed when rounding first base with a wide angle versus a narrow angle in the population of interest. 4. State the null and alternative hypotheses (using µd), to test whether the mean difference in running speeds between the two running angles is not 0. 5. Go to the Matched Pairs applet. The default dataset is the baseball running data. Comment on the distribution of differences as revealed in the dotplot. 6. Report the mean and SD of the differences, with appropriate symbols. 7. Describe how you could flip a coin repeatedly to simulate the randomness in the way this experiment was conducted. 8. The Matched Pairs applet will perform the simulation similar to what you could do with flipping a coin. Set the number of times to Randomize to 1 and click on Randomize. One the coin tosses have determined which time will be in which column, the applet displays the re-randomized data (the colors show you the original column for each observation, so you should see a mix in each group now). The could-have-been value for the mean difference is added to the Average Difference graph. June 27, 2014 MAA PREP workshop 86 a. How does the simulated mean difference compare to observed mean difference. Circle one. More extreme Less extreme Similar b. How are you deciding? 9. Update the number of Randomizations to 99 (for a total of 100 repetitions), uncheck Animate, and press Randomize. Consider the graph “Average Difference” that the applet has created... a. How many dots are in this graph? b. What does each dot represent? 10. To see many more possible values of mean difference in sample means that could have been, IF the two running angles were swappable, update the Number of Randomizations to 900, and press Randomize (for a total of 1000 repetitions). Describe the updated “Average Difference” graph with the 1000 samples or repetitions, with regard to the following features. a. Shape: b. About what number is this graph centered? Explain why you were expecting this. c. This graph also reports a value for standard deviation, SD. Report this value, and give a simple interpretation of this value. As in, what is this value measuring? 11. You now should have generated 1000 possible values of the mean difference in running times between the two angles, assuming the null hypothesis was true and these times were the same, on average. How does the observed mean difference from your data compare to these simulated values? Is an average difference in running times like that observed in the actual study unlikely to happen by chance alone, if the two angles are the same, on average? How are you deciding? June 27, 2014 MAA PREP workshop 87 12. To quantify the strength of evidence against the null hypothesis, you can find the pvalue. Go back to the Matched Pairs applet. In the Count Samples box, make an appropriate selection from the drop down menu (Hint: what direction does your alternative hypothesis look), and enter the appropriate number in the box (Hint: at least as extreme as what number). Report the approximate p-value. 13. Use the p-value to state a conclusion in the context of the study. Be sure to comment on statistical significance. Can you conclude that there is strong evidence the wide angle and the narrow angle differ? Why or why not? Can you draw a cause-and-effect conclusion? To what population are you willing to generalize the results? 14. Alternatively, you can summarize the strength of evidence using a standardized statistic. Find the standardized statistic and confirm that the strength of evidence you receive from the p-value is approximately the same as with the standardized statistic. 15. We can again use the 2SD Method to approximate a 95% confidence interval for the mean difference in running angles. The overall structure of the formula is the same: estimate ± 2(SD) where the estimate is the sample mean difference in narrow angle minus wide angle and SD is the standard deviation of your null distribution when you did 1000 repetitions in the applet. Use these numbers to find an approximate 95% confidence interval for the population mean difference in running times among all baseball players who could use these two running angles. 16. Provide an interpretation of this confidence interval, being sure to explain the parameter in this context. 17. Describe what would change in your analysis if you had calculated differences in running times by subtracting in the other order. June 27, 2014 MAA PREP workshop 88 Randomization Tests (comparing many groups – categorical response) (adapted from Exploration 8.1: Recruiting Organ Donors) In Example P.1 we considered a study that investigated how to encourage people to be more likely to become organ donors. At the time, we did not formally analyze the data from the study. Now we are studying methods for comparing three groups, so we can further analyze the data and draw inferential conclusions from this study. Recall that researchers asked volunteer subjects to imagine that they had moved to a new state, applied for a driver’s license, and needed to decide whether or not to become an organ donor. Researchers created three different versions of the survey, each with a different donor recruiting strategy. Subjects were randomly assigned to one of the three types of recruiting strategies: The default option is to be an organ donor, and individuals have to “opt-out” of organ donation (Opt-out group) The default option is to not be an organ donor, and individuals have to “opt-in” to organ donation (Opt-in group) There is no default option, and individuals have to choose whether or not they will be an organ donor (Neutral group) 1. Write the appropriate null and alternative hypotheses in words or in symbols. 2. Researchers found that 23 of 55 subjects in the opt-in group chose to be organ donors, 41 of 50 in the opt-out group chose to be organ donors, and 44 of 56 in the neutral group chose to be organ donors. To see these data in the Multiple Proportions applet, open the OrganDonor data file and paste the two columns (with column names) into the applet and press Use Data. (Or enter/paste in the two-way table of counts in the Sample data window and press Use Table.) a. Check the Show table box to have the applet calculate the conditional proportions who agreed to become an organ donor across the three groups and produce a segmented bar graph. What are the three proportions? b. Comment on what your calculations and graph reveal about whether the default option used in the question appears to affect how likely the person is to become an organ donor. June 27, 2014 MAA PREP workshop 89 We see some differences in the sample proportions of individuals who chose to be organ donors across the three treatments, but are these differences large enough to be statistically significant? In other words, do these data provide strong evidence of an association between donor recruiting strategy (opt-in, opt-out, neutral) and choosing to become an organ donor in the population? 3. Suggest two possible explanations for how these observed proportions could have turned out to be as different from each other as they are. (Hint: Think about what our null and alternative hypotheses are.) Applying the 3S Strategy To investigate we will see how we can apply the 3S Strategy to these data. Because researchers are attempting to minimize (or at least control) the probability of Type I Errors, in cases with multiple groups they first do an overall test (see the Unit 3 introduction for more discussion on this). However, this requires computing a single statistic and considering the null distribution of that statistic. Let’s look at this approach now. 1. Statistic 4. a. Propose a formula for a statistic that could be used to measure how different the three sample proportions are. (Hint: The key here is that the statistic needs to provide a single number that somehow focuses on differences between/among the observed sample proportions.) b. For the statistic you propose, would a large value or small value give evidence for the alternative hypothesis that at least one proportion is different in the population? When we compared two proportions in Chapter 5, we used the difference in two proportions as the statistic. In this chapter, though, there are multiple proportions. There are actually several reasonable ways of summarizing the differences among the groups in one number. Once you have settled on a statistic, you apply the 3S process procedures as before – simulate the distribution of the statistic under the null hypothesis and then see where the observed value of the statistic falls in that distribution. June 27, 2014 MAA PREP workshop 90 5. One possible statistic is max p̂ – min p̂ . Review the conditional proportions you determined in #2b and calculate the observed value of this statistic for these data. Why might this not be a less than satisfactory choice? However, this statistic ignores some of the groups. Another reasonable statistic to calculate is the mean of the absolute values of differences for each pair. (We could call this statistic “MAD” for mean of absolute differences.) 6. Let’s construct the MAD statistic for these data by going through the following steps. a. Review the conditional proportions of organ donors you determined in #5b and calculate the differences in these proportions for each pair: Opt-out minus opt-in: Opt-out minus neutral: Opt-in minus neutral: b. Calculate the mean of the absolute values of these differences. So which statistic should you use? And how can you use the statistic to estimate a p-value? We’ll consider those questions in the following sections. 2. Simulation In Chapter 5 we first saw how we can simulate a null hypothesis of no association by shuffling the response variable outcomes across the explanatory variable groups. This models the random assignment process used in the data collection process, and, assuming the null hypothesis is true, “breaks any potential association” between the response and explanatory variables. We modeled this with playing cards in Chapter 5. 7. Describe how you would model a simulation of the null hypothesis with red and black playing cards now that there are three explanatory variable groups instead of two. Make sure to tell how many black cards you would need and how many red ones. Also tell what each color of cards would represent. How many cards would you deal out to each group? What would you calculate after dealing them out? 8. Now let’s use the Multiple Proportions applet to simulate a null distribution of these MAD statistics. This null distribution is again used to measure the strength of evidence provided by the observed statistic against the null hypothesis. June 27, 2014 MAA PREP workshop 91 a. Now check Show Shuffle Options, and leave Number of Shuffles at 1. Press Shuffle Response to perform one shuffle of the response variable values. You should see the shuffled response variable, the new two-way table for the simulated data, and the value of the simulated statistic (Most Recent Shuffled MAD) which is also placed in blue on the graph on the right. Select the Plot radio button to see the shuffled segmented bar graph. How does the distribution across the groups for the shuffled data compare to the original data? Is the simulated MAD statistic value closer to zero than the observed value of the MAD statistic? b. Now enter 999 for the Number of Shuffles, and press Shuffle Responses. This repeats the shuffling of the response variable 999 more times for a total of 1000 repetitions. You should see a graph of a null distribution of the MAD statistics. What is the shape of this distribution? Why is it not a bell-shaped curve centered at zero? 3. Strength of Evidence 9. To estimate a p-value, determine how often the observed value of the statistic, or something even larger, occurred in the null distribution. (Hint: Enter the observed value for the MAD statistic from the research study in the Count Samples box and press Count.) 10. Based on the p-value, summarize the strength of evidence that the sample data provide against the null hypothesis. June 27, 2014 MAA PREP workshop 92 Randomization Tests Comparing Many Groups on a Quantitative Response Research Question: An investigation of strategies that may affect memory scores. Data collection: You will be asked to study a sequence of letters for 20 seconds and then to write down as many as you can remember, in order. Your score will be the number that you remember correctly before your first error of any kind. 1) Note your score here: ____________ 2) Note your treatment group here: ______________ (the instructor will tell you the answer to this) 3) On the instructor’s computer, add your score to the appropriate column. Sketch the dotplots of the class data here. Make sure to draw a rough sketch of the three graphs in the space below, with labeled axes. 4) Did all the students in the study get the same memory score? YES NO What are a few reasons why the memory scores vary? 5) Record the averages (means) and SDs for each group. Record these below. Group Sample size Sample average Sample SD 6) Do the graphical and numerical summaries suggest that the three “treatment” groups tended to produce different responses? Explain. June 27, 2014 MAA PREP workshop 93 7) Write out the null and alternative hypotheses of what we are trying to test about memory in this experiment. Null hypothesis, H0: Alternative hypothesis, Ha: Drawing Inference Once again, we arrive at the question: how surprising are the study results to have happened by chance alone? And, we can answer that question by finding a p-value. Let us now use the 3S Strategy to find a p-value in the context of our study. That is, let us find the probability of seeing the kind of differences we are seeing between the three groups or something even more extreme, if the grouping of letters had no effect on memory scores. 8) First though, how can we measure or quantify the observed difference between the three groups? Suggest one or two ways to do this. In answering #8, you have identified a few possible statistics for the “statistic” step of the 3S Strategy. One option is to use the following statistic to quantify the observed difference in mean scores between the three groups, and call it the MAD statistic: MAD = Mean Absolute Difference = | avg1 avg2 | | avg1 avg3 | | avg2 avg3 | 3 Using the 3S Strategy: Statistic 9) For our data, find the numeric value for the MAD statistic. June 27, 2014 MAA PREP workshop 94 10) What kind of numeric values would you expect the MAD statistic to take if the type of grouping of letters had no effect on memory scores? Why? Recall that the issue of Statistical Significance asks whether the observed results would be surprising to occur by random chance alone, if there were really no effect/difference between the groups. In this case, we are exploring whether the observed MAD statistic value is statistically significant by investigating how often a MAD value as large as ours would occur by random chance alone, if there really were no effect/difference between the groups. 11) How was randomness (or random chance) used in this study: random sampling or random assignment? 12) Is it possible that even with the random assignment, we ended up with three groups where the good memorizers ended up in one group, and the not-so-good ones in the others? 13) Is it possible that the grouping of letters has absolutely no effect on memory score, and the difference observed here in our experiment is merely the result of an unlucky random assignment? Note that the more interesting question is, if the grouping of letters has absolutely no effect on memory score, what is the probability that an unlucky random assignment alone leads to the difference observed between groups in our study (as measured by the MAD statistic) or an even greater difference? We investigate this issue by assuming that all the subjects would have obtained the same memory score regardless of which group they had been in, and we repeat the random assignment process over and over, and calculate a could-have-been MAD value for each repetition. Then, we note how surprising the observed MAD value from our study would be in the long run, if the type of grouping of letters had no effect on memory scores. Of course, like we did in Chapter 6, we could shuffle index cards with the scores on them, to produce several possible values of the MAD statistic that could have happened by random chance (that is, random assignment) if the type of grouping of letters had no effect on memory scores. But we will directly go to an applet from here. NOTE: In the interest of time, we have presented the results from data collected previously. June 27, 2014 MAA PREP workshop 95 14) The data collected were copied and pasted the data into the Multiple Means applet. Pressing Use My Data on the applet produces the following output: Thus, the observed MAD statistic is: ____________________ Simulate 15) Similar to how the Multiple Proportions applet was used, we used the Multiple Means applet to produce the following randomization-based null distribution of the MAD statistic. With regard to the randomization-based null distribution of the MAD statistic: a. Is the null distribution centered at zero? Explain why your answer makes sense. June 27, 2014 MAA PREP workshop 96 b. Describe the shape of the null distribution. Is it symmetric or skewed? If it is skewed, in which direction is it skewed? Explain why your answer makes sense. Strength of evidence 16) Indicate the region of the null distribution that denotes the p-value. How did you decide where to look? Did the center and shape of the null distribution have a bearing on how you decided about what region of the null distribution denoted the p-value? 17) The (approximate) p-value was calculated to be: 0.570. Write a conclusion for the result of your hypothesis test in the context of this problem. In your conclusion, be sure to address the following: Can we conclude that type of letter grouping affects average memory scores? Why or why not? To whom can we say that these results are applicable? How do you decide? June 27, 2014 MAA PREP workshop 97 June 27, 2014 MAA PREP workshop 98 Correlation Test (adapted from Exploration 10.2: Draft Lottery) In 1970, the United States Selective Service conducted a lottery to decide which young men would be drafted into the armed forces (Fienberg, 1971). Each of the 366 birthdays in a year (including February 29) was assigned a draft number. Young men born on days assigned low draft numbers were drafted. We will regard the 366 dates of the year as observational units. We will consider two variables recorded on each date: draft number assigned to the date, and sequential date in the year (so January 31 is sequential date 31, February 1 is sequential date 32, and so on). 1. In a perfectly fair, random lottery, what should be the value of the correlation coefficient between draft number and sequential date of birthday? The graph below displays a scatterplot of the assigned draft numbers and the sequential dates. There are 366 dots, one for each day of the (leap) year. 2. a. Does the scatterplot reveal much of an association between draft number and sequential date? b. Based on the scatterplot, guess the value of the correlation coefficient. c. Does it appear that this was a fair, random lottery? It’s difficult to see much of a pattern or association in the scatterplot, so it seems reasonable to conclude that this was a fair, random lottery with a correlation coefficient near zero. But let’s dig a little deeper … June 27, 2014 MAA PREP workshop 99 3. The 1970 Draft Lottery data sheet (at the end of the exploration) shows the draft number assigned to each of the 366 birthdays. a. Find and report the draft number assigned to your birthday. b. Is your draft number in the bottom third (1-122), middle third (123-244), or top third (245-366)? 4. The second table at the end of the exploration has ordered the draft numbers within each month. a. Use this table to determine the median draft number for your birth month. b. Collaborate with your classmates to determine and report the median draft number for all twelve months. Month Median draft number Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec c. Do you notice any pattern or trend in the median draft numbers over the course of the year? (Hint: If you do not see a trend, compare the six medians from Jan – June with the 6 medians from July – Dec.) 5. The correlation coefficient for these data is r = -0.226. What does this number reveal? Is it consistent with the scatterplot? 6. a. Suggest two possible explanations (hypotheses) which could have generated the value of the observed correlation coefficient. June 27, 2014 MAA PREP workshop 100 b. In your own words, how could we go about determining whether random chance is a plausible explanation for the observed correlation value between sequential date and draft number? Explain how the 3S strategy could be applied here, in particular identifying a simulation strategy you could conduct “by hand.” Note: You do not need to actually carry out a simulation analysis. i. What is the statistic? ii. How would you simulate? iii. How would you evaluate the strength of evidence? The null hypothesis to be tested is that the lottery was conducted with a fair, random process. The null hypothesis would therefore mean that there is no association between sequential date and draft number for this process. The alternative hypothesis is that this lottery was not conducted with a fair, random process, so there is an association between sequential date and draft number. How can we assess whether the observed correlation coefficient of r = -0.226 is far enough from zero to provide convincing evidence that the lottery process was not random? Like always, we ask how unlikely it would be for a fair, random lottery to produce a correlation value as far from zero as -0.226. Also like always, we answer that question by simulating a large number of fair random lotteries, calculating the correlation coefficient for each one, and seeing how often we obtain a correlation coefficient as or more extreme (as far from zero) as -0.226. 7. Open the Corr/Regression applet. Copy the data from DraftLottery into the applet (remember to include the column titles). a. Check the Correlation Coefficient box and confirm that the correlation coefficient is -0.226. b. Check the Show Shuffle Options box and select the Correlation radio button to the right to keep track of that statistic. Then press Shuffle Y-values to simulate one fair, random lottery. (Select the Plot radio button to see the corresponding scatterplot.) Record the value of the correlation coefficient between the shuffled draft numbers and sequential date (in blue). c. Press Shuffle Y-values four more times to generate results of four more fair, random lotteries. Record the values of the correlation coefficients in the table below. June 27, 2014 MAA PREP workshop 101 Repetition Correlation coefficient 1 2 3 4 5 d. Did any of these simulated statistics from fair, random lotteries produce a correlation coefficient as extreme (far from zero) as the observed -0.226? e. Change the Number of Shuffles from 1 to 995 and press Shuffle Y-values to simulate 995 more fair, random lotteries. Look at the null distribution of these 1000 correlation coefficients. Where is this distribution centered? Why does this make sense? f. Use Count Samples pull-down menu to select Beyond. Specify the observed correlation coefficient (-.226) and press Count. What proportion of the 1000 simulated random lotteries produced a correlation coefficient at least as extreme (as far from zero in either direction) as -0.226? Report the approximate p-value. g. Interpret this p-value: This is the probability of what, assuming what? h. What conclusion would you draw from this p-value? Do you have strong evidence that the 1970 draft lottery was not conducted with a fair, random process? Explain the reasoning behind your conclusion. Once they saw these results, statisticians were quick to point out that something fishy happened with the 1970 draft lottery. The irregularity can be attributed to improper mixing of the balls used in the lottery drawing process. (Balls with birthdays early in the year were placed in the bin first, and balls with birthdays late in the year were placed in the bin last. Without thorough mixing, balls with birthdays late in the year settled near the top of the bin and so tended to be selected earlier.) The mixing process was changed for the 1971 draft lottery (e.g., two bins, one for the draft numbers and one for the birthdays), for which the correlation coefficient turned out to be r = 0.014. 8. Use your simulation results to approximate the p-value for the 1971 draft lottery. Is there any reason to suspect that this 1971 draft lottery was not conducted with a fair, random process? Explain the reasoning behind your conclusion. Also explain why you don’t need to paste in the data from the 1971 lottery first. June 27, 2014 MAA PREP workshop 102 1970 Draft Lottery Data date 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Jan 305 159 251 215 101 224 306 199 194 325 329 221 318 238 17 121 235 140 58 280 186 337 118 59 52 92 355 77 349 164 211 June 27, 2014 Feb 86 144 297 210 214 347 91 181 338 216 150 68 152 4 89 212 189 292 25 302 363 290 57 236 179 365 205 299 285 Mar 108 29 267 275 293 139 122 213 317 323 136 300 259 354 169 166 33 332 200 239 334 265 256 258 343 170 268 223 362 217 30 Apr 32 271 83 81 269 253 147 312 219 218 14 346 124 231 273 148 260 90 336 345 62 316 252 2 351 340 74 262 191 208 May 330 298 40 276 364 155 35 321 197 65 37 133 295 178 130 55 112 278 75 183 250 326 319 31 361 357 296 308 226 103 313 Jun 249 228 301 20 28 110 85 366 335 206 134 272 69 356 180 274 73 341 104 360 60 247 109 358 137 22 64 222 353 209 Jul 93 350 115 279 188 327 50 13 277 284 248 15 42 331 322 120 98 190 227 187 27 153 172 23 67 303 289 88 270 287 193 MAA PREP workshop Aug 111 45 261 145 54 114 168 48 106 21 324 142 307 198 102 44 154 141 311 344 291 339 116 36 286 245 352 167 61 333 11 Sep 225 161 49 232 82 6 8 184 263 71 158 242 175 1 113 207 255 246 177 63 204 160 119 195 149 18 233 257 151 315 Oct 359 125 244 202 24 87 234 283 342 220 237 72 138 294 171 254 288 5 241 192 243 117 201 196 176 7 264 94 229 38 79 Nov 19 34 348 266 310 76 51 97 80 282 46 66 126 127 131 107 143 146 203 185 156 9 182 230 132 309 47 281 99 174 Dec 129 328 157 165 56 10 12 105 43 41 39 314 163 26 320 96 304 128 240 135 70 53 162 95 84 173 78 123 16 3 100 103 The following table arranges the draft numbers in order for each month: rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Jan 17 52 58 59 77 92 101 118 121 140 159 164 186 194 199 211 215 221 224 235 238 251 280 305 306 318 325 329 337 349 355 June 27, 2014 Feb 4 25 57 68 86 89 91 144 150 152 179 181 189 205 210 212 214 216 236 285 290 292 297 299 302 338 347 363 365 Mar 29 30 33 108 122 136 139 166 169 170 200 213 217 223 239 256 258 259 265 267 268 275 293 300 317 323 332 334 343 354 362 Apr 2 14 32 62 74 81 83 90 124 147 148 191 208 218 219 231 252 253 260 262 269 271 273 312 316 336 340 345 346 351 May 31 35 37 40 55 65 75 103 112 130 133 155 178 183 197 226 250 276 278 295 296 298 308 313 319 321 326 330 357 361 364 Jun 20 22 28 60 64 69 73 85 104 109 110 134 137 180 206 209 222 228 247 249 272 274 301 335 341 353 356 358 360 366 Jul 13 15 23 27 42 50 67 88 93 98 115 120 153 172 187 188 190 193 227 248 270 277 279 284 287 289 303 322 327 331 350 MAA PREP workshop Aug 11 21 36 44 45 48 54 61 102 106 111 114 116 141 142 145 154 167 168 198 245 261 286 291 307 311 324 333 339 344 352 Sep 1 6 8 18 49 63 71 82 113 119 149 151 158 160 161 175 177 184 195 204 207 225 232 233 242 246 255 257 263 315 Oct 5 7 24 38 72 79 87 94 117 125 138 171 176 192 196 201 202 220 229 234 237 241 243 244 254 264 283 288 294 342 359 Nov 9 19 34 46 47 51 66 76 80 97 99 107 126 127 131 132 143 146 156 174 182 185 203 230 266 281 282 309 310 348 Dec 3 10 12 16 26 39 41 43 53 56 70 78 84 95 96 100 105 123 128 129 135 157 162 163 165 173 240 304 314 320 328 104 FAQ 10.3.1 Why least squares? Buzz: Why least squares? Why do we square the deviations? Why not absolute values? Doris: Not only is your question a thoughtful one, it’s also a question that invites us back in history. In the 1700s, scientists tried to do what you suggest – to minimize the sum of absolute deviations: they wanted to use least MAD (deviation) instead of least sum of squared deviations. The scientific logic was good, but the math turned out to be too messy. Sometimes it worked, and there was a single “best” line that minimized the MAD statistic. But sometimes there would be infinitely many “best” lines, with no way to choose from among them. “Best” and “infinitely many” don’t go together. Even when there was a single best line, it was devilishly hard to find. Overall, the math was just too hard. Buzz: So least squares is pretty much just the fallback Plan B after least absolute deviations failed? Doris: Yes. The justification for least absolute deviations makes good sense, but the theory is much simpler for least squares. As Fred Mosteller, one of the greatest statisticians of the last century, used to tell his students at Harvard, “Mathematicians love calculus, and calculus loves squares.” Buzz: Does the choice – absolute values or squares -- make much difference in the line you get? Doris: Another good question. Sometimes not, but sometimes it does. Buzz: Say more. Doris: Remember SD and IQR and outliers? Buzz: You mean SD is sensitive to outliers and IQR is resistant? Doris: Exactly. Remember why? Buzz: Because the SD squares the deviations, and the IQR does not. Squaring gives extra weight to large deviations. Doris: Apply the same logic to the least squares line. Buzz: Outliers have a big effect because their large deviations get extra weight from the squaring. Doris: Bravo! This is why some observations can become influential points – the least squares line won’t let their residuals get too large. If this were second grade in the 1950s you’d get to stay after class and dust the erasers. If this were the 1970s and you were a statistician, you’d be able to get a big research grant from the National Science Foundation. If this was Dr. Bastian’s lab and you were a dolphin, you’d have just earned a fish. June 27, 2014 MAA PREP workshop 105 Regression (adapted from Exploration 10.3: Predicting Height from Footprints) Can a footprint taken at the scene of a crime help to predict the height of the criminal? In other words, is there an association between height and foot length? To investigate this question, a sample of 20 statistics students measured their height (in inches) and their foot length (in centimeters). The sample data for the 20 students appear in the Correlation/Regression applet. 1. Open this applet, and look at the scatterplot for the data already there. Check the Show Movable Line box to add a blue line to the scatterplot. If you place your mouse over one of the green squares at the ends of the line and drag, you can change the slope of the line and move it. You can also use the mouse to move the green dot up and down vertically to change the intercept of the line. a. Move the line until you believe your line “best” summarizes the relationship between height and foot length for these data. Write down the resulting equation for your line (using traditional statistical notation). b. Why do you believe that your line is “best?” c. Did all students in your class obtain the same line/equation? d. How can we decide whether your line provides a better fit to the data than other students’ lines? Suggest a criterion for deciding which line “best” summarizes the relationship. One way to draw the best fit line is to minimize the distance of the points to the line (these distances are called residuals). Key idea: A residual is the difference between an observed response and the corresponding prediction made by the least squares regression line (residual = observed – predicted). Thus, negative residuals occur when points are below the best fit line and positive residuals occur when points are above the best fit line. June 27, 2014 MAA PREP workshop 106 2. Check the Show Residuals box to visually represent these residuals for your line on the scatterplot. The applet also reports the sum of the values of the residuals (SAE). SAE stands for “Sum of the Absolute Errors.” The acronym indicates that we need to make residuals positive before we add them up and that sometimes people call residuals “errors.” Record the SAE value for your line _____________ What is the best (lowest) SAE in the class?_________________ 3. It turns out that a more common criterion for determining the “best” line is to instead look at the sum of the squared residuals (SSE). This approach is similar to simply adding up the residuals, but is even more strict in not letting individual residuals get too large. Check the Show Squared Residuals box to visually represent the squared residual for each observation. Note that we can visually represent the squared residual as the area of a square where each side of the square has length equal to the residual. a. What is the SSE (sum of squared residuals) for your line?____________ What is the best (lowest) SSE in the class?________ b. Now continue to adjust your line until you think you have minimized the sum of the squared residuals. Report your new equation_________________________________ Report your new SSE value_____________________________ What is the best SSE in the class?_________________________ Key Idea: The least squares regression line minimizes the sum of squared residuals. 4. Now check the Show Regression Line box to determine and display the equation for the line that actually does minimize (as can be shown using calculus) the sum of the squared residuals. a. Record the equation of the least squares regression line by indicating the appropriate slope and intercept of the line. Note that we’ve used variable names in the equation, not generic x and y. And put a carat (“hat”) over the y variable name to emphasize that the line gives predicted values of the y (response) variable. ̂ = _______ + _______(𝐹𝑜𝑜𝑡𝐿𝑒𝑛𝑔𝑡ℎ) 𝐻𝑒𝑖𝑔ℎ𝑡 Notation: The equation of the best fit line is written as 𝑦̂ = 𝑎 + 𝑏(𝑥) where 𝑎 is the y-intercept 𝑏 is the slope 𝑥 is a value of the explanatory variable 𝑦̂ is the predicted value for the response variable June 27, 2014 MAA PREP workshop 107 b. Did everyone in your class obtain the same equation? 5. Uncheck the Show Movable Line box to remove it from the display and check the Move observations box. a. Now click on one student’s point in the scatterplot and drag the point up and down (changing the height, without changing the foot length). Does the regression line change much as you change this student’s height? b. Repeat the previous question, using a student with a very small x (foot size) value and then a point with an x value near the middle and then a point with a very large x value. Which of these seem(s) to have the most influence on the least squares regression line? Explain. Key idea: An observation or set of observations is considered influential if removing the observation from the dataset substantially changes the values of the correlation coefficient and/or the least squares regression equation. Typically, observations that have extreme explanatory variable values (far below or far above x ) are potentially influential. They may not have large residuals, having pulled the line close to them. Residuals also help us measure how accurate our predictions are from using the regression line. In particular, we can compare the “prediction errors” from the regression line to the prediction errors if we made no use of the explanatory variable. 6. Press the Revert button to reload the original data set. Recheck the Show Movable Line box to redisplay the blue line. Notice that this line is flat at the mean of the y (height) values. a. Check the Show Squared Residuals box (under the Movable Line information) to determine the SSE if we were to use the average height ( y ) as the predicted value for every x (foot size). Record this value. b. What is the slope of this line? c. If the slope of the best fit line is zero, our data shows __________________ (positive/negative/no) association between the explanatory and response variables. June 27, 2014 MAA PREP workshop 108 7. Now compare this to the SSE value from the regression line. Which is smaller? Why does that make sense? Coefficient of Determination (r2) 8. A quantity related to the correlation coefficient is called the coefficient of determination, or r-squared (r2). a. To measure how much the regression line decreases the “unexplained variability” in the response variable, we can calculate the percentage reduction in SSE. SSE( y ) – SSE(regression line) x 100% = SSE( y ) Key Idea: The coefficient of determination indicates the percentage of variation in the response variable (height) that is explained by changes in the explanatory variable (foot length). b. Complete the following statement: The coefficient of determination is ______________percent, this means that _____________ percent of the variation in people’s ______________________ is attributable to changes in their ______________________________. If we think of 𝑆𝑆𝐸(𝑦̅) as the “worst” line (no association), then we are seeing how much better the actual line is than this worst line. The best possible line would get rid of all of the residuals, and would have an r-squared of 100%. c. Find the value of the correlation coefficient using the applet, and confirm that when you square the correlation coefficient you get the same number as the applet reports for the coefficient of determination. June 27, 2014 MAA PREP workshop 109