Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Preliminaries: Introduction to Statistical Investigations P-1 Preliminaries Introduction to Statistical Investigations Have you ever heard statements like these? “Don’t get your child vaccinated. I vaccinated my child and now he is autistic.” “I will never start jogging because my friend’s dad jogged his whole life but he died at age 46 of a heart attack.” “Teenagers shouldn’t be allowed to drive. Just last year there was a terrible accident at our high school.” The people making these statements are using anecdotal evidence (personal observations or striking examples) to support broad conclusions. The first person concludes that vaccinations cause autism, based solely on her own child. The second concludes that running is too risky and could cause heart attacks, based entirely on the experience of one acquaintance. The third person also judges risk based on a single striking incident. Scientific conclusions cannot be based on anecdotal evidence. Science requires evidence from data. Statistics is the science of producing useful data to address a research question, analyzing the resulting data, and drawing appropriate conclusions from the data. For example, suppose you are running for a student government office and have two different campaign slogans in mind. You’re curious about whether your fellow students would react more positively to one slogan than the other. Would you ask only for your roommate’s opinion, or several of your friends? Or could you conduct a more systematic study? What might that look like? The study of Statistics will help you see how to design and carry out such a study, and you will see how Statistics can also help to answer many important research questions from a wide variety of fields of application. Example P.1: Organ Donations Organ donations save lives. But recruiting organ donors is difficult, even though surveys show that about 85% of Americans approve of organ donations in principle and many states offer a simple organ donor registration process when people apply for a driver’s license. However, only about 38% of licensed drivers in the United States are registered to be organ donors. Some people prefer not to make an active decision about organ donation because the topic can be unpleasant to think about. But perhaps phrasing the question differently could affect people’s willingness to become a donor? Johnson and Goldstein (2003) recruited 161 participants for a study, published in the journal Science, to address this question of organ donor recruitment. The participants were asked to imagine they have moved to a new state and are applying for a driver’s license. As part of this application, the participants were to decide whether or not to become an organ donor. What differed was the default option that the participants were presented: Some of the participants were forced to make a choice of becoming a donor or not, without being given a default option (the “neutral” group). © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-2 Other participants were told that the default option was not to be a donor but that they could choose to become a donor if they wished (the “opt-in” group). The remaining participants were told that the default option was to be a donor but that they could choose not to become a donor if they wished (the “opt-out” group). What did the researchers find? Those given the “opt-in” strategy were much less likely to agree to become donors. Consequently, policy makers have argued that we should employ an “opt out” strategy instead. Individuals can still choose not to donate, but would have to more actively do so rather than accept the default. Based on their results, Johnson and Goldstein stated that their data “suggest changes in defaults could increase donations in the United States of additional thousands of donors a year.” In fact, as of 2010, 24 European countries had some form of the opt-out system–which some call “presumed consent”–with Spain, Austria, and Belgium yielding high donor rates. Why were Johnson and Goldstein able to make such a strong recommendation? Because rather than relying on their own opinions or on anecdotal evidence, they conducted a carefully planned study of the issue using sound principles of science and statistics. Similar to the scientific method, we now identify six steps of a statistical investigation, Six Steps of a Statistical Investigation Step 1: Ask a research question that can be addressed by collecting data. These questions often involve comparing groups, asking whether something affects something else, or assessing people’s opinions. Step 2: Design a study and collect data. This involves selecting the people or objects to be studied and deciding how to gather relevant data on them. Step 3: Explore the data, looking for patterns related to your research question as well as unexpected outcomes that might point to additional questions to pursue. Step 4: Draw inferences beyond the data by determining whether any findings in your data reflect a genuine tendency and estimating the size of that tendency. Step 5: Formulate conclusions that consider the scope of the inference made in Step 4. To what underlying process or larger group can these conclusions be generalized? Is a cause-and-effect conclusion warranted? Step 6: Look back and ahead to point out limitations of the study and suggest new studies that could be performed to build on the findings of the study. Let’s see how the organ donation study followed these steps. Step 1: Ask a research question. The general question here is whether a method can be found to increase the likelihood that a person agrees to become an organ donor. This question was then sharpened into a more focused one: Does the default option presented to driver’s license applicants influence the likelihood of someone becoming an organ donor? Step 2: Design a study and collect data. The researchers decided to recruit various participants and ask them to pretend to apply for a new driver’s license. The participants did not know in advance that different options were given for the donor question, or even that this issue was the main focus of the study. These researchers recruited participants for their study through various general interest bulletin boards on the internet. They offered an incentive of $4.00 for completing an online survey. After the results were collected, the researchers removed data arising from multiple responses from the same IP address, surveys completed in less than five seconds, and respondents whose residential address could not be verified. © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-3 Step 3: Explore the data. The results of this study were: 44 of the 56 participants in the neutral group agreed to become organ donors, 23 of 55 participants in the opt-in group agreed to become organ donors, and 41 of 50 participants in the opt-out group agreed to become organ donors. The proportions who agreed to become organ donors are 44/56 ≈ .786 (or 78.6%) for the neutral group, 23/55 ≈ .418 (or 41.8%) for the opt-in group, and 41/50 = .820 (or 82.0%) for the opt-out group. The Science article displayed a graph of these data similar to Figure P.1. Figure P.1: Percentages for Organ Donation Study These results indicate that the neutral version of the question, forcing participants to make a choice between becoming an organ donor or not, and the opt-out option, for which the default is to be an organ donor, produced a higher percentage who agreed to become donors than the opt-in version for which the default is not to be a donor. Step 4: Draw inferences beyond the data. Using methods that you will learn in this course, the researchers analyzed whether the observed differences between the groups was large enough to indicate that the default option had a genuine effect, and then estimated the size of that effect. In particular, this study reported strong evidence that the neutral and opt-out versions do lead to a higher chance of agreeing to become a donor, as compared to the opt-in version currently used in many states. In fact, they could be quite confident that the neutral version increases the chances that a person agrees to become a donor by between 20 and 54 percentage points, a difference large enough to save thousands of lives per year in the United States. Step 5: Formulate conclusions. Based on the analysis of the data and the design of the study, it is reasonable for these researchers to conclude that the neutral version causes an increase in the proportion who agree to become donors. But because the participants in the study were volunteers recruited from internet bulletin boards, generalizing conclusions beyond these participants is only legitimate if they are representative of a larger group of people. Step 6: Look back and ahead. The organ donation study provides strong evidence that the neutral or opt-out wording could be helpful for improving organ donation proportions. One limitation of the study is that participants were asked to imagine how they would respond, which might not mirror how people would actually respond in such a situation. A new study might look at people’s actual responses to questions about organ donation or could monitor donor rates for © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-4 states that adopt a new policy. Researchers could also examine whether presenting educational material on organ donation might increase people’s willingness to donate. Another improvement would be to include participants from wider demographic groups than these volunteers. Part of looking back also considers how an individual study relates to similar studies that have been conducted previously. Johnson and Goldstein compare their study to two others: one by Gimbel et al. (2003) that found similar results with European countries and one by Caplan (1994) that did not find large differences in the proportion agreeing to donate between the three default options. Figure P.2 displays the six steps of a statistical investigation that we have identified: Figure P.2: Six Steps of a Statistical Investigation 1. Ask a research question Research Hypothesis 2. Design a study and collect data 3. Explore the data Logic of Inference 4. Draw inferences Significance Estimation Scope of Inference 5. Formulate conclusions Generalization Causation 6. Look back and ahead Four Pillars of Statistical Inference Notice from Figure P.2 that Step 4 can be considered as the logic of statistical inference and Step 5 as the scope of statistical inference. Furthermore, each of these two steps involves two components. The following questions comprise the four pillars of statistical inference: 1. Significance: How strong is the evidence of an effect? You will learn how to provide a measure of the strength of the evidence provided by the data that the neutral and opt-out versions increase the chance of agreeing to become an organ donor, as compared to the opt-in version. © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-5 2. Estimation: What is the size of the effect? You will learn how to estimate how much higher (if any) the chance someone agrees to donate organs when asked with the neutral version is compared to the other versions. 3. Generalization: How broadly do the conclusions apply? You will learn to consider what larger group of individuals you believe this conclusions can be applied to. 4. Causation: Can we say what caused the observed difference? You will learn whether we can legitimately conclude that the version of the question was the cause of the observed differences in the proportion who agreed to become organ donors. These four concepts are so important that they should be addressed in virtually all statistical studies. Chapters 1-4 of this book will be devoted to introducing and exploring these four pillars of inference. To begin our study of the six steps of statistical investigation, we now introduce some basic terminology that will be used throughout the text. Basic Terminology Data can be thought of as the values measured or categories recorded on individual entities of interest. These individual entities on which data are recorded are called observational units. The recorded characteristics of the observational units are the variables of interest. o Some variables are quantitative, taking numerical values on which ordinary arithmetic operations make sense. o Other variables are categorical, taking category designations. The distribution of variable describes the pattern of value/category outcomes. In the organ donation study, the observational units are the participants in the study. The two variables recorded on these participants are the version of the question that the participant received, and whether or not the participant agreed to become an organ donor. Both of these are categorical variables. The graph in Figure P.1 displays the distributions of the donation variable for each default option category. The observational units in a study are not always people. For example, you might take the Reese’s Pieces candies in a small bag as your observational units, on which you could record variables such as the color (a categorical variable) and weight (a quantitative variable) of each individual candy. Or you might take all of the Major League Baseball games being played this week as your observational units, on which you could record data on variables such as the total number of runs scored, whether the home team wins the game, and the attendance at the game. Think about it: For each of the three variables just mentioned (about Major League Baseball games), identify the type of variable: categorical or quantitative. The total number of runs scored and attendance at the game are quantitative variables. Whether or not the home team won the game is a categorical variable. Think about it: Identify the observational units and variable for a recent study (Ackerman, Griskevicius, and Li, 2011) that investigated this research question: Among heterosexual couples in a committed romantic relationship, are men more likely than women to say “I love you” first? © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-6 The observational units in this study are the heterosexual couples, and the variable is whether the man or the woman was the first to say “I love you.” Coming Next Next you will explore two situations where observing and summarizing data are helpful in making decisions. In Example P.2, you will encounter data arising from a natural “datagenerating” process that repeats the same “random event” many, many times, which allows us to see a pattern (distribution) in the resulting data. Then in Exploration P.3, you will examine data from a purely random process (like rolling dice), to see how to use that information to make better decisions. In subsequent chapters, you will analyze both data-generating processes and random processes, often with the goal of seeing how well a random process models what you find in data. Example P.2: Old Faithful Millions of people from around the world flock to Yellowstone Park in order to watch eruptions of Old Faithful geyser. But, just how faithful is this geyser? How predictable is it? How long does a person usually have to wait between eruptions? Suppose the park ranger gives you a prediction for the next eruption time, and then that eruption occurs five minutes after that predicted time. Would you conclude that predictions by the Park Service are not very accurate? We hope not, because that would be using anecdotal evidence. To investigate these questions about the reliability of Old Faithful, it is much better to collect data. (A live webcam of Old Faithful and surrounding geysers is available at: http://www.nps.gov/yell/photosmultimedia/yellowstonelive.htm.) Researchers collected data on 222 eruptions of Old Faithful taken over a number of days in August 1978 and August 1979. Figure P.3 contains a graph (called a dotplot) displaying the times until the next eruption (in minutes) for these 222 eruptions. Each dot on the dotplot represents a single eruption. Figure P.3: Times between eruptions of Old Faithful geyser 40 45 50 55 60 65 70 75 80 time until next eruption (min) 85 90 95 100 Think about it: What are the observational units and variable in this study? Is the variable quantitative or categorical? © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-7 The observational units are the 222 geyser eruptions, and the variable is the time until the next eruption, which is a quantitative variable. The dotplot displays the distribution of this variable, which means the values taken by the variable and how many eruptions have those values. The dotplot helps us see the patterns in times until the next eruption. The most obvious point to be seen from this graph is that even Old Faithful is not perfectly predictable! The time until the next eruption varies from eruption to eruption. In fact, variability is the most fundamental property in studying Statistics. We can view these times until the next eruption as observations from a process, an endless series of potential observations from which our data constitute a small “snapshot.” Our assumption is that these observations give us a representative view of the long-run behavior of the process. Although we don’t know in advance how long it will take for the next eruption, in part because there are many factors that determine when that will be (e.g., temperature, season, pressure), and in part because of unavoidable, natural variation, we may be able to see a predictable pattern overall if we record enough inter-eruption times. Statistics helps us to describe, measure, and often explain the pattern of variation in these measurements. Looking more closely at the dotplot, we can notice several things about the distribution of the time until the next eruption: The shortest time until the next eruption was 42 minutes, and the longest time was 95 minutes. There appear to be two clusters of times, one cluster between roughly 42 and 63 minutes, another between about 66 and 95 minutes. The lower cluster of inter-eruption times is centered at approximately 55 minutes, whereas the upper cluster is centered at approximately 80 minutes. Overall, the distribution of times until the next eruption is centered at approximately 75 minutes. In the lower cluster, times until next eruption range from 42 to 63 minutes, with most of the times between 50-60 minutes. In the upper cluster, times range from 66 to 95 minutes, with most between 75-85 minutes. What are some possible explanations for the variation in the times? One thought is that some of the variability in times until next eruption might be explained by considering the duration length of the previous eruption. It seems to make sense that after a particularly long eruption, Old Faithful might need more time to build enough pressure to produce another eruption. Similarly, after a shorter eruption, Old Faithful might be ready to erupt again without having to wait very long. Fortunately, the researchers recorded a second variable about each eruption: the duration of the eruption, which is another quantitative variable. For simplicity we can categorize each eruption’s duration as short (less than 3.5 minutes) or long (3.5 minutes or longer), a categorical variable. Figure P.4 displays dotplots of the distribution of time until next eruption for short and long eruptions separately. © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-8 eruption type Figure P.4: Times between eruptions of old faithful geyser, separated by duration of previous eruption (less than 3.5 minutes or at least 3.5 minutes) short long 40 45 50 55 60 65 70 75 80 85 time until next eruption (min) 90 95 100 We can make several observations about the distributions of times until next eruption, comparing eruptions with short and long durations, from this dotplot: The shapes of each individual distribution no longer reveal the two distinct clusters (bimodality) that was apparent in the original distribution before separating by duration length. Each of these distributions seems to have a single peak. The centers of these two distributions are quite different: After a short eruption, a typical time until the next eruption is between 50 and 60 minutes. In contrast, after a long eruption, a typical time until the next eruption is between 75 and 85 minutes. The variability in the times until the next eruption is much smaller for each individual distribution (times tend to fall closer to the mean within each duration type) than the variability for the overall distribution, as we have been able to take into account one source of variability in the data. But of course the times still vary, partly due to other factors that we have not yet accounted for and partly due to natural variability inherent in all random processes. One way to measure the center of a distribution is with the average, also called the mean. One way to measure variability is with the standard deviation, which is roughly the average distance between a data value in the distribution and the mean of the distribution. (See the Appendix for details about calculating standard deviation, which will also be explored further in Chapter 3.) These values for the time until next eruption, both for the overall distribution and for short and long eruptions separately, are given in Table P.1. Table P.1: Means and Standard Deviations of Inter-Eruption Times Mean Standard deviation 71.0 12.8 Overall 56.3 8.5 After short duration 78.7 6.3 After long duration Notice that the standard deviations (SD) of time until next eruption are indeed smaller for the separate groups than for the overall dataset, as suggested by examining the variability in the dotplots. © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-9 Figure P.5: Times between eruptions of old faithful geyser, separated by duration of previous eruption, with mean and standard deviation shown eruption type SD = 8.5 short long Mean = 56.3 40 45 50 55 SD = 6.3 Mean 75 = 78.780 60 65 70 85 time until next eruption (min) 90 95 100 So, what do you learn from this analysis? First, you can better predict when Old Faithful will erupt if you know how long the previous eruption lasted. Second, with that information in hand, Old Faithful is rather reliable, because it often erupts within six to nine minutes of the time you would predict based on the duration of the previous eruption. So if the park ranger’s prediction was only off by five minutes, that’s pretty good. Basic Terminology From this example you should have learned that a graph such as a dotplot can display the distribution of a quantitative variable. Some aspects to look for in that distribution are: Shape: Is the distribution symmetric? Mound-shaped? Are there several peaks or clusters? Center: Where is the distribution centered? What is a typical value? Variability: How spread out are the data? Are most within a certain range of values? Unusual observations: Are there outliers that deviate markedly from the overall pattern of the other data values? If so, identify them to see if you can explain why those observations are so different. Are there other unusual features in the distribution? You have also begun to think about ways to measure the center and variability in a distribution. In particular, the standard deviation is a tool we will use quite often as a measure of variability. At this point, we want you to be comfortable visually comparing the variability among distributions and anticipating which variables you might expect to have more variability than others. Think about it: Suppose that Mary records the ages of people entering a McDonald’s fast-food restaurant near the interstate today, while Colleen records the ages of people entering a snack bar on a college campus. Who would you expect to have the larger standard deviation of these ages: Mary (McDonald’s) or Colleen (campus snack bar)? Explain briefly. © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-10 The customers at McDonald’s are likely to include people of all ages, from young children to elderly people. But customers at the campus snack bar are most likely to be college-aged students, with some older people who work on campus and perhaps a few younger people. Therefore the ages of the customers at McDonald’s will vary more than the ages of those at the campus snack bar. Mary is therefore more likely to have a larger standard deviation of ages than Colleen. Exploration P.3: Cars or Goats A popular television game show (Let’s Make a Deal from the 1960s and 1970s) featured a new car hidden behind one of three doors, selected at random. Behind the other two doors were less appealing prizes (e.g., goats!). When a contestant played the game, he or she was asked to pick one of the three doors. If the contestant picked the correct door, he or she won the car! 1. Suppose you are a contestant on this show. Intuitively, what do you think is the probability that you win the car (i.e., that the door you pick has the car hidden behind it)? 2. Give a one-sentence description of what you think probability means in this context. Assuming there is no set pattern to where the game show puts the car initially, this game is an example of a random process: Although the outcome for an individual game is not known in advance, we expect to see a very predictable pattern in the results if you play this game many, many times. This pattern is called a probability distribution, similar to a data distribution as you examined with Old Faithful inter-eruption times. We are interested in features such as how common certain outcomes are – e.g., are you more likely to win this game (select the door with the car) or lose this game? To investigate what we mean by the term probability, we ask you to play the game many times. As the game show no longer exists, we will simulate (artificially re-create) playing the game, keeping track of how often you win the car. 3. Use three playing cards with identical backs, but two of the card faces should match and one should differ. The different card represents the car. Work with a partner (playing the role of game show host), who will shuffle the three cards and then randomly arrange them face down. You pick a card and then reveal whether you have won the car or selected a goat. Play this game a total of 15 times, keeping track of whether you win the car (C) or a goat (G) each time: Game # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Outcome (car or goat) © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-11 4. In what proportion of these 15 games did you win the car? Is this close to what you expected? Explain. These fifteen “trials” or “repetitions” mimic the behavior of the game show’s random process, where you are introducing randomness into the process by shuffling the cards between games. To get a sense of the long-run behavior of this random process, we want to observe many, many more trials. Because it is not realistic to ask you to perform thousands of repetitions with your partner, we will turn to technology to continue to generate a large number of outcomes from this random process. 5. Suppose that you were to play this game 1000 times. In what proportion of those games would you expect to win the car? Explain. 6. Use the website http://www.grand-illusions.com/simulator/montysim.htm to simulate playing this game 10 times. Be sure to use the “keep” choice on the left side, and change the Run times to 10. Click on the “Start” button. Record the proportion of wins in these 10 games. Then simulate another 10 games, and record the overall proportion of wins at this point. Keep doing this in multiples of 10 games until you reach 100 games played. Record the overall proportions of wins after each additional multiple of 10 games in the table below. Number of games 10 20 30 40 50 60 70 80 90 100 Proportion of wins 7. What do you notice about how the proportion of wins changes as you play more games? Does this proportion appear to be approaching some common value? 8. Now change the Run times to play to 100 and click on “Start.” Repeat this until you reach a total of 1000 games played. Calculate the proportion of wins by dividing the number of wins by 1000. Is this close to what you expected in #5? You should see that the proportion of wins generally gets closer and closer to 1/3 (or .3333) as you play more and more games. This is what it means to say that the probability of winning is 1/3: If you play the game repeatedly under the same conditions, then after a very large number © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-12 of games, your proportion of wins should be very close to 1/3. Figure P.6 displays a graph showing how the proportion of wins changed over time for one simulation of 1000 games. Notice that the proportion of wins bounces around a lot at first but then gradually settles down and approaches a long-run value of 1/3. Figure P.6: Proportion of wins as more and more games are played 1.0 0.9 proportion of wins 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 200 400 600 number of games 800 1000 Now consider a fun twist that the game show host adds to this game: Before revealing what’s behind your door, the host will first reveal what’s behind a different door that the host knows to be a goat. Then the host asks whether you (the contestant) prefer to stay with (keep) the door you picked originally or switch (change) to the remaining door. 9. Prediction: Do you think the probability of winning is different between the “stay” (keep) and “switch” (change) strategies? Whether the “stay” or “switch” strategy is better is a famous mathematical question known as the Monty Hall Problem, named for the host of the game show. Many people, including some renowned mathematicians, got the solution wrong when this problem became popular through the “Ask Marilyn” column in Parade magazine in 1990. We can approach this question as a statistical one that you can investigate by collecting data. 10. Investigate the probability of winning with the “switch” strategy by playing with three cards for 15 games. This time your partner should randomly arrange the three cards in his/her hand, making sure that your partner (playing the role of game show host) knows where the car is but you do not. You pick a card. Then your partner reveals one of the cards known to be a goat but not the card you chose. Play with the “switch” strategy for a total of 15 games, keeping track of the outcome each time: Repetition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Outcome (car or goat) 11. In what proportion of these 15 games did you win the car? Is this more or less than (or the same as) when you stayed with the original door? (question #4) © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-13 12. To investigate what would happen in the longer run, return to http://www.grandillusions.com/simulator/montysim.htm. Notice that you can change from “keep” your original choice to “change” your original choice. Clear any previous work and then simulate playing 1000 games with each strategy, and record the number of times you win/lose with each: “Stay” strategy “Switch” strategy Wins (cars) Losses (goats) 1000 1000 Total 13. Do you believe that the simulation has been run for enough repetitions to declare one strategy as superior? Which strategy is better? Explain how you can tell. 14. Based on the 1000 simulated repetitions of playing this game, what is your estimate for the probability of winning the game with the “switch” strategy? 15. How could you use simulation to obtain a better estimate of this probability? 16. The probability of winning with the “switch” strategy can be shown mathematically to be 2/3. (One way to see this is to recognize that with the “switch” strategy, you only lose when you had picked the correct door in the first place.) Explain what it means to say that the probability of winning equals 2/3. Extension 17. Suppose that you watch the game show over many years and find that door #1 hides the car 50% of the time, door #2 has the car 40% of the time, and door #3 has the car 10% of the time. What then is your optimal strategy? In other words, which door should you pick initially, and then should you stay or switch? What is your probability of winning with the optimal strategy? Explain. © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-14 Basic Terminology Through this investigation you should have learned: A random process is one that can be repeated a very large number of times (in principle, forever) under identical conditions with the following property: o Outcomes for any one instance cannot be known in advance, but the proportion of times that particular outcomes occur in the long run can be predicted well in advance. The probability of an outcome refers to the long-run proportion of times that the outcome would occur if the random process were repeated a very large number of times. Simulation (artificially re-creating a random process) can be used to estimate a probability. o Simulations can be conducted with both tactile methods (e.g., cards) and with computers. o Using a larger number of repetitions in a simulation generally produces a better estimate of the probability. Simulation can be used for making good decisions involving random processes. o A “good” decision (in this context) means you can accurately predict which strategy would result in a larger probability of winning. This tells you which strategy to use if you do find yourself on this game show, but of course does not guarantee you will win! Preliminaries Summary This concludes your study of the preliminary but important ideas necessary to begin studying Statistics. We hope you have learned that: Collecting data from carefully designed studies is more dependable than relying on anecdotes for answering questions and making decisions. Statistical investigations, which can address interesting and important research questions from a wide variety of fields of application, follow the six steps illustrated in Figure P.2. Some data arise from processes that include a mix of systematic elements and natural variation. All data display variability. Distributions of quantitative data can be analyzed with dotplots, where we look for shape, center, variability, and unusual observations. Standard deviation is a widely used tool for quantifying variability in data. Random processes that arise from chance mechanisms display predictable long-run patterns of outcomes. Probability is the language of random processes. Beginning in Chapter 1, we will use a random process to model a data-generating process in order to assess whether the data process appears to behave similarly to the random process. Using data to draw conclusions and make decisions requires careful planning in collecting and analyzing the data, paying particular attention to issues of variability and randomness. © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission Preliminaries: Introduction to Statistical Investigations P-15 Preliminaries Glossary Cases anecdotal evidence Personal experience or striking example ....................................................................................... P-1 categorical variable A variable whose outcomes are category designations............................................................... P-5 center A middle or typical value of a quantitative variable....................................................................... P-9 distribution The pattern of outcomes of a variable ....................................................................................P-5, P-7 dotplot A graph with one dot representing the variable outcome for each observational unit ............ P-6 observational units The individual entities on which data are recorded ...................................................................... P-5 probability distribution The pattern of long-run outcomes form a random process .......................................................P-10 probability The long-run proportion of times an outcome from a random process occurs.......................P-11 process An endless series of potential observations .................................................................................. P-7 quantitative variable A variable taking numerical values on which ordinary arithmetic operations make sense..... P-5 random process A repeatable process with unknown individual outcomes but a long-run pattern ..................P-10 shape A characteristic of the distribution of a quantitative variable ....................................................... P-9 simulation Artificial recreation of a random process ......................................................................................P-10 six steps of a statistical investigation ............................................................................................ P-2 standard deviation A measure of variability of a quantitative variable ........................................................................ P-8 Statistics: A discipline that guides researchers in collecting, exploring and drawing conclusions from data ...................................................................................................................................................... P-1 variability The spread in observations for a quantitative variable ................................................................ P-7 variables Recorded characteristics of the observational units ..................................................................... P-5 © Fall 2013, Tintle et al.; to be published by Wiley and Sons, not to be modified without permission