Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
12.1 A BOOTSTRAP HYPOTHESIS TEST OF THE POPULATION MEAN In Chapter 6 we looked at sets of data and asked whether the data could have arisen from a binomial model with a certain population probability of success such as that given by a hypothesized drug cure rate. If the observed proportion of successes was too large, we rejected the hypothesized model. In Chapter 7 we looked at a set of data and asked whether it could have arisen from a particular hypothesized model such as that of a six-sided fair die or a many-sided loaded die. The chi-square statistic was calculated. If it was improbably large, we concluded that the data did not come from the hypothesized model. If the chi-square statistic was not too large, we concluded that the data may very well have come from that model. That is, we “rejected” or “accepted” the null hypothesis on the basis of the above considerations. The six-step decision-making process of Chapters 6 and 7 can be used for many other statistical purposes. Which of two new drugs, if either, is more effective? Are husbands more likely to be older than their wives? Is a particular hypothesized population blood pressure average correct for 60-year-olds? This chapter shows how to formally assess such hypotheses. The chapter is best viewed as a continuation of Chapter 6. Let us consider the Key Problem. We will assume that the body temperature readings constitute a random sample from a population of adults and hence are representative of this population. The population is taken to be a large one, such as the residents of Chicago. Confidence intervals (discussed in Chapter 11) give a set of reasonable values for the unknown theoretical mean of the population under consideration. We need the standard deviation of the data, which turns out to be 0.73. Thus, as explained in Chapter 11, we can use the theoretical result that a sample mean when appropriately standardized is approximately distributed as a standard normal variable. From this, the approximate 95% confidence interval for the population mean is computed by 98.25 ⫾ 1.96(0.73) 冪130 ⳱ 98.25 ⫾ 0.13 ⳱ (98.12, 98.38) The “normal” body temperature value 98.6 is not in that interval, even though the theory of confidence intervals tells us the interval covers the true population average temperature about 95% of the time. This suggests that the population mean is lower than 98.6—that it is not just chance causing the observed mean to be so low as 98.25. Another approach to this question is to formally set out a null hypothesis: H0 : The population mean is 98.6. and then use the data to see if the hypothesis holds up or whether the population average temperature is in fact lower. The H0 is the null hypothesis: H is for hypothesis and 0 is for null. This is the hypothesis-testing approach of this chapter. The idea is the same as in Chapter 6, where we asked whether the data conformed closely enough to a given hypothesized model. The null hypothesis represents the status quo: it is believed that the mean of the population is 98.6. It is usually the hypothesis of “business as usual” or “nothing of interest here.” We then look at the data, which have a mean of 98.25. We have two choices: 1. It is plausible that the observed sample mean of 98.25 could have come from a null hypothesis population with a mean of 98.6. We thus accept the null hypothesis, meaning that the evidence tells us either that the null hypothesis is true or that there is not enough evidence to say it is false. Thus “accept” does not necessarily mean we have strong evidence that the null hypothesis is true. 2. It is not plausible that the observed sample mean could have come from a null hypothesis population with a mean of 98.6. We thus reject the null hypothesis and conclude that the true population average temperature is less than 98.6. How far does the sample average have to be from 98.6 before we are compelled to reject the null hypothesis? In particular, what is the chance that if the null hypothesis is true, we could observe a sample mean as low as or lower than 98.25? To decide, we take the six-step hypothesis-testing approach of Chapter 6. A key step in our six-step simulation approach to statistical hypothesis testing is step 1: making a realistic choice of the model to be randomly sampled from. One cannot be effective as a statistician without understanding how to realistically specify the model generating the data to be analyzed. Because we are hypothesis testing, the specified model of step 1 must satisfy the null hypothesis as well as be a realistic model in terms of shape and spread for the problem at hand. In Chapters 6 and 7 supplying such a model was fairly straightforward. For example, in Chapter 7 we often used a fair many-sided die null hypothesis model, and in the Chapter 6 problem about community attitudes toward raising the driving age, a fair coin null hypothesis model sufficed. In this section something more subtle is often called for. When understood, this new approach to building the null hypothesis model of step 1 will seem most reasonable. The approach is called bootstrapping the observed data. We were introduced to bootstrapping in Section 11.3 as a method of approximately obtaining the standard error of an estimate. Bootstrapping’s central idea is to use the shape of the data to supply a good estimate of the unknown model population that we wish to sample from. The name bootstrapping comes from the cliche´ of pulling one’s self up by one’s own bootstraps—that is, climbing upward with no assistance other than one’s own body, clearly a feat not literally possible. In a statistical context it means making a statistical inference using only the data to produce the model: that is, we do not make the usual specification of a model (such as binomial, normal, or uniform distribution), which is usually arrived at independently of the data. The statistical bootstrap has a valid justification and indeed often works well in applications. From the viewpoint of this book, it is a special version of our six-step method of hypothesis testing in which the model of step 1 is entirely determined by the observed data and the null hypothesis. Bootstrap methods are beginning to be heavily used in modern statistical practice; hence the method you will learn here is part of the modern statistical arsenal. Let us return to the Key Problem. 1. Choice of a Model (Definition of the Null Hypothesis Population): We must choose a realistic box model—a population—that conforms with the null hypothesis. In particular, the population must represent the body temperatures of the large set of adults actually being sampled from, but with the null hypothesis being true and hence having a population average of 98.6. But simply saying the population mean is 98.6 does not define the null hypothesis population, for many populations have a theoretical mean of 98.6. What is the population standard deviation? What does the shape of the population relative frequency histogram, also called a probability histogram, look like (that is, the theoretical probability distribution of the temperature of a randomly sampled person—see the introduction to Chapter 8, and see Figure 8.1 for an example of a probability histogram)? For example, we could decide it is unwise to assume a normal shape for our population distribution, even though many would take this approach. Indeed, we will instead presume that the shape of the population relative frequency histogram or theoretical distribution is exactly that of the sample relative frequency histogram in Figure 12.1, except that it is shifted so that its theoretical mean is 98.6. (Note that the rectangles in Figure 12.1 are 0.2 in width, and keep in mind that the probability of an interval is given by the area, not the height, of the rectangle.) Note that the population probability histogram represents the temperature distribution for all the individuals in the population. Since on average the data points are 98.6 ⫺ 98.25 ⳱ 0.35 lower than the hypothesized value, we simply add 0.35 to all the data points. Now we replicate each such shifted data point many times to create the needed large population of adults satisfying the null hypothesis. (Example 12.1 below exhibits this procedure more explicitly.) This invented population (box model) that characterizes H0 being true then has a theoretical mean of 98.6 but otherwise is shaped like the data. The advantage of this approach is that we have not needed to assume any particular theoretical distributional shape, but have let the data alone determine our estimate of the shape. An approach that makes no assumptions about the particular shape (for example, a bell shape) of the population distribution is called nonparametric because it is free of re- .45 .40 Temperature ≥ 98.6 .35 .30 .25 .20 .15 .10 .05 0 96 97 Figure 12.1 98 99 Temperature (°F ) 100 101 Adult temperature data. strictive assumptions that are usually given by parameters that specify the shape of the population distribution. The nonparametric approach is very powerful because the user takes no risk of being deceived by assuming an incorrect shape for the population histogram. The nonparametric, datadriven approach we are embarking on is one version of what statisticians call the nonparametric bootstrap. 2. Definition of a Trial (Sample): A trial consists of randomly choosing 130 readings from the population, sampling without replacement, because sampling without replacement is the realistic way to sample from a population. Indeed, the actual 130 observations were obtained by sampling without replacement. 3. Definition of a Successful Trial: Because we are concerned with the average temperature of 130 people, the statistic is the average of the (new) 130 readings sampled from the population. The trial is a success if the average is less than or equal to 98.25. 4. Repetition of Trials: We perform the sampling 100 times, each time obtaining a new mean. The stem-and-leaf plot in Table 12.1 contains these means. The average of the means is 98.6048—very close to the null hypothesis value, as expected because the population mean is 98.6. The standard deviation of the means can be computed: it is 0.0606. 5. Estimation of the Probability of the Obtained Average or Less (Probability of a Successful Trial): We want to know the chance of obtaining a bootstrap sampled mean that is ⱕ 98.25 from this null hypothesis population designed to have a population mean of 98.6. It turns out that all the simulated sample averages in step 5 are higher than 98.25: they range from 98.45 to 98.75. Thus we estimate the probability to be 0. Table 12.1 Sample Means of Temperatures from 100 Samples of Size 130 Stem Leaf 984 985 985 986 986 987 987 567 1222223334444444 5555566777777888889999 00000000001111112222222333344444 55555555666777788899 00014 55 Key: “984 567” stands for 98.45, 98.46, 98.47 degrees. 6. Decision: If the null hypothesis is true, the chance that a bootstrap sample mean is ⱕ 98.25 is estimated to be 0, much less than the conventional value of 0.05 for an unlikely event. Thus there is strong evidence that the null hypothesis does not appear plausible—the observed value of 98.25 cannot be ascribed to chance under the null hypothesis. We reject the null hypothesis, believing the evidence to be very strong. That is, the evidence is strong that for the population from which the data were sampled, the average temperature is not the “normal” value of 98.6 degrees. Since it is a totally new idea to form a nonparametric bootstrap null hypothesis population, we now examine the crucial step 1 in detail. To understand how to construct the null hypothesis population model from the sample data, let’s simplify the body temperature problem by assuming that the sample is of 5 people rather than 130. Suppose their observed temperatures are 97.3, 97.5, 98.4, 98.6, and 99.2. The goal is to invent a large population (to be sampled from without replacement) shaped like these data but with a mean of 98.6, thereby satisfying the null hypothesis. The mean of the five points is 98.2. To start building the large population, we have to add something to each point so that the mean of the five points is 98.6. Because 98.6 ⫺ 98.2 ⳱ 0.4, we have to add 0.4. We then have 97.7, 97.9, 98.8, 99.0, 99.6. (Check that these five have mean 98.6.) We now create the desired realistically large population by replicating each of these five points many times. Suppose the large population to be modeled is of size 5000. Any large number is acceptable: for instance, 500 would be fine, too. Then we replicate each point 1000 times so that the total in our invented large population is 5000. The result is a box model of size 5000 that satisfies the null hypothesis and, we believe, is shaped approximately like the real population distribution of all people’s temperatures. (See the following table.) This invented population is our best guess of what the real population looks like in terms of centering, spread, and overall distributional shape if the null hypothesis is true. If the sample of size 5 is reasonably representative of the unknown population (spread, shape, and so on), then our population 97.7 97.9 98.8 99.0 99.6 97.7 97.9 98.8 99.0 99.6 97.7 97.9 98.8 99.0 99.6 97.7 97.9 98.8 99.0 99.6 97.7 97.9 98.8 99.0 99.6 97.7 97.9 98.8 99.0 99.6 97.7 97.9 98.8 99.0 99.6 97.7 97.9 98.8 99.0 99.6 97.7 97.9 98.8 99.0 99.6 97.7 97.9 98.8 99.0 99.6 ... ... ... ... ... 97.7 97.9 98.8 99.0 99.6 created from the data should be (roughly) shaped like the unknown true population distribution we would use in step 1 if we knew it. Now our plan is to repeatedly sample five numbers without replacement from this large invented population. Because the population size (5000) is large relative to the planned sampling size (5), we know from Section 5.7 that the probability of a successful trial is almost unaffected if we instead sample with replacement, so we can sample with replacement with no harm done. But if we sample with replacement, the probability law of the five-observation sample average needed for step 5 is the same if we use the shifted original (five-member) sample as our step 1 population instead of the population of 5000 formed from the 1000 replications. By switching to sampling with replacement to form our simulated samples, we can avoid all the effort of creating and sampling without replacement from a large invented population. We can rather merely take the basic five shifted null hypothesis values, 97.7, 97.9, 98.8, 99.0, and 99.6 to be the entire null hypothesis population, and repeatedly randomly sample five observations from this five-member population with replacement. That is, we randomly choose one, record its value, and then put it back. We then randomly choose another, which could be the same as the first, record its value, and then put it back. We do this five times to obtain each step 2 sample. Such random sampling with replacement from the actually observed data set is bootstrap sampling. Indeed, in Section 11.3 we introduced the bootstrap by sampling with replacement from a small (nonreplicated) population. This sampling with replacement from the (possibly small) sample is what statisticians actually do in practice. In summary, we can bootstrap-sample repeatedly (that is, with replacement from the actual observed sample, but translated to make the null hypothesis true) as an acceptable substitute for repeatedly sampling without replacement from the large invented population we would have created to be shaped like the actual observed sample but translated to make the null hypothesis true. Now imagine the above bootstrap sampling with replacement using as our box model the original 130 temperature readings shifted by Ⳮ0.35 ⳱ 98.6 ⫺ 98.25. That is, we now return to the original data set of 130 body temperature measurements. We sample 130 observations with replacement 100 times. A statistician could use such a nonparametric bootstrap procedure in this situation in order to verify the choice of the distribution of the sample average under the null hypothesis for use in step 5. Indeed, in cases in which the population distributional shape is not known, the above bootstrap approach is very appealing and would be used by many professional statisticians when the sample size is too small for application of the central limit theorem—that is, well under 30 (and hence certainly in the case of 5). But when the sample size used to compute the sample mean is large, as 130 is in the example, most statisticians would appeal to the central limit theorem of Chapter 11 because it tells us that the distribution of these sample means will be well approximated by the normal distribution regardless of the shape of the distribution of temperatures in the population. This normal approach to carrying out step 5 is developed in Section 12.2. In Example 12.1 below we illustrate the bootstrap approach as a special case of our six-step method of hypothesis testing. Example 12.1 A Paired Comparison of Two Population Means Data on 177 Illinois husband-wife couples from the 1989 Current Population Survey* yielded the comparison of attained educational levels presented in the relative frequency histogram of Figure 12.2. The distributions look reasonably similar, although it appears that more of the husbands go through two years of college (14 years of education) and more wives have but one year of college (13 years). The average for the 177 husbands is 12.89, and that for the 177 wives is 12.65. The husbands average 0.24 year more of education. Could that be due to chance, or 0.50 Wives 0.40 0.30 0.20 0.10 0 0.50 Husbands 0.40 0.30 0.20 0.10 0 0 2 Figure 12.2 4 6 8 10 12 14 16 Number of years of education 18 20 Years of education data for married couples. *From data disk accompanying D. Freedman, R. Pisani, R. Purves, and A. Adhikari, Statistics (New York: Norton, 1991). are husbands more educated than their wives on average in the population we are sampling from? Because we are interested in differences, the variable we look at is Difference ⳱ husband’s education ⫺ wife’s education for each couple. Figure 12.3 shows the relative frequency histogram of the 177 differences. Over 30% of the couples have the same educational level. One wife (with 14 years) has 11 more years than her husband (with 3). Otherwise, the largest difference is seven years. The average of these differences is the 0.24 we saw above. This is called a paired comparison test of two population means, because the (X, Y) values are paired (a wife and a husband). Is it plausible that these data could be a sample from a population with a difference in number of years of education exactly 0? We wish to test the following null hypothesis: H0 : The average difference in years of education for the population of Illinois husband-wife couples is 0. Let us proceed to the six steps, using the nonparametric bootstrap method of this section. 1. Choice of a Model (Definition of the Population): We seek a null hypothesis population. It will be our best estimate of the Illinois husband-wife population satisfying the null hypothesis. Consistent with the bootstrap approach, we will use the sample to create this population. Because in the sample the observed difference is 0.24 year, we subtract 0.24 from each of the 177 differences in the data to produce a null hypothesis population of differences whose mean is exactly 0. That is, we shift all the differences by the same amount so that they can be viewed as a step 1 population for which the null hypothesis holds. Again we have the choice of creating a large realistic population by replicating each of the 177 members many times and then repeatedly sampling 177 differences without replacement. But once again we will more simply let the 177 translated differences be our entire null hypothesis population and produce our large number of size 177 sampled means by bootstrap-sampling with replacement between draws. 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0 –12 –10 –8 –6 – 4 –2 0 2 4 6 Difference in years of education 8 10 Figure 12.3 Bootstrap-simulated differences in years of education for married couples. Table 12.2 Bootstrap-Simulated Sample Means of Education Differences from 100 Samples of Size 177 ⫺6 ⫺5 ⫺4 ⫺3 ⫺2 ⫺1 ⫺0 0 1 2 3 4 1 Key: “⫺6 64 84433331 88877777554433 988865554221110000 877777653322221110 1113388889 01223345556667789 0024589 1334 0 1” stands for ⫺0.61. 2. Definition of a Trial (Sample): A trial consists of randomly choosing 177 differences (husband’s years of education minus wife’s years of education) from the population of 177 by sampling with replacement. 3. Definition of a Successful Trial: The statistic of interest is the average of the 177 differences sampled from the invented population. A trial is a success if the observed average of differences is as large as or larger than 0.24. 4. Repetition of Trials: We perform the sampling 100 times. The means of the 100 samples are shown in the stem-and-leaf plot in Table 12.2. 5. Estimation of the Probability of the Obtained Average or More (Probability of a Successful Trial): We want to know the chance of obtaining an average of differences as large as or larger than 0.24. From Table 12.2, we can count 9 that are 0.24 or above, so the probability is estimated to be 0.09. 6. Decision: If the null hypothesis were true, the chance that the sample mean difference is as high as 0.24 would be about 0.09. That is fairly small, but it is certainly larger than our 0.05 convention for rejection. We decide to accept the null hypothesis that the average difference in the population is 0. (A tentative rejection would be reasonable, too; the results are borderline. The statistically knowledgeable researcher would definitely want to revisit this problem, perhaps with a new and larger sample. For although 0.09 does not meet our “gold standard” for strong statistical evidence for rejecting the null hypothesis, it surely raises our scientific suspicions that the hypothesis may be false!) The decision we make to “accept” H0 here in step 6 means that although there is an observed difference in educational levels in the sample, the difference is not large enough to take as convincing evidence that there is a difference in educational levels for the entire population. That does not mean we are convinced there is not a difference; there may be no difference, or there could be a small one. It is therefore more accurate to say we fail to reject H0 . In Chapter 7, where we did chi-square testing, we learned that statisticians can, in fact, bypass the simulations of the six-step method and appeal to the method of chi-square density. In contrast, the bootstrap simulation method, which does not bypass the simulations of the six-step method, is often the professional statistician’s method of choice when the sample size is small (say, under 30). SECTION 12.1 EXERCISES 1. Suppose a sample of 100 heights of men has a mean of 70.53 inches and a standard deviation of 3.22 inches. Explain how to use the bootstrap to test the hypothesis that the mean height of men in the population is 69 inches. 2. Suppose for Exercise 1 we have 100 bootstrapped means each from a bootstrap sample of 100 taken from the invented population with a mean height of 69 inches. The bootstrapped means are recorded in the following stem-and-leaf plot. Is the mean height of the men in the population really 69 inches? Stem 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 Leaf 2 2 08 2 01117 233459 01112334444559 003357778899 1122333445555678 001233459 000113333799 2223339 1234444478 1 26 5 Key: “697 5” stands for 69.75 inches. 3. Suppose 100 small specimens were taken from a certain batch of concrete. The mean compression strength of the specimens was 4129.58 pounds with a standard deviation of 164.12 pounds. Explain how to test the hypothesis that the mean compression strength of the batch of concrete is 4200 pounds using the bootstrap. 4. For Exercise 3, we found 100 bootstrapped means each from a bootstrap sample of size 100 taken from the invented population with mean 4200 pounds. The bootstrapped means are recorded in the following stem-and-leaf plot. Is the compression strength of the batch of concrete really 4200 pounds? Stem 416 417 417 418 418 419 419 420 420 421 421 422 422 423 423 424 424 Leaf 778 2 557789 0022244 5567779 00001122334 55667777899999 11122333334444 5556777888999 011144 5666699 0012 5667 57 6 Key: “424 6” stands for 4246 pounds. 5. In a random sample of 100 husbands, the mean age was 46.62 years with a standard deviation of 4.40 years. The mean age of the 100 wives of the husbands was 41.88 years with standard deviation of 3.41 years. Explain how to test whether the difference between the ages of husbands and wives is 0 using the bootstrap. 6. Refer to Exercise 5. Here is a frequency table of 100 bootstrapped mean differences obtained from 100 bootstrap samples of 100 taken from an invented population with a mean difference of 0. Test whether the difference between the ages of the husbands and wives is 0. Difference (years) Frequency ⫺1.4 ⫺1.3 ⫺1.2 ⫺1.1 ⫺1.0 ⫺0.9 ⫺0.8 ⫺0.7 ⫺0.6 ⫺0.5 ⫺0.4 ⫺0.3 ⫺0.2 ⫺0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 2 0 1 1 0 2 1 1 3 5 7 3 10 5 8 9 5 9 7 5 1 8 0 2 3 1 0 0 1 7. In a random sample of 200 high school seniors who took the SAT test, the mean SAT verbal score was 544.7 with a standard deviation of 36.36, and the mean SAT math score was 531.7 with a standard deviation of 47.10. Explain how to test whether the difference between SAT verbal and math scores is 0. 8. Refer to Exercise 7. The 100 bootstrapped mean differences (verbal ⫺ math) shown in the following stem-and-leaf plot were obtained from 100 bootstrap samples of 200 taken from the invented population with a mean difference of 0. Test whether the difference between SAT verbal and math scores is 0. Stem Leaf ⫺16 ⫺15 ⫺14 ⫺13 ⫺12 ⫺11 ⫺10 ⫺9 ⫺8 ⫺7 ⫺6 ⫺5 ⫺4 ⫺3 ⫺2 ⫺1 ⫺0 0 1 2 3 4 5 6 7 8 9 10 11 Key: 5 0 4 2 976 9852 98732 9966621 87653210 77410 97554 55333321000 665 011268 02358999 1578 3356888 1245 244469 0 37 24 0235 2 “11 2” stands for 11.2. For additional exercises, see page 731.