Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 1.1 Assessment of uncertainty margins around population estimates: An alternative introduction to inferential statistics Preface The present chapter gives a very fundamental description of resampling techniques, in particular the bootstrap procedure. The introduction is set up as a first introduction to inferential statistics. This may give the reader the feeling that it takes rather long before the text ‘comes to the point’ (i.e., starts explaining the bootstrap) , but one should realize that this text in particular aims at clarifying the rationale of the bootstrap: Why is the bootstrap useful for inferential purposes. The text is much less aimed at describing the mechanics of the bootstrap, the philosophy being that, if one really understands the rationale underlying the bootstrap, it will be relatively easy to modify the bootstrap to adjust it to ones own purposes. 1.2 Introduction When data are collected, very often we have data on a sample of observation units from a (usually much) wider population. To give a simple example, suppose our research goal is to assess whether male students at the university of Groningen drink more beer (measured as their weekly average) than female students. Then, it is very likely that the university administration will not offer the means to study all of the 20000 students that form the Groningen student population. So what we do in practice is to draw a sample of, for example, 50 male students, and 50 female students. Hence, the data we thus collect will depend on which persons happen to be in our sample. Now, if we found for our samples that, on average the male students drink 8.98 glasses of beer per week, while the females have an average of 7.14, then obviously, this does not mean that these averages are averages for the populations of male and female students. So, when we use the sample averages as estimates for the population averages, we make an error. To get some idea of the possible sizes of this error, several approaches are available, and we will treat these in detail here. First, we will study the fictitious situation where we know the beer consumption in the complete population. 1.3 Fictitious situation: Complete population known Suppose that the complete population of Groningen students consists of 20000 students, of which 9888 males, and 10112 females, and that we know for each of these their weekly average beer consumption. Part of these population data is given in Table 1. A graphical representation of the data is given in Figure 1, separated for males and females. In these populations the average for males is 9.0 and for the females is 7.0. 0 Table 1. Part of the beer consumption data of the Groningen student population number weekly average beer consumption (in glasses) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 9 7 9 7 8 10 8 7 7 9 9 8 8 11 7 gender number male male male female male male female female female male male female male male female 16 17 18 19 20 …. 19992 19993 19994 19995 19996 19997 19998 19999 20000 weekly average beer consumption (in glasses) 7 10 9 7 8 …. 9 9 7 6 7 7 8 7 9 gender female male male female male …. male male female female female female female female male population of male students population of female students 4000 3500 frequency frequency 4000 3000 3500 3000 2500 2500 2000 2000 1500 1500 1000 1000 500 500 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 weekly beer consumption 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 weekly beer consumption Figure 1. Histograms with frequencies of different weekly beer consumption scores (in glasses per week) for populations of male and female students. We now draw from this population a random sample of 50 male students and a random sample of 50 female students. The sample data are given in Table 2. We see that the average for the 50 males is 8.98, whereas that for the 50 females is 7.14. This does not differ much from the averages in the population, but it differs nevertheless. 1 Table 2. Samples of 50 male and 50 female students. number male students no. of number glasses no. of glasses number 10 4799 1629 19907 18718 19725 12571 1956 19570 2734 301 19654 13350 5331 3071 9333 10938 17333 7332 8180 7979 10571 172 17145 7524 10 8 10 8 9 10 9 9 8 9 8 9 8 9 7 9 10 10 10 7 9 9 7 9 9 10 9 10 9 10 9 10 11 8 9 9 9 10 9 9 7 8 9 9 10 9 9 8 10 9 7359 2513 17985 3599 11014 19549 9363 11558 8825 5429 8582 666 1709 17963 8012 8923 17758 14569 9242 16385 12770 16866 3923 16224 18503 mean 3926 12672 5818 18726 11268 6017 6953 15820 2580 17046 11476 2880 1683 10198 7312 8817 6631 222 5631 1833 18840 11745 13230 8389 1718 8.98 mean female students no. of number glasses 9 8 6 8 7 9 8 6 8 8 7 6 7 8 8 8 6 9 4 9 8 8 7 6 7 14942 7341 420 18516 2141 2194 39 358 5011 7705 3453 8696 9284 6964 15040 2642 11981 12165 2552 8258 620 14645 9584 18145 4892 no. of glasses 8 8 8 7 5 7 7 7 7 6 9 7 6 7 7 7 6 6 8 6 6 7 6 6 8 7.14 If we would draw different samples of male and female students, we would find different means. To illustrate this, we drew 19 different additional samples of 50 male and 50 female students and computed the mean beer consumption in all these samples. The means are given in Table 3. 2 Table 3. Sample means for 20 different samples of 50 male and 50 female students each. sample 1 sample 2 sample 3 sample 4 sample 5 sample 6 sample 7 sample 8 sample 9 sample 10 sample 11 sample 12 sample 13 sample 14 sample 15 sample 16 sample 17 sample 18 sample 19 sample 20 Mean mean for males 8.98 9.08 8.90 9.06 9.10 9.08 9.00 8.80 8.96 8.90 9.00 9.20 9.18 9.20 9.28 9.00 9.00 9.20 9.10 9.00 9.05 mean for females 7.14 6.78 6.74 7.02 6.92 7.02 6.96 7.00 7.06 6.90 6.98 7.10 6.88 7.24 6.74 6.74 7.08 6.94 7.16 6.98 6.97 We see that the mean beer consumption in the samples of 50 male students fluctuates around 9, or, more precisely, ranges from 8.80 to 9.28. For the females, the means fluctuate around 7, ranging from 6.74 to 7.24. This gives a reasonable impression of what you would find when you would draw various different samples. Obviously, one could draw many more samples. To give an impression of what we find then, we drew 1000 samples of 50 male and of 50 female students, and computed the mean beer consumption for each sample. The means are represented by the histograms in Figure 2. Each bar here indicates how often an average was found in the interval represented by the bar. vrouwelijke studenten 300 250 250 200 150 100 50 0 8.4 8.5 8.6 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 steekproefgemiddelde aantal gemiddelden in interval aantal gemiddelden in interval mannelijke studenten 300 200 150 100 50 0 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 steekproefgemiddelde Figure 2. Distributions of 1000 sample means for male and female students. Note: the translation of the (Dutch) headings is: “aantal gemiddelden in interval” = “number of means in interval”; “male students” = “male students”; “female students” = “female students”; “steekproefgemiddelde” = “sample mean”. In Figure 2, we see the distribution of the 1000 sample means. Just as in our example of 20 samples, the sample means for the male students vary around 9, 3 ranging roughly from 8.6 to 9.4, whereas those for the female students vary around 7, roughly between 6.6 and 7.4. Thus, we see that, if you draw 1000 random samples of 50 male students from a population with a mean of 9, in the samples you may find means that deviate a bit from this (ranging from 8.6 to 9.4). In our original sample (see Section 1.1), we found an average of 8.98, but, if the population were what we assumed it to be just now, then we could just as well have found a mean value that is some tenths higher or lower. Likewise, for the female students, for which we found a mean of 7.14, we could just as well have found a mean value that is some tenths higher or lower, if it would be a sample from our fictitious population. What’s the use of all this? If we would really know the beer consumption of each member of the Groningen student population, we would be able to predict the mean we would find in a sample of 50 male or female students, and we could predict its fluctuation from sample to sample. However, in practice, we do not know the beer consumption for each member of the population. In fact, if we would already know this, there would not have been any reason for carrying out this research in the first place…. Hence, the situation we have here is the reverse: by means of our research based on a sample, we want to get an impression of the beer consumption in the complete Groningen student population. Whereas, above we saw that, if we know the population, we can predict the mean beer consumption in a sample, we actually want the reverse: knowing the sample mean beer consumption, we want to estimate the mean beer consumption in the population. But how, at all, can we do that if we do not have any idea on what the population looks like? A first step is made in the next Section. 1.4 More realistic situation: We don’t know the population, but we are able to draw several samples from it As mentioned above, in practice we do not know the mean beer consumption in the population. The question thus is: How can we get some insight into the mean beer consumption in the population, on the basis of samples. In our original sample of 50 male and 50 female students, we found that the mean beer consumption for male students was 8.98, and for female students 7.14. Thanks to the analysis in Section 1.3, we realize that these means may very well have been found if in the complete population the mean beer consumptions are 9.0 and 7.0, respectively. And likewise, we can imagine that, if the means in the population were slightly different, for instance, 9.1 and 7.1, then the story in Section 1.3 would be similar, and sample means would fluctuate around 9.1 and 7.1, respectively. So our sample means of 8.98 and 7.14 could also very well have been found if the population means were 9.1 and 7.1, respectively. In this way, there are many populations that might reasonably well have led to finding the sample means that we found. Hence we shouldn’t expect to be able to determine what the means will be exactly in the population, but we ought to be satisfied with getting a rough estimate of the mean beer consumption in the population, and of the uncertainty margins we should take into account for this estimate. We would, for instance, like to make a statement like “We expect the mean beer consumption in the population of male students to be between 8.6 and 9.5”, or a bit more precise, “We are 90% sure that the mean beer consumption in the population of male students is between 8.6 and 9.5”.1 How can we get to such statements? A first answer is “by drawing another sample”, as follows. 1 Absolute certainty can never be attaiend, because one can never exclude that the sample drawn is an extreme sample. Consider the situation where in a population of 10000 males, there are 50 who 4 We draw a second random sample of male and female students. This is easier said then done in practice, but suppose we are able to do this. Then we can compare the means from both samples. Table 4 gives an example of what you could find. Table 4. Mean beer consumption of male and female students in two samples (in both samples 50 male and 50 female students). first sample second sample mean beer consumption (in glasses of beer) male students female students 9.04 6.84 9.08 6.94 Here we see that the means in the two samples differ - not surprisingly, actually it would be strange if they were exactly equal -, but do not differ much. This gives an indication that, for the sample mean, it does not make a lot of difference which sample you draw. Indeed, when in two samples you find almost equal sample means, you may expect that you will find similar means in other samples as well. For instance, you could estimate that such sample means will not differ by more than 0.1 or 0.2 from what you found in the original sample. Of course, you have to be careful here, because it is possible that you found the almost equal sample means purely accidentally. However, you might use it as a first indication of the sampling fluctuation. A totally different situation is the following: Suppose the sample means in your second samples are 8.20 and 8.50, respectively, see Table 5. Table 5. Mean beer consumption of male and female students in two samples (in both samples 50 male and 50 female students). first sample second sample mean beer consumption (in glasses of beer) male students female students 9.04 6.84 8.20 8.50 If we would find the means in Table 5 in a second sample, we would certainly not conclude that it hardly matters what sample you draw. Indeed, the two samples give very different means, and even the order of the means for the males and females is different in the two samples. Moreover, we would now expect that, if we would draw a different random sample, the means would again be rather different from what we found before. Of course, we do not know this, but the comparison of two samples gives an indication that this may very well be the case. In short, if you want to give a rough indication of the sampling fluctuation, you could state that sample means may easily differ by one or even two glasses. The above estimates of sampling fluctuation give just a rough indication. They are based on only one extra sample. It would be much better to draw many new samples, because this would give a complete picture of possible sampling fluctuation. However, in practice this is impossible: Even drawing one extra sample is rarely possible in practice, because of financial or time limitations. In drink 9.04 glasses of beer weekly, whereas the other 99950 drink only 1 glass per week. If your sample happens to consist only of these 50 ‘heavy drinkers’, then the sample mean is 9.04, whereas the population mean is 1.004. The chance of drawing such a sample is very low (it happens once in every 310136 times), but yet it cannot be totally excluded. 5 practice, we usually must rely on only one sample. Even then, we can get insight into sampling fluctuations, which is the topic of the next section. 1.5 Practice: We have only one sample We are now going to describe how we can get an indication of sampling fluctuation in cases where we have only one sample. Moreover, we will use this to give an indication of the uncertainty around our estimates of the population means. In fact, in this we way we want something that is in principle impossible: We want to assess how much our sample means deviate from the population means, whereas we do not know the population means. In Section 1.4 we also did not know the population means, but there at least we had information on a second sample from the same population. There our reasoning was as follows: If we do not know anything about the population, but we do know how much some different samples from the population may vary, we can use this to get a rough indication of how much sample means can fluctuate in general. If we cannot draw a new sample, it is harder to tell something about sampling fluctuation, but not impossible. 1.5.1 First approach: the jackknife A first approach to get insight into what could happen if you would draw a different sample is to simply compute the mean that you get after leaving out one of the 50 male students from your sample. Specifically, you can compute the sample mean upon randomly leaving out one individual from your sample, and then repeat this by leaving out each individual once. This can be done as follows. Suppose you leave out the data for one male student, and you find for the remaining 49 a mean of 8.6 (instead of 8.98 as you found in the sample of 50). Then you would immediately see that it is rather crucial whether or not this individual is present in the sample, because apparently presence or absence of this single individual makes a difference of 0.4 to the mean. In other words, you find that, even a sample that very closely resembles the original sample, has quite a different mean, so you can immediately conclude that there is quite a bit of sampling fluctuation in your mean. Now suppose that upon leaving out one male student, for the remaining 49 you find a mean of 9.00, then you might conclude that leaving out this single individual does not make a lot of difference (compared to the mean of 8.98 for the sample of 50). Then this is a first indication (and not more than that) that it does not make a lot of difference what sample you use. Now by repeating this approach, in which you leave out one random individual from your sample, many times, you can get reasonable insight into the sampling fluctuation in the mean drawn from your population. This procedure is called the ‘leave-one-out’ or ‘jackknife’ procedure (e.g., see Efron & Tibshirani, 1993). Rather than repeatedly, randomly, leaving out one individual, in practice one often leaves out each individual once, and compares all thus obtained different means to get insight in the sampling fluctuation of the mean, for instance by computing the standard deviation of all such jackknife means. This is a first, and relatively simple, procedure for getting insight into the sampling fluctuation. In principle this can also be used for getting insight into the uncertainty margins one has to maintain around your estimates for the population means, but rather than describing this here, we move on to the second procedure for assessing sampling fluctuation, which is better suited for finding uncertainty estimates around population means and other population parameters. This is the so-called bootstrap. 6 1.5.2 The bootstrap Before we explain the bootstrap, we repeat what is actually the goal of our analysis, which is described schematically in Figure 3. The main goal of our analysis is to estimate the population means (of beer consumption in the whole population of male students and of female students). We now focus on the population of male students. The ideal way to assess the mean beer consumption in the population of male students is simply to study the whole population of all male students, and for this compute the mean beer consumption. Unfortunately, in practice this is unfeasible (indicated by the light gray characters in Figure 3: there is a population mean, but we don’t know it). To yet give an estimate of the population mean, we draw a sample from that population, and compute the mean in the sample of male students. We use this sample mean as an estimate of the population mean. However, in practice this estimate will never be perfectly equal to the population mean. We can simply ‘hope that it will at least be a reasonable estimate’, but this is not a very rational approach, since in this way we do not have any idea of the reliability of our estimate. As soon as we want to use such an estimate (e.g., for theory building, or decision making), we would also like to know to what extent we could take our result seriously. We would like to have an idea how much our estimate could, reasonably, deviate from the population mean. The best way to study how much sample means usually deviate from the population mean, is to draw many samples, and compute the sample mean for each sample. In Figure 3 we see 4 such sample means: the actual sample mean and three others. From the complete collection of sample means we would find a broad range of possible sample mean values (e.g., see Table 3), which would give a good indication of how much our estimate (i.e., our sample mean) could, reasonably, deviate from the population mean. In practice, however, we cannot use this: In practice, we draw only one sample. (Therefore, in Figure 3 all other samples are in light gray). So we now are in the situation where we have only one sample, and yet we want to estimate both the population mean, and the uncertainty margins in our estimate of the population mean. For this purpose we use the bootstrap. sample population mean=8.98 mean=9.0 sample mean=9.08 ……. sample mean=9.10 ……….. sample mean=8.96 Figure 3. Schematic overview of results of drawing samples. Everything set in light gray is unknown in practice. 7 With the bootstrap we reason as follows. We try to imagine what would happen if we would have had a different random sample from the same population. Then we could reason as follows: Suppose that, instead of each person in my sample I had an arbitrary other person from the population. What would then be the sample mean? The next question then is, how could we replace each person by an arbitrary other from the population. In the bootstrap this is done as follows. Our sample consists of 50 persons from the population of male students for which we know the beer consumption scores. The scores are (ordered): 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 12 We now reason that ‘an arbitrary other person from this population’ must be someone with an arbitrary score that fits in with our population. Because we (only) know that our sample is a sample from that population, it is reasonable to assume that ‘an arbitrary person’ from this population would have one of the scores that we actually encountered in the sample. Obviously, ‘an arbitrary other’ could have a different score, but this is less likely than that this ‘arbitrary other’ has a score that also occurred in the sample. This is because we know for sure that scores 7, 8, 9, 10, 11 and 12 occur in the population and some of these (like 8, 9 and 10) even relatively frequently, as suggested by what we found in our sample. The crucial step in the bootstrap is that we assume that ‘an arbitrary other person’ from the population will have one of the scores that also occurred in the sample, and that the probability of this person having a particular score is directly related to the frequency of this score in the sample. More precisely, in our example, we assume that ‘an arbitrary other’ has a chance of 4/50 to have a score 7, a chance of 10/50 to have a score 8, a chance of 20/50 to have a score 9, a chance of 13/50 to have a 10, a chance of 2/50 to have a 11 and a chance of 1/50 to have a score 12. Thus, we actually defined the probability distribution, also given in Table 6. In the bootstrap we thus assume that the distribution in the sample is identical to that in the population. In reality, this will obviously not be true exactly. However, when we try to guess how the scores in the population could be distributed, when we only know the scores in the sample, then it is not unreasonable to assume that the scores in the population would be distributed more or less as in the sample. In any case, of all ideas we might have about the distribution of scores in the population, it is the ‘most likely’2 idea, because it is based on the only information we have, that is, the scores in the sample. Table 6. Hypothetical probability distribution of scores for arbitrary others from the population. Score Probability 7 0.08 8 0.20 9 0.40 10 0.26 11 0.04 12 0.02 2 Efron & Tibshirnai (1993) therefore call this a nonparametric maximum likelihood estimate of the population distribution. 8 Next, we can replace each person in the sample by an ‘arbitrary other’ with a score found by drawing randomly a score from the probability distribution in Table 6. This is exactly what happens in the bootstrap. As we replace each of the 50 persons by an ‘arbitrary other’, we could also say that we simply draw a new sample of 50 persons with scores drawn randomly from the probability distribution in Table 6. This new sample is called a ‘bootstrap sample’, and we will treat this as a real new sample. So we can compute its mean, and inspect how much this deviates from the mean of our original sample. Obviously, we repeat this procedure many times to get many such bootstrap samples, and compute the mean in each. This collection of means then shows us ‘everything’ that could happen to the mean when we would have had a somewhat different sample. The above procedure is the bootstrap. This term is chosen to indicate, via an analogy, that we do not use any other information than that available in our original sample, and that we do not invoke any outside help. The bootstrap refers to the bootstraps used by the legendary Baron Munchausen who was able to lift himself out of the moors (and avoid drowning) by pulling his own bootstraps (see Efron & Tibshirani, 1993). The analogy is that the baron, like the researcher, actually needs outside help, which, however, is not available. The baron, like the researcher, finds an emergency solution in using the most usable things that are available: his own bootstraps, respectively, the sample data. Against all physical laws, the baron actually succeeds in salvaging himself, and likewise, seemingly against all logic, the researcher is able to only use his own data to make an estimate of the sampling fluctuations and (as we will see) the uncertainty in his population estimate, by only using the sample data. A slightly more concrete image of the above idea is to consider that you have a big population, for instance with 10000 persons, with scores 7 through 12, in such a way that 800 have score 7, 2000 score 8, 4000 score 9, 2600 score 10, 400 score 11, and 200 score 12. Now if you draw randomly 50 persons from this population, for each individual the probability of entering in the sample is given by the probability distribution in Table 6. So, in fact, in the bootstrap, we draw random samples from a hypothetical population with a distribution of scores as in Table 6. Thus, the bootstrap is similar to what we did in Section 1.3, namely drawing samples from a population. However, in Section 1.3 we drew from the real population, whereas in the bootstrap we use a substitute for the population (‘plug-in’ according to Efron & Tibshirani, 1993). This plug-in is based on the distribution of scores in the sample, which we use as estimate of the population distribution Two important questions have remained unanswered. These will be treated in the next sections. 1.5.3 How do we get a bootstrap sample? The first question, how to get a bootstrap sample, is easy to answer. We get a bootstrap sample by drawing a sample from our original sample, with sample size n, and by drawing with replacement. In fact, if we would draw without replacement, we would always get exactly the same bootstrap sample. In this way we mimic the process of sampling from a large population with the same distribution of scores as in the original sample, but we avoid actually setting up such a large population. To get a random sample with replacement from our own sample, we use a computer program with a random number generator (e.g., SPSS, Excel, Matlab). We assign sequence numbers 1 through 50 to all scores in our sample, randomly generate 50 integers, and consider this as our new sample of 50 sequence 9 numbers. Then, we construct our bootstrap sample as the set of scores related to these 50 sequence numbers. In Table 7 you find, next to the original sample, the scores in two bootstrap samples (all ordered by sequence number). We see clearly that some individuals now appear twice in the bootstrap samples and others not at all, which is a consequence of drawing with replacement. This is no problem, because in fact, we are no longer interested in the actual individuals, but only in the scores, and the distribution of scores in the hypothetical population. The bottom line in Table 7 contains the means of the bootstrap samples. The mean in the first bootstrap sample is 9.02, in the second it is 8.98. Thus, we start seeing that if you draw different samples from the hypothetical population, these have means that lie rather close to the original sample mean. Thus, we start getting an answer to our second question, as described in the next Section. Table 7. Beer consumption scores in original sample of 50 male students and two bootstrap samples (sorted on number). original sample bootstrap sample 1 bootstrap sample 2 no. beer score no. beer score no. beer score no. beer score no. beer score no. beer score 10 172 222 301 1629 1683 1718 1833 1956 2580 2734 2880 3071 3926 4799 5331 5631 5818 6017 6631 6953 7312 7332 7524 7979 10 7 9 8 10 10 9 10 9 8 9 9 7 10 8 9 9 10 9 8 10 9 10 9 9 8180 8389 8817 9333 10198 10571 10938 11268 11476 11745 12571 12672 13230 13350 15820 17046 17145 17333 18718 18726 18840 19570 19654 19725 19907 7 10 7 9 9 9 10 10 9 9 9 9 8 8 11 9 9 10 9 9 9 8 9 10 8 10 172 1629 1629 1683 1683 2734 2734 2880 2880 3071 3926 4799 4799 4799 5331 5331 5631 5818 5818 6953 6953 7332 7524 7979 10 7 10 10 10 10 9 9 9 9 7 10 8 8 8 9 9 9 10 10 10 10 10 9 9 8180 8180 8389 8817 8817 9333 9333 10198 10571 10938 11268 11268 11745 12571 13230 17046 17046 17046 17046 17333 18718 18840 18840 19654 19725 7 7 10 7 7 9 9 9 9 10 10 10 9 9 8 9 9 9 9 10 9 9 9 9 10 172 301 1683 1683 1833 2580 2580 2580 2734 2880 3071 3926 3926 3926 5631 5631 5818 6631 7312 8389 8389 8817 8817 8817 9333 7 8 10 10 10 8 8 8 9 9 7 10 10 10 9 9 10 8 9 10 10 7 7 7 9 10571 10938 10938 11268 11268 11268 11268 11268 11745 12571 12672 12672 12672 13350 15820 17145 18718 18726 18840 19570 19570 19654 19725 19907 19907 9 10 10 10 10 10 10 10 9 9 9 9 9 8 11 9 9 9 9 8 8 9 10 8 8 Mean 8.98 Mean 9.02 Mean 8.98 1.5.4 How do we use the information from several bootstrap samples to assess the uncertainty margins around our sample mean when used as an estimate of the population mean? To get a well-founded answer to the second question (How do we use the information from several bootstrap samples to assess the uncertainty margins around our sample mean when used as an estimate of the population mean?), we draw very many bootstrap samples, for instance, 500, and compute the mean in each. We then get a distribution of bootstrap means and consider this as an 10 indication of the distribution of sample means when we would draw many samples from the real population. This is done as follows. In Table 8 we have the means of 100 bootstrap samples, and the distribution of the bootstrap sample means is represented graphically in the histogram in Figure 4. Table 8. Means of 100 bootstrap samples. no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 mean 8.84 8.86 9.08 9.02 8.86 8.56 8.88 8.88 9.10 9.00 9.10 8.94 8.86 8.88 8.96 9.06 8.98 9.12 9.00 9.00 no. 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 mean 9.04 8.84 8.82 8.98 9.04 8.98 9.12 9.04 9.16 9.10 9.08 9.04 8.98 9.06 8.94 9.04 9.10 9.04 9.00 8.88 no. 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 mean 9.04 8.94 9.00 8.96 8.86 9.06 8.96 9.02 8.92 8.96 8.98 9.02 8.86 9.06 9.14 8.86 8.92 8.84 8.94 9.06 no. 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 mean 8.88 8.80 9.00 8.88 8.92 8.82 8.88 9.18 9.00 9.12 8.96 8.98 9.16 9.06 8.76 8.86 8.82 9.02 9.04 8.88 no. 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 mean 9.06 9.10 8.76 8.96 9.22 9.34 8.94 8.96 8.96 8.96 8.94 9.08 9.04 8.84 9.20 9.06 8.88 8.72 9.26 8.96 40 35 30 25 20 15 10 5 0 8.4 8.5 8.6 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 Figure 4. Distribution of bootstrap sample means. We now see that the bootstrap means vary between 8.55 and 9.35, and that the average of all bootstrap means is almost equal to the mean of the original sample. If we disregard the 10% extreme values, we obtain that 90% of the means lie between 8.8 and 9.2. Such an interval is called a ‘90% percentile interval’, as it indicates which bootstrap mean values belong to the middle 90 percentiles. We 11 could also express the variation in bootstrap means by the standard deviation of the bootstrap means, which in this case is 0.12. How does all this help us estimating our uncertainty margins, when we use our sample mean as an estimate for the population mean? We now found that, if the population would have the same distribution as the sample, then, randomly sampling from this population, 90% of the sample means would lie between 8.8 and 9.2, and their spread, as measured by the standard deviation, is 0.12. The 90% percentile interval hence is a ‘90% prediction interval’: It gives the interval in which we expect (with 90% certainty) to find the means of samples drawn from a population with the same score distribution as our sample has. This indicates what could happen if we draw a new sample from a population with a distribution similar to that in our sample. Yet, this does not give an answer to our actual question. We actually wanted to find an estimate of the mean in the population and to provide this estimate with uncertainty margins. We can now do so by using the following reasoning. We assume that, whatever the population mean will be, the type and degree of variation of means of samples from this population will be the same as that of our bootstrap samples. If this assumption holds reasonably well, then the information we now have on the bootstrap means can be used to make an estimate of the amount of variation in actual sample means that you would get when sampling form the actual population. This can be done as follows. We have found that 90% of the bootstrap means lie between 8.8 and 9.2, hence in an interval of width 0.40. If sample means from the real population vary in the same range, we can conclude that, whatever the population mean is, approximately 90% of the sample means will differ at most (approximately) 0.2 from the population mean. Conversely, then, the population mean will, in 90% of the cases where you draw a sample, not deviate more from the sample mean than (approximately) 0.2. In other words, we can thus be 90% sure that the population mean will be between 8.98 0.2 = 8.78 and 8.98 + 0.2 = 9.18. Such an interval, which indicates the location of the population value with 90% certainty, is called a “90% confidence interval”. Thus, we have (at last) an answer to the question how to get an estimate of the uncertainty margins around a sample mean when this is used as an estimate of the population mean. We can say that our estimate of the population mean is 8.98 and that we are 90% sure that the actual population mean lies within a(n uncertainty) margin of 0.2 from this value. Above, we have thus, on the basis of our bootstrap means, found a 90% confidence interval from 8.78 to 9.18. This interval is very similar to the earlier 90% percentile interval running from 8.8 to 9.2. That is no coincidence. If the distribution of bootstrap means is roughly symmetric3, and the mean of the bootstrap means is close to the sample mean, then the 90% percentile interval will be roughly equal to the 90% confidence interval. The percentile interval is easier to determine, because it simply runs from the 5th percentile to the 95th percentile. The percentile interval suffices in many practical cases to give a reasonable indication of the uncertainty margins around a sample mean (when used as estimate of the population mean). More advanced bootstrap procedures have been developed, which give more accurate estimates of confidence intervals (see, Efron and Tibshirani, 1993), but here we will limit ourselves to the simplest case only. In a similar way as for the male students, we can now construct a 90% percentile interval for the mean beer consumption of the female students. In our sample, we found a mean of 7.14. We now drew 100 bootstrap samples, computed 3 Or if a monotone transformation of the measure at hand exists for which the distribution is roughly symmetric 12 the mean scores in all these bootstrap samples, ordered them, and determined the 90% percentile interval by determining the 5th and 95th percentile of these values. These were 6.8 and 7.4, respectively. Thus, we can state that for male students our 90% certainty estimate is that the mean beer consumption lies between 8.8 and 9.2, whereas that for the female students lies between 6.8 and 7.4. It follows that it is very likely that, in the population of male students the average beer consumption is higher than in the population of female students. 1.6 Uncertainty margins for other measures Above, we determined uncertainty margins for sample means, for situations where the sample mean is used as an estimate of the population mean. Besides means, in practice often other measures are used for summarizing data. Examples are the median and the correlation. Also for such measures, one will use sample data to give estimates of population values of such measures, and again, one can use the bootstrap to determine uncertainty margins around such population estimates. In fact, one can always use the bootstrap to get insight into uncertainty margins around measures computed for a sample, which are used as estimate of the population values of this measures. Thus, the bootstrap is a very powerful procedure, usable for assessing uncertainty margins for whichever measure you want to use. It should be noted, though, that the bootstrap cannot perform miracles. If an estimate of a population value based on sample data cannot logically be made, then the bootstrap cannot give a sensible measure of uncertainty around such nonsense estimates. An example is using the maximum score in a sample as an estimate of the maximum score in a population (see Efron & Tibshirani, 1993, p.81ff). In some situations, like estimating means, there are easier ways to construct confidence intervals. For other measures, like the median, such alternatives are unavailable or of dubious quality, and the bootstrap is recommendable. In some situations, simple percentile intervals are not good enough, and advance bootstrap procedures are called for, but even in such cases simple percentile intervals give at least some insight in uncertainty margins in situations where other methods are unavailable. As an illustration, we now first describe the construction of bootstrap percentile intervals for the median obtained in a particular sample. Next, as a second example, we describe the construction of bootstrap percentile intervals for the correlation between two variables. 1.6.1 Bootstrapping the median We consider a study in which for 40 subjects reaction times were measured in two conditions. The first condition was the control condition, the second the experimental condition, in which subject received a dose of a sleep-inducing drug. The scores are given in Table 9. Clearly, in the sleep-inducing drug condition, there are a few extremely high reaction times. These strongly affect the mean of the scores (see Table 9). Therefore, here the median is considered a better measure to summarize the ‘average’ reaction time. The medians in both conditions are given as well in Table 9. We would now like to know how well the sample medians can be used as estimates of the population medians. In other words, we would like to know the uncertainty margins around the sample medians when these are used as estimates of population medians. 13 Table 9. Reaction times (in ms) in two conditions. control condition 179 196 152 188 117 198 127 125 188 174 Mean Median Sleep-inducing drug condition 114 101 189 120 130 166 128 147 106 199 152.2 149.5 208 892 202 183 193 173 208 226 203 214 Mean Median 171 188 228 218 196 207 229 1275 210 155 288.95 207.5 To get an answer to the above question, we drew 500 bootstrap samples from both our samples, and computed the median in each bootstrap sample. The frequencies of these medians are represented graphically in the histograms in Figure 5. Figure 5. Histograms for 500 bootstrap medians for control and sleep-inducing drug conditions. Note: The titles of the histograms can be translated as:“Bootstrapmedianen voor controle conditie” = “Bootstrap medians for control condition”; “Bootstrapmedianen voor slaapmiddel conditie” = “Bootstrap medians for sleep-inducing drug condition”. For the medians in the control condition, the 90% percentile interval runs from 127 to 181. In the sleep-inducing drug condition the 90% percentile interval for the median runs from 196 to 212.5. We thus see that it is likely that the median reaction time for the sleep-inducing drug population is higher than that for the control population. 1.6.2 Bootstrapping the correlation: An example of bootstrapping multivariate data. In a small study with 19 students, we recorded their age and the number of sexual partners they had in the past year. The data are given in Table 10, and represented by a scatter plot in Figure 6. Also, we computed the (linear) correlation coefficient between these variables, which was 0.57. 14 Table 10. Data on age and number of sexual partners in past year. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 age 18.2 19.1 24.5 18.7 18.3 20.9 22.0 24.2 21.1 25.4 19.6 18.2 24.1 19.8 23.9 22.7 19.8 20.2 21.5 no. of sexual partners 5 3 2 9 4 3 3 5 3 1 4 3 1 4 2 2 4 4 5 10 aantal partners in afgelopen jaar subj r = -.57 8 6 4 2 0 16 18 20 22 24 leeftijd 26 28 30 Figure 6. Scatterplot of data in Table 10. Note: “leeftijd” = “age” ; “aantal partners in afgelopen jaar” = “number of sexual partners in past year”. It seems to hold roughly that, the older the students, the fewer sexual partners they had in the past year. The obtained correlation is quite strong, but it should be noted that it is based on a small sample only. Therefore, we may have serious doubts about the accuracy of this correlation as an estimate of the correlation between age and the number of sexual partners in the population of all students. To get insight into this, we can again use the bootstrap, but the situation is a bit more special than in the univariate cases (i.e., with only one variable) described above. To use the bootstrap in the present situation, where we deal with scores on two variables (hence we have multivariate data), a bootstrap sample is obtained by drawing, with replacement, a number of pairs of scores. Specifically, we can use the subject numbers as sequence numbers, randomly generate (with replacement) a sample of these sequence numbers, and then use the data associated with these sequence numbers as our bootstrap sample. Two examples are given in Table 11. It should be emphasized that, in producing bootstrap samples, pairs of scores are never separated. Hence, in the bootstrap samples we do encounter the score pair (24.2, 5) but not, for instance, (24.2, 4), because this pair did not occur in the original sample. 15 Table 11. Two examples of bootstrap samples 21.1 18.3 18.2 21.1 24.2 20.9 22.7 21.5 24.1 25.4 19.6 18.2 22.7 18.3 23.9 24.5 24.5 21.5 19.8 3 4 5 3 5 3 2 5 1 1 4 3 2 4 2 2 2 5 4 19.8 19.1 19.8 22.7 18.3 21.5 18.2 24.2 21.1 21.5 24.5 18.2 24.1 18.7 18.7 19.8 20.9 20.2 23.9 r = .60 4 3 4 2 4 5 5 5 3 5 2 3 1 9 9 4 3 4 2 r = .51 In this way, 5000 bootstrap samples were drawn from the original sample, and in each of them the correlation was computed. The values of these 5000 correlations are summarized in the histogram in Figure 7. We see that the correlation varies quite a bit over these bootstrap samples, and we even encounter positive correlations in a number of bootstrap samples. These, however, pertain to the extreme findings. When we only consider the 90% middle values, hence determine the 90% percentile interval, we find that it runs from .83 to .29. We thus conclude that the accuracy of the sample correlation (of .57) as an estimate of the population correlation is not very high; indeed, in the population the correlation could ‘very well’ be a different value between .83 and .29. On the other hand, we are reasonably confident that the correlation in the population will be at least mildly negative, hence that younger students tend to have had somewhat more sexual partners than older students in the past year. Figure 7. Histogram of correlations between age and number of sexual partners in 5000 bootstrap samples. 16 In the histogram with bootstrap correlations we can discern a special feature: There are notably more values lower than .57 than there are higher than .57. Moreover, values higher than .57 are more spread out; in other words, the distribution of the bootstrap correlations is skewed to the right. This is no coincidence. Also when we would draw samples from a population in which the correlation is .57 we would obtain a picture like this. In this way, the distribution of the bootstrap correlations mirrors that of the correlations in samples drawn from a population. However, we do not know whether in reality the population correlation is (close to) .57; in fact, given the large spread, this correlation might just as well be .2 or.3 lower or higher. 1.6.3 Other possibilities In Sections 1.5 and 1.6.1 means and medians, respectively, were compared across populations. In both cases, we gave uncertainty margins for each population estimate. Instead, in both cases we could have given uncertainty margins for the difference across populations. Specifically, we could have computed, for instance, the difference between the medians in the two original samples, and next many times draw one bootstrap sample from the one original sample, and one from the other, and compute the difference between these two bootstrap medians. Thus, we would obtain a large number of bootstrap differences in medians, and we could next determine a 90% percentile interval from these bootstrap differences in medians, which in turn can be used to set up a 90% confidence interval around the observed difference in medians. 1.7 Relation to classical statistics In Section 1.5, it has been described that the basic principle of the bootstrap is the following reasoning. We wish to guess what would happen to our summarizer (e.g., mean, correlation, etc.), if we would have an arbitrary other sample than the one we actually have. For this purpose, we imagine what would happen if we would replace each given observation unit by an arbitrary other from the same population. The question then is, how can we actually do this replacement? In classical statistics the answer to this question is given in a rather different way. In classical statistics, it is usually assumed that population distribution of the parameter at hand (e.g., the mean, the median, etc.) has a particular shape that is easily characterized, usually by means of only few parameters, the most common example being the normal distribution. This assumption only pinpoints the shape of the distribution; to fully specify the distribution we would need (estimates of) the population mean and the population standard deviation, but to obtain confidence intervals, we only need the standard deviation. In classical statistics, this is standard deviation is usually estimated on the basis of the data. Thus, in this respect, also classical statistics is a kind of bootstrap procedure, in that information in the actual sample is used to get an estimate of the uncertainty of our population estimate. The difference with the bootstrap described above is that in classical statistics the shape of the distribution of sample measures is specified by assuming that this is given by a simple mathematical function, whereas in the bootstrap no such assumption is made. Instead, in the bootstrap, the distribution of sample measures is obtained by repeatedly drawing samples from a hypothetical population, which itself is our best guess of what the actual population would look like. 17 The classical statistics assumption that sample means are normally distributed is a reasonable assumption when in the population the scores themselves are (nearly) normally distributed, or when the sample size is quite large. In fact, even when the scores in the population are clearly not normally distributed, but sample sizes exceed, say, 30, often sample means are nearly normally distributed. However, in case of small samples from populations that are clearly not normally distributed, confidence intervals obtained by classical procedures can be grossly incorrect, and bootstrap procedures will be more reliable. It should be emphasized that this does not mean that bootstrap procedures work perfectly in case of small samples and strongly nonnormal distributions. Indeed, in such cases, to get good confidence intervals, it becomes relatively important to use the more advanced bootstrap procedures, and even these will not always give exactly proper confidence intervals. However, in such situations bootstrap confidence intervals can be expected to at least work better than classical statistical confidence intervals, because the classical ones rely on assumptions that, in such cases, are clearly violated, whereas the bootstrap intervals do not rely on such assumptions. It may be noteworthy that, when a sufficient amount of bootstrap samples is drawn, the standard deviation of the bootstrap sample means will be the same as the standard deviation of the sample mean as estimated in classical statistics; however, the classical confidence interval will be different from the percentile interval, because the intervals are based on the shape of the distribution (and not only the standard deviation). In practice, the classical approach, when available, is often preferred, because it is easiest to apply: Instead of drawing a large number of bootstrap samples, and computing the measure of interest in all of these, it suffices to compute some sample statistics, and insert these in a specific formula for determining a confidence interval around a population estimate. This approach works fine for a number of well-known measures provided that the assumptinos required are not violated too heavily. However, often the assumptions are violated to quite a large extent. Moreover, for new measures, classical procedures are not available yet, and using them would first require the mathematical statistical derivation of such formulas, which may be cumbersome and/or require untenable assumptions. Instead, in such situations of heavy violation of assumptions, or of unavailable ‘classical’ results, one can always use the bootstrap, which, provided that in certain complex cases (see Efron & Tibshirani, 1993) the advanced bootstrap procedures are used, gives an estimate of uncertainty margins for whatever measure one wishes to use. The procedure is also easily generalized to situations with more than one measure as outcome. A brief introduction to this is given in the next, and final, Section. 1.8 Bootstrapping several outcome parameters jointly In many multivariate applications, the result of the analysis of a data set consists of several outcome variables. A simple, and common, example is multiple regression. Suppose we want to predict the scores on a criterion variable Y by the scores of a number (p) of predictor variables. We collect data of a number of observation units on all p predictor variables and on the criterion variable, and next submit this data set to a procedure for carrying out a multiple regression analysis. Note that this could be any variant of multiple regression analysis, not just the classical one. Then the result of such a multiple regression analysis is a number of regression weights, say (b1,…,bp). These weights are based on the sample scores, but we will use them as estimates for regression weights that apply 18 to the complete population. The question then, again, is how accurate these estimates are, or, in other words, what are the uncertainty margins around these regression weights. For ordinary multiple regression classical procedures (based on normality assumptions) are available for the construction of confidence intervals around regression weights. However, these do not work for special types of multiple regression procedures. A general procedure that does always apply, however, is the bootstrap, which here works as follows. In the multiple regression situation we have a data set with scores of n observation units on p predictor variables and one criterion variable. So each observation unit has p+1 scores. Now from the sample of n observation units we draw a number of bootstrap samples in the same way as we did in the case of bootstrapping the correlation. That is, we can use the observation unit numbers as sequence numbers, randomly generate (with replacement) a sample of these sequence numbers, and then use, for each (re)drawn observation unit the p+1 scores associated with it, and thus constitute a bootstrap sample. In this manner we compute a large number (say 500) of bootstrap samples, and in each bootstrap sample we compute the regression weights according to the multiple regression method that we also used in our original sample. Thus, we get 500 bootstrap regression weights b1, 500 bootstrap regression weights b2, etc. For each of these bootstrap regression weights individually, we can now construct a 90% percentile interval, and consider this as a 90% confidence interval around the sample value. Thus, in this way, for each separate regression weight, we can obtain the uncertainty margins. The above procedure can be used for any statistical procedure that yields multiple outcome measures. A prerequisite for the procedure to work well is that the outcome measures are uniquely identified (which is, for instance, not the case in a method like principal components analysis). In case of methods with nonuniquely identified outcome measures, special procedures have to be invoked to use the bootstrap, which is beyond the scope of the present text. 19