Download 1 Assessment of uncertainty margins around population estimates

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

German tank problem wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Transcript
1
1.1
Assessment of uncertainty margins around population
estimates: An alternative introduction to inferential statistics
Preface
The present chapter gives a very fundamental description of resampling
techniques, in particular the bootstrap procedure. The introduction is set up as a
first introduction to inferential statistics. This may give the reader the feeling that
it takes rather long before the text ‘comes to the point’ (i.e., starts explaining the
bootstrap) , but one should realize that this text in particular aims at clarifying the
rationale of the bootstrap: Why is the bootstrap useful for inferential purposes.
The text is much less aimed at describing the mechanics of the bootstrap, the
philosophy being that, if one really understands the rationale underlying the
bootstrap, it will be relatively easy to modify the bootstrap to adjust it to ones own
purposes.
1.2
Introduction
When data are collected, very often we have data on a sample of observation units
from a (usually much) wider population. To give a simple example, suppose our
research goal is to assess whether male students at the university of Groningen
drink more beer (measured as their weekly average) than female students. Then, it
is very likely that the university administration will not offer the means to study
all of the 20000 students that form the Groningen student population. So what we
do in practice is to draw a sample of, for example, 50 male students, and 50
female students. Hence, the data we thus collect will depend on which persons
happen to be in our sample. Now, if we found for our samples that, on average the
male students drink 8.98 glasses of beer per week, while the females have an
average of 7.14, then obviously, this does not mean that these averages are
averages for the populations of male and female students. So, when we use the
sample averages as estimates for the population averages, we make an error. To
get some idea of the possible sizes of this error, several approaches are available,
and we will treat these in detail here. First, we will study the fictitious situation
where we know the beer consumption in the complete population.
1.3
Fictitious situation: Complete population known
Suppose that the complete population of Groningen students consists of 20000
students, of which 9888 males, and 10112 females, and that we know for each of
these their weekly average beer consumption. Part of these population data is
given in Table 1. A graphical representation of the data is given in Figure 1,
separated for males and females. In these populations the average for males is 9.0
and for the females is 7.0.
0
Table 1. Part of the beer consumption data of the Groningen student population
number
weekly average
beer
consumption
(in glasses)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
9
7
9
7
8
10
8
7
7
9
9
8
8
11
7
gender
number
male
male
male
female
male
male
female
female
female
male
male
female
male
male
female
16
17
18
19
20
….
19992
19993
19994
19995
19996
19997
19998
19999
20000
weekly
average beer
consumption
(in glasses)
7
10
9
7
8
….
9
9
7
6
7
7
8
7
9
gender
female
male
male
female
male
….
male
male
female
female
female
female
female
female
male
population of male students
population of female students
4000
3500
frequency
frequency
4000
3000
3500
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
500
0
0
1
2
3
4
5 6 7 8 9 10 11 12 13 14 15 16
weekly beer consumption
0
0
1
2
3
4
5 6 7 8 9 10 11 12 13 14 15 16
weekly beer consumption
Figure 1. Histograms with frequencies of different weekly beer consumption
scores (in glasses per week) for populations of male and female students.
We now draw from this population a random sample of 50 male students and a
random sample of 50 female students. The sample data are given in Table 2. We
see that the average for the 50 males is 8.98, whereas that for the 50 females is
7.14. This does not differ much from the averages in the population, but it differs
nevertheless.
1
Table 2. Samples of 50 male and 50 female students.
number
male students
no. of
number
glasses
no. of
glasses
number
10
4799
1629
19907
18718
19725
12571
1956
19570
2734
301
19654
13350
5331
3071
9333
10938
17333
7332
8180
7979
10571
172
17145
7524
10
8
10
8
9
10
9
9
8
9
8
9
8
9
7
9
10
10
10
7
9
9
7
9
9
10
9
10
9
10
9
10
11
8
9
9
9
10
9
9
7
8
9
9
10
9
9
8
10
9
7359
2513
17985
3599
11014
19549
9363
11558
8825
5429
8582
666
1709
17963
8012
8923
17758
14569
9242
16385
12770
16866
3923
16224
18503
mean
3926
12672
5818
18726
11268
6017
6953
15820
2580
17046
11476
2880
1683
10198
7312
8817
6631
222
5631
1833
18840
11745
13230
8389
1718
8.98
mean
female students
no. of
number
glasses
9
8
6
8
7
9
8
6
8
8
7
6
7
8
8
8
6
9
4
9
8
8
7
6
7
14942
7341
420
18516
2141
2194
39
358
5011
7705
3453
8696
9284
6964
15040
2642
11981
12165
2552
8258
620
14645
9584
18145
4892
no. of
glasses
8
8
8
7
5
7
7
7
7
6
9
7
6
7
7
7
6
6
8
6
6
7
6
6
8
7.14
If we would draw different samples of male and female students, we would find
different means. To illustrate this, we drew 19 different additional samples of 50
male and 50 female students and computed the mean beer consumption in all
these samples. The means are given in Table 3.
2
Table 3. Sample means for 20 different samples of 50 male and 50 female
students each.
sample 1
sample 2
sample 3
sample 4
sample 5
sample 6
sample 7
sample 8
sample 9
sample 10
sample 11
sample 12
sample 13
sample 14
sample 15
sample 16
sample 17
sample 18
sample 19
sample 20
Mean
mean for
males
8.98
9.08
8.90
9.06
9.10
9.08
9.00
8.80
8.96
8.90
9.00
9.20
9.18
9.20
9.28
9.00
9.00
9.20
9.10
9.00
9.05
mean for
females
7.14
6.78
6.74
7.02
6.92
7.02
6.96
7.00
7.06
6.90
6.98
7.10
6.88
7.24
6.74
6.74
7.08
6.94
7.16
6.98
6.97
We see that the mean beer consumption in the samples of 50 male students
fluctuates around 9, or, more precisely, ranges from 8.80 to 9.28. For the females,
the means fluctuate around 7, ranging from 6.74 to 7.24. This gives a reasonable
impression of what you would find when you would draw various different
samples. Obviously, one could draw many more samples. To give an impression
of what we find then, we drew 1000 samples of 50 male and of 50 female
students, and computed the mean beer consumption for each sample. The means
are represented by the histograms in Figure 2. Each bar here indicates how often
an average was found in the interval represented by the bar.
vrouwelijke studenten
300
250
250
200
150
100
50
0
8.4 8.5 8.6 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8
steekproefgemiddelde
aantal gemiddelden in interval
aantal gemiddelden in interval
mannelijke studenten
300
200
150
100
50
0
6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8
steekproefgemiddelde
Figure 2. Distributions of 1000 sample means for male and female students.
Note: the translation of the (Dutch) headings is: “aantal gemiddelden in interval” = “number of
means in interval”; “male students” = “male students”; “female students” = “female students”;
“steekproefgemiddelde” = “sample mean”.
In Figure 2, we see the distribution of the 1000 sample means. Just as in our
example of 20 samples, the sample means for the male students vary around 9,
3
ranging roughly from 8.6 to 9.4, whereas those for the female students vary
around 7, roughly between 6.6 and 7.4. Thus, we see that, if you draw 1000
random samples of 50 male students from a population with a mean of 9, in the
samples you may find means that deviate a bit from this (ranging from 8.6 to 9.4).
In our original sample (see Section 1.1), we found an average of 8.98, but, if the
population were what we assumed it to be just now, then we could just as well
have found a mean value that is some tenths higher or lower. Likewise, for the
female students, for which we found a mean of 7.14, we could just as well have
found a mean value that is some tenths higher or lower, if it would be a sample
from our fictitious population.
What’s the use of all this? If we would really know the beer consumption
of each member of the Groningen student population, we would be able to predict
the mean we would find in a sample of 50 male or female students, and we could
predict its fluctuation from sample to sample. However, in practice, we do not
know the beer consumption for each member of the population. In fact, if we
would already know this, there would not have been any reason for carrying out
this research in the first place…. Hence, the situation we have here is the reverse:
by means of our research based on a sample, we want to get an impression of the
beer consumption in the complete Groningen student population. Whereas, above
we saw that, if we know the population, we can predict the mean beer
consumption in a sample, we actually want the reverse: knowing the sample mean
beer consumption, we want to estimate the mean beer consumption in the
population. But how, at all, can we do that if we do not have any idea on what the
population looks like? A first step is made in the next Section.
1.4
More realistic situation: We don’t know the population, but we are able
to draw several samples from it
As mentioned above, in practice we do not know the mean beer consumption in
the population. The question thus is: How can we get some insight into the mean
beer consumption in the population, on the basis of samples. In our original
sample of 50 male and 50 female students, we found that the mean beer
consumption for male students was 8.98, and for female students 7.14. Thanks to
the analysis in Section 1.3, we realize that these means may very well have been
found if in the complete population the mean beer consumptions are 9.0 and 7.0,
respectively. And likewise, we can imagine that, if the means in the population
were slightly different, for instance, 9.1 and 7.1, then the story in Section 1.3
would be similar, and sample means would fluctuate around 9.1 and 7.1,
respectively. So our sample means of 8.98 and 7.14 could also very well have
been found if the population means were 9.1 and 7.1, respectively. In this way,
there are many populations that might reasonably well have led to finding the
sample means that we found. Hence we shouldn’t expect to be able to determine
what the means will be exactly in the population, but we ought to be satisfied with
getting a rough estimate of the mean beer consumption in the population, and of
the uncertainty margins we should take into account for this estimate. We would,
for instance, like to make a statement like “We expect the mean beer consumption
in the population of male students to be between 8.6 and 9.5”, or a bit more
precise, “We are 90% sure that the mean beer consumption in the population of
male students is between 8.6 and 9.5”.1 How can we get to such statements? A
first answer is “by drawing another sample”, as follows.
1
Absolute certainty can never be attaiend, because one can never exclude that the sample drawn is
an extreme sample. Consider the situation where in a population of 10000 males, there are 50 who
4
We draw a second random sample of male and female students. This is
easier said then done in practice, but suppose we are able to do this. Then we can
compare the means from both samples. Table 4 gives an example of what you
could find.
Table 4. Mean beer consumption of male and female students in two samples
(in both samples 50 male and 50 female students).
first sample
second sample
mean beer consumption (in glasses of beer)
male students
female students
9.04
6.84
9.08
6.94
Here we see that the means in the two samples differ - not surprisingly, actually it
would be strange if they were exactly equal -, but do not differ much. This gives
an indication that, for the sample mean, it does not make a lot of difference which
sample you draw. Indeed, when in two samples you find almost equal sample
means, you may expect that you will find similar means in other samples as well.
For instance, you could estimate that such sample means will not differ by more
than 0.1 or 0.2 from what you found in the original sample. Of course, you have to
be careful here, because it is possible that you found the almost equal sample
means purely accidentally. However, you might use it as a first indication of the
sampling fluctuation.
A totally different situation is the following: Suppose the sample means in
your second samples are 8.20 and 8.50, respectively, see Table 5.
Table 5. Mean beer consumption of male and female students in two samples
(in both samples 50 male and 50 female students).
first sample
second sample
mean beer consumption (in glasses of beer)
male students
female students
9.04
6.84
8.20
8.50
If we would find the means in Table 5 in a second sample, we would certainly not
conclude that it hardly matters what sample you draw. Indeed, the two samples
give very different means, and even the order of the means for the males and
females is different in the two samples. Moreover, we would now expect that, if
we would draw a different random sample, the means would again be rather
different from what we found before. Of course, we do not know this, but the
comparison of two samples gives an indication that this may very well be the
case. In short, if you want to give a rough indication of the sampling fluctuation,
you could state that sample means may easily differ by one or even two glasses.
The above estimates of sampling fluctuation give just a rough indication.
They are based on only one extra sample. It would be much better to draw many
new samples, because this would give a complete picture of possible sampling
fluctuation. However, in practice this is impossible: Even drawing one extra
sample is rarely possible in practice, because of financial or time limitations. In
drink 9.04 glasses of beer weekly, whereas the other 99950 drink only 1 glass per week. If your
sample happens to consist only of these 50 ‘heavy drinkers’, then the sample mean is 9.04,
whereas the population mean is 1.004. The chance of drawing such a sample is very low (it
happens once in every 310136 times), but yet it cannot be totally excluded.
5
practice, we usually must rely on only one sample. Even then, we can get insight
into sampling fluctuations, which is the topic of the next section.
1.5
Practice: We have only one sample
We are now going to describe how we can get an indication of sampling
fluctuation in cases where we have only one sample. Moreover, we will use this to
give an indication of the uncertainty around our estimates of the population
means. In fact, in this we way we want something that is in principle impossible:
We want to assess how much our sample means deviate from the population
means, whereas we do not know the population means. In Section 1.4 we also did
not know the population means, but there at least we had information on a second
sample from the same population. There our reasoning was as follows: If we do
not know anything about the population, but we do know how much some
different samples from the population may vary, we can use this to get a rough
indication of how much sample means can fluctuate in general. If we cannot draw
a new sample, it is harder to tell something about sampling fluctuation, but not
impossible.
1.5.1 First approach: the jackknife
A first approach to get insight into what could happen if you would draw a
different sample is to simply compute the mean that you get after leaving out one
of the 50 male students from your sample. Specifically, you can compute the
sample mean upon randomly leaving out one individual from your sample, and
then repeat this by leaving out each individual once. This can be done as follows.
Suppose you leave out the data for one male student, and you find for the
remaining 49 a mean of 8.6 (instead of 8.98 as you found in the sample of 50).
Then you would immediately see that it is rather crucial whether or not this
individual is present in the sample, because apparently presence or absence of this
single individual makes a difference of 0.4 to the mean. In other words, you find
that, even a sample that very closely resembles the original sample, has quite a
different mean, so you can immediately conclude that there is quite a bit of
sampling fluctuation in your mean.
Now suppose that upon leaving out one male student, for the remaining 49
you find a mean of 9.00, then you might conclude that leaving out this single
individual does not make a lot of difference (compared to the mean of 8.98 for the
sample of 50). Then this is a first indication (and not more than that) that it does
not make a lot of difference what sample you use. Now by repeating this
approach, in which you leave out one random individual from your sample, many
times, you can get reasonable insight into the sampling fluctuation in the mean
drawn from your population. This procedure is called the ‘leave-one-out’ or
‘jackknife’ procedure (e.g., see Efron & Tibshirani, 1993). Rather than repeatedly,
randomly, leaving out one individual, in practice one often leaves out each
individual once, and compares all thus obtained different means to get insight in
the sampling fluctuation of the mean, for instance by computing the standard
deviation of all such jackknife means. This is a first, and relatively simple,
procedure for getting insight into the sampling fluctuation. In principle this can
also be used for getting insight into the uncertainty margins one has to maintain
around your estimates for the population means, but rather than describing this
here, we move on to the second procedure for assessing sampling fluctuation,
which is better suited for finding uncertainty estimates around population means
and other population parameters. This is the so-called bootstrap.
6
1.5.2 The bootstrap
Before we explain the bootstrap, we repeat what is actually the goal of our
analysis, which is described schematically in Figure 3. The main goal of our
analysis is to estimate the population means (of beer consumption in the whole
population of male students and of female students). We now focus on the
population of male students. The ideal way to assess the mean beer consumption
in the population of male students is simply to study the whole population of all
male students, and for this compute the mean beer consumption. Unfortunately, in
practice this is unfeasible (indicated by the light gray characters in Figure 3: there
is a population mean, but we don’t know it). To yet give an estimate of the
population mean, we draw a sample from that population, and compute the mean
in the sample of male students. We use this sample mean as an estimate of the
population mean. However, in practice this estimate will never be perfectly equal
to the population mean. We can simply ‘hope that it will at least be a reasonable
estimate’, but this is not a very rational approach, since in this way we do not have
any idea of the reliability of our estimate. As soon as we want to use such an
estimate (e.g., for theory building, or decision making), we would also like to
know to what extent we could take our result seriously. We would like to have an
idea how much our estimate could, reasonably, deviate from the population mean.
The best way to study how much sample means usually deviate from the
population mean, is to draw many samples, and compute the sample mean for
each sample. In Figure 3 we see 4 such sample means: the actual sample mean
and three others. From the complete collection of sample means we would find a
broad range of possible sample mean values (e.g., see Table 3), which would give
a good indication of how much our estimate (i.e., our sample mean) could,
reasonably, deviate from the population mean. In practice, however, we cannot
use this: In practice, we draw only one sample. (Therefore, in Figure 3 all other
samples are in light gray). So we now are in the situation where we have only one
sample, and yet we want to estimate both the population mean, and the
uncertainty margins in our estimate of the population mean. For this purpose we
use the bootstrap.
sample
population
mean=8.98
mean=9.0
sample
mean=9.08
…….
sample
mean=9.10
………..
sample
mean=8.96
Figure 3. Schematic overview of results of drawing samples. Everything set in
light gray is unknown in practice.
7
With the bootstrap we reason as follows. We try to imagine what would
happen if we would have had a different random sample from the same
population. Then we could reason as follows:
Suppose that, instead of each person in my sample I had an
arbitrary other person from the population. What would then be the
sample mean?
The next question then is, how could we replace each person by an arbitrary other
from the population. In the bootstrap this is done as follows. Our sample consists
of 50 persons from the population of male students for which we know the beer
consumption scores. The scores are (ordered):
7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 10, 10, 11, 11, 12
We now reason that ‘an arbitrary other person from this population’ must be
someone with an arbitrary score that fits in with our population. Because we
(only) know that our sample is a sample from that population, it is reasonable to
assume that ‘an arbitrary person’ from this population would have one of the
scores that we actually encountered in the sample. Obviously, ‘an arbitrary other’
could have a different score, but this is less likely than that this ‘arbitrary other’
has a score that also occurred in the sample. This is because we know for sure that
scores 7, 8, 9, 10, 11 and 12 occur in the population and some of these (like 8, 9
and 10) even relatively frequently, as suggested by what we found in our sample.
The crucial step in the bootstrap is that we assume that ‘an arbitrary other
person’ from the population will have one of the scores that also occurred in the
sample, and that the probability of this person having a particular score is directly
related to the frequency of this score in the sample. More precisely, in our
example, we assume that ‘an arbitrary other’ has a chance of 4/50 to have a score
7, a chance of 10/50 to have a score 8, a chance of 20/50 to have a score 9, a
chance of 13/50 to have a 10, a chance of 2/50 to have a 11 and a chance of 1/50
to have a score 12. Thus, we actually defined the probability distribution, also
given in Table 6. In the bootstrap we thus assume that the distribution in the
sample is identical to that in the population. In reality, this will obviously not be
true exactly. However, when we try to guess how the scores in the population
could be distributed, when we only know the scores in the sample, then it is not
unreasonable to assume that the scores in the population would be distributed
more or less as in the sample. In any case, of all ideas we might have about the
distribution of scores in the population, it is the ‘most likely’2 idea, because it is
based on the only information we have, that is, the scores in the sample.
Table 6. Hypothetical probability distribution of scores for arbitrary others from
the population.
Score
Probability
7
0.08
8
0.20
9
0.40
10
0.26
11
0.04
12
0.02
2
Efron & Tibshirnai (1993) therefore call this a nonparametric maximum likelihood estimate of
the population distribution.
8
Next, we can replace each person in the sample by an ‘arbitrary other’ with a
score found by drawing randomly a score from the probability distribution in
Table 6. This is exactly what happens in the bootstrap. As we replace each of the
50 persons by an ‘arbitrary other’, we could also say that we simply draw a new
sample of 50 persons with scores drawn randomly from the probability
distribution in Table 6. This new sample is called a ‘bootstrap sample’, and we
will treat this as a real new sample. So we can compute its mean, and inspect how
much this deviates from the mean of our original sample. Obviously, we repeat
this procedure many times to get many such bootstrap samples, and compute the
mean in each. This collection of means then shows us ‘everything’ that could
happen to the mean when we would have had a somewhat different sample.
The above procedure is the bootstrap. This term is chosen to indicate, via
an analogy, that we do not use any other information than that available in our
original sample, and that we do not invoke any outside help. The bootstrap refers
to the bootstraps used by the legendary Baron Munchausen who was able to lift
himself out of the moors (and avoid drowning) by pulling his own bootstraps (see
Efron & Tibshirani, 1993). The analogy is that the baron, like the researcher,
actually needs outside help, which, however, is not available. The baron, like the
researcher, finds an emergency solution in using the most usable things that are
available: his own bootstraps, respectively, the sample data. Against all physical
laws, the baron actually succeeds in salvaging himself, and likewise, seemingly
against all logic, the researcher is able to only use his own data to make an
estimate of the sampling fluctuations and (as we will see) the uncertainty in his
population estimate, by only using the sample data.
A slightly more concrete image of the above idea is to consider that you
have a big population, for instance with 10000 persons, with scores 7 through 12,
in such a way that 800 have score 7, 2000 score 8, 4000 score 9, 2600 score 10,
400 score 11, and 200 score 12. Now if you draw randomly 50 persons from this
population, for each individual the probability of entering in the sample is given
by the probability distribution in Table 6. So, in fact, in the bootstrap, we draw
random samples from a hypothetical population with a distribution of scores as in
Table 6. Thus, the bootstrap is similar to what we did in Section 1.3, namely
drawing samples from a population. However, in Section 1.3 we drew from the
real population, whereas in the bootstrap we use a substitute for the population
(‘plug-in’ according to Efron & Tibshirani, 1993). This plug-in is based on the
distribution of scores in the sample, which we use as estimate of the population
distribution
Two important questions have remained unanswered. These will be treated
in the next sections.
1.5.3 How do we get a bootstrap sample?
The first question, how to get a bootstrap sample, is easy to answer. We get a
bootstrap sample by drawing a sample from our original sample, with sample size
n, and by drawing with replacement. In fact, if we would draw without
replacement, we would always get exactly the same bootstrap sample. In this way
we mimic the process of sampling from a large population with the same
distribution of scores as in the original sample, but we avoid actually setting up
such a large population.
To get a random sample with replacement from our own sample, we use a
computer program with a random number generator (e.g., SPSS, Excel, Matlab).
We assign sequence numbers 1 through 50 to all scores in our sample, randomly
generate 50 integers, and consider this as our new sample of 50 sequence
9
numbers. Then, we construct our bootstrap sample as the set of scores related to
these 50 sequence numbers. In Table 7 you find, next to the original sample, the
scores in two bootstrap samples (all ordered by sequence number). We see clearly
that some individuals now appear twice in the bootstrap samples and others not at
all, which is a consequence of drawing with replacement. This is no problem,
because in fact, we are no longer interested in the actual individuals, but only in
the scores, and the distribution of scores in the hypothetical population.
The bottom line in Table 7 contains the means of the bootstrap samples.
The mean in the first bootstrap sample is 9.02, in the second it is 8.98. Thus, we
start seeing that if you draw different samples from the hypothetical population,
these have means that lie rather close to the original sample mean. Thus, we start
getting an answer to our second question, as described in the next Section.
Table 7. Beer consumption scores in original sample of 50 male students and
two bootstrap samples (sorted on number).
original sample
bootstrap sample 1
bootstrap sample 2
no.
beer
score
no.
beer
score
no.
beer
score
no.
beer
score
no.
beer
score
no.
beer
score
10
172
222
301
1629
1683
1718
1833
1956
2580
2734
2880
3071
3926
4799
5331
5631
5818
6017
6631
6953
7312
7332
7524
7979
10
7
9
8
10
10
9
10
9
8
9
9
7
10
8
9
9
10
9
8
10
9
10
9
9
8180
8389
8817
9333
10198
10571
10938
11268
11476
11745
12571
12672
13230
13350
15820
17046
17145
17333
18718
18726
18840
19570
19654
19725
19907
7
10
7
9
9
9
10
10
9
9
9
9
8
8
11
9
9
10
9
9
9
8
9
10
8
10
172
1629
1629
1683
1683
2734
2734
2880
2880
3071
3926
4799
4799
4799
5331
5331
5631
5818
5818
6953
6953
7332
7524
7979
10
7
10
10
10
10
9
9
9
9
7
10
8
8
8
9
9
9
10
10
10
10
10
9
9
8180
8180
8389
8817
8817
9333
9333
10198
10571
10938
11268
11268
11745
12571
13230
17046
17046
17046
17046
17333
18718
18840
18840
19654
19725
7
7
10
7
7
9
9
9
9
10
10
10
9
9
8
9
9
9
9
10
9
9
9
9
10
172
301
1683
1683
1833
2580
2580
2580
2734
2880
3071
3926
3926
3926
5631
5631
5818
6631
7312
8389
8389
8817
8817
8817
9333
7
8
10
10
10
8
8
8
9
9
7
10
10
10
9
9
10
8
9
10
10
7
7
7
9
10571
10938
10938
11268
11268
11268
11268
11268
11745
12571
12672
12672
12672
13350
15820
17145
18718
18726
18840
19570
19570
19654
19725
19907
19907
9
10
10
10
10
10
10
10
9
9
9
9
9
8
11
9
9
9
9
8
8
9
10
8
8
Mean
8.98
Mean
9.02
Mean
8.98
1.5.4 How do we use the information from several bootstrap samples to assess
the uncertainty margins around our sample mean when used as an
estimate of the population mean?
To get a well-founded answer to the second question (How do we use the
information from several bootstrap samples to assess the uncertainty margins
around our sample mean when used as an estimate of the population mean?), we
draw very many bootstrap samples, for instance, 500, and compute the mean in
each. We then get a distribution of bootstrap means and consider this as an
10
indication of the distribution of sample means when we would draw many
samples from the real population. This is done as follows.
In Table 8 we have the means of 100 bootstrap samples, and the
distribution of the bootstrap sample means is represented graphically in the
histogram in Figure 4.
Table 8. Means of 100 bootstrap samples.
no.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
mean
8.84
8.86
9.08
9.02
8.86
8.56
8.88
8.88
9.10
9.00
9.10
8.94
8.86
8.88
8.96
9.06
8.98
9.12
9.00
9.00
no.
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
mean
9.04
8.84
8.82
8.98
9.04
8.98
9.12
9.04
9.16
9.10
9.08
9.04
8.98
9.06
8.94
9.04
9.10
9.04
9.00
8.88
no.
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
mean
9.04
8.94
9.00
8.96
8.86
9.06
8.96
9.02
8.92
8.96
8.98
9.02
8.86
9.06
9.14
8.86
8.92
8.84
8.94
9.06
no.
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
mean
8.88
8.80
9.00
8.88
8.92
8.82
8.88
9.18
9.00
9.12
8.96
8.98
9.16
9.06
8.76
8.86
8.82
9.02
9.04
8.88
no.
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
mean
9.06
9.10
8.76
8.96
9.22
9.34
8.94
8.96
8.96
8.96
8.94
9.08
9.04
8.84
9.20
9.06
8.88
8.72
9.26
8.96
40
35
30
25
20
15
10
5
0
8.4
8.5
8.6
8.7
8.8
8.9
9
9.1
9.2
9.3
9.4
Figure 4. Distribution of bootstrap sample means.
We now see that the bootstrap means vary between 8.55 and 9.35, and that the
average of all bootstrap means is almost equal to the mean of the original sample.
If we disregard the 10% extreme values, we obtain that 90% of the means lie
between 8.8 and 9.2. Such an interval is called a ‘90% percentile interval’, as it
indicates which bootstrap mean values belong to the middle 90 percentiles. We
11
could also express the variation in bootstrap means by the standard deviation of
the bootstrap means, which in this case is 0.12.
How does all this help us estimating our uncertainty margins, when we use
our sample mean as an estimate for the population mean? We now found that, if
the population would have the same distribution as the sample, then, randomly
sampling from this population, 90% of the sample means would lie between 8.8
and 9.2, and their spread, as measured by the standard deviation, is 0.12. The 90%
percentile interval hence is a ‘90% prediction interval’: It gives the interval in
which we expect (with 90% certainty) to find the means of samples drawn from a
population with the same score distribution as our sample has. This indicates what
could happen if we draw a new sample from a population with a distribution
similar to that in our sample.
Yet, this does not give an answer to our actual question. We actually
wanted to find an estimate of the mean in the population and to provide this
estimate with uncertainty margins. We can now do so by using the following
reasoning. We assume that, whatever the population mean will be, the type and
degree of variation of means of samples from this population will be the same as
that of our bootstrap samples. If this assumption holds reasonably well, then the
information we now have on the bootstrap means can be used to make an estimate
of the amount of variation in actual sample means that you would get when
sampling form the actual population. This can be done as follows.
We have found that 90% of the bootstrap means lie between 8.8 and 9.2,
hence in an interval of width 0.40. If sample means from the real population vary
in the same range, we can conclude that, whatever the population mean is,
approximately 90% of the sample means will differ at most (approximately) 0.2
from the population mean. Conversely, then, the population mean will, in 90% of
the cases where you draw a sample, not deviate more from the sample mean than
(approximately) 0.2. In other words, we can thus be 90% sure that the population
mean will be between 8.98  0.2 = 8.78 and 8.98 + 0.2 = 9.18. Such an interval,
which indicates the location of the population value with 90% certainty, is called a
“90% confidence interval”. Thus, we have (at last) an answer to the question how
to get an estimate of the uncertainty margins around a sample mean when this is
used as an estimate of the population mean. We can say that our estimate of the
population mean is 8.98 and that we are 90% sure that the actual population mean
lies within a(n uncertainty) margin of 0.2 from this value.
Above, we have thus, on the basis of our bootstrap means, found a 90%
confidence interval from 8.78 to 9.18. This interval is very similar to the earlier
90% percentile interval running from 8.8 to 9.2. That is no coincidence. If the
distribution of bootstrap means is roughly symmetric3, and the mean of the
bootstrap means is close to the sample mean, then the 90% percentile interval will
be roughly equal to the 90% confidence interval. The percentile interval is easier
to determine, because it simply runs from the 5th percentile to the 95th percentile.
The percentile interval suffices in many practical cases to give a
reasonable indication of the uncertainty margins around a sample mean (when
used as estimate of the population mean). More advanced bootstrap procedures
have been developed, which give more accurate estimates of confidence intervals
(see, Efron and Tibshirani, 1993), but here we will limit ourselves to the simplest
case only.
In a similar way as for the male students, we can now construct a 90%
percentile interval for the mean beer consumption of the female students. In our
sample, we found a mean of 7.14. We now drew 100 bootstrap samples, computed
3
Or if a monotone transformation of the measure at hand exists for which the distribution is
roughly symmetric
12
the mean scores in all these bootstrap samples, ordered them, and determined the
90% percentile interval by determining the 5th and 95th percentile of these values.
These were 6.8 and 7.4, respectively. Thus, we can state that for male students our
90% certainty estimate is that the mean beer consumption lies between 8.8 and
9.2, whereas that for the female students lies between 6.8 and 7.4. It follows that it
is very likely that, in the population of male students the average beer
consumption is higher than in the population of female students.
1.6
Uncertainty margins for other measures
Above, we determined uncertainty margins for sample means, for situations where
the sample mean is used as an estimate of the population mean. Besides means, in
practice often other measures are used for summarizing data. Examples are the
median and the correlation. Also for such measures, one will use sample data to
give estimates of population values of such measures, and again, one can use the
bootstrap to determine uncertainty margins around such population estimates. In
fact, one can always use the bootstrap to get insight into uncertainty margins
around measures computed for a sample, which are used as estimate of the
population values of this measures. Thus, the bootstrap is a very powerful
procedure, usable for assessing uncertainty margins for whichever measure you
want to use. It should be noted, though, that the bootstrap cannot perform
miracles. If an estimate of a population value based on sample data cannot
logically be made, then the bootstrap cannot give a sensible measure of
uncertainty around such nonsense estimates. An example is using the maximum
score in a sample as an estimate of the maximum score in a population (see Efron
& Tibshirani, 1993, p.81ff).
In some situations, like estimating means, there are easier ways to
construct confidence intervals. For other measures, like the median, such
alternatives are unavailable or of dubious quality, and the bootstrap is
recommendable. In some situations, simple percentile intervals are not good
enough, and advance bootstrap procedures are called for, but even in such cases
simple percentile intervals give at least some insight in uncertainty margins in
situations where other methods are unavailable.
As an illustration, we now first describe the construction of bootstrap
percentile intervals for the median obtained in a particular sample. Next, as a
second example, we describe the construction of bootstrap percentile intervals for
the correlation between two variables.
1.6.1
Bootstrapping the median
We consider a study in which for 40 subjects reaction times were measured in two
conditions. The first condition was the control condition, the second the
experimental condition, in which subject received a dose of a sleep-inducing drug.
The scores are given in Table 9. Clearly, in the sleep-inducing drug condition,
there are a few extremely high reaction times. These strongly affect the mean of
the scores (see Table 9). Therefore, here the median is considered a better
measure to summarize the ‘average’ reaction time. The medians in both
conditions are given as well in Table 9. We would now like to know how well the
sample medians can be used as estimates of the population medians. In other
words, we would like to know the uncertainty margins around the sample medians
when these are used as estimates of population medians.
13
Table 9. Reaction times (in ms) in two conditions.
control condition
179
196
152
188
117
198
127
125
188
174
Mean
Median
Sleep-inducing drug condition
114
101
189
120
130
166
128
147
106
199
152.2
149.5
208
892
202
183
193
173
208
226
203
214
Mean
Median
171
188
228
218
196
207
229
1275
210
155
288.95
207.5
To get an answer to the above question, we drew 500 bootstrap samples
from both our samples, and computed the median in each bootstrap sample. The
frequencies of these medians are represented graphically in the histograms in
Figure 5.
Figure 5. Histograms for 500 bootstrap medians for control and sleep-inducing
drug conditions.
Note: The titles of the histograms can be translated as:“Bootstrapmedianen voor controle
conditie” = “Bootstrap medians for control condition”; “Bootstrapmedianen voor slaapmiddel
conditie” = “Bootstrap medians for sleep-inducing drug condition”.
For the medians in the control condition, the 90% percentile interval runs from
127 to 181. In the sleep-inducing drug condition the 90% percentile interval for
the median runs from 196 to 212.5. We thus see that it is likely that the median
reaction time for the sleep-inducing drug population is higher than that for the
control population.
1.6.2 Bootstrapping the correlation: An example of bootstrapping multivariate
data.
In a small study with 19 students, we recorded their age and the number of sexual
partners they had in the past year. The data are given in Table 10, and represented
by a scatter plot in Figure 6. Also, we computed the (linear) correlation coefficient
between these variables, which was 0.57.
14
Table 10. Data on age and number
of sexual partners in past year.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
age
18.2
19.1
24.5
18.7
18.3
20.9
22.0
24.2
21.1
25.4
19.6
18.2
24.1
19.8
23.9
22.7
19.8
20.2
21.5
no. of
sexual
partners
5
3
2
9
4
3
3
5
3
1
4
3
1
4
2
2
4
4
5
10
aantal partners in afgelopen jaar
subj
r = -.57
8
6
4
2
0
16
18
20
22
24
leeftijd
26
28
30
Figure 6. Scatterplot of data in Table 10.
Note: “leeftijd” = “age” ; “aantal partners in
afgelopen jaar” = “number of sexual partners in
past year”.
It seems to hold roughly that, the older the students, the fewer sexual partners they
had in the past year. The obtained correlation is quite strong, but it should be
noted that it is based on a small sample only. Therefore, we may have serious
doubts about the accuracy of this correlation as an estimate of the correlation
between age and the number of sexual partners in the population of all students.
To get insight into this, we can again use the bootstrap, but the situation is a bit
more special than in the univariate cases (i.e., with only one variable) described
above.
To use the bootstrap in the present situation, where we deal with scores on
two variables (hence we have multivariate data), a bootstrap sample is obtained by
drawing, with replacement, a number of pairs of scores. Specifically, we can use
the subject numbers as sequence numbers, randomly generate (with replacement)
a sample of these sequence numbers, and then use the data associated with these
sequence numbers as our bootstrap sample. Two examples are given in Table 11.
It should be emphasized that, in producing bootstrap samples, pairs of scores are
never separated. Hence, in the bootstrap samples we do encounter the score pair
(24.2, 5) but not, for instance, (24.2, 4), because this pair did not occur in the
original sample.
15
Table 11. Two examples of bootstrap samples
21.1
18.3
18.2
21.1
24.2
20.9
22.7
21.5
24.1
25.4
19.6
18.2
22.7
18.3
23.9
24.5
24.5
21.5
19.8
3
4
5
3
5
3
2
5
1
1
4
3
2
4
2
2
2
5
4
19.8
19.1
19.8
22.7
18.3
21.5
18.2
24.2
21.1
21.5
24.5
18.2
24.1
18.7
18.7
19.8
20.9
20.2
23.9
r = .60
4
3
4
2
4
5
5
5
3
5
2
3
1
9
9
4
3
4
2
r = .51
In this way, 5000 bootstrap samples were drawn from the original sample, and in
each of them the correlation was computed. The values of these 5000 correlations
are summarized in the histogram in Figure 7. We see that the correlation varies
quite a bit over these bootstrap samples, and we even encounter positive
correlations in a number of bootstrap samples. These, however, pertain to the
extreme findings. When we only consider the 90% middle values, hence
determine the 90% percentile interval, we find that it runs from .83 to .29. We
thus conclude that the accuracy of the sample correlation (of .57) as an estimate
of the population correlation is not very high; indeed, in the population the
correlation could ‘very well’ be a different value between .83 and .29. On the
other hand, we are reasonably confident that the correlation in the population will
be at least mildly negative, hence that younger students tend to have had
somewhat more sexual partners than older students in the past year.
Figure 7. Histogram of correlations between age and number of sexual partners
in 5000 bootstrap samples.
16
In the histogram with bootstrap correlations we can discern a special feature:
There are notably more values lower than .57 than there are higher than .57.
Moreover, values higher than .57 are more spread out; in other words, the
distribution of the bootstrap correlations is skewed to the right. This is no
coincidence. Also when we would draw samples from a population in which the
correlation is .57 we would obtain a picture like this. In this way, the distribution
of the bootstrap correlations mirrors that of the correlations in samples drawn
from a population. However, we do not know whether in reality the population
correlation is (close to) .57; in fact, given the large spread, this correlation might
just as well be .2 or.3 lower or higher.
1.6.3 Other possibilities
In Sections 1.5 and 1.6.1 means and medians, respectively, were compared across
populations. In both cases, we gave uncertainty margins for each population
estimate. Instead, in both cases we could have given uncertainty margins for the
difference across populations. Specifically, we could have computed, for instance,
the difference between the medians in the two original samples, and next many
times draw one bootstrap sample from the one original sample, and one from the
other, and compute the difference between these two bootstrap medians. Thus, we
would obtain a large number of bootstrap differences in medians, and we could
next determine a 90% percentile interval from these bootstrap differences in
medians, which in turn can be used to set up a 90% confidence interval around the
observed difference in medians.
1.7
Relation to classical statistics
In Section 1.5, it has been described that the basic principle of the bootstrap is the
following reasoning. We wish to guess what would happen to our summarizer
(e.g., mean, correlation, etc.), if we would have an arbitrary other sample than the
one we actually have. For this purpose, we imagine what would happen if we
would replace each given observation unit by an arbitrary other from the same
population. The question then is, how can we actually do this replacement? In
classical statistics the answer to this question is given in a rather different way. In
classical statistics, it is usually assumed that population distribution of the
parameter at hand (e.g., the mean, the median, etc.) has a particular shape that is
easily characterized, usually by means of only few parameters, the most common
example being the normal distribution. This assumption only pinpoints the shape
of the distribution; to fully specify the distribution we would need (estimates of)
the population mean and the population standard deviation, but to obtain
confidence intervals, we only need the standard deviation. In classical statistics,
this is standard deviation is usually estimated on the basis of the data. Thus, in this
respect, also classical statistics is a kind of bootstrap procedure, in that
information in the actual sample is used to get an estimate of the uncertainty of
our population estimate. The difference with the bootstrap described above is that
in classical statistics the shape of the distribution of sample measures is specified
by assuming that this is given by a simple mathematical function, whereas in the
bootstrap no such assumption is made. Instead, in the bootstrap, the distribution of
sample measures is obtained by repeatedly drawing samples from a hypothetical
population, which itself is our best guess of what the actual population would look
like.
17
The classical statistics assumption that sample means are normally
distributed is a reasonable assumption when in the population the scores
themselves are (nearly) normally distributed, or when the sample size is quite
large. In fact, even when the scores in the population are clearly not normally
distributed, but sample sizes exceed, say, 30, often sample means are nearly
normally distributed. However, in case of small samples from populations that are
clearly not normally distributed, confidence intervals obtained by classical
procedures can be grossly incorrect, and bootstrap procedures will be more
reliable. It should be emphasized that this does not mean that bootstrap procedures
work perfectly in case of small samples and strongly nonnormal distributions.
Indeed, in such cases, to get good confidence intervals, it becomes relatively
important to use the more advanced bootstrap procedures, and even these will not
always give exactly proper confidence intervals. However, in such situations
bootstrap confidence intervals can be expected to at least work better than
classical statistical confidence intervals, because the classical ones rely on
assumptions that, in such cases, are clearly violated, whereas the bootstrap
intervals do not rely on such assumptions.
It may be noteworthy that, when a sufficient amount of bootstrap samples
is drawn, the standard deviation of the bootstrap sample means will be the same as
the standard deviation of the sample mean as estimated in classical statistics;
however, the classical confidence interval will be different from the percentile
interval, because the intervals are based on the shape of the distribution (and not
only the standard deviation).
In practice, the classical approach, when available, is often preferred,
because it is easiest to apply: Instead of drawing a large number of bootstrap
samples, and computing the measure of interest in all of these, it suffices to
compute some sample statistics, and insert these in a specific formula for
determining a confidence interval around a population estimate. This approach
works fine for a number of well-known measures provided that the assumptinos
required are not violated too heavily. However, often the assumptions are violated
to quite a large extent. Moreover, for new measures, classical procedures are not
available yet, and using them would first require the mathematical statistical
derivation of such formulas, which may be cumbersome and/or require untenable
assumptions. Instead, in such situations of heavy violation of assumptions, or of
unavailable ‘classical’ results, one can always use the bootstrap, which, provided
that in certain complex cases (see Efron & Tibshirani, 1993) the advanced
bootstrap procedures are used, gives an estimate of uncertainty margins for
whatever measure one wishes to use. The procedure is also easily generalized to
situations with more than one measure as outcome. A brief introduction to this is
given in the next, and final, Section.
1.8
Bootstrapping several outcome parameters jointly
In many multivariate applications, the result of the analysis of a data set consists
of several outcome variables. A simple, and common, example is multiple
regression. Suppose we want to predict the scores on a criterion variable Y by the
scores of a number (p) of predictor variables. We collect data of a number of
observation units on all p predictor variables and on the criterion variable, and
next submit this data set to a procedure for carrying out a multiple regression
analysis. Note that this could be any variant of multiple regression analysis, not
just the classical one. Then the result of such a multiple regression analysis is a
number of regression weights, say (b1,…,bp). These weights are based on the
sample scores, but we will use them as estimates for regression weights that apply
18
to the complete population. The question then, again, is how accurate these
estimates are, or, in other words, what are the uncertainty margins around these
regression weights. For ordinary multiple regression classical procedures (based
on normality assumptions) are available for the construction of confidence
intervals around regression weights. However, these do not work for special types
of multiple regression procedures. A general procedure that does always apply,
however, is the bootstrap, which here works as follows.
In the multiple regression situation we have a data set with scores of n
observation units on p predictor variables and one criterion variable. So each
observation unit has p+1 scores. Now from the sample of n observation units we
draw a number of bootstrap samples in the same way as we did in the case of
bootstrapping the correlation. That is, we can use the observation unit numbers as
sequence numbers, randomly generate (with replacement) a sample of these
sequence numbers, and then use, for each (re)drawn observation unit the p+1
scores associated with it, and thus constitute a bootstrap sample. In this manner
we compute a large number (say 500) of bootstrap samples, and in each bootstrap
sample we compute the regression weights according to the multiple regression
method that we also used in our original sample. Thus, we get 500 bootstrap
regression weights b1, 500 bootstrap regression weights b2, etc. For each of these
bootstrap regression weights individually, we can now construct a 90% percentile
interval, and consider this as a 90% confidence interval around the sample value.
Thus, in this way, for each separate regression weight, we can obtain the
uncertainty margins.
The above procedure can be used for any statistical procedure that yields
multiple outcome measures. A prerequisite for the procedure to work well is that
the outcome measures are uniquely identified (which is, for instance, not the case
in a method like principal components analysis). In case of methods with
nonuniquely identified outcome measures, special procedures have to be invoked
to use the bootstrap, which is beyond the scope of the present text.
19