Download 12.1 BOOTSTRAP HYPOTHESIS TEST OF THE POPULATION MEAN

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
12.1 A BOOTSTRAP HYPOTHESIS TEST
OF THE POPULATION MEAN
In Chapter 6 we looked at sets of data and asked whether the data could
have arisen from a binomial model with a certain population probability of
success such as that given by a hypothesized drug cure rate. If the observed
proportion of successes was too large, we rejected the hypothesized model.
In Chapter 7 we looked at a set of data and asked whether it could have
arisen from a particular hypothesized model such as that of a six-sided fair
die or a many-sided loaded die. The chi-square statistic was calculated. If
it was improbably large, we concluded that the data did not come from
the hypothesized model. If the chi-square statistic was not too large, we
concluded that the data may very well have come from that model. That is,
we “rejected” or “accepted” the null hypothesis on the basis of the above
considerations. The six-step decision-making process of Chapters 6 and 7
can be used for many other statistical purposes. Which of two new drugs,
if either, is more effective? Are husbands more likely to be older than their
wives? Is a particular hypothesized population blood pressure average
correct for 60-year-olds? This chapter shows how to formally assess such
hypotheses. The chapter is best viewed as a continuation of Chapter 6.
Let us consider the Key Problem. We will assume that the body temperature readings constitute a random sample from a population of adults and
hence are representative of this population. The population is taken to be a
large one, such as the residents of Chicago. Confidence intervals (discussed
in Chapter 11) give a set of reasonable values for the unknown theoretical
mean of the population under consideration. We need the standard deviation of the data, which turns out to be 0.73. Thus, as explained in Chapter
11, we can use the theoretical result that a sample mean when appropriately
standardized is approximately distributed as a standard normal variable.
From this, the approximate 95% confidence interval for the population mean
is computed by
98.25 ⫾
1.96(0.73)
冪130
⳱ 98.25 ⫾ 0.13 ⳱ (98.12, 98.38)
The “normal” body temperature value 98.6 is not in that interval, even
though the theory of confidence intervals tells us the interval covers the true
population average temperature about 95% of the time. This suggests that
the population mean is lower than 98.6—that it is not just chance causing
the observed mean to be so low as 98.25.
Another approach to this question is to formally set out a null hypothesis:
H0 : The population mean is 98.6.
and then use the data to see if the hypothesis holds up or whether
the population average temperature is in fact lower. The H0 is the null
hypothesis: H is for hypothesis and 0 is for null. This is the hypothesis-testing
approach of this chapter. The idea is the same as in Chapter 6, where we
asked whether the data conformed closely enough to a given hypothesized
model. The null hypothesis represents the status quo: it is believed that the
mean of the population is 98.6. It is usually the hypothesis of “business as
usual” or “nothing of interest here.” We then look at the data, which have a
mean of 98.25. We have two choices:
1. It is plausible that the observed sample mean of 98.25 could have come
from a null hypothesis population with a mean of 98.6. We thus accept
the null hypothesis, meaning that the evidence tells us either that the
null hypothesis is true or that there is not enough evidence to say it is
false. Thus “accept” does not necessarily mean we have strong evidence
that the null hypothesis is true.
2. It is not plausible that the observed sample mean could have come from
a null hypothesis population with a mean of 98.6. We thus reject the null
hypothesis and conclude that the true population average temperature
is less than 98.6.
How far does the sample average have to be from 98.6 before we are
compelled to reject the null hypothesis? In particular, what is the chance
that if the null hypothesis is true, we could observe a sample mean as low
as or lower than 98.25? To decide, we take the six-step hypothesis-testing
approach of Chapter 6.
A key step in our six-step simulation approach to statistical hypothesis
testing is step 1: making a realistic choice of the model to be randomly sampled from. One cannot be effective as a statistician without understanding
how to realistically specify the model generating the data to be analyzed.
Because we are hypothesis testing, the specified model of step 1 must
satisfy the null hypothesis as well as be a realistic model in terms of shape
and spread for the problem at hand. In Chapters 6 and 7 supplying such
a model was fairly straightforward. For example, in Chapter 7 we often
used a fair many-sided die null hypothesis model, and in the Chapter 6
problem about community attitudes toward raising the driving age, a fair
coin null hypothesis model sufficed. In this section something more subtle
is often called for. When understood, this new approach to building the
null hypothesis model of step 1 will seem most reasonable. The approach is
called bootstrapping the observed data. We were introduced to bootstrapping
in Section 11.3 as a method of approximately obtaining the standard error
of an estimate. Bootstrapping’s central idea is to use the shape of the data
to supply a good estimate of the unknown model population that we wish
to sample from. The name bootstrapping comes from the cliche´ of pulling
one’s self up by one’s own bootstraps—that is, climbing upward with no
assistance other than one’s own body, clearly a feat not literally possible.
In a statistical context it means making a statistical inference using only the
data to produce the model: that is, we do not make the usual specification of a
model (such as binomial, normal, or uniform distribution), which is usually
arrived at independently of the data. The statistical bootstrap has a valid
justification and indeed often works well in applications. From the viewpoint
of this book, it is a special version of our six-step method of hypothesis testing
in which the model of step 1 is entirely determined by the observed data and
the null hypothesis. Bootstrap methods are beginning to be heavily used in
modern statistical practice; hence the method you will learn here is part of
the modern statistical arsenal. Let us return to the Key Problem.
1. Choice of a Model (Definition of the Null Hypothesis Population):
We must choose a realistic box model—a population—that conforms with
the null hypothesis. In particular, the population must represent the body
temperatures of the large set of adults actually being sampled from, but with
the null hypothesis being true and hence having a population average of
98.6. But simply saying the population mean is 98.6 does not define the
null hypothesis population, for many populations have a theoretical mean
of 98.6. What is the population standard deviation? What does the shape
of the population relative frequency histogram, also called a probability
histogram, look like (that is, the theoretical probability distribution of
the temperature of a randomly sampled person—see the introduction to
Chapter 8, and see Figure 8.1 for an example of a probability histogram)?
For example, we could decide it is unwise to assume a normal shape for
our population distribution, even though many would take this approach.
Indeed, we will instead presume that the shape of the population relative
frequency histogram or theoretical distribution is exactly that of the sample
relative frequency histogram in Figure 12.1, except that it is shifted so that
its theoretical mean is 98.6. (Note that the rectangles in Figure 12.1 are 0.2 in
width, and keep in mind that the probability of an interval is given by the
area, not the height, of the rectangle.) Note that the population probability
histogram represents the temperature distribution for all the individuals in
the population. Since on average the data points are 98.6 ⫺ 98.25 ⳱ 0.35
lower than the hypothesized value, we simply add 0.35 to all the data points.
Now we replicate each such shifted data point many times to create the needed large population of adults satisfying the null hypothesis.
(Example 12.1 below exhibits this procedure more explicitly.) This invented population (box model) that characterizes H0 being true then has
a theoretical mean of 98.6 but otherwise is shaped like the data. The
advantage of this approach is that we have not needed to assume any
particular theoretical distributional shape, but have let the data alone
determine our estimate of the shape. An approach that makes no assumptions about the particular shape (for example, a bell shape) of the
population distribution is called nonparametric because it is free of re-
.45
.40
Temperature ≥ 98.6
.35
.30
.25
.20
.15
.10
.05
0
96
97
Figure 12.1
98
99
Temperature (°F )
100
101
Adult temperature data.
strictive assumptions that are usually given by parameters that specify the
shape of the population distribution. The nonparametric approach is very
powerful because the user takes no risk of being deceived by assuming
an incorrect shape for the population histogram. The nonparametric, datadriven approach we are embarking on is one version of what statisticians
call the nonparametric bootstrap.
2. Definition of a Trial (Sample): A trial consists of randomly choosing
130 readings from the population, sampling without replacement, because
sampling without replacement is the realistic way to sample from a population. Indeed, the actual 130 observations were obtained by sampling
without replacement.
3. Definition of a Successful Trial: Because we are concerned with the
average temperature of 130 people, the statistic is the average of the (new)
130 readings sampled from the population. The trial is a success if the
average is less than or equal to 98.25.
4. Repetition of Trials: We perform the sampling 100 times, each time
obtaining a new mean. The stem-and-leaf plot in Table 12.1 contains
these means. The average of the means is 98.6048—very close to the null
hypothesis value, as expected because the population mean is 98.6. The
standard deviation of the means can be computed: it is 0.0606.
5. Estimation of the Probability of the Obtained Average or Less (Probability of a Successful Trial): We want to know the chance of obtaining
a bootstrap sampled mean that is ⱕ 98.25 from this null hypothesis population designed to have a population mean of 98.6. It turns out that all the
simulated sample averages in step 5 are higher than 98.25: they range from
98.45 to 98.75. Thus we estimate the probability to be 0.
Table 12.1
Sample Means of Temperatures
from 100 Samples of Size 130
Stem
Leaf
984
985
985
986
986
987
987
567
1222223334444444
5555566777777888889999
00000000001111112222222333344444
55555555666777788899
00014
55
Key:
“984
567” stands for 98.45, 98.46, 98.47 degrees.
6. Decision: If the null hypothesis is true, the chance that a bootstrap
sample mean is ⱕ 98.25 is estimated to be 0, much less than the conventional
value of 0.05 for an unlikely event. Thus there is strong evidence that the
null hypothesis does not appear plausible—the observed value of 98.25
cannot be ascribed to chance under the null hypothesis. We reject the null
hypothesis, believing the evidence to be very strong. That is, the evidence
is strong that for the population from which the data were sampled, the
average temperature is not the “normal” value of 98.6 degrees.
Since it is a totally new idea to form a nonparametric bootstrap null
hypothesis population, we now examine the crucial step 1 in detail. To
understand how to construct the null hypothesis population model from
the sample data, let’s simplify the body temperature problem by assuming
that the sample is of 5 people rather than 130. Suppose their observed
temperatures are 97.3, 97.5, 98.4, 98.6, and 99.2. The goal is to invent a large
population (to be sampled from without replacement) shaped like these data
but with a mean of 98.6, thereby satisfying the null hypothesis. The mean
of the five points is 98.2. To start building the large population, we have
to add something to each point so that the mean of the five points is 98.6.
Because 98.6 ⫺ 98.2 ⳱ 0.4, we have to add 0.4. We then have 97.7, 97.9, 98.8,
99.0, 99.6. (Check that these five have mean 98.6.) We now create the desired
realistically large population by replicating each of these five points many
times. Suppose the large population to be modeled is of size 5000. Any large
number is acceptable: for instance, 500 would be fine, too. Then we replicate
each point 1000 times so that the total in our invented large population is 5000.
The result is a box model of size 5000 that satisfies the null hypothesis and,
we believe, is shaped approximately like the real population distribution of
all people’s temperatures. (See the following table.)
This invented population is our best guess of what the real population
looks like in terms of centering, spread, and overall distributional shape if
the null hypothesis is true. If the sample of size 5 is reasonably representative
of the unknown population (spread, shape, and so on), then our population
97.7
97.9
98.8
99.0
99.6
97.7
97.9
98.8
99.0
99.6
97.7
97.9
98.8
99.0
99.6
97.7
97.9
98.8
99.0
99.6
97.7
97.9
98.8
99.0
99.6
97.7
97.9
98.8
99.0
99.6
97.7
97.9
98.8
99.0
99.6
97.7
97.9
98.8
99.0
99.6
97.7
97.9
98.8
99.0
99.6
97.7
97.9
98.8
99.0
99.6
...
...
...
...
...
97.7
97.9
98.8
99.0
99.6
created from the data should be (roughly) shaped like the unknown true
population distribution we would use in step 1 if we knew it.
Now our plan is to repeatedly sample five numbers without replacement
from this large invented population. Because the population size (5000) is
large relative to the planned sampling size (5), we know from Section 5.7
that the probability of a successful trial is almost unaffected if we instead
sample with replacement, so we can sample with replacement with no
harm done. But if we sample with replacement, the probability law of the
five-observation sample average needed for step 5 is the same if we use
the shifted original (five-member) sample as our step 1 population instead
of the population of 5000 formed from the 1000 replications. By switching
to sampling with replacement to form our simulated samples, we can
avoid all the effort of creating and sampling without replacement from a
large invented population. We can rather merely take the basic five shifted
null hypothesis values, 97.7, 97.9, 98.8, 99.0, and 99.6 to be the entire null
hypothesis population, and repeatedly randomly sample five observations
from this five-member population with replacement. That is, we randomly
choose one, record its value, and then put it back. We then randomly choose
another, which could be the same as the first, record its value, and then put
it back. We do this five times to obtain each step 2 sample. Such random
sampling with replacement from the actually observed data set is bootstrap
sampling. Indeed, in Section 11.3 we introduced the bootstrap by sampling
with replacement from a small (nonreplicated) population. This sampling
with replacement from the (possibly small) sample is what statisticians
actually do in practice.
In summary, we can bootstrap-sample repeatedly (that is, with replacement from the actual observed sample, but translated to make the null
hypothesis true) as an acceptable substitute for repeatedly sampling without replacement from the large invented population we would have created
to be shaped like the actual observed sample but translated to make the null
hypothesis true.
Now imagine the above bootstrap sampling with replacement using as
our box model the original 130 temperature readings shifted by Ⳮ0.35 ⳱
98.6 ⫺ 98.25. That is, we now return to the original data set of 130 body
temperature measurements. We sample 130 observations with replacement 100 times. A statistician could use such a nonparametric bootstrap
procedure in this situation in order to verify the choice of the distribution
of the sample average under the null hypothesis for use in step 5. Indeed,
in cases in which the population distributional shape is not known, the
above bootstrap approach is very appealing and would be used by many
professional statisticians when the sample size is too small for application
of the central limit theorem—that is, well under 30 (and hence certainly in
the case of 5). But when the sample size used to compute the sample mean
is large, as 130 is in the example, most statisticians would appeal to the
central limit theorem of Chapter 11 because it tells us that the distribution of
these sample means will be well approximated by the normal distribution
regardless of the shape of the distribution of temperatures in the population.
This normal approach to carrying out step 5 is developed in Section 12.2.
In Example 12.1 below we illustrate the bootstrap approach as a special
case of our six-step method of hypothesis testing.
Example 12.1
A Paired Comparison of Two Population Means
Data on 177 Illinois husband-wife couples from the 1989 Current Population
Survey* yielded the comparison of attained educational levels presented in the
relative frequency histogram of Figure 12.2. The distributions look reasonably
similar, although it appears that more of the husbands go through two years of
college (14 years of education) and more wives have but one year of college (13
years). The average for the 177 husbands is 12.89, and that for the 177 wives is 12.65.
The husbands average 0.24 year more of education. Could that be due to chance, or
0.50
Wives
0.40
0.30
0.20
0.10
0
0.50
Husbands
0.40
0.30
0.20
0.10
0
0
2
Figure 12.2
4
6
8
10 12 14 16
Number of years of education
18
20
Years of education data for married couples.
*From data disk accompanying D. Freedman, R. Pisani, R. Purves, and A. Adhikari, Statistics
(New York: Norton, 1991).
are husbands more educated than their wives on average in the population we are
sampling from?
Because we are interested in differences, the variable we look at is
Difference ⳱ husband’s education ⫺ wife’s education
for each couple. Figure 12.3 shows the relative frequency histogram of the 177
differences. Over 30% of the couples have the same educational level. One wife
(with 14 years) has 11 more years than her husband (with 3). Otherwise, the largest
difference is seven years. The average of these differences is the 0.24 we saw above.
This is called a paired comparison test of two population means, because the (X, Y)
values are paired (a wife and a husband). Is it plausible that these data could be a
sample from a population with a difference in number of years of education exactly
0? We wish to test the following null hypothesis:
H0 : The average difference in years of education for the population
of Illinois husband-wife couples is 0.
Let us proceed to the six steps, using the nonparametric bootstrap method of
this section.
1. Choice of a Model (Definition of the Population): We seek a null hypothesis
population. It will be our best estimate of the Illinois husband-wife population
satisfying the null hypothesis. Consistent with the bootstrap approach, we will use
the sample to create this population. Because in the sample the observed difference
is 0.24 year, we subtract 0.24 from each of the 177 differences in the data to produce
a null hypothesis population of differences whose mean is exactly 0. That is, we
shift all the differences by the same amount so that they can be viewed as a step
1 population for which the null hypothesis holds. Again we have the choice of
creating a large realistic population by replicating each of the 177 members many
times and then repeatedly sampling 177 differences without replacement. But once
again we will more simply let the 177 translated differences be our entire null
hypothesis population and produce our large number of size 177 sampled means
by bootstrap-sampling with replacement between draws.
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0
–12 –10 –8 –6 – 4 –2 0
2
4
6
Difference in years of education
8
10
Figure 12.3
Bootstrap-simulated differences in years of education for married couples.
Table 12.2
Bootstrap-Simulated
Sample Means of Education Differences from 100 Samples of Size 177
⫺6
⫺5
⫺4
⫺3
⫺2
⫺1
⫺0
0
1
2
3
4
1
Key:
“⫺6
64
84433331
88877777554433
988865554221110000
877777653322221110
1113388889
01223345556667789
0024589
1334
0
1” stands for ⫺0.61.
2. Definition of a Trial (Sample): A trial consists of randomly choosing 177
differences (husband’s years of education minus wife’s years of education) from
the population of 177 by sampling with replacement.
3. Definition of a Successful Trial: The statistic of interest is the average of the
177 differences sampled from the invented population. A trial is a success if the
observed average of differences is as large as or larger than 0.24.
4. Repetition of Trials: We perform the sampling 100 times. The means of the
100 samples are shown in the stem-and-leaf plot in Table 12.2.
5. Estimation of the Probability of the Obtained Average or More (Probability
of a Successful Trial): We want to know the chance of obtaining an average of
differences as large as or larger than 0.24. From Table 12.2, we can count 9 that are
0.24 or above, so the probability is estimated to be 0.09.
6. Decision: If the null hypothesis were true, the chance that the sample mean
difference is as high as 0.24 would be about 0.09. That is fairly small, but it is
certainly larger than our 0.05 convention for rejection. We decide to accept the null
hypothesis that the average difference in the population is 0. (A tentative rejection
would be reasonable, too; the results are borderline. The statistically knowledgeable researcher would definitely want to revisit this problem, perhaps with a new
and larger sample. For although 0.09 does not meet our “gold standard” for strong
statistical evidence for rejecting the null hypothesis, it surely raises our scientific
suspicions that the hypothesis may be false!) The decision we make to “accept” H0
here in step 6 means that although there is an observed difference in educational
levels in the sample, the difference is not large enough to take as convincing
evidence that there is a difference in educational levels for the entire population.
That does not mean we are convinced there is not a difference; there may be no
difference, or there could be a small one. It is therefore more accurate to say we fail
to reject H0 .
In Chapter 7, where we did chi-square testing, we learned that statisticians can, in fact, bypass the simulations of the six-step method and appeal
to the method of chi-square density. In contrast, the bootstrap simulation
method, which does not bypass the simulations of the six-step method, is
often the professional statistician’s method of choice when the sample size
is small (say, under 30).
SECTION 12.1 EXERCISES
1. Suppose a sample of 100 heights of men has
a mean of 70.53 inches and a standard deviation of 3.22 inches. Explain how to use the
bootstrap to test the hypothesis that the mean
height of men in the population is 69 inches.
2. Suppose for Exercise 1 we have 100 bootstrapped means each from a bootstrap sample
of 100 taken from the invented population
with a mean height of 69 inches. The bootstrapped means are recorded in the following
stem-and-leaf plot. Is the mean height of the
men in the population really 69 inches?
Stem
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
Leaf
2
2
08
2
01117
233459
01112334444559
003357778899
1122333445555678
001233459
000113333799
2223339
1234444478
1
26
5
Key: “697 5” stands for
69.75 inches.
3. Suppose 100 small specimens were taken from
a certain batch of concrete. The mean compression strength of the specimens was 4129.58
pounds with a standard deviation of 164.12
pounds. Explain how to test the hypothesis
that the mean compression strength of the
batch of concrete is 4200 pounds using the
bootstrap.
4. For Exercise 3, we found 100 bootstrapped
means each from a bootstrap sample of size
100 taken from the invented population with
mean 4200 pounds. The bootstrapped means
are recorded in the following stem-and-leaf
plot. Is the compression strength of the batch
of concrete really 4200 pounds?
Stem
416
417
417
418
418
419
419
420
420
421
421
422
422
423
423
424
424
Leaf
778
2
557789
0022244
5567779
00001122334
55667777899999
11122333334444
5556777888999
011144
5666699
0012
5667
57
6
Key: “424 6” stands for
4246 pounds.
5. In a random sample of 100 husbands, the mean
age was 46.62 years with a standard deviation
of 4.40 years. The mean age of the 100 wives of
the husbands was 41.88 years with standard
deviation of 3.41 years. Explain how to test
whether the difference between the ages of
husbands and wives is 0 using the bootstrap.
6. Refer to Exercise 5. Here is a frequency table of
100 bootstrapped mean differences obtained
from 100 bootstrap samples of 100 taken from
an invented population with a mean difference of 0. Test whether the difference between
the ages of the husbands and wives is 0.
Difference (years)
Frequency
⫺1.4
⫺1.3
⫺1.2
⫺1.1
⫺1.0
⫺0.9
⫺0.8
⫺0.7
⫺0.6
⫺0.5
⫺0.4
⫺0.3
⫺0.2
⫺0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
2
0
1
1
0
2
1
1
3
5
7
3
10
5
8
9
5
9
7
5
1
8
0
2
3
1
0
0
1
7. In a random sample of 200 high school seniors
who took the SAT test, the mean SAT verbal
score was 544.7 with a standard deviation of
36.36, and the mean SAT math score was 531.7
with a standard deviation of 47.10. Explain
how to test whether the difference between
SAT verbal and math scores is 0.
8. Refer to Exercise 7. The 100 bootstrapped
mean differences (verbal ⫺ math) shown in the
following stem-and-leaf plot were obtained
from 100 bootstrap samples of 200 taken from
the invented population with a mean difference of 0. Test whether the difference between
SAT verbal and math scores is 0.
Stem
Leaf
⫺16
⫺15
⫺14
⫺13
⫺12
⫺11
⫺10
⫺9
⫺8
⫺7
⫺6
⫺5
⫺4
⫺3
⫺2
⫺1
⫺0
0
1
2
3
4
5
6
7
8
9
10
11
Key:
5
0
4
2
976
9852
98732
9966621
87653210
77410
97554
55333321000
665
011268
02358999
1578
3356888
1245
244469
0
37
24
0235
2
“11
2” stands for 11.2.
For additional exercises, see page 731.