Download Bootstrapping: described and illustrated Comparing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Statistical inference wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Transcript
Bootstrapping: described and illustrated
Comparing the standard mean, 10 and 20% trimmed means, & median
Before discussing the main topic, let us quickly review sampling distributions so that everyone is
clear on the major theoretical background.
Because the concept of a sampling distribution of a statistic (especially a mean) is so fundamental
to bootstrapping – what it’s about, why it works as it does – I want to review the following: The
sampling distribution of the mean has three principal characteristics you should remember: (1)
For any sample size n, the mean of all (!) sample means [drawn from a finite population] is
necessarily equal to the population mean (such a statistic is said to be unbiased); (2) The
variance of the distribution of all means (always) equals the population variance divided by n;
and (perhaps surprisingly), (3) As sample size, n, grows larger, the shape of the sampling
distribution of the mean tends toward that of a normal curve, regardless of the shape or form of
the parent population. Thus, it is properly said that the distribution of the sample mean,
necessarily has mean µ, and variance σ2/n, where these Greek letters stand for the population
mean and variance respectively. Moreover, such a sampling distribution approaches normal
form as the sample size n grows larger for [virtually] every population!
It is the generality of the last point that is so distinctive. The square root of the variance σ2/n
(written as σ/√n, so we may write σmean = σ/√n) is called the standard error of the mean (i.e., the
standard deviation of the sampling distribution of the mean, σmean); a term that is frequently
encountered in statistical practice. What has just been stated are the principal results of the
‘Central limit theorem.’ (Stop to note that the sampling distribution [of any statistic] is itself a
population; but such a distribution is wholly distinct from the distribution of the parent
population, or from the distribution of any particular sample. Because of its status as a
population, the characteristics of a sampling distribution are generally denoted by Greek letters
[consider that all possible samples of a given size were the sources of the statistics]. But don’t
confuse the sampling distribution (of any statistic…and there is generally a different one for
different statistics) with the parent population from which samples were drawn.
The preceding points are fundamental to the bootstrap method (see below). But note that when
we speak about a bootstrap distribution of a statistic we are talking about an approximate
sampling distribution of a particular statistic (not just the mean!), based on a ‘large’ number of
bootstrap samples; and for each sample, the sampling is done with replacement from a particular
‘sample-as-population’. And each sample is of the same size as the original, i.e., n. Still that
‘large’ number [1000 below] of statistics is far smaller than the total of all possible samples,
which is generally n to the power n (nn) in a bootstrap context, or N to the power n (Nn), for a
finite population of size N.)
Bootstrapping entails sampling with replacement from a vector (or ‘rows’ of matrix or data
frame; see next page, bottom for how to do in R), so that each bootstrap sample is always the
same size as the original. But don’t confine your thinking to just the mean as we begin to
consider bootstrapping; in general, bootstrap distributions can be created for any statistic that can
be computed – and each statistic is based on a set of resampled data points.
The following illustration begins from a vector y that contains n = 100 values, originally
generated as a random sample from the t3 distribution, i.e. t w/ 3 degrees of freedom, and then
scaled to have a mean of 20 and a standard deviation of about 3. This accounts for the relatively
long tails of y, compared with a Gaussian (normal) distribution that you see below. See plot of y
(which is both a sample and a population, depending on your point of view – both ideas are
relevant); its summary statistics (parameters?) are given below.
Here is the R function I used to obtain four central tendency estimates for each of 1000
bootstrap samples: (Copy and paste means4 into your R session)
means4 <- function(x,tr1=.1,tr2=.2) {
# function that, given vector x, computes FOUR statistics to
assess central tendency
xm1 <- mean(x)
xmt.1 <- mean(x, trim =tr1) # 10% trimmed mean
xmt.2 <- mean(x,trim=tr2)
# 20% trimmed mean
xm.5 <- median(x)
# 50% trimmed mean = median
xms <- c(xm1, xmt.1,xmt.2, xm.5)
xms=round(xms,2)
list(xms=xms) }
#the four ‘means’ above are given as mean 1...mean4 below.
Now, we use the bootstrap function from the library bootstrap: mns4.y <- bootstrap(y, nboots=1000,means4)  command used for main bootstrap run
(1000 replicates)[ nboot >= 1000 for ‘good’ C.I’s]
I generated 1000 bootstrap replications of the four statistics [for library: bootstrap in R]
All R commands given on page 4 below.
Numerical summary of bootstrap results: >cbind(my.summary(y=x),my.summary(mns4.500))
pop. mean1 mean2 mean3 mean4 #mean1 is just conventional mean.
means 20.00 19.99 20.01 20.00 20.02
departures from 20 indicate bias
s.d.s
3.03 0.30
0.26
0.26
0.24 #first value is ‘popul’ s.d./rest (bold vals=s.e.’s)
skewns -1.04 -0.07
0.00 -0.09 -0.06 of s.d.s are bootstrap s.e.’s
will discuss! See plot next p.
#key results in bold italics above AND BELOW.
A second run, again w/ 1000 bootstrap replications, gave results: means 19.99 20.02 20.01 20.03
s.d.s
0.30 0.25 0.25 0.24
skewns -0.07 0.10 0.00 -0.02
#
I ignore the pop. values here
as they did not change.
For practical purposes, identical.
Following are the four bootstrap distributions for the first set:
NB: The initial ‘population’ was a long-tailed sample. It’s use affords an opportunity to study the
conventional sample mean as an estimate of the ‘center’ of a distribution, when normality does
not hold. We in fact see below that the conventional mean is the worst of the four estimators of
the center of the distribution of the parent population, based on 1000 bootstrap samples.
Remember initial sample as population had n=100 scores, so that n = 100 for each sample.
Thus, the first s.e., for mean1, can be calculated by theory; that theory says divide the population
s.d. by sqrt(n); here 3.03/sqrt(100)=.303.
We are most informed by the computed standard error estimates; these quantify how well the
different estimators of the ‘center’ of this distribution work in relation to one another.
To repeat, each bootstrap sample entails sampling WITH replacement from the elements of the
initial data vector y. Each of the B = 1000 bootstrap samples contains n = 100 resampled scores,
and all four statistics (‘means’; three ‘trimmed’) were computed for each bootstrap sample. The
summary results, and especially standard error estimates, based on the bootstrap replicates are the
principal results on which one will usually focus in studies like this. See the documentation for
‘bootstrap’ for more information as to what this function does, or can do.
The first major book on the bootstrap was written by Bradley Efron, inventor of the bootstrap,
and Tibshirani: ‘An introduction to the bootstrap’, 1993. There are now at least a dozen books,
many of them technical, about bootstrapping. (The May 2003 issue of Statistical Science is
devoted exclusively to articles on bootstrapping, for its 25th anniversary.)
Some things you may find useful about bootstrapping within the world of R:
1. A vector such as y, regarded as y[1:n], where one controls contents, e.g. y[c(1,3)] = 1st and 3rd
elements of y; or y[n:1] presents y values in reverse order; or y[sample(1:n,n,repl=T)] yields
a bootstrap sample of y, of size n; and the latter, repeated (using sampling WITH
replacement), becomes a basis for bootstrap analysis.
2. A matrix such as yy, regarded as yy[1:n,1:p] (of order n x p) can be examined in parts using
bracket notation; e.g. yy[1:3, ] displays the first 3 rows of yy; also, to sample the rows of yy,
use yy[sample(1:n,n,repl=T), ], where comma in [ , ] separates row and column designations.
R commands used to get the preceding numerical results:
> bt.mn4=bootstrap(x,nboot=1000,theta=means4) #x is called xt3 in Fig1
> bt.mns4=(as.data.frame(bt.mn4$theta)) #the output thetastar is of
class ‘list’; needs to be data.frame or a matrix for what follows.
> bt.mns4=t(bt.mns4)
#transpose of matrix (d.frame) bt.mns4 is taken
for convenience (below)
> gpairs(bt.mns4)
#gpairs function is from package YaleToolkit
> my.summary(bt.mns4)
#I wrote my.summary. Copy it into your R
session, or just use summary from R.
my.summary <- function (xxx, dig = 3) {
#xxx is taken to be input data.frame or matrix.
xxx <- as.matrix(xxx)
xm <- apply(xxx, 2, mean)
s.d <- sqrt(apply(xxx, 2, var))
xs <- scale(xxx)
sk <- apply(xs^3, 2, mean)
kr <- apply(xs^4, 2, mean) - 3
rg <- apply(xxx, 2, range)
sumry <- round(rbind(xm, s.d, sk, kr, rg), 3)
dimnames(sumry)[1] <- list(c("means", "s.d.s", "skewns",
"krtsis", "low", "high"))
sumry <- round(sumry, dig)
sumry }
Bootstrapping sources in R R functions for bootstrapping can be found in the bootstrap and the boot library, so you should
examine the help files for several of the functions in these libraries to see how to proceed. Note
that bootstrap is a much smaller library than boot, and generally easier to use effectively. I
recommend that you begin w/ the function bootstrap in the library of the same name, but the
greater generality of ‘boot’ will be most helpful in some more advanced situations. The
help.search function will yield more packages that reference bootstrapping, so give that a try
too. Naturally, many introductions and discussions can be found on the web; let’s see what you
like – post a URL or two.
These pages summarize key points, and offer more general principles. Of particular interest
is the concept of confidence intervals, and ways to use bootstrap methods to generate limits for
CIs. Note the (deliberate) redundancy in what follows and what you have read above.
The essence of bootstrapping:
We begin by assuming a quantitative variable, for which we want to characterize or describe
using numerous different statistics (means, medians, trimmed means, variances, sds, skewness,
etc.). Our goal is to make inferences about ‘parent population’ parameters using confidence
intervals that have in turn been constructed using the information from a reasonably large
number of computer generated bootstrap samples. (Take note: we will NOT introduce any
mathematical theory here; all that follows involves computer intensive computation, but no
‘theory’ as such.)
1. Begin from an initial sample, not too small (say, 30 or 40 cases at a minimum); this should be a random sample, or a sample for which we can reasonably argue that it reasonably ‘re-­‐presents’ some ‘larger universe’ of scores that we shall think of as our parent population. Let y represent the sample data. 2. Decide what feature(s) of the parent population we would like to make inferences about
(‘center’, ‘spread’, ‘skewness’, etc.); then, given one or two choices, say center and spread,
decide on what statistics we want to use for inferences. We might have two, three or more
alternative measures of each feature (e.g., four ‘means’ for center; s.d.s and IQRs to assess
spread, etc), a total of S statistics, say. One goal here is to compare various estimators with
one another with respect to their purposes in helping to make inferences. We might also have
begun with difference scores in our initial vector.
3. Compute and save each of these ‘statistics’ for our initial sample of y values; we shall call
them by a special name: bootstrap parameters (which are also statistics, see below). Reflect
on this point since it is easy to be confused here.
3. Choose or write functions that we will be able to ‘apply’ to each bootstrap sample, where
each bootstrap sample is simply a sample drawn w/out replacement from the initial sample.
The initial sample will now be regarded – for the purposes of bootstrapping – as our
‘bootstrap population’. (Note carefully that we must take care in what follows to distinguish
the parent population from the bootstrap population. The latter population can be said to
have bootstrap parameters that are also properly labeled as ‘conventional sample statistics’.)
4. Generate bootstrap samples a substantial number of times (say B = 500 to 1000 of these),
where we save these bootstrap replicates (those that ‘measure’ center, spread, skewness) for
each of the bootstrap samples. Best to generate an array (matrix, of order B x S) that contains
all of these; they shall be called ‘replicate values’ for the respective statistics, and they will be
the basis for the ultimate inferences.
5. Summarize the preceding table by columns, one for each statistic that relates to a particular
feature of the initial bootstrap population (recalling that our bootstrap population began as our
initial sample). Both numeric and graphical methods should usually be employed.
6. Compute and compare the (conventional) means of the replicate statistics (columns) with
the bootstrap population parameters; the differences may positive or negative, and these
differences measure ‘bias’. Ideally, we might seek zero bias, but small amounts of bias are
usually tolerated, particularly if the biased statistics have compensating virtues, especially
relatively small variation across the set of bootstrap samples.
7. Then compute and compare the s.d.s of the respective statistics; often the main goal of the
entire bootstrapping study is to find which statistics have the smallest s.d.s (which is to say
bootstrap standard errors) since these are the statistics that will have the narrowest confidence
intervals. If a statistic is found to be notably biased, we may want to ‘adjust’ the statistics
(nominally used as ‘centers’ of our ultimate confidence intervals).
8. Generate the density distributions (histograms ok) and, more importantly, selected
quantiles, of any bootstrap statistics (the bootstrap replicates) we generated. For, example if
we aim to get a 95% interval for a trimmed mean, we find the 2.5% and the 97.5% quantile
points of the distribution of that trimmed mean, and (supposing it has minimal bias) these
become our confidence limits for a 95% interval. We will surely want to compare these limits
with those for the conventional mean. Statistics with the narrowest CIs can usually be said
to be ‘best’, particularly if they were found to be ‘minimally biased.’ Similar methods are
used for 99% CIs, etc. Graphics can be useful in this context, but be sure to note that all the
information is based on a (rather arbitrary) initial sample, so care has to be taken not to
misinterpret, or over-interpret results.
9. Summarize by comprehensively describing the main results, also noting that this
methodology has bypassed normal theory methods – that strictly speaking, apply only when
normality assumptions can be invoked; moreover, we have made no assumptions about
shapes or other features of the (putative [look it up!]) parent population. In particular, we
have not assumed normality at any point, and we have gone (well) beyond the mean to
consider virtually any statistic of interest (review these; add others).
10. Finally, recognize that the interpretation of any such bootstrap CI is essentially the same
as that for a conventional CI gotten by normal theory methods.
These ideas readily generalize to statistics that are vectors, such as vectors of regression
coefficients. This means we are all free to invent and use bootstrap methods to study the
comparative merits and demerits of a wide variety of statistics, without regard for whether
they are supported by ‘normal theory.’ We need not invoke normality assumptions, nor
make any other so-called parametric assumptions in the process. The main thing to note is
that any bootstrap sample drawn from a vector or matrix, is just a sample drawn with
replacement from the ROWS of an initial sample data matrix; and (vector) bootstrap statistics
are computed for each bootstrap matrix, analogous to what has been described above. The
computation may be ‘intense’ but with modern computers such operations are readily carried
out for rather large matrices (thousands of rows, hundreds of columns) if efficient
computational routines are used. Conventional statistics can be notably inferior to certain new
counterparts, a point that needs often to be considered seriously.