Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Confidence interval wikipedia , lookup
Taylor's law wikipedia , lookup
Misuse of statistics wikipedia , lookup
Student's t-test wikipedia , lookup
German tank problem wikipedia , lookup
Bootstrapping D. Patterson, Dept. of Mathematical Sciences, U. of Montana A good reference is Davison, A.C. and D.V. Hinkley, Bootstrap Methods and their Application, Cambridge, 1997. Let X = X1 , . . . , Xn ) be a random sample from some distribution with d.f. F . Suppose also that T (X) is an estimator of some parameter θ(F ), where the notation θ(F ) indicates that the parameter is a function of the distribution F . We are often interested in the sampling distribution of T (X) or in some quantity which is a function of T (X) and θ(F ). Examples of quantities we might be interested in: 1. Bias(T ) = E[T (X)] − θ(F ) 2. Var(T ) = E[(T (X) − E[T (X)])2 ] 3. MSE(T ) = E[(T (X) − E[T (X)])2 ] = Var(T ) + Bias2 (T ) 4. MAE(T ) = E[|T (X) − θ(F )|] 5. Quantiles of the distribution of T (X) Nonparametric bootstrap The nonparametric bootstrap is used when we don’t want to make any assumptions about the distribution F (e.g., that it’s normal or exponential). The bootstrap says to estimate any desired quantity which is a function of T (X) and θ(F ) by the corresponding quantities for the empirical distribution Fb . We’ll denote these by X ∗ and θ(Fb ) where X ∗ denotes a random sample of size n from Fb . Fb is the MLE of F . It puts probability 1/n at each Xi in the sample. That is, we assume the population looks exactly like the sample. Then the quantities above are estimated by 1. E[T (X ∗ )] − θ(Fb ) 2. E[(T (X ∗ ) − E[T (X ∗ )])2 ] 3. E[(T (X ∗ ) − E[T (X ∗ )])2 ] 4. E[|T (X ∗ ) − θ(Fb )|] 5. Quantiles of the distribution of T (X ∗ ) Note: We implicitly assume T (X) = θ(Fb ), as you can verify in the examples below. For a more specific example, let T (X) = X (the sample mean) and let θ(F ) = µ (the population mean). Note that T (X) = θ(Fb ) = X, the mean of the original sample, because the mean of 1 a distribution which puts probability 1/n at each Xi , i = 1, . . . , n, is the mean of the Xi ’s. Note that T (X) is the mean of n random observations from F while T (X ∗ )is the mean of n random observations from Fb . Going through the quantities above for this example, assuming that we have observed a specific sample x: 1. We know that the bias of X is 0, but let’s look at what the bootstrap estimate of Bias(X) is. It’s ∗ E[X ] − θ(Fb ) = x − x = 0. Note that the expected value of the mean of a sample from Fb is the mean of Fb which is x. Here, the bootstrap estimate of bias is exactly right. 2. The bootstrap estimate of the variance of X is ∗ ∗ ∗ b 2 /n E[(X − E[X ])2 ] = Var(X ) = Var(Fb )/n = σ P 2 b 2 = (1/n) n where σ i=1 (xi − x) . This is the usual estimate except that the divisor n is used rather than n − 1. This is a trivial difference. 3. The bootstrap estimate of MSE(X) is the same as the estimate of the variance from item 2, because the bootstrap estimate of bias is 0. 4. The bootstrap estimate of MAE(X) is ∗ ∗ E(|X − θ(Fb )|) = E(|X − x|). Unfortunately, the above expectation cannot be easily computed because we would have to ∗ know the sampling distribution of X which is very complicated for a discrete distribution. Therefore, we will have to estimate the above quantity by simulation as described below. 5. The bootstrap estimate of any quantile of the distribution of X is the corresponding quantile ∗ of the distribution of X . These must also be estimated by simulation. Simulation must often be used to estimate the bootstrap estimates as in items 4 and 5. In fact, simulation is so commonly used to obtain bootstrap estimates that some people mistakenly think that bootstrapping requires simulation. The structure of the simulation is to generate many, many random samples of size n from Fb and then to compute the quantity of interest for each bootstrap sample. Simulating a sample from Fb is the same as drawing a random sample with replacement from x1 , . . . , xn , the observed data. Since we’re sampling with replacement, the samples will not all be the same. M&M example The data consist of the net weights in grams of 18 bags of peanut M&M’s with claimed net weight of 49.3 grams. 2 > mm [1] 47.9 48.5 49.8 49.8 50.2 50.6 50.8 50.9 51.1 51.2 51.5 52.7 53.0 53.1 54.5 54.7 55.8 55.9 > mean(mm) [1] 51.77778 While we can easily estimate the variance (and standard deviation) of X without simulation as in item 2, suppose we’re interested in estimating the MAE of X. The bootstrap estimate must be obtained by simulation. Here’s the R code and output: # Bootstrap estimate of mean absolute error of sample mean B <- 100000 # number of bootstrap samples (should be large) > x <- matrix(sample(mm,n*B,rep=T),ncol=n) # create B samples of size n > mn <- rowMeans(x) # compute mean of each bootstrap sample > mean(abs(mn-mean(mm))) # compute mean absolute error for bootstrap samples [1] 0.4299903 The estimated MAE of X is 0.429. If you run the program again, you’ll get a slightly different estimate of MAE, but if B is large, it shouldn’t be too different. Remember, the simulation is estimating the bootstrap estimate of MAE; you can increase the accuracy of the simulation estimate by increasing B, but the accuracy of the bootstrap estimator to begin with is controlled by the sample size n. ∗ We could also estimate quantiles of the distribution of X by the corresponding quantiles of the B bootstrap values of the mean (the quantile function in R can be used to compute quantiles of a vector of values, e.g. quantile(mn,.95) ). Continuing with the M&M example, suppose we were interested in estimating the population median weight. The obvious estimator is the sample median (51.15 grams), but the sampling distribution of the sample median is not easy to derive and depends on the population distribution F . The bias, standard deviation (standard error), and root MSE of the sample median can be estimated by bootstrapping. Given that the bootstrap samples have already been generated as above in a B by 18 matrix x, here’s the R code with results for the run above: > median(mm) # median of original sample [1] 51.15 > med <- apply(x,1,median) # medians of bootstrap samples > mean(med)-median(mm) [1] 0.2265355 > sd(med) [1] 0.7037118 > sqrt(mean((med-median(mm))^2)) # MSE [1] 0.7392724 The estimated bias is 0.23 grams which indicates that the sample median tends to overestimate the population median by an average of 0.23 grams for data like these. We therefore might guess that 3 our sample median of 51.15 gm. overestimates the population median by 0.23 grams and we could adjust our estimate downward by 0.23 grams if having an approximately unbiased estimator were important. The bootstrap estimated standard error of 0.704 could be used to form normal-based confidence intervals for the population median. For example, a 95% confidence interval would be 51.15 ± 1.960(0.7037118) = (49.8, 52.5) grams. To assess the validity of the assumption of normality for the distribution of the sample median, we could look at a histogram or normal quantile plot of the B bootstrap medians (hist(med) or qqnorm(med)). I won’t show it here since I show it below as part of the output of the boot command. Using the boot command in R The boot function in library boot in R can be used to carry out nonparametric bootstrapping. As we’ve seen above, it’s not hard to program the bootstrap in R, but the boot command does give some other output which is useful. To understand how to use the boot function in R, first examine the following alternative way to generate a single bootstrap sample for the M&M data: > i = sample(18,18,replace=T) > i [1] 10 16 11 16 11 5 5 15 13 16 7 2 4 3 10 16 14 4 > mm[i] [1] 51.2 54.7 51.5 54.7 51.5 50.2 50.2 54.5 53.0 54.7 50.8 48.5 49.8 49.8 51.2 [16] 54.7 53.1 49.8 > median(mm[i]) [1] 51.35 We would want to store the value of the median for the bootstrap sample (the value 51.35) and then repeat the process (with a loop) several thousand times, getting a new vector i each time we called sample. Notice that rather than sampling with replacement from the data vector (e.g., sample(mm,18,rep=T)), I sampled from the integers 1 to n and then got the data values for the bootstrap sample by using the expression mm[i]. There is no difference between the two ways of doing it, but this way illustrates how the boot command does it and will help in understanding how to use boot. The three required arguments to boot are, in order: 1) the original data vector (or data frame), 2) a function which tells R how to compute the statistic from a bootstrap sample (described more below), and 3) the number of bootstrap replications. The second argument is a function you define which computes the statistic of interest. You can read 4 about defining your own functions in the documentation at the R website. The function you define must itself have two arguments: the first represents the original data set and the second argument is a vector representing the indices of the observations in a bootstrap sample (like i above). It does not matter what names you give to these arguments because they are only placeholders which are internal to the function. The boot function will call this function B times using the original data as the first argument in the call and its vector of random indices as the second argument. For example, to compute the median for a bootstrap sample, the function would be bmed = function(x,i) median(x[i]) I would then call the boot function as follows: b = boot(mm,bmed,10000) Of course, the function name bmed can be anything you want. In addition, the argument names can be anything you want; for example, bmed = function(data,index) median(data[index]) works just as well. The output from the command is > bmed <- function(x,i) median(x[i]) > b <- boot(mm,bmed,100000) > b ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = mm, statistic = bmed, R = 1e+05) Bootstrap Statistics : original bias std. error t1* 51.15 0.2288135 0.7057048 The results are very similar to those above. In addition, you can get a plot of the bootstrap values with > plot(b) If you want to calculate something more complicated, like estimated MSE, then you need to access the B bootstrap values of the statistic, which are stored in b$t. The value of the statistic for the original data is in b$t0. For example, the bootstrap estimates of root MSE and mean absolute error are: > sqrt(mean((b$t-b$t0)^2)) [1] 0.7423753 > mean(abs(b$t-b$t0)) 5 51 52 t* 1.0 49 0.0 50 0.5 Density 53 54 1.5 55 2.0 56 Histogram of t 50 52 54 56 t* −4 −2 0 2 4 Quantiles of Standard Normal [1] 0.4823145 Bootstrap Confidence Intervals There are several alternatives to normal-based confidence intervals that have been proposed based on bootstrapping. These attempt to adjust for the bias and/or non-normality of the sampling distribution of the statistic as estimated by bootstrapping. The boot.ci command produces five different confidence intervals: Normal, Basic, Percentile, BCa, and Studentized. The default confidence level is 0.95 but can be changed with the conf= option. Descriptions of the five confidence intervals follow. 1. Normal: This is a normal-based CI with the bootstrap estimated SE and adjusted by the bootstrap estimated bias. It is based on the assumption that T (X) − E[T (X)] ∼ N (0, v), that is, that the sampling distribution of T (X) is normal. We can then write T (X) − E[T (X)] = T (X) − {E[T (X)] − θ(F )} − θ(F ) = T (X) − bias[T (X)] − θ(F ) ∼ N (0, v) Let γ = 1 − α be the confidence level. Abbreviating the quantities in the equation above, we can write µ ¶ T −b−θ P −z1−α/2 < < z 1−α/2 = γ v 1/2 from which it follows that ³ ´ P T − b − z1−α/2 v 1/2 < θ < T − b + z1−α/2 v 1/2 = γ. Hence, a γ level confidence interval for θ can be approximated by T (X) − b∗ ± z1−α/2 (v ∗ )1/2 6 where b∗ and v ∗ are the bootstrap estimates of bias and variance. In the M&M bootstrap output above, the estimated bias of the sample median is 0.2288 and the bootstrap SE is 0.7057 so the bias-adjusted normal confidence interval is 51.15−.2288±1.960(.7057) = (49.54, 52.30) as given in R code below and in the output from boot.ci further below. > # Normal CI > bias <- mean(b$t)-b$t0 > SE <- sd(b$t) > c(bias,SE) [1] 0.2257875 0.7049703 > b$t0-bias+c(-1,1)*qnorm(.975)*SE [1] 49.54250 52.30593 2. Basic: This assumes that T (X) − θ(F ) has some unspecified distribution which can be approximated by the distribution of T (X ∗ ) − θ(Fb ). Let tα/2 and t1−α/2 be quantiles of the distribution of T (X) (note: t does not denote a t-distribution here). These quantiles can be estimated by t∗α/2 and t∗1−α/2 , the quantiles of the distribution of T (X ∗ ). The corresponding quantiles of T (X) − θ(F ) can then be estimated by t∗α/2 − θ(Fb ) and t∗1−α/2 − θ(Fb ). Recalling that we assume T (X) = θ(Fb ), we can write ³ ´ P t∗α/2 − θ(Fb ) < θ(Fb ) − θ < t∗1−α/2 − θ(Fb ) = γ from which it follows that ³ ´ P 2θ(Fb ) − t∗1−α/2 < θ < 2θ(Fb ) − t∗α/2 = γ. Hence, a a γ level confidence interval for θ can be approximated by ³ 2θ(Fb ) − t∗1−α/2 , 2θ(Fb ) − t∗α/2 ´ For the M&M example, we can calculate the Basic CI by hand from the quantiles of b$t, as below, or from the boot.ci command further below. > # Basic CI > 2*b$t0 - quantile(b$t,c(.975,.025)) 97.5% 2.5% # ignore these labels which come from the quantile command 49.25 51.90 3. Percentile: This interval uses the α/2 and 1 − α/2 quantiles of the distribution of T (X). These are estimated by the corresponding quantiles of T (X ∗ ) which, in turn, can be estimated by the corresponding quantiles of the B bootstrap estimates. That is, this gives quantile(b$t,c(.025,.975)) for a 95% confidence interval. This method implicitly assumes 7 the sampling distribution of T (X) is symmetric, but not necessarily normal, and centered at θ(F ) (unbiased). It can be computed by hand, as below, or from the boot.ci command further below. > # Percentile CI > quantile(b$t,c(.025,.975)) 2.5% 97.5% 50.40 53.05 4. Studentized: The Studentized interval is a variation on the Basic interval. The Basic interval approximates the distribution of T (X)−θ(F ) by the distribution of T (X ∗ )−θ(Fb ). Looking at the difference between the the statistic and what it’s estimating compensates, in one way, for the fact that the original sample is not perfectly representative of the population distribution F ; that is, the distribution of T (X ∗ ) may be shifted up or down from that of T (X), but the distribution of T (X ∗ )− θ(Fb ) is hopefully not too different from T (X) − θ(F ). It might also be that the variance of T (X ∗ ) − θ(Fb ) is quite different from the variance of T (X) − θ(F ) because the original sample does not look exactly like the population. We might hope, then, that we could get a better approximation by approximating the distribution of [T (X) − θ(F )]/S(X) where S(X) is a statistic which estimates the spread of the distribution F from a sample. The distribution of the ratio can be approximated by [T (X ∗ ) − θ(Fb )]/S(X ∗ ). That is, for each bootstrap sample, we compute the value of this ratio, then compute the α/2 and 1 − α/2 quantiles of the bootstrap distribution. We multiply these quantiles by S(X) for the original sample to estimate the quantiles of T (X) − θ(F ). I have not computed Studentized intervals for the M&M example. You must provide the values of S(X) for the bootstrap samples and for the original sample in the boot.ci command. 5. BCa: The BCa interval attempts to adjust for both bias and non-normality of the sampling distribution. It is the most commonly used non-normal confidence interval method. Here are the intervals (except the Studentized) as computed by boot.ci: > boot.ci(b) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 100000 bootstrap replicates CALL : boot.ci(boot.out = b) Intervals : Level Normal Basic 95% (49.54, 52.30 ) (49.25, 51.90 ) Level Percentile BCa 95% (50.40, 53.05 ) (50.40, 53.00 ) Calculations and Intervals on Original Scale 8 Warning message: In boot.ci(b) : bootstrap variances needed for studentized intervals The choice of confidence interval type affects how many bootstrap replications you need. Normalbased confidence intervals which require just the bootstrap SE and possibly an estimate of the bias require fewer bootstrap replications than percentile-based methods. However, there’s no reason not to do as many replications as can be done in a reasonable time, perhaps at least 10,000 for normal-based and 100,000 for percentile-based methods. Nonparametric bootstrap of a ratio Here’s another example of the nonparametric bootstrap. This example illustrates the use of bootstrapping on a ratio of two quantities, something not addressed in this course, but it’s illustrative of how powerful and easy to use the bootstrap can be. In the data below, x is the weight (lbs) and y is the sugar content (lbs). for 10 randomly selected oranges from a shipment. The goal is to estimate the ratio of sugar to total weight since that is how the shipment is valued. A good estimator is the mean sugar content divided by the mean weight, or, equivalently, the total sugar divided by the total weight of the 10 oranges. There are only asymptotic expressions for the variance of a ratio so it’s a good candidate for bootstrapping. A bootstrap sample will consist of 10 oranges selected with replacement, that is, 10 pairs of values. We do not select 10 x values and 10 y values independently because that is not how the shipment was sampled and the bootstrap sampling must mimic the original sampling. > x > y > r > r [1] <- c(.4,.48,.43,.42,.5,.46,.39,.41,.42,.44) <- c(.021,.03,.025,.022,.033,.027,.019,.021,.023,.025) <- sum(y)/sum(x) 0.05655172 The estimated ratio is 0.05655. To bootstrap this data set, we must put the two variables together in a data frame which is simply a spreadsheet with cases as rows and variables as columns: > orange <- data.frame(x,y) > orange x y 1 0.40 0.021 2 0.48 0.030 3 0.43 0.025 4 0.42 0.022 5 0.50 0.033 6 0.46 0.027 7 0.39 0.019 8 0.41 0.021 9 9 0.42 0.023 10 0.44 0.025 We refer to a particular column of the data frame with the notation orange$x or orange$y where $ is the separator between the data frame name and the variable name. The second argument to boot must be a function that takes the original data frame and an index of rows selected for the bootstrap sample and compuytes the ratio of sugar to weight for the sample. The bootstrap can then be executed. You can see from the output below that the estimated bias is small and the sampling distribution is close to normal. Hence, the Normal confidence interval from boot.ci is very close to the BCa confidence interval. The former would also be almost identical to a normal-based confidence interval without a bias adjustment. > ratio <- function(d,i) sum(d$y[i])/sum(d$x[i]) > b2 <- boot(orange,ratio,100000) > b2 ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = orange, statistic = ratio, R = 1e+05) Bootstrap Statistics : original bias std. error t1* 0.05655172 -3.644977e-05 0.001655720 > boot.ci(b2) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 100000 bootstrap replicates CALL : boot.ci(boot.out = b2) Intervals : Level Normal Basic 95% ( 0.0533, 0.0598 ) ( 0.0533, 0.0598 ) Level Percentile BCa 95% ( 0.0533, 0.0598 ) ( 0.0536, 0.0601 ) Calculations and Intervals on Original Scale Warning message: In boot.ci(b2) : bootstrap variances needed for studentized intervals > plot(b2) Parametric Bootstrap Bootstrapping can also be used to estimate properties of estimators for parametric models where exact results are not available. In parametric bootstrapping, resampling is done from a parametric model fit to the original sample. For example, suppose we have a random sample of size n from a gamma distribution where α is unknown, but β is known to be .25. The MLE of α must be computed numerically (using, for example, fitdistr in library MASS). Although the asymptotic distribution of α is known (see problem 14, Sec. 7.8, p. 445), the small sample distribution is not (and the variance of the asymptotic distribution depends on α so can only be estimated by a 10 0.058 0.056 t* 150 0.050 0 0.052 50 0.054 100 Density 200 0.060 250 0.062 Histogram of t 0.050 0.054 0.058 t* 0.062 −4 −2 0 2 4 Quantiles of Standard Normal plug-in estimate). Parametric bootstrapping can be used in this case. First, I show the results from using fitdistr which computes the estimated asymptotic standard error. It does this by plugging b into the expression for the asymptotic variance in problem 7.14 of section 7.8. That in the MLE α expression in problem 7.14 turns out to be the inverse of n times the trigamma function evaluated b so we can make that calculation directly in R and verify the result from fitdistr: at α, > > > > > library(MASS) n <- 10 beta <- .25 x <- rgamma(n,2,beta) x [1] 7.624342 10.885505 9.954169 17.217671 1.218637 7.452474 15.795296 [8] 6.671382 12.304400 3.762149 > y <- mean(x) > m <- fitdistr(x,"gamma",rate=beta) Warning message: In optim(x = c(7.62434209283415, 10.8855049212113, 9.95416929929193, : one-diml optimization by Nelder-Mead is unreliable: use optimize > m shape 2.3747306 (0.4377666) > sqrt(1/(n*trigamma(m$estimate))) # how SE was computed (plug-in estimate) shape 11 0.4377666 > > # Normal-based asymptotic CI’s > m$estimate +c(-1,1)*qnorm(.95)*m$sd # 90% CI [1] 1.654669 3.094793 > m$estimate +c(-1,1)*qnorm(.975)*m$sd # 95% CI [1] 1.516724 3.232737 I illustrate two ways to do the parametric bootstrap below: programming it manually or using the boot function in library boot. The key is that the bootstramp samples are drawn from a gamma distriution with α set equal to the MLE from the original sample (and β = .25). After generating each bootstrap sample, I refit the gamma model using fitdistr (I use the original MLE as the starting value for α). > #Parametric bootstrap > bb <- rep(0,10000) > for(i in 1:10000){ + xx <- rgamma(n,m$estimate,beta) + bb[i] <- fitdistr(xx,"gamma",start=list(shape=m$estimate),rate=beta)$estimate + } There were 50 or more warnings (use warnings() to see the first 50) > mean(bb)-m$estimate # Estimate of bias shape 0.05040127 > se <- sd(bb) > se [1] 0.448613 # Normal-based CI without bias correction > m$estimate +c(-1,1)*qnorm(.975)*se # 95% CI [1] 1.495465 3.253996 > # Type "Normal" bootstrap CI > bias <- mean(bb)-m$estimate > bias shape 0.05040127 > m$estimate - bias + c(-1,1)*qnorm(.975)*se # 95% CI [1] 1.445064 3.203595 > # Type "Percentile" bootstrap CI 12 > t1 <- quantile(bb,.025) > t2 <- quantile(bb,.975) > c(t1,t2) # 95% CI 2.5% 97.5% 1.634459 3.398833 > > # Type "Basic" bootstrap CI > 2*m$estimate-c(t2,t1) # 95% CI 97.5% 2.5% 1.350628 3.115002 The bootstrap SE is a little bigger than the asymptotic estimate (0.4486 vs. 0.4378). In addition, the estimated bias is not trivial and the sampling distribution is somewhat non-normal. Therefore, I would use the Basic confidence interval. Note: the warnings from the boot command all come from the fitdistr command being used by boot. These, in turn, are warnings from the function optim being used by fitdistr and can be ignored. To do parametric bootstrapping in boot requires defining a function to fit the model and another function to generate random values from the parametric model. It is shown below. See the help for the boot command for more details and an example. > library(boot) > galpha = function(y) fitdistr(y,"gamma",start=list(shape=2.375),rate=.25)$estimate > alpha.rg = function(data,mle) rgamma(length(data),mle,.25) > > gamma.boot <- boot(x, galpha, R=10000, sim="parametric", + ran.gen=alpha.rg, mle=m$estimate) There were 50 or more warnings (use warnings() to see the first 50) > gamma.boot PARAMETRIC BOOTSTRAP Call: boot(data = x, statistic = galpha, R = 10000, sim = "parametric", ran.gen = alpha.rg, mle = m$estimate) Bootstrap Statistics : original bias t1* 2.375 0.0485918 std. error 0.4385587 > boot.ci(gamma.boot,type=c("norm","basic", "perc")) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS 13 Based on 10000 bootstrap replicates CALL : boot.ci(boot.out = gamma.boot, type = c("norm", "basic", "perc")) Intervals : Level Normal Basic 95% ( 1.467, 3.186 ) ( 1.384, 3.086 ) Calculations and Intervals on Original Scale Percentile ( 1.664, 3.366 ) > plot(gamma.boot) 2.5 t* 3.0 0.6 2.0 0.4 1.5 0.2 0.0 Density 3.5 0.8 4.0 1.0 4.5 Histogram of t 1 2 3 4 −4 t* −2 0 2 Quantiles of Standard Normal 14 4