* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Boostrapping - Rossman/Chance
Survey
Document related concepts
Transcript
Fall, 2013 Wednesday, Dec. 4 Stat 301 – Brief Overview of Bootstrapping Suppose we did not believe the population distribution was approximately normal, how could we construct a confidence interval for the population mean? Still want to consider starting with ̅ as our estimate, but then what about the margin-of-error? Or what if we wanted to use a statistic other than ̅ ? (a) What does the margin-of-error measure? Central Limit Theorem for a sample mean: If we take repeated samples of size n from a population with mean , then, if n is large, the “what if” distribution of ̅ will be approximately normal with mean and standard deviation /√n. (b) Remind yourself of what this theorem claims by opening the Sampling Pennies applet. These are the ages in a population of 1000 pennies (actually collected and recorded by a statistics professor). 0. Uncheck the Animate box. 1. Take 1000 samples of 5 pennies each and describe the center, spread, and shape of the distribution. 2. Repeat for 1000 samples of 50 pennies each. Mean SD shape n=5 n = 50 (c) Do these sampling distributions behave as expected/predicted by the CLT? Proposal: To assess the variability of our statistic from random sample to random sample, we can sample with replacement from our sample. The steps are: 1. Take a random sample of n observations from the population 2. Take a random sample of n observations from the sample with replacement (think of this as repeating the first sample infinitely many times and using that as the population) 3. Calculate the “bootstrap statistic” 4. Repeat steps 2 and 3 a large number of times to create a bootstrap sampling distribution. The claim is the standard deviation of the bootstrap sampling distribution will be a reasonable approximation of the standard deviation of the statistic. (d) Consider this claim using the Bootstrap Sampling Change applet. 1. Draw a sample of 5 pennies. 2. Press Bootstrap Population a few times. (This creates the bootstrap population.) 3. Now set of the Number of samples to 1000, uncheck the Animate box, and press Draw Bootstrap Samples. Record the characteristics of the distribution below. 4. Repeat for an initial sample of 50 pennies. Compare the results to those above. Mean SD Shape n=5 n = 50 Fall, 2013 Wednesday, Dec. 4 The real advantage of bootstrapping is to get an estimate of the standard error of the statistic for more interesting statistics when the “normal theory” (CLT) does not apply. Simulation in R – Using the penny ages data with variable name “pennyages” > iscamsummary(pennyages) n Min Q1 Median 1000.00 0.00 4.00 11.00 Q3 19.00 Max 59.00 Mean 12.30 SD 9.61 Suppose we take a random sample of 50 pennies: > pennysample=sample(pennyages, 50) > iscamsummary(pennysample) n Min Q1 Median Q3 50.0 0.0 4.0 10.0 20.0 Max 40.0 Mean 13.2 SD 10.9 Now we are going to take “bootstrap samples” from this sample. > I=1000 > bootstrapsample =0 > bootstrapmean = 0 > for (i in 1:I){ + bootstrapsample=sample(pennysample, 50, replace=TRUE) + bootstrapmean[i]=mean(bootstrapsample) + } > iscamsummary(bootstrapmean) n Min Q1 Median 1000.00 8.86 12.20 13.10 Q3 14.10 Max 17.80 Mean 13.10 SD 1.51 (e) How could you construct a 95% confidence interval from this bootstrap sampling distribution? (f) How could you decide if this is a 95% confidence interval procedure? Other Statistics: What about the median? bootstrapmedian[i]=median(bootstrapsample) > iscamsummary(bootstrapmedian) n Min Q1 Median Q3 989.00 4.00 9.50 11.00 12.00 > hist(bootstrapmedian, nclass=20) Max 17.50 Mean 10.70 SD 2.21 Two groups: To compare two groups, we can find the standard error of the differences by sampling with replacement from each group, calculating the statistic comparing the two samples, and building the bootstrap distribution of the statistic. Does not require you to use a hypothesized parameter value in the simulation (no assumption the samples are coming from the same population). Allows you to model random sampling instead of random assignment. [Investigation 3.9 required assuming particular populations to sample from.] Fall, 2013 Wednesday, Dec. 4 Simulation in Minitab – Using the penny data with the sample of 50 ages in column 1 Build a .mac file: Do k1=1:1000 sample 50 c1 c2; replace. let c3(k1)=mean(c2) Enddo > Describe c3