Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals STAT 113 Bootstrap Confidence Intervals Colin Reimer Dawson Oberlin College 3 March 2017 Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Using Samples to Make Estimates About Populations Statistic : Sample :: Parameter : Population We want to use our sample statistic to estimate the corresponding population parameter Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Standard Error Standard Error Definition The distribution of a quantitative variable has a standard deviation. The sampling distribution of a quantitative sample statistic (like a mean) has a standard deviation too. This has a special name: the standard error (e.g., “of the mean”). Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Confidence Intervals • A point estimate of some population parameter (like a mean), together with some measure of our confidence/uncertainty (e.g., MoE), defines a confidence interval. • Can be written in the form “statistic ± MoE”. • “With 95% confidence, the mean flavor-life of our gumballs is between 65.3 and 67.1 minutes.” • “With 95% confidence, between 39 (42 − 3) and 45 (42 + 3) percent of U.S. adults approve of Donald Trump’s job performance as president. Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals How to Determine the Margin of Error? The population mean µ is within 2 Standard Errors of most (about 95%) sample means (from simple random samples). Margin of Error A 95% margin of error of 3 points means that 95% of surveys with the same procedure and sample size will yield sample statistics which are within 3 points of the corresponding population parameter. If the sampling distribution is approximately Normal, then a 95% Margin of Error is about 2 Standard Errors. Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Interpretations of CIs • 95% CIs contain 95% of the cases in the population. False. They represent uncertainty about a population parameter, not about individual points. • There is a 95% chance that the sample mean falls in the 95% CI. False. Any given CI is centered around the sample mean for that sample, so the sample mean is inside 100% of the time. • 95% of samples produce confidence intervals that contain the population parameter. True: This is the definition of a confidence interval Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Correct or Incorrect? A 98% confidence interval for mean pulse rate in the Oberlin student population is 65 to 71. The interpretation “I am 98% sure that all students will have pulse rates between 65 and 71.” is A. Correct B. Incorrect Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Correct or Incorrect? A 98% confidence interval for mean pulse rate in the Oberlin student population is 65 to 71. The interpretation “I am 98% sure that the mean pulse rate for this sample of students will fall between 65 and 71” is A. Correct B. Incorrect Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Correct or Incorrect? A 98% confidence interval for mean pulse rate in the Oberlin student population is 65 to 71. The interpretation “I am 98% sure that the mean pulse rate for the population of all students will fall between 65 and 71” is A. Correct B. Incorrect Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Correct or Incorrect? A 98% confidence interval for mean pulse rate in the Oberlin student population is 65 to 71. The interpretation “98% of the pulse rates for students at this college will fall between 65 and 71” is A. Correct B. Incorrect Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Summary To create a 95% confidence interval for a parameter: 1. Take many random samples from the population, and compute the sample statistic for each sample 2. Compute the standard error as the standard deviation of all these statistics 3. For your actual sample, use statistic ± 2SE Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Ok, but... In reality we only have one sample. How do we know what the standard error is? • Standard error depends on population characteristics, particularly variability • We can use the sample to estimate not only the parameter of interest (e.g., mean, proportion), but also the variability. • Two approaches: (1) Simulation, (2) Probability theory Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Estimating the Margin of Error from One Sample • Since we only have one sample, we have to estimate the Margin of Error using only the information it contains. • Idea: Let the whole sample (not just the statistic of interest) serve as an estimate for the whole population Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Note: We do not literally make copies of the data, or increase our sample size, by bootstrapping! Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Sampling from the Pseudo-Population • Sampling from the estimated population is equivalent to sampling from the sample, but never “using up” the cases. • In other words, we sample with replacement from the sample. • The resulting sample is called a bootstrap sample. Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Bootstrap Statistic and Bootstrap Distribution • We compute the relevant statistic (e.g., mean) on the bootstrap sample. This is a bootstrap statistic. • Over many bootstrap samples, each contributing a bootstrap statistic, we get a bootstrap distribution. • Each bootstrap statistic differs from the “pseudopopulation parameter” (which is really the real sample statistic). • We hope these differences are similar in size to the differences between true sample statistics and population parameter. Bootstrap statistic : Actual sample statistic :: Actual sample statistic : Actual Population Parameter Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Examples: StatKey http://lock5stat.com/statkey Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Population vs. Sample vs. Sampling Dist. vs. Bootstrap Dist. Population <- read.file("http://colindawson.net/data/ames.csv") Sample <- sample(Population, size = 50) SamplingDist <- do(5000) * sample(Population, size = 50) %>% mean(~Price, data = .) BootstrapDist <- do(5000) * resample(Sample) %>% mean(~Price, data = .) Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Pop. Cases Population vs. Sample vs. Sampling Dist. vs. Bootstrap Dist. 100 80 60 40 20 0 150000 200000 Samp. Cases Price • What is the center of the 4 3 2 1 0 sampling distribution? 150000 200000 Price • What is the center of the Samples bootstrap distribution? 800 600 400 200 0 • How does the spread 150000 200000 Boot. Samples Mean Price 600 400 200 0 150000 200000 Mean Price compare? Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals 95% 0 400 Samples Estimating the Margin of Error ● 100000 150000 200000 250000 200000 250000 600 95% ● 0 Boot. Samples Mean Price 100000 150000 Mean Price • The spread of the bootstrap distribution approximates the spread of the true sampling distribution. • We can use the bootstrap distribution to get a Margin of Error for our Confidence Interval • Where should the center of the CI be? Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Adjusting the Confidence Level If the sampling distribution is approximately Normal, then a 95% Margin of Error is about 2 Standard Errors. If the bootstrap distribution is approximately Normal, 95% of the bootstrap statistics are within 2 SE of the boostrap center (i.e., original sample stat.). That is, 95% of bootstrap statistics are within the 95% CI. If the bootstrap distribution is symmetric, then capturing the middle X% of the bootstrap statistics yields an X% confidence interval! Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals 99% 0 400 Samples Estimating the Margin of Error ● 100000 150000 200000 250000 200000 250000 600 99% ● 0 Boot. Samples Mean Price 100000 150000 Mean Price • If we want a 99% CI, we need a MoE such that 99% of sample stats are within that MoE of the population parameter. • Since the bootstrap dist. has similar spread to the true sampling dist., we can estimate such an MoE there • Then build a CI around the sample stat. (aka center of boostrap dist.) with that MoE. Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals CI with Arbitrary Confidence Level ### 99% CI goes from 0.5 percentile to 99.5 percentile of bootstrap dist. CI <- quantile(~result, data = BootstrapDist, probs = c(0.005, 0.995)) CI 0.5% 99.5% 142708.9 190408.7 Density histogram(~result, data = BootstrapDist, fit = "normal", nint = 100, v = CI) 0.00005 0.00004 0.00003 0.00002 0.00001 0.00000 140000 160000 180000 result 200000 Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Example: Atlanta Commutes http://lock5stat.com/StatKey Confidence Intervals Bootstrap Resampling Bootstrap Confidence Intervals Bootstrap Percentile Intervals Summary: Bootstrap CIs To generate a bootstrap distribution, we 1. Generate bootstrap samples by sampling with replacement from the original sample, using the same sample size 2. Compute the statistic of interest, a bootstrap statistic, for each of the bootstrap samples 3. Collect the statistics for many bootstrap samples to form a bootstrap distribution If the bootstrap distribution is symmetric, an X% CI can be estimated by taking the range of the middle X% of the bootstrap statistics.