Download Bootstrap Confidence Intervals

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sufficient statistic wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Transcript
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
STAT 113
Bootstrap Confidence Intervals
Colin Reimer Dawson
Oberlin College
3 March 2017
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Using Samples to Make Estimates About Populations
Statistic : Sample :: Parameter : Population
We want to use our sample statistic to estimate the corresponding
population parameter
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Standard Error
Standard Error Definition
The distribution of a quantitative variable has a standard deviation.
The sampling distribution of a quantitative sample statistic (like
a mean) has a standard deviation too. This has a special name: the
standard error (e.g., “of the mean”).
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Confidence Intervals
• A point estimate of some population parameter (like a mean),
together with some measure of our confidence/uncertainty
(e.g., MoE), defines a confidence interval.
• Can be written in the form “statistic ± MoE”.
• “With 95% confidence, the mean flavor-life of our gumballs is
between 65.3 and 67.1 minutes.”
• “With 95% confidence, between 39 (42 − 3) and 45 (42 + 3)
percent of U.S. adults approve of Donald Trump’s job
performance as president.
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
How to Determine the Margin of Error?
The population mean µ is within 2 Standard Errors of most
(about 95%) sample means (from simple random samples).
Margin of Error
A 95% margin of error of 3 points means that 95% of surveys with
the same procedure and sample size will yield sample statistics
which are within 3 points of the corresponding population
parameter.
If the sampling distribution is approximately Normal, then a 95%
Margin of Error is about 2 Standard Errors.
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Interpretations of CIs
• 95% CIs contain 95% of the cases in the population. False.
They represent uncertainty about a population parameter, not
about individual points.
• There is a 95% chance that the sample mean falls in the 95%
CI. False. Any given CI is centered around the sample mean for
that sample, so the sample mean is inside 100% of the time.
• 95% of samples produce confidence intervals that contain the
population parameter. True: This is the definition of a
confidence interval
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Correct or Incorrect?
A 98% confidence interval for mean pulse rate in the Oberlin
student population is 65 to 71. The interpretation “I am 98% sure
that all students will have pulse rates between 65 and 71.” is
A. Correct
B. Incorrect
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Correct or Incorrect?
A 98% confidence interval for mean pulse rate in the Oberlin
student population is 65 to 71. The interpretation “I am 98% sure
that the mean pulse rate for this sample of students will fall
between 65 and 71” is
A. Correct
B. Incorrect
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Correct or Incorrect?
A 98% confidence interval for mean pulse rate in the Oberlin
student population is 65 to 71. The interpretation “I am 98% sure
that the mean pulse rate for the population of all students will fall
between 65 and 71” is
A. Correct
B. Incorrect
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Correct or Incorrect?
A 98% confidence interval for mean pulse rate in the Oberlin
student population is 65 to 71. The interpretation “98% of the
pulse rates for students at this college will fall between 65 and 71”
is
A. Correct
B. Incorrect
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Summary
To create a 95% confidence interval for a parameter:
1. Take many random samples from the population, and compute
the sample statistic for each sample
2. Compute the standard error as the standard deviation of all
these statistics
3. For your actual sample, use statistic ± 2SE
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Ok, but...
In reality we only have one sample. How do we know what the
standard error is?
• Standard error depends on population characteristics,
particularly variability
• We can use the sample to estimate not only the parameter of
interest (e.g., mean, proportion), but also the variability.
• Two approaches: (1) Simulation, (2) Probability theory
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Estimating the Margin of Error from One Sample
• Since we only have one sample, we have to estimate the
Margin of Error using only the information it contains.
• Idea: Let the whole sample (not just the statistic of interest)
serve as an estimate for the whole population
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Note: We do not literally make copies of the data, or increase our
sample size, by bootstrapping!
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Sampling from the Pseudo-Population
• Sampling from the estimated population is equivalent to
sampling from the sample, but never “using up” the cases.
• In other words, we sample with replacement from the sample.
• The resulting sample is called a bootstrap sample.
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Bootstrap Statistic and Bootstrap Distribution
• We compute the relevant statistic (e.g., mean) on the
bootstrap sample. This is a bootstrap statistic.
• Over many bootstrap samples, each contributing a bootstrap
statistic, we get a bootstrap distribution.
• Each bootstrap statistic differs from the “pseudopopulation
parameter” (which is really the
real sample
statistic).
• We hope these differences are similar in size to the differences
between true sample statistics and population parameter.
Bootstrap statistic : Actual sample statistic
::
Actual sample statistic : Actual Population Parameter
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Examples: StatKey
http://lock5stat.com/statkey
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Population vs. Sample vs. Sampling Dist. vs. Bootstrap
Dist.
Population <- read.file("http://colindawson.net/data/ames.csv")
Sample <- sample(Population, size = 50)
SamplingDist <- do(5000) *
sample(Population, size = 50) %>% mean(~Price, data = .)
BootstrapDist <- do(5000) *
resample(Sample) %>% mean(~Price, data = .)
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Pop. Cases
Population vs. Sample vs. Sampling Dist. vs. Bootstrap
Dist.
100
80
60
40
20
0
150000
200000
Samp. Cases
Price
• What is the center of the
4
3
2
1
0
sampling distribution?
150000
200000
Price
• What is the center of the
Samples
bootstrap distribution?
800
600
400
200
0
• How does the spread
150000
200000
Boot. Samples
Mean Price
600
400
200
0
150000
200000
Mean Price
compare?
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
95%
0 400
Samples
Estimating the Margin of Error
●
100000
150000
200000
250000
200000
250000
600
95%
●
0
Boot. Samples
Mean Price
100000
150000
Mean Price
• The spread of the bootstrap distribution approximates the
spread of the true sampling distribution.
• We can use the bootstrap distribution to get a Margin of Error
for our Confidence Interval
• Where should the center of the CI be?
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Adjusting the Confidence Level
If the sampling distribution is approximately Normal, then a 95%
Margin of Error is about 2 Standard Errors.
If the bootstrap distribution is approximately Normal, 95% of the
bootstrap statistics are within 2 SE of the boostrap center (i.e.,
original sample stat.). That is, 95% of bootstrap statistics are
within the 95% CI.
If the bootstrap distribution is symmetric, then capturing the middle
X% of the bootstrap statistics yields an X% confidence interval!
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
99%
0 400
Samples
Estimating the Margin of Error
●
100000
150000
200000
250000
200000
250000
600
99%
●
0
Boot. Samples
Mean Price
100000
150000
Mean Price
• If we want a 99% CI, we need a MoE such that 99% of sample
stats are within that MoE of the population parameter.
• Since the bootstrap dist. has similar spread to the true
sampling dist., we can estimate such an MoE there
• Then build a CI around the sample stat. (aka center of
boostrap dist.) with that MoE.
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
CI with Arbitrary Confidence Level
### 99% CI goes from 0.5 percentile to 99.5 percentile of bootstrap dist.
CI <- quantile(~result, data = BootstrapDist, probs = c(0.005, 0.995))
CI
0.5%
99.5%
142708.9 190408.7
Density
histogram(~result, data = BootstrapDist, fit = "normal", nint = 100, v = CI)
0.00005
0.00004
0.00003
0.00002
0.00001
0.00000
140000
160000
180000
result
200000
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Example: Atlanta Commutes
http://lock5stat.com/StatKey
Confidence Intervals
Bootstrap Resampling
Bootstrap Confidence Intervals
Bootstrap Percentile Intervals
Summary: Bootstrap CIs
To generate a bootstrap distribution, we
1. Generate bootstrap samples by sampling with replacement
from the original sample, using the same sample size
2. Compute the statistic of interest, a bootstrap statistic, for
each of the bootstrap samples
3. Collect the statistics for many bootstrap samples to form a
bootstrap distribution
If the bootstrap distribution is symmetric, an X% CI can be
estimated by taking the range of the middle X% of the bootstrap
statistics.