Download 2-13-08a 95% CI for the population mean via the usual formula. The

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Confidence interval wikipedia , lookup

Taylor's law wikipedia , lookup

German tank problem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Transcript
2-13-08a
95% CI for the population mean via the usual formula.
The file stat200 contains an example population of size N = 4000 called popSALES.
Our objective is to estimate the mean of this population based on a random sample.
Here is a with-replacement random sample dat1 of n = 100 drawn from this population.
The formula for the usual 95% confidence interval for the population mean is
(sample mean) +/- 1.96 (sample standard deviation) / root(n).
There is around 95% probability that an interval calculated from the sample (as just
above) will cover the population mean.
stat200 can evaluate this 95% confidence interval using the call
From this output we read off the confidence interval
Around 95% of such intervals are supposed to cover the actual population mean.
But we actually have the entire population in the computer and can find out if the CI has
covered the population mean.
Comparing the actual population mean 14.8564 with the evaluated confidence interval
{14.2879, 17.8089} we see that this particular sample of n = 100 has produced a 95% CI
covering the population mean, as 95% of such CI are supposed to do.
In actual statistical applications it is usual that we do not know the population mean. The
whole point of the exercise is to estimate the population mean and provide an estimated
margin of error
1.96 (sample standard deviation) / root(n).
Bootstrap 95% confidence interval for the population mean.
Here is a little thought exercise that helps us understand bootstrap.
Pretend the impossible, that we already know the population mean (!) and can have as
many samples of n from the population as we wish! The margin of error could then be
determined by finding a symmetric interval around the population mean which contains
around 95% of all possible means of samples of n. It is simply because the population
mean is that close to the sample mean of around 95% of all samples of n (i.e. if I am
within a foot of you then you are within a foot of me).
The idea behind bootstrap is that virtually every sample of n (if n is large enough) will
sufficiently resemble the population to stand-in for it in the above. That is, the estimated
margin of error of the sample mean (usually obtained as 1.96 sx / root(n)) may also be
obtained by the following script:
a. Produce very many samples of n WITH-REPLACEMENT FROM THE
ORIGINAL SAMPLE
b. For each such sample of n (bootstrap sample) find its sample mean denoted
x*BAR.
c. Find a symmetric interval around the original sample mean xBAR containing
around 95% of these many bootstrap means x*BAR.
It is customary to use around 2000 or more bootstrap samples of n.
stat200 can do this very quickly even though the computational requirements are very
great! Here is the call to produce the bootstrap 95% CI for the population mean:
The bootstrap CI is reported at the end. Compare it with the 95% CI produced by the
usual method xBAR +/- 1.96 sx / root(n) which is {14.2879, 17.8089}.
When comparing CI produced by two methods it is not necessary that they agree very
closely on the same date. For example, two archers may be equally accurate and yet
shoot to different places when they both step to the line under the same conditions.
Having said this, it is nonetheless remarkable just how closely the two CI methods agree
on the same data.
Why bootstrap?
Analogues of the formula xBAR +/- 1.96 sx / root(n) exist for estimating other population
quantities. For example
med[dat1] +/- ?
iqr[dat1] +/- ?
s[dat1] +/- ?
The associated margins of error for estimates such as those above are each dependent on
their own specific formulas and tables. Here is how bootstrap would do it for median:
bootci[med, dat1, 2000, 0.95]
The bootstrap CI for population median just drops-in “med” for “mean” in the script.
It is worth emphasizing the bootstrap CI is not just selecting from specialized
mathematical formulas. It is substituting an entirely new paradigm bypassing the
formulas and tables.
Smoothing.
As previously mentioned, our population is a list of 4000 prices. Can we look at such a
large list of numbers? Here is a very narrow bandwidth smoothing. It looks like a very
narrow bin-width histogram.
You really cannot reliably read too much into the spikes in the plots above. Here are
larger bandwidth (more smooth) versions.
So much for smoothing the population.
Does a sample dat1 of n = 100 reveal this population detail?
Comparing the two plots above it appears that even a seemingly large bandwidth of 1
taxes the ability of a random sample of n = 100 to faithfully estimate the population
counterpart.
How can we choose a good bandwidth if we (as usual) have only the sample of n (not the
population) to go on? One approach is to smooth the sample and compare that with
similarly smoothing samples from the sample (again, the bootstrap idea).