Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2-13-08a 95% CI for the population mean via the usual formula. The file stat200 contains an example population of size N = 4000 called popSALES. Our objective is to estimate the mean of this population based on a random sample. Here is a with-replacement random sample dat1 of n = 100 drawn from this population. The formula for the usual 95% confidence interval for the population mean is (sample mean) +/- 1.96 (sample standard deviation) / root(n). There is around 95% probability that an interval calculated from the sample (as just above) will cover the population mean. stat200 can evaluate this 95% confidence interval using the call From this output we read off the confidence interval Around 95% of such intervals are supposed to cover the actual population mean. But we actually have the entire population in the computer and can find out if the CI has covered the population mean. Comparing the actual population mean 14.8564 with the evaluated confidence interval {14.2879, 17.8089} we see that this particular sample of n = 100 has produced a 95% CI covering the population mean, as 95% of such CI are supposed to do. In actual statistical applications it is usual that we do not know the population mean. The whole point of the exercise is to estimate the population mean and provide an estimated margin of error 1.96 (sample standard deviation) / root(n). Bootstrap 95% confidence interval for the population mean. Here is a little thought exercise that helps us understand bootstrap. Pretend the impossible, that we already know the population mean (!) and can have as many samples of n from the population as we wish! The margin of error could then be determined by finding a symmetric interval around the population mean which contains around 95% of all possible means of samples of n. It is simply because the population mean is that close to the sample mean of around 95% of all samples of n (i.e. if I am within a foot of you then you are within a foot of me). The idea behind bootstrap is that virtually every sample of n (if n is large enough) will sufficiently resemble the population to stand-in for it in the above. That is, the estimated margin of error of the sample mean (usually obtained as 1.96 sx / root(n)) may also be obtained by the following script: a. Produce very many samples of n WITH-REPLACEMENT FROM THE ORIGINAL SAMPLE b. For each such sample of n (bootstrap sample) find its sample mean denoted x*BAR. c. Find a symmetric interval around the original sample mean xBAR containing around 95% of these many bootstrap means x*BAR. It is customary to use around 2000 or more bootstrap samples of n. stat200 can do this very quickly even though the computational requirements are very great! Here is the call to produce the bootstrap 95% CI for the population mean: The bootstrap CI is reported at the end. Compare it with the 95% CI produced by the usual method xBAR +/- 1.96 sx / root(n) which is {14.2879, 17.8089}. When comparing CI produced by two methods it is not necessary that they agree very closely on the same date. For example, two archers may be equally accurate and yet shoot to different places when they both step to the line under the same conditions. Having said this, it is nonetheless remarkable just how closely the two CI methods agree on the same data. Why bootstrap? Analogues of the formula xBAR +/- 1.96 sx / root(n) exist for estimating other population quantities. For example med[dat1] +/- ? iqr[dat1] +/- ? s[dat1] +/- ? The associated margins of error for estimates such as those above are each dependent on their own specific formulas and tables. Here is how bootstrap would do it for median: bootci[med, dat1, 2000, 0.95] The bootstrap CI for population median just drops-in “med” for “mean” in the script. It is worth emphasizing the bootstrap CI is not just selecting from specialized mathematical formulas. It is substituting an entirely new paradigm bypassing the formulas and tables. Smoothing. As previously mentioned, our population is a list of 4000 prices. Can we look at such a large list of numbers? Here is a very narrow bandwidth smoothing. It looks like a very narrow bin-width histogram. You really cannot reliably read too much into the spikes in the plots above. Here are larger bandwidth (more smooth) versions. So much for smoothing the population. Does a sample dat1 of n = 100 reveal this population detail? Comparing the two plots above it appears that even a seemingly large bandwidth of 1 taxes the ability of a random sample of n = 100 to faithfully estimate the population counterpart. How can we choose a good bandwidth if we (as usual) have only the sample of n (not the population) to go on? One approach is to smooth the sample and compare that with similarly smoothing samples from the sample (again, the bootstrap idea).