Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics Worksheet 5 (Looking back at #1) Math& 146 Find a confidence interval for the “true” mean for the following data sets. Note that the variance is unknown in all cases. These are the data sets we worked on in Worksheet #1. Data Set 1 We look at the following data set, describing hypothetical observations of voltage of as et of 9V batteries. The data is sorted to make it easier to identify some indexes 8.517 8.952 9.056 9.312 8.590 8.954 9.056 9.330 8.640 8.955 9.059 9.336 8.667 8.958 9.073 9.360 8.683 8.979 9.107 9.405 8.904 9.031 9.182 9.469 8.929 9.034 9.215 9.486 8.944 9.043 9.246 9.671 We also have the following summaries: count 32 sum 290.141 sum of squares 2632.996 Solutions A spreadsheet like Gnumeric will produce confidence intervals automatically, at any level you may request. Here are a few examples: 90% CI for the Mean from to 8.98499780333134 9.14881512448977 93% CI for the Mean from to 8.97624510556133 9.15756782225977 95% CI for the Mean from to 8.96837979682357 9.16543313099754 97% CI for the Mean from to 8.95702235238557 9.17679057543553 99% CI for the Mean from to 8.93434479576723 9.19946813205387 The simulation assumed a normal distribution with true mean of 9.1, and this number is inside all the confidence intervals. Data Set 2 We look at the following data set, describing very hypothetical observations of interarrival times of a bus at a bus stop. The data is sorted to make it easier to identify some indexes 0.1 2.5 8.3 28.6 0.3 2.7 9.1 29.0 0.9 3.0 11.6 33.6 1.2 3.2 12.4 41.6 1.3 3.6 13.5 45.0 1.6 3.9 15.0 46.6 1.7 4.2 18.0 49.5 1.7 4.5 18.7 71.9 2.3 4.6 21.2 92.2 2.5 6.5 21.5 115.0 We also have the following summaries: Count 40 Sum 754.7 Sum of squares 40684.6 Solutions A spreadsheet like Gnumeric will produce confidence intervals automatically, at any level you may request. Here are a few examples: 90% CI for the Mean from to 11.9288807739416 25.8036282947808 93% CI for the Mean from to 11.195283815658 26.5372252530643 95% CI for the Mean from to 10.5379441846598 27.1945648840626 97% CI for the Mean from to 9.59212447335202 28.1403845953703 99% CI for the Mean from to 7.71658200210244 30.0159270666199 The data was the result of a simulation of exponential data with true mean 21. There are much better methods to estimate the mean of an exponential distribution, if we know that that's the right one. Nonetheless, 21 is included in all confidence intervals, which is good news. The bad news is that these intervals are extremely wide, giving a lot less insight, compared to those obtained when dealing with proper normal data in the previous data set. Data Set 3 We look at the following data set, describing hypothetical observations of yearly income, in units of $10,000. The data is sorted to make it easier to identify some indexes 1.008 1.604 4.279 20.416 1.021 1.921 4.422 35.329 1.163 2.409 4.652 40.490 1.389 3.129 5.236 131.974 1.395 3.608 5.892 201.500 1.585 3.899 9.993 1265.355 We also have the following summaries: count 24 sum 1753.667 sum squares 1662727.764 Solutions A spreadsheet like Gnumeric will produce confidence intervals automatically, at any level you may request. Here are a few examples: 90% CI for the Mean from to -42.969649135377 309.46032752025 93% CI for the Mean from to -62.1383705297281 328.629048914601 95% CI for the Mean from to -79.4476607043657 345.938339089238 97% CI for the Mean from to -104.593938746235 371.084617131108 99% CI for the Mean from to -155.396263599157 421.886941984029 The data was simulated form a distribution that has no mean at all. Thus, this exercise is essentially meaningless. A red flag is raised by the huge gap between the lower and the upper estimate, and, of course, even the raw descriptive statistics showed a really oversize range, and a consequently large sample standard deviation (note that both the sample mean and the sample/population variance and standard deviation are extremely sensitive to outliers). One superficial approach in such cases is to chop off the data that is classified as outliers, when matching the result with a hypothetical normal distribution. However, this is something to be done only when you are totally, absolutely sure that the “exceptional” data points are indeed spurious (typing errors, extraneous data that somehow got mixed with the proper data, and so on). Otherwise, you might be losing the most important information provided by your sample!