Download Solutions to the worksheet

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Statistics Worksheet 5 (Looking back at #1)
Math& 146
Find a confidence interval for the “true” mean for the following data sets. Note that the variance is
unknown in all cases. These are the data sets we worked on in Worksheet #1.
Data Set 1
We look at the following data set, describing hypothetical observations of voltage of as et of 9V
batteries. The data is sorted to make it easier to identify some indexes
8.517 8.952 9.056 9.312
8.590 8.954 9.056 9.330
8.640 8.955 9.059 9.336
8.667 8.958 9.073 9.360
8.683 8.979 9.107 9.405
8.904 9.031 9.182 9.469
8.929 9.034 9.215 9.486
8.944 9.043 9.246 9.671
We also have the following summaries:
count
32
sum
290.141
sum of squares
2632.996
Solutions
A spreadsheet like Gnumeric will produce confidence intervals automatically, at any level you may
request. Here are a few examples:
90% CI for the Mean from
to
8.98499780333134
9.14881512448977
93% CI for the Mean from
to
8.97624510556133
9.15756782225977
95% CI for the Mean from
to
8.96837979682357
9.16543313099754
97% CI for the Mean from
to
8.95702235238557
9.17679057543553
99% CI for the Mean from
to
8.93434479576723
9.19946813205387
The simulation assumed a normal distribution with true mean of 9.1, and this number is inside all the
confidence intervals.
Data Set 2
We look at the following data set, describing very hypothetical observations of interarrival times of a bus at a bus stop. The
data is sorted to make it easier to identify some indexes
0.1
2.5
8.3
28.6
0.3
2.7
9.1
29.0
0.9
3.0
11.6
33.6
1.2
3.2
12.4
41.6
1.3
3.6
13.5
45.0
1.6
3.9
15.0
46.6
1.7
4.2
18.0
49.5
1.7
4.5
18.7
71.9
2.3
4.6
21.2
92.2
2.5
6.5
21.5
115.0
We also have the following summaries:
Count
40
Sum
754.7
Sum of squares 40684.6
Solutions
A spreadsheet like Gnumeric will produce confidence intervals automatically, at any level you may request. Here are a few
examples:
90% CI for the Mean from
to
11.9288807739416
25.8036282947808
93% CI for the Mean from
to
11.195283815658
26.5372252530643
95% CI for the Mean from
to
10.5379441846598
27.1945648840626
97% CI for the Mean from
to
9.59212447335202
28.1403845953703
99% CI for the Mean from
to
7.71658200210244
30.0159270666199
The data was the result of a simulation of exponential data with true mean 21. There are much better methods to estimate
the mean of an exponential distribution, if we know that that's the right one. Nonetheless, 21 is included in all confidence
intervals, which is good news. The bad news is that these intervals are extremely wide, giving a lot less insight, compared to
those obtained when dealing with proper normal data in the previous data set.
Data Set 3
We look at the following data set, describing hypothetical observations of yearly income, in units of $10,000. The data is
sorted to make it easier to identify some indexes
1.008
1.604
4.279
20.416
1.021
1.921
4.422
35.329
1.163
2.409
4.652
40.490
1.389
3.129
5.236
131.974
1.395
3.608
5.892
201.500
1.585
3.899
9.993
1265.355
We also have the following summaries:
count
24
sum
1753.667
sum squares
1662727.764
Solutions
A spreadsheet like Gnumeric will produce confidence intervals automatically, at any level you may request. Here are a few
examples:
90% CI for the Mean from
to
-42.969649135377
309.46032752025
93% CI for the Mean from
to
-62.1383705297281
328.629048914601
95% CI for the Mean from
to
-79.4476607043657
345.938339089238
97% CI for the Mean from
to
-104.593938746235
371.084617131108
99% CI for the Mean from
to
-155.396263599157
421.886941984029
The data was simulated form a distribution that has no mean at all. Thus, this exercise is essentially meaningless. A red flag
is raised by the huge gap between the lower and the upper estimate, and, of course, even the raw descriptive statistics
showed a really oversize range, and a consequently large sample standard deviation (note that both the sample mean and the
sample/population variance and standard deviation are extremely sensitive to outliers). One superficial approach in such
cases is to chop off the data that is classified as outliers, when matching the result with a hypothetical normal distribution.
However, this is something to be done only when you are totally, absolutely sure that the “exceptional” data points are
indeed spurious (typing errors, extraneous data that somehow got mixed with the proper data, and so on). Otherwise, you
might be losing the most important information provided by your sample!