Download SECTION 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Mean field particle methods wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
SOLUTIONS TO HOMEWORK 01
SECTION 1.1 (NOT means it’s an extra problem that I included, but it’s NOT to be graded)
NOT
1.15
10
11
12
13
14
15
|
|
|
|
|
|
9
0
1
0
1
2
3
0
1
4
4
1
2
4
4
2
2
7
6
4
2
8
7
5
3
9
7
5
4
9
8 8 9
5 6 6 7 8 9 9 9 9
4 4 4 5 7 8 9
9
a.) 10.9% and 11.0% are the smallest values in the dataset.
b.) The shape is rather symmetric if you ignore the 2 smallest. The center of the distribution is about 13.9%. The
spread is  1 or 2% from center.
1.18
The shape is skewed to the right. This means that there are
more short words (34 letters), but there are a few quite long
words (>10 letters). We would expect the distribution of other
authors to be similar, because short words are common.
Notice: we could define center as: 1) halfway between 0 –
12; 2) the 50th %tile = x = 4 (you can stack the bars 
5+17+23 = only 45, so 4, the next bar, contains the median);
3) the mode = the tallest bar = 4; or 4) the mean = x , which is
too complicated to calculate from the graph, but it would be 4
or more since the data is skewed to the right. Whichever we
choose, because the center is fairly close to zero (some-where
around 4), word lengths can be only a little less than the center
but can be much greater than the center, hence the tail on the
right.
1.25 These are modified stemplots from SPSS to show the ‘extremes’ in a different way and to make them easier to
compare by ‘lining up’ the stems.
Stem-and-Leaf Plot for
GENDER= F
Frequency
0.00
2.00
8.00
15.00
4.00
0.00
0.00
1.00
Stem
0
0
1
1
2
2
3
3
Stem width:
Each leaf:
100
& Leaf
.
. 69
. 12222222
. 555578888888888
. 0444
.
.
. 6
1 case(s)
Stem-and-Leaf Plot for
GENDER= M
Frequency
6.00
8.00
7.00
3.00
5.00
0.00
1.00
0.00
Stem
0
0
1
1
2
2
3
3
Stem width:
Each leaf:
100
&
.
.
.
.
.
.
.
.
Leaf
033334
66679999
2222222
558
00344
0
1 case(s)
It is now easy to see that the center for women is larger and more concentrated than that for men, plus the outlier in
women looks more extreme. In fact, without the outlier, women’s times look fairly normal. Men’s times are skewed
to the right.
a.) The times are in multiples of 10 minutes probably because it is difficult to estimate to an exact minute. One
woman claimed to study 360 minutes (= 6 hours) a night.
b.) Mid for Men = 120, Mid for Women = 180 the true median is shown with the box .
1.27 The stemplot and histogram show a low outlier (4.88) and otherwise a mound-shaped distribution from 5.07 to
5.85, with the center being 5.44 and 5.46. The histogram seems to show two low areas (actually three measurements),
then a larger peak at 5.3, with irregular peaks extending up to 5.8. Using the center as the estimate for the density:
from the histogram, I would estimate the density to be between 5.38 and 5.5; from the stemplot, I would estimate
8
5.46, the median.
Stem-and-Leaf Plot
Frequency
Stem &
Leaf
6
1.00
.00
1.00
1.00
4.00
5.00
4.00
5.00
5.00
2.00
1.00
48
49
50
51
52
53
54
55
56
57
58
.
.
.
.
.
.
.
.
.
.
.
8
7
0
6799
04469
2467
03578
12358
59
5
4
2
Std. Dev = .22
Mean = 5.45
Stem width:
Each leaf:
.10
1 case(s)
N = 29.00
0
4.88
5.00
5.13
This is the data in each bar:
4.88
It doesn’t exactly match the stemplot, but it’s close.
5.25
5.38
5.50
5.63
5.75
5.88
5.07 5.26 5.34 5.44 5.57 5.75 5.85
5.10 5.27 5.34 5.46 5.58 5.79
5.29 5.36 5.47 5.61
5.29 5.39 5.50 5.62
5.30 5.42 5.53 5.63
5.55 5.65
5.68
NOT
1.32
Both the stemplot and the histogram, suggest that the midpoint is somewhat above 100 and is more like 109.
modified
Stem-and-Leaf Plot
Frequency
Stem &
4.00
7 .
2.00
8 .
8.00
9 .
22.00
10 .
27.00
11 .
12.00
12 .
3.00
13 .
Stem width:
Each leaf:
20
Leaf
2479
69
01336778
0022333344555666777789
000011112222333444455688999
003344677888
10
026
10
1 case(s)
Shape: slightly skewed to the left
Center: x = 110
x = 108.9
Spread:  15 to 20
Std. Dev = 13.17
Mean = 108.9
N = 78.00
0
70.0
80.0
75.0
90.0
85.0
100.0
95.0
110.0
105.0
120.0
115.0
130.0
125.0
135.0
SECTION 1.2
1.46
10
11
12
13
14
15
|
|
|
|
|
|
9
0
1
0
1
2
3
0
1
4
4
1
2
4
4
2
2
7
6
4
2
8
7
5
3
9
7
5
4
9
median = middle number, if there is an even number of points, average the 2
8 8 9
middle numbers; if odd it’s THE middle number
5 6 6 7 8 9 9 9 9
4 4 4 5 7 8 9
9
a.) The 25th and 26th largest values are 13.9 (rounded) so the median is 13.9. Q1 is the median for the lower half
= 13.0. Q3 is the median of the upper half = 14.4.
b.) IQR = Q3  Q1 = 14.4  13.0 = 1.4. To find out if there are any outliers, you use the 1.5*IQR Rule:
Q3 + 1.5*1.4 = 14.4 + 2.1 = 16.1, so there are no outliers above
Q1  1.5*1.4 = 13.0  2.1 = 10.9, so there are no outliers below (barely). Had there been a 10.8, it would have
been an outlier.
c.) Omitting the 2 smallest points, Montana and Wyoming, the median will move up one place in the list, but that
is still 13.9.
1.47 (refer to the graph in 1.18)
The method for finding the median from a histogram is explained in 1.18. The min = 1, Q1 = 2, x = 3 or 4, Q3 = 5,
max (given) = 12. You can find Q1 by stacking the bars until you reach 25% and Q3 until you reach 75%.
1.53 with DC
De scriptives
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Sk ewness
Kurtos is
Std. Error
12.6862
20
269.6770
232.7516
225.0000
8207.881
90.5974
154
737
583
47.0000
3.570
17.243
10
Frequency
Mean
95% Confidenc e
Interval for Mean
Histogram
Statistic
244.1961
218.7152
Std. Dev = 90.60
.333
.656
Mean = 244.2
N = 51.00
0
150.0
250.0
200.0
without DC
400.0
550.0
500.0
650.0
600.0
750.0
700.0
Lower Bound
Upper Bound
14
St atist ic
234.3400
217.9647
St d. Error
8.1487
229.5667
224.5000
3320.025
57.6197
154
412
258
46.7500
1.414
1.967
12
10
250.7153
8
6
4
Frequency
5% Trimmed Mean
Median
Variance
St d. Deviat ion
Minimum
Maximum
Range
Int erquartile Range
Sk ewness
Kurtos is
300.0
450.0
Histogram
De scri ptives
Mean
95% Confidenc e
Int erval for Mean
350.0
Std. Dev = 57.62
2
Mean = 234.3
.337
.662
N = 50.00
0
160.0
200.0
180.0
240.0
220.0
280.0
260.0
320.0
300.0
360.0
340.0
400.0
380.0
420.0
Percentiles
Weighted
Average(Definition 1)
5
10
25
Percentiles
50
75
90
95
165.2000
171.2000
199.5000
224.5000
246.2500
334.7000
379.8500
200.0000
224.5000
246.0000
Tukey's Hinges
a.) Mean, x = 234.34 (down from 244.2 since DC is an outlier above), StDev, s = 57.6 (down from 90.6, again
because of DC), Min = 154 (still the same since DC was on the other end), Q1 = 199.5 (almost the same as
200 since we only dropped 1 point), x = 224 (down only slightly from 225 as opposed to how much the
mean changed), Q3 = 246.25 (again, down only slightly from 247), Max = 412 (down considerably since 737
was DC, the outlier that we dropped). Even without the outlier, the distribution is still skewed, so the median
and IQR are preferred summary numbers.
b.) Mean and StDev do not reveal skewness, so the 5-number-summary is a better representation. You can ‘see’
the skewedness by noticing that the minimum is closer to Q1 (199.5  165.2 = 34.3) than the maximum is to
Q3 (412  246.25 = 165.75), but is still doesn’t show the gaps.
1.62
The total of the observations is 11200. Divided by 7, this gives a mean of 1600 = x/7 = x . Subtracting this from
each observation gives the following table:
data
Diff Squares
1792
192
36864
1666
66
4356
1362
-238
56644
1614
14
196
1460
-140
19600
1867
267
71289
1439
-261
25921
0
214872
Totals 11200
Dividing total Squares by 6 gives 35812 = s2 = ((x x )2)/(n1) = 214872/6. Taking square root yields 189.24 = s.
Descriptives
Mean
95% Confidence Interval for Mean Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Statistic
1600.0000
1424.9825
1775.0175
1598.3889
1614.0000
35811.667
189.2397
1362
1867
505
353.0000
.207
-1.498
Std. Error
71.5259
.794
1.587
NOT
1.65
a.) Choose all four to be the same, such as {1,1,1,1}. To have a standard deviation of 0 means that you have NO
spread at all. It doesn’t matter what the value is, it will be the mean and the difference between the mean and
any other value will be 0 (since they will all be equal to the mean).
b.) We want the greatest spread; {0,0,10,10} is the answer. Adding more 0’s and 10’s will increase the standard
deviation. Adding other numbers (not 0 or 10) will actually decrease the standard deviation since on average
the distance to the mean will be smaller.
c.) There are many answers to a.) but only one to b.)
1.68
De scriptive Statistics
20
N
83
83
Valid N (listwise)
Minimum
-26.00
Maximum
19.20
Mean
1.9072
Std. Deviation
7.4853
a.) Mean, x = 1.9072%. SD, s = 7.4853
New value = investment + average rate of return*investment
= $100 + 1.91%($100) = $101.91
10
b.) New value = $100 + (-26.6%)($100) = $100  $26.60
= $74.40.
Std. Dev = 7.49
Mean = 1.9
N = 83.00
0
-25.0
-22.5 -17.5 -12.5
De scriptive Statistics
N
Valid N (listwise)
82
82
Minimum
-14.00
Maximum
19.20
-20.0 -15.0 -10.0
Sum
184.30
Mean
2.2476
-5.0
-7.5
-2.5
0.0
5.0
2.5
10.0
7.5
15.0
12.5
20.0
17.5
Std. Deviation
6.8548
Mean, x = 2.2476. SD, s = 6.8548. Omitting one point will not change the median or quartiles by much. (Look
at the graph)
NOT
1.72
a.) Since this is just a scale change, the shift, a = 0. If there are 0.62m/km, we would need to multiply the miles
by the scale change, b = 1/0.62. So, kilometers = (1/0.62)*65miles = 104.84.
b.) 746 watts = 1 horsepower  watts = 0 + 746*hp  a = 0, b = 746 So, watts = 746*140-hp = 104,440 or
140*742watts = 140hp  1-4,440 watts = 140hp
1.74
Descriptives
Statistic Std. Error
Mean
5.4479 4.103E-02
95% Confidence Interval for Mean Lower Bound
5.3639
Upper Bound
5.5320
5% Trimmed Mean
5.4549
Median
5.4600
Variance
4.882E-02
Std. Deviation
.2209
Minimum
4.88
Maximum
5.85
Range
.97
Interquartile Range
.3200
Skewness
-.468
.434
Kurtosis
.354
.845
a.) Mean, x = 5.4479, SD, s = 0.2209
b.) Cavendish found the density of the earth to be 5.5 times the density of water. This density is 62.43 lb/cu-ft,
so his value of 5.5 is 5.5*62.43 = 343.365 lb/cu-ft. The mean is 340.11 lb/cu-ft (5.4479*62.43) and the
standard deviation is 13.79 lb/cu-ft (0.2209*62.43). Remember, scale changes affect BOTH locations and
spreads.
SECTION 1.3
1.79
a.) In order for the total area to be 1, a the width of 2 means the height must be ½.
b.) Half the area is to the left of 1, so half the outcomes are less than 1.
c.) The area is (1.3-0.5) * 0.5 = 0.4.
1.81
a.) Mean is C, median is B. Skewed right means mean, x > median, x .
b.) Mean, x , and Median, x , are A since the distribution is symmetric.
c.) Mean, x , is A, median, x , is B since the distribution is skewed to the left, x < x .
1.83
a.) 99.7% falls within 3sd’s of the mean    3* = 336  3*3 = 327 to 345 days
b.) 336 + 1*3 = 339, so 339 days is 1 standard deviation above the mean.
It’s really asking what percent of the distribution is above 339. The z-score = (339  336)/3 = 1. We know that
68% of the distribution is within   1, so 32% falls outside. The percent of the distribution more than 1
standard deviation above the mean = percent of the distribution more than 1 sd below the mean, so half of 32% =
16% falls above 339.
1.87
Cobb’s batting avg = 0.420 z = (0.420-0.266)/0.371 = 4.15
Williams’s = 0.406  z = (0.406-0.267)/0.0326 = 4.26
Brett’s = 0.390  z = (0.390-0.261)/0.0317 = 4.07
All are over 4 standard deviations over the mean. The three stand close together, an astounding four standard
deviations above the typical hitter. (Williams has a slight edge, but perhaps not large enough to declare him “the
best.”) Notice that although Cobb’s average is higher than Williams’, it’s not relatively higher. Williams actually did
better vs. his peers than Cobb did. Also, even though Brett’s is 0.03 (almost 10%) lower than Cobb’s, it’s not but
0.08th of a standard deviation less (closer to the mean).
1.88 Draw a curve, locate the point on the line, then shade in the direction of the sign: < means shade the area to the
left, > means shade the area to the right. See the handout on the web for more help.
a.) 0.9978
b.) 0.0022
Note: if we add the last 2 together we’d get 1 since it would cover the entire curve.
c.) 0.9515
d.) 0.95150.0022 = 0.9493
NOT
1.89 Since Z is continuous, it doesn’t matter whether we include the ‘line’ (=) or not. We are looking at areas under
the curve and adding the width of a line (at the exact point, e.g., 2.25) doesn’t add anything to the area.
a.) 0.0122
b.) 0.9878
c.) 0.0384
d.) 0.98780.0384 = 0.9494
1.99
a.) z = 1.625. The area to the left of 1.625 (probability of less than) = 0.0521.
b.) z = 0.25 for 270 days. The area between z = 1.625 and z = 0.25 is 0.5466.
c.) Longest 20% are 0.84 standard deviations above the mean or 279.4 days.
The area to the right of 279.4 ( = 266 + 0.84*16) is 20%. Look up 0.20 in the body of the Z table and read off the
z-score = 0.84, but since it’s the area to the right(longest), it’s the negative, or +0.84.