Download Homework #2 Key - users.miamioh.edu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Homework #2: Due Friday Sept. 10, 2004
1. A clinician administered the Liebowitz Social Anxiety Scale to a sample of 12 individuals.
People with “moderate” social anxiety score between 55-65, people with “marked’ social anxiety
score between 65-80, people with “severe” social anxiety score between 80-95, and people with
“very severe” social anxiety score above 95. The data are shown below:
54, 58, 68, 70, 64, 82, 80, 65, 60, 71, 64, 116
a). What is the mean, median, and mode of these data? (3 points)
Mean 
X
n

54  58  68  70  64  82  80  65  60  71  64  116 852

 71
12
12
Median location = (n +1) / 2 = (12 + 1)/2 = 13/2 = 6.5
Values in order: 54, 58, 60, 64, 64, 65, 68, 70, 71, 80, 82, 116
Median = 6.5 up from the bottom
= average of 65 and 68 = 66.5
Mode = most frequently occurring score = 64
b). What is the range, interquartile range, variance, and standard deviation of these data? (4 points)
Range = Largest score – smallest score = 116- 54 = 62
Drop the median
location fraction first
Median location = (n + 1)/ 2 = (12 + 1) / 2 = 6.5
Quartile location = (median location + 1)/2 = (6 + 1) / 2 = 3.5
Values in order: 54, 58, 60, 64, 64, 65, 68, 70, 71, 80, 82, 116
Q1= 3.5 up from bottom = average of 60 & 64 = 62
Q3 = 3.5 down from top = average of 71 & 80 = 75.5
IQR = Q3 – Q1 = 75.5 – 62 = 13.5
Sample Variance using the computational formula:
s2 =
X
X
2
2

 X

n 1
2
n
 (54) 2  (58) 2  (68) 2  (70) 2  (64) 2  (82) 2  (80) 2  (65) 2  (60) 2  (71) 2  (64) 2  (116) 2
 2916  3364  4624  4900  4096  6724  6400  4225  3600  5041  4096  13456
 63442
 X  (54  58  68  70  64  82  80  65  60  71  64  116)  852
s2 =
2

852
63442 
12
12  1

63442  60492
 268.18
11
Using the definitional formula:
s2 =
(X  X) 2
n 1
We get the same
answer using
either formula
=
 (54  71) 2  (58  71) 2  (68  71) 2  (70  71) 2  (64  71) 2  (82  71) 2  (80  71) 2 


  (65  71) 2  (60  71) 2  (71  71) 2  (64  71) 2  (116  71) 2



12  1
=
(289  169  9  1  49  121  81  36  121  0  49  2025)
= 2950  268.18
11
11
Sample Standard deviation:
s  s 2  268.18  16.38
2. Construct a boxplot of the data reported in question 1. (1 point)
*
115
Outlier: 116
110
105
End
Upper
Whisker:
82
100
95
90
Q3: 75.5
85
80
75
Median:
66.5
70
65
60
55
50
End
Lower
Whisker:
54
Q1: 62
a). Describe/identify what each part of the plot signifies. (1 point)
See plot above. Note the end upper whisker is largest value  Q3 + (1.5)(IQR) =
largest value  75.5 + (1.5)(13.5) = 95.75
Therefore, the end upper whisker value is 82
The end lower whisker is the smallest value  Q1 - (1.5)(IQR) =
smallest value  62 - (1.5)(13.5) = 41.75
Therefore, the end lower whisker value is 54
b). Is the distribution of anxiety scores symmetric, positively, or negatively
skewed, and how can you tell? (1 point)
The distribution is positively skewed. I can tell that the distribution is not symmetric because the
black line inside the box is not centered inside. If the distribution was symmetric, the 1st and 3rd
quartiles would be about equally close to the median. In terms of the box plot, this would be true if
the black line inside the box (the median) was centered inside. In this case though, the line
representing the median is closer to the bottom of the box (the 1st quartile) than the top of the box
(the 3rd quartile). I can tell that the skew is positive because the outlier is an extreme large number.
Thus, the tail of the distribution would be extended out toward this large number, or if plotted as a
histogram, to the right.
3. Remove the outlier from the dataset.
a). Calculate the mean, median and mode based on this new dataset. (3 points)
Mean 
X
n

54  58  68  70  64  82  80  65  60  71  64 736

 66.91
11
11
Median location = (n +1) / 2 = (11 + 1)/2 = 12/2 = 6
Values in order: 54, 58, 60, 64, 64, 65, 68, 70, 71, 80, 82
Median = 6 up from the bottom = 65
Mode = most frequently occurring score = 64
b). Compared to the answers you obtained for question 1, which values changed substantially,
which changed very little, and which stayed the same and why? (2 points)
The mode is exactly the same as before. The median changed slightly, by 1.5 units. The mean
changed more significantly, by 4.09 units. The median and mode are fairly resistant to outliers. A
single outlying value will never change the mode, because it is just the most frequently occurring
score. The median is the middle location in the distribution. It ignores how far values are away
from it. So even though an outlying value is quite far away from the median, the median is not
affected by the magnitude of that distance. Thus, outliers have little influence on medians. The
mean is not very resistant to outliers. The mean is the mathematical center of the distribution. It is
constructed such that the sum of the deviations around it will equal zero. Thus, it is the “balancing
point” of the distribution. Because the mean is the value for which the sum of the deviations is
zero, the mean is sensitive to how far values are away from it. Therefore, an outlying value will
pull the mean toward it.
c). Calculate the range, interquartile range, and variance based on this new dataset (3 points)
Range = Largest score – smallest score = 82- 54 = 28
Median location = (n + 1)/ 2 = (11 + 1) / 2 = 6
Quartile location = (median location + 1)/2 = (6 + 1) / 2 = 3.5
Values in order: 54, 58, 60, 64, 64, 65, 68, 70, 71, 80, 82
Q1= 3.5 up from bottom = average of 60 & 64 = 62
Q3 = 3.5 down from top = average of 70 & 71 = 70.5
IQR = Q3 – Q1 = 70.5 – 62 = 8.5
Above, I illustrated that the definitional and computational methods yield the same answer. So this
time, I will only use the computational formula (though using either formula is correct).
s2 =
X
2
2

 X

n 1
n
 X 2  (54) 2  (58) 2  (68) 2  (70) 2  (64) 2  (82) 2  (80) 2  (65) 2  (60) 2  (71) 2  (64) 2
 2916  3364  4624  4900  4096  6724  6400  4225  3600  5041  4096
 49986
 X  (54  58  68  70  64  82  80  65  60  71  64)  736
s2 =
49986 
7362
11  1
11

49986  49245.09
 74.091
10
d). Compared to the answers you obtained for question 1, which values changed substantially and
which changed very little and why? (2 points)
The range changed substantially (from 62 in problem 1 to 28 now). By definition, the range is the
difference between the two most extreme scores in the distribution. When there is a single, extreme
outlier, its removal will have an obvious and direct impact on the range.
The variance also changed substantially (from 268.16 in problem 1 to 74.091 now). The variance is
calculated by finding the average squared distance between the data points and the mean. An
outlier, by definition, is a far distance from the mean value. Thus, its squared distance from the
mean is quite large and inflates the variance estimate. Therefore, removal of an outlier will reduce
the variance.
The IQR changed somewhat (from 13.5 in problem 1 to 8.5 now), but not nearly as much as the
variance or range did. The IQR is fairly resistant to outliers. This is because the IQR is the distance
between the 3rd and 1st quartiles. In other words, the IQR captures the middle 50% of the scores in
the distribution. The IQR ignores completely the extreme scores by focusing exclusively on values
in the middle 50% of the distribution. Therefore, the existence of an outlier will have little impact
on the calculation of the IQR.