Download What is the distribution of the data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Putting Things Together Part 2
These exercise blend ideas from various graphs (histograms and boxplots), differing shapes of
distributions, and values summarizing the data. Data for 1, 2 and 5 are in the instructor’s shared
folder on LakerApps: Putting Things Together Part 2.
1. Data on the number of laps completed by drivers in a “100-lap” car race. (The race was called
off due to rain with 5 laps to go.) n = 39.
a) Obtain a histogram and identify the shape of this distribution.
b) Determine the 5 number summary.
{ __________ __________
__________
__________
__________ }
c) Draw a simple boxplot below the histogram.
d) Determine the interquartile range and range:
IQR = __________
Range = __________
e) Determine the mean and standard deviation:
Mean = ________
SD = __________
f) Compute Range  Standard Deviation to determine how many times bigger the standard
deviation is than the range.
The range is __________ times bigger than the standard deviation.
g) Mode = __________ . Technically the mode (most common value) is every value in this data
set (because all of them appear once; no ties). One way to more meaningfully identify a mode is
to use a representative value from the interval that occurs most often in the histogram. For this
histogram, values in the interval 80-90 occur most often, and it’s correct to say “85 is
(approximately) the mode.”
h) For left skewed data (like this), how do the mean, mode and median compare? Which is
largest? Which is smallest?
10
Frequency
8
6
4
2
0
20
30
40
50
60
70
80
90
100
Boxplot 
1
9
9
8
8
7
7
6
6
Frequency
Frequency
2. Here are histograms for data sets A, B, C and D. Notice that all are drawn to the same scales in
both the data scale (horizontal) and the frequency scale (vertical).
5
4
4
3
3
2
2
1
1
0
0
0
5
10
15
A
20
25
30
9
9
8
8
7
7
6
6
Frequency
Frequency
5
5
4
5
10
15
B
20
25
30
0
5
10
15
D
20
25
30
5
4
3
3
2
2
1
1
0
0
0
0
5
10
15
C
20
25
30
a) Determine the 5 number summary for A. Identify the shape of each distribution. Complete the
table below.
Data Set
Shape
Min
Q1
Median
Q3
Max
A _____________________
_______
_______
_______
_______
_______
B _____________________
9.1
15.25
16.95
18.98
29.40
C _____________________
3.60
16.55
24.70
26.78
29.90
D _____________________
7.90
16.75
21.70
23.30
26.70
c) Use these 5 number summaries to construct a simple boxplot for each data set. (Space is
provided below. Stack the boxplots one atop the other.
2
A
B
C
D
0
5
10
15
Data
20
25
30
d) Look at the boxplots (5 # summaries) and histograms. While “shape” is generally defined in
terms of a histogram, you should see that the orientation of a boxplot (the “pattern” of the 5 #
summary) provides a good indication of its shape. For a right skewed distribution (such as A) the
5 # summary has the min, Q1 and median rather close, then Q3 and the max are relatively
distant. For a symmetric distribution (B), the distance between the min and Q1 is about equal to
that for Q3 and the max; Q1 to the median is about the same as the median to Q3. The left
skewed distribution is a mirror image of the right skewed distribution.
A good quantitative key is to compare the distances from the quartiles to the median. If the first
quartile is a lot closer to the median than the third, then you’re probably looking at right skew. If
the third quartile is a lot closer to the median than is the first, then you’re probably looking at left
skew.
Distances
Q1 to median
median to Q3 Shape
A:
1.63
< <1
B:
1.70
≈
C:
8.15
>>
2.08 Left skewed
D:
4.95
>>
1.60 Left skewed
6.18 Right skewed
2.03 Symmetric
Of course, you want to “see” these comparisons through the boxplot – you don’t want to be
computing all this.
1
The < < means “a good deal less than”; similarly > > means “a good deal greater than.” Finally, ≈ means
“approximately equal.”
3
e) Here are means and standard deviations for the four data sets:
Data Set
A
B
Mean
8.84
17.18
StDev
7.82
4.49
Data Set
C
D
Mean
21.09
19.78
StDev
8.08
4.91
For which data set are the mean and median closest? Notice the shape of the distribution for this
data set.
For sets C and D the mean is a decent amount less than the median. What shape are the
distributions for C and D?
For set A the mean is a decent amount greater than the median. What shape is the distribution for
A?
Fill in the blanks below with one of these phrases: “less than” “greater than” “about equal to.”

For a right skewed distribution the mean is __________________________ the median.

For a left skewed distribution the mean is __________________________ the median.

For a symmetric distribution the mean is __________________________ the median.
For continuous data, it is rare for the mean and median to be exactly the same. Distributions of
real data are virtually never exactly symmetric.” Slight discrepancies between mean and median
exist for a distribution that is best described as symmetric. (By “slight discrepancy” is meant:
The mean and median plot very close to each other on the horizontal (x) axis of the histogram or
boxplot.)
f) Identify a reasonable value for the mode for each of the sets A – D.
A:
________ B:
________ C:
________ D:
________
You probably answered “2.5” for set A. Fine. Most statisticians would say “0.” Why? Because
the histogram shows a pattern of increasing frequency for values closer to 0. In the same way, a
statistician would say that the mode for set C is _________.
Now, examine the relationship between mean, median and mode for the data sets, keeping in
mind the distribution shapes.
A:
B:
C:
D:
Mode = 0 (or 2.5)
Mode = 17.5
Mode = 30 (or 27.5)
Mode = 22.5
Median = 5.45
Median = 16.95
Median = 24.70
Median = 21.70
Mean = 8
Mean = 17.18
Mean = 21.09
Mean = 19.78
Right skewed
Symmetric
Left skewed
Left skewed
Suppose the mean and mode are quite different – which generally happens when there is skew.
Where does the median generally fall, relative to the mean and mode?
4
20
20
15
15
Percent
Percent
3. a) For each histogram, identify the shape of the distribution. Also give reasonable values for
the mode of each distribution.
10
5
10
5
0
0
0.00
0.15
0.30
0.45
0.60
0.75
0.90
0.00
0.15
0.30
0.45
0.75
0.90
0.60
0.75
0.90
B
20
20
15
15
Percent
Percent
A
0.60
10
5
10
5
0
0
0.00
0.15
0.30
0.45
0.60
0.75
0.90
0.00
0.15
0.30
0.45
C
D
b) Match the histograms to the boxplots.
c) For each of the distributions identified
by histogram letters A, B, C and D,
determine how the mean and median
compare to each other, as well as to the
mode.
4
3
2
1
0.0
0.2
0.4
0.6
0.8
1.0
Data
5
4. Here’s a boxplot of the amounts people paid for an identical model of car. (Different people
pay different amounts because automobile prices are usually negotiated.)
Answer from the boxplot alone. (Do the best you can. No one can be exact.)
a) Determine the 5-# summary.
b) Determine values for the range and IQR.
c) About what % of people paid over 36,800?
d) About what % of people paid between 37,150 and 37,600?
e) Should the mean price be less than, more than, or about equal to the median price?
6
5. Consider the four data sets
Set A
4.3 3.8 3.0 2.9 3.8 2.5 5.8 2.4 3.5 1.1 3.3 4.7 4.4
Set B
5.0 5.4 5.8 5.3 3.7 4.7 1.2 1.6 4.8 1.9 1.1 2.3 2.7
Set C
3.8 3.7 3.3 2.5 1.1 3.5 3.4 3.4 3.8 5.8 3.2 3.5 4.5
Set D
4.9 5.4 3.2 2.2 1.1 3.8 4.7 1.6 3.6 1.9 4.4 2.9 5.8
9
9
8
8
7
7
6
6
Frequency
Frequency
a) Obtain histograms for all four sets. (The preferred method is to use a computer to do this. If
you do that, the scales of the histograms might be somewhat different from those shown
below – which have been forced to be identical. That’s OK.)
5
4
5
4
3
3
2
2
1
1
0
1
2
3
4
5
0
6
1
2
3
4
5
6
N
B
9
9
8
8
7
7
6
6
Frequency
Frequency
A
5
4
5
4
3
3
2
2
1
1
0
0
1
2
3
4
C
5
6
1
L
2
3
4
5
6
U
D
b) Examine the histograms. Without any computing:
i) What approximately are the means of these four sets?
ii) Which set do you think has the most variability? The least? Rank the sets A, B, C and D,
from least to most variable.
7
c) Obtain the five number summary for data set A. Similar five number summaries are shown for
the other three data sets. Also determine the range and interquartile range (IQR).
A: { ______ , ______ , ______ , ______ , ______ } IQR = _____
Range = _____
B: { 1.1, 1.9, 3.7, 5.0, 5.8 }
IQR = 3.1
Range = 4.7
C: { 1.1, 3.3, 3.5, 3.8, 5.8 }
IQR = 0.5
Range = 4.7
D: { 1.1, 2.2, 3.6, 4.7, 5.8 }
IQR = 2.5
Range = 4.7
d) Use the Range Rule of Thumb to guess the standard deviations for these four sets of data.
e) Obtain mean and standard deviation for set A.
Set
A
B
C
D
Mean
_____
3.50
3.50
3.50
SD
_____
1.68
1.02
1.45
f) Make some comparisons:
 How do the means compare?
Standard deviation, range and IQR are all measures of variability. The aim of this exercise is
a finer point demonstrating that Range alone is somewhat flawed.
 How do the ranges compare? Rank the data sets A, B, C and D from smallest range to
largest.
 How do standard deviations compare? Rank the data sets A, B, C and D from smallest
standard deviation to largest. (This is how you want to answer (ii) of part (b) above.) Do
these rankings agree with those for the range?
 Could you use the order of ranges to predict the order of standard deviations?
 How do IQRs compare? Rank the data sets A, B, C and D from smallest IQR to largest. Do
these rankings agree with those for the standard deviation?
Comment: The Range Rule of Thumb tends to work better when the shape is near Normal (bell).
Data set A is closest to Normal shaped. The Range Rule of Thumb predicts 4.74 = 1.175 for the
standard deviation, and, in fact, for set A the actual standard deviation is quite close to that:
1.143.
8
6. The histograms below are all drawn to the same scale on the horizontal (x) axis.
Min
Max
Min
A
Min
Max
B
Max
C
a) How do the ranges compare for these four
distributions?
Min
Max
D
L
b) The standard deviations for the distributions are:
35, 28, 15 and 10. Which standard deviation goes
with each of the four histograms?
U
c) Match the boxplots to the histograms.
P
N
d) Which of these distributions (A, B, C, D) has largest interquartile range? Second largest?
Second smallest? Which has the smallest? How does this ranking comparing to that for the
standard deviations?
e) For which distribution is the Range Rule of Thumb (Range4 ≈ SD) going to work best?
f) Suppose you know the means of these distributions are all 100. Use your answer to e to guess
the values of Min and Max.
9
Solutions
1. a) The distribution is left skewed. b) The five number summary is
{ 27.8, 63.8, 78.7, 88.5, 95.0 }
d) IQR = 24.7, Range = 67.2. e) The mean is 74.11, the standard deviation is 17.03. f) 3.95.
h) Mean = 74.11; median = 78.7; mode = 85. So for left skewed data: mean < median < mode.
2. a) A is right skewed; B is symmetric; C and D are both left skewed.
b) { 1.10, 3.83, 5.45, 11.63, 27.40 } is the 5 # summary. The mean is 8.84 with standard
deviation 7.82.
c)
A
B
C
D
0
5
10
15
Data
20
25
30
e) The mean and median are the closest for the symmetric distribution for data set B. The mean
is 17.18 and the median is 17.40 – these are very close when marked on the scale of the
10
histogram or boxplot. When the mean is below the median (as for C and D) we see left skew.
When the mean is above the median (as for A) we see right skew.

For a right skewed distribution the mean is greater than the median.

For a left skewed distribution the mean is less than the median.

For a symmetric distribution the mean is about equal to the median.
f) The median is generally between the mean and mode. We can extend the results of part e:
If Mean < Median < Mode then you are probably looking at a left skewed distribution.
If Mean > Median > Mode then you are probably looking at a right skewed distribution.
If the three are fairly close to each other, you are probably looking at a fairly symmetric
distribution.
3. a) A and D are right skewed (D is more skewed than is A); B is symmetric; C is left skewed.
For the modes, see the table below. b) A-3; B-2; C-1; D-4. c)
A:
Mode = 0.225 < Median < Mean
B:
Mode = 0.500 ≈ Median ≈ Median (so the mean and median are about 0.500)
C:
Mean < Median < Mode = 0.725
D:
Mode = 0 < Median < Mean
Your modes may be a little different, but should basically be in the same place when you
compare to mine with marks under the horizontal (x) axis of the histograms.
4. a) {36525, 36800, 37150, 37600, 39725} (if you are fairly close, that’s good). b) The range is
about 3200 and the IQR about 800. Again: Your values should be close. c) About 75% from the
boxplot. d) From the boxplot: About 25%.
e) This is the sort of boxplot you’d see for a right skewed distribution. So, the mean should be
larger than the median.
11
5. c) and e) Here are some corresponding statistics.
Variable
A
B
C
D
Mean
3.50
3.50
3.50
3.50
StDev
1.14
1.68
1.02
1.45
Minimum
1.1
1.1
1.1
1.1
Q1
2.9
1.9
3.3
2.2
Median
3.5
3.7
3.5
3.6
Q3
4.3
5.0
3.8
4.7
Maximum
5.8
5.8
5.8
5.8
Range
4.7
4.7
4.7
4.7
IQR
1.4
3.1
0.5
2.5
d) The range rule of thumb does not discriminate the differences in variability among these 4
data sets. The range rule of thumb anticipates a standard deviation of (5.8 – 1.1) / 4 = 1.175 for
each of the data sets.
f)

The means are identical. (The medians are nearly the same, and fairly close to the means
– which goes hand in hand with the symmetry shown in all these distributions.)

The ranges are identical.

The standard deviations are somewhat different. C has lowest, then A, then D, with B
highest. Standard deviation measures a “standard (typical) deviation from the mean.”
Take C for instance: Almost all the data is very near the mean. So the standard deviation
is small relative to the others. For B however, much of the data is at the extremes – far
from the mean. The standard deviation is large relative to the others. The standard
deviation is a more subtle measure of variability than is the range.
1.175 is a decent guess for all four – but this exercise is designed to reinforce this idea:
While standard deviation tends to be about ¼ the range, there is more to it than that.
Standard deviation takes into account not only what the largest deviations from the center
(mean) are, but also how often these occur relative to smaller deviations.

The IQRs also measure variability, and they also discriminate the differences in
variability among the data sets better than do the ranges. You can see that the order of
IQRs (from small to high: C, A, D, B) is the same as for standard deviations.
6.
a) They are about the same.
b) C-35, A-28, B-15, D-10.
c) U-A, N-B, P-C, L-D.
d) P has largest; then U; then N; L has smallest. The IQR is the width of the box, so a
comparison is simple. Then using the result to part c we can order the histograms by IQR: C has
largest; then A; then B; D has smallest. IQRs rank the same as standard deviations. (Which is
good. They are just different ways of measuring the variability – and generally they discriminate
the same way. You can see that the range can, in cases, be unable to make this discrimination.
Look at C and D. Clearly there’s more variability in C – extremes (values far from the
mean/center) are very likely for C and very uncommon in D.
e) B – the one that has a Normal (bell) shape.
f) Range rule of thumb works best with Normal shapes. For B we have a standard deviation of
15, anticipating a range of 60. With a mean at 100, we’d have Min around 70 and Max around
130.
12