Download Data Summary Using Descriptive Measures

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Introduction to Business Statistics, 6e
Kvanli, Pavur, Keeling
Chapter 3 –
Data Summary
Using
Descriptive
Measures
Slides prepared by Jeff Heyl, Lincoln University
Thomson/South-WesternLearning™
1
©2003 South-Western/Thomson
Types of Descriptive
Measures
 Measures of central tendency
 Measures of variation
 Measures of position
 Measures of shape
©2003 Thomson/South-Western
2
Measures of Central
Tendency
 The Mean
 The Median
 The Midrange
 The Mode
©2003 Thomson/South-Western
3
The Mean
The Mean is simply the average of the data
A Sample Mean
Each value in the sample is represented by x.
Thus to get the mean simply add all the
values in the sample and divide by the
number of values in the sample (n)
x
x= n
©2003 Thomson/South-Western
4
The Population Mean
Each value in the population is
represented by x.
Thus to get the population mean (m)
simply add all the values in the
population and divide by the number of
values in the population (N)
x
m= N
©2003 Thomson/South-Western
5
The Accident Data Set
6 + 9 + 7 + 23 +5
x=
= 10.0
5
If we remove the last value
from the data set, then
6 + 9 + 7 + 23
x=
= 11.25
4
©2003 Thomson/South-Western
6
The Median
The Median (Md) of a set of data is
the value in the center of the data
values when they are arranged
from lowest to highest
©2003 Thomson/South-Western
7
Accident Data
Ordered array: 5, 6, 7, 9, 23
The value that has an equal number of
items to the right and left is the median
Md = 7
If n is an odd number, Md is the center
data value of the ordered data set
n+1
Md =
st ordered value
2
©2003 Thomson/South-Western
8
Even Numbered Data
Ordered array: 3, 8, 12, 14
The value that has an equal number of
items to the right and left is the median
Md = (8 + 12)/2 = 10
If n is an even number, Md is the average of
the two center values of the ordered data set
©2003 Thomson/South-Western
9
The Midrange
The Midrange (Mr) provides an easyto-grasp measure of central tendency
L+H
Mr =
2
©2003 Thomson/South-Western
10
Accident Data
Ordered array: 5, 6, 7, 9, 23
5 + 23
Mr =
= 14
2
Note: that the Midrange is severely affected
by outliers
Compare Mr to x = 10 and Md = 7
©2003 Thomson/South-Western
11
The Mode
 The Mode (Mo) of a data set is the
value that occurs more than once
and the most often
 The Mode is not always a
measure of central tendency; this
value need not occur in the
center of the data
©2003 Thomson/South-Western
12
Bellaire College Example
Figure 3.2
©2003 Thomson/South-Western
13
Bellaire College Example
Figure 3.3
©2003 Thomson/South-Western
14
Bellaire College Example
Figure 3.4
©2003 Thomson/South-Western
15
Level of Measurement and
Measure of Central Tendency
Summary of levels of measurement and appropriate measure
of central tendency. A “Y” indicates this measure can be
used with the corresponding level of measurement.
Measure of
Central Tendency
Mean
Median
Midrange
Mode
Level of Measurement
Nominal Ordinal Interval
Y
Y
Y
Y
Y
Y
Y
Ratio
Y
Y
Y
Y
Table 3.1
©2003 Thomson/South-Western
16
Measures of Variation
 Homogeneity refers to the degree of
similarity within a set of data
 The more homogeneous a set of
data is, the better the mean will
represent a typical value
 Variation is the tendency of data
values to scatter about the mean, x
©2003 Thomson/South-Western
17
Common Measures of
Variation
 Range
 Variance
 Standard Deviation
 Coefficient of Variation
©2003 Thomson/South-Western
18
The Range
For the Accident data:
Range = H - L = 23 - 5 = 18
Rather crude measure but easy to
calculate and contains valuable
information in some situations
©2003 Thomson/South-Western
19
The Variance and
Standard Deviation
Both measures describe the variation of
the values about the mean
Data Value (x)
5
6
7
9
23
(x - x )
(x - x )2
-5
25
-4
16
-3
9
-1
1
13
169
(x - x ) = 0 (x - x )2 = 220
©2003 Thomson/South-Western
20
Sample Variance
(x - x )2
s2 =
n-1
Using the accident data:
s2
220
220
=
=
= 55.0
4
5-1
©2003 Thomson/South-Western
21
Sample Standard Deviation
s=
(x - x )2
n-1
Using the accident data:
s=
55.0 = 7.416
©2003 Thomson/South-Western
22
Population Variance and
Standard Deviation
(x - m)2
2 =
N
=
(x - m)2
N
©2003 Thomson/South-Western
23
The Coefficient of Variation
The Coefficient of Variation (CV) is
used to compare the variation of
two or more data sets where the
values of the data differ greatly
s
CV =
 100
x
©2003 Thomson/South-Western
24
Machined Parts Example
Figure 3.6
©2003 Thomson/South-Western
25
Measures of Position
 Percentile (Quartile)
 Most common measure of position
 Quartiles are percentiles with the data
divided into quarters
 Z-Score
 The relative position of a data value
expressed in terms of the number of
standard deviations above or below the
mean
©2003 Thomson/South-Western
26
Percentile Example
The 35th Percentile (P35) is that
value such that at most 35% of the
data values are less than P35 and at
most 65% of the data values are
greater than P35.
©2003 Thomson/South-Western
27
Aptitude Test Scores
22
25
28
31
34
35
39
39
40
42
Table 3.2
44
44
46
48
49
51
53
53
55
55
56
57
59
60
61
63
63
63
65
66
68
68
69
71
72
72
74
75
75
76
78
78
80
82
83
85
88
90
92
96
Ordered array of aptitude test scores
for 50 applicants (x = 60.36, s = 18.61)
©2003 Thomson/South-Western
28
Percentile
Texon Industries Data
Number of data values, n = 50
Percentile, P = 35
P
n•
= 50 • .35 = 17.5
100
17.5 represents the position of the
35th percentile
©2003 Thomson/South-Western
29
Percentile Location Rules
Rule 1: If n  P/100 is not a counting
number, round it up, and the Pth
percentile will be the value in this
position of the ordered data
Rule 2: If n  P/100 is a counting number,
the Pth percentile is the average of
the number in this location (of the
ordered data) and the number in the
next largest location
©2003 Thomson/South-Western
30
Aptitude Scores Example
Ms. Jensen received a score of 83 on the
aptitude test. What is her percentile value?
83 is the 45th largest value out of 50.
A guess of the percentile would be:
P=
45
• 100 = 90
50
Examining the surrounding values clarifies
the true percentile
Example 3.5
P
88
89
90
(n • P)/100
50 • .88 = 44
50 • .89 = 44.5
50 • .90 = 45
P th Percentile
(80 + 83)/2 = 82.5
45th value = 83
(83 + 85)/2 = 84
©2003 Thomson/South-Western
31
Quartiles
Quartiles are merely particular percentiles
that divide the data into quarters, namely:
Q1 = 1st quartile = 25th percentile (P25)
Q2 = 2nd quartile = 50th percentile
= median (P50)
Q3 = 3rd quartile = 75th percentile (P75)
©2003 Thomson/South-Western
32
Quartile Example
Using the applicant data, the first quartile is:
P
n•
= (50)(.25) = 12.5
100
Rounded up Q1 = 13th ordered value = 46
Similarly the third quartile is:
P
n•
= (50)(.75) = 37.5 ≈ 38 and Q3 = 75
100
©2003 Thomson/South-Western
33
Interquartile Range
The interquartile range (IQR) is
essentially the middle 50% of the
data set
IQR = Q3 - Q1
Using the applicant data, the IQR is:
IQR = 75 - 46 = 29
©2003 Thomson/South-Western
34
Z-Scores
 Z-score determines the relative position
of any particular data value x and is
based on the mean and standard
deviation of the data set
 The Z-score is expresses the number of
standard deviations the value x is from
the mean
 A negative Z-score implies that x is to the
left of the mean and a positive Z-score
implies that x is to the right of the mean
©2003 Thomson/South-Western
35
Z Score Equation
x-x
z=
s
For a score of 83 from the aptitude data set,
83 - 60.66
z=
= 1.22
18.61
For a score of 35 from the aptitude data set,
35 - 60.66
z=
= -1.36
18.61
©2003 Thomson/South-Western
36
Standardizing Sample Data
The process of subtracting the
mean and dividing by the standard
deviation is referred to as
standardizing the sample data.
The corresponding z-score is the
standardized score.
©2003 Thomson/South-Western
37
Measures of Shape
 Skewness
 Skewness measures the tendency of
a distribution to stretch out in a
particular direction
 Kurtosis
 Kurtosis measures the peakedness
of the distribution
©2003 Thomson/South-Western
38
Skewness
 In a symmetrical distribution the mean,
median, and mode would all be the same
value and Sk = 0
 A positive Sk number implies a shape
which is skewed right and the
mode < median < mean
 In a data set with a negative Sk value the
mean < median < mode
©2003 Thomson/South-Western
39
Skewness Calculation
Pearsonian coefficient of skewness
3(x - Md)
Sk =
s
Values of Sk will always fall between -3 and 3
©2003 Thomson/South-Western
40
Frequency
Histogram of Symmetric Data
Figure 3.7
x = Md = Mo
©2003 Thomson/South-Western
41
Relative Frequency
Histogram with Right
(Positive) Skew
Sk > 0
Mode Median Mean
(Mo) (Md)
(x)
Figure 3.8
©2003 Thomson/South-Western
42
Relative Frequency
Histogram with Left
(Negative) Skew
Figure 3.9
Sk < 0
Mean Median Mode
(x)
(Md) (Mo)
©2003 Thomson/South-Western
43
Kurtosis
 Kurtosis is a measure of the
peakedness of a distribution
 Large values occur when there is a
high frequency of data near the
mean and in the tails
 The calculation is cumbersome and
the measure is used infrequently
©2003 Thomson/South-Western
44
Chebyshev’s Inequality
1. At least 75% of the data values are between
x - 2s and x + 2s, or
At least 75% of the data values have a zscore value between -2 and 2
2. At least 89% of the data values are between
x - 3s and x + 3s, or
At least 75% of the data values have a zscore value between -3 and 3
3. In general, at least (1-1/k2) x 100% of the
data values lie between x - ks and x +
ks for any k>1
©2003 Thomson/South-Western
45
Empirical Rule
Under the assumption of a bell
shaped population:
1. Approximately 68% of the data values lie
between x - s and x + s (have z-scores
between -1 and 1)
2. Approximately 95% of the data values lie
between x - 2s and x + 2s (have z-scores
between -2 and 2)
3. Approximately 99.7% of the data values lie
between x - 3s and x + 3s (have z-scores
between -3 and 3)
©2003 Thomson/South-Western
46
A Bell-Shaped
(Normal) Population
Figure 3.10
©2003 Thomson/South-Western
47
Chebyshev’s Versus
Empirical
Between
Actual
Percentage
Chebyshev’s
Inequality
Percentage
Empirical Rule
Percentage
x - s and x + s
66%
(33 out of 50)
—
≈ 68%
x - 2s and x + 2s
98%
(49 out of 50)
≥ 75%
≈ 95%
x - 3s and x + 3s
100%
(50 out of 50)
≥ 89%
≈ 100%
Table 3.3
Md = 62
Sk = -.26
©2003 Thomson/South-Western
48
Allied Manufacturing Example
Is the Empirical Rule
applicable to this data?
Probably yes.
Histogram is
approximately bell
shaped.
x - 2s = 10.275 and x + 2s = 10.3284
96 of the 100 data values fall between these limits
closely approximating the 95% called for by the
Empirical Rule
©2003 Thomson/South-Western
49
Grouped Data
When raw data are not available
Estimate x by assuming data values are equal to the
midpoint of their class
Class Number
Class (Age in years)
Frequency
1
2
3
4
5
20 and under 30
30 and under 40
40 and under 50
50 and under 60
60 and under 70
5
14
9
6
2
36
Table 3.4
©2003 Thomson/South-Western
50
Grouped Data
When raw data are not available
Estimate x by assuming data values are equal to the
midpoint of their class
5 values at (20 + 30)/2
14 values at (30 + 40)/2
9 values at (40 + 50)/5
6 values at (50 + 60)/2
2 values at (60 + 70)/2
= 25
= 35
= 45
= 55
= 65
(5)(25) + (14)(35) + (9)(45) + (6)(55) + (2)(65)
x=
36
1480
x=
= 41.1
36
©2003 Thomson/South-Western
51
Grouped Data
When raw data are not available
Estimate s2 by assuming data values are equal to the
midpoint of their class and using the normal method
∑(each data value)2 - ∑(each data value)2/n
s2 =
n-1
65,100 - (1480)2/36
s2 =
= 121.59
35
s=
121.59 = 11.03
©2003 Thomson/South-Western
52
Grouped Data
Summary of calculations
Class
Number
1
2
3
4
5
Class
20 and under 30
30 and under 40
40 and under 50
50 and under 60
60 and under 70
f
m
f•m
f • m2
5
14
9
6
2
25
35
45
55
65
125
490
405
330
130
3,125
17,150
18,225
18,150
8,450
36
∑f • m = 1,480 ∑f • m2 = 65,100
Table 3.5
©2003 Thomson/South-Western
53
Grouped Data
Figure 3.11
©2003 Thomson/South-Western
54
Box Plots
Box plots are graphical representations of
data sets that illustrate the lowest data
value (L), the first quartile (Q1), the median
(Q2, MD), the third quartile (Q3), the
interquartile range (IQR), and the highest
data value (H)
©2003 Thomson/South-Western
55
Box Plots
Given the aptitude test data:
L = 22
Q1 = 46
Q2 = Md = 62
Q3 = 75
IQR = 75 - 46 = 29
H = 96
x
|
20
L = 22
x
|
30
|
40
|
50
Q1 = 46
|
60
|
70
|
80
Md = 62 Q3 = 75
|
90
|
100
H = 96
Figure 3.12
©2003 Thomson/South-Western
56
Box Plots
Figure 3.13
©2003 Thomson/South-Western
57
Box Plots
Figure 3.14
©2003 Thomson/South-Western
58
Box Plots
Figure 3.15
©2003 Thomson/South-Western
59
Box Plots
Figure 3.16a
©2003 Thomson/South-Western
60
Box Plots
Figure 3.16b
©2003 Thomson/South-Western
61
Box Plots
Box Plots for Aptitude Scores
Apptitude Score
100
80
60
40
20
1
2
Sample
Figure 3.17
©2003 Thomson/South-Western
62