Download chapter 3

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
CHAPTER 3 :
NUMERICAL DESCRIPTIVE
MEASURES
MEASURES OF CENTRAL TENDENCY
FOR UNGROUPED DATA




Mean
Median
Mode
Relationships among the Mean, Median,
and Mode
2
Y-Axis
Figure 3.1
Spread
Center
$56,260
Position of a particular family
Income
3
Mean
The mean for ungrouped data is
obtained by dividing the sum of all values
by the number of values in the data set.
Thus,
Mean for population data:
x


Mean for sample data:
x

x
N
n
4
Example 3-1
Table 3.1 gives the 2002 total payrolls of
five Major League Baseball (MLB) teams.
Find the mean of the 2002 payrolls of these
five MLB teams.
5
Table 3.1
MLB Team
Anaheim Angels
Atlanta Braves
New York Yankees
St. Louis Cardinals
Tampa Bay Devil Rays
2002 Total Payroll
(millions of dollars)
62
93
126
75
34
6
Solution 3-1
x 390

x

 $78 million
n
5
Thus, the mean 2002 payroll of these five MLB
teams was $78 million.
7
Example 3-2
The following are the ages of all eight
employees of a small company:
53 32 61 27 39 44 49 57
Find the mean age of these employees.
8
Solution 3-2
x 362



 45.25 years
N
8
Thus, the mean age of all eight employees of
this company is 45.25 years, or 45 years and
3 months.
9
Mean cont.
Definition
Values that are very small or very large
relative to the majority of the values in a
data set are called outliers or extreme
values.
10
Example 3-3
Table 3.2 lists the 2000 populations (in
thousands) of the five Pacific states.
Table 3.2
State
Washington
Oregon
Alaska
Hawaii
California
Population
(thousands)
5894
3421
627
1212
33,872
An outlier
11
Example 3-3
Notice that the population of California is
very large compared to the populations of
the other four states. Hence, it is an
outlier. Show how the inclusion of this
outlier affects the value of the mean.
12
Solution 3-3
If we do not include the population of
California (the outlier) the mean population
of the remaining four states (Washington,
Oregon, Alaska, and Hawaii) is
5894  3421  627  1212
Mean 
 2788.5 thousand
4
13
Solution 3-3
Now, to see the impact of the outlier on the
value of the mean, we include the
population of California and find the mean
population of all five Pacific states. This
mean is
5894  3421  627  1212  33,872
Mean 
 9005.2 thousand
5
14
Median
Definition
The median is the value of the middle term
in a data set that has been ranked in
increasing order.
15
Median cont.
The calculation of the median consists of
the following two steps:
1.
2.
Rank the data set in increasing order
Find the middle term in a data set with n
values. The value of this term is the median.
16
Median cont.
Value of Median for Ungrouped Data
 n 1
Median  Value of the 
 th term in a ranked data set
 2 
17
Example 3-4
The following data give the weight lost (in
pounds) by a sample of five members of a
health club at the end of two months of
membership:
10 5 19 8 3
Find the median.
18
Solution 3-4
First, we rank the given data in increasing
order as follows:
3 5 8 10 19
There are five observations in the data set.
Consequently, n = 5 and
n 1 5 1
Position of the middle term 

3
2
2
19
Solution 3-4
Therefore, the median is the value of the
third term in the ranked data.
3 5 8 10 19
Median
The median weight loss for this sample of
five members of this health club is 8
pounds.
20
Example 3-5
Table 3.3 lists the total revenue for the 12
top-grossing North American concert tours
of all time.
Find the median revenue for these data.
21
Table 3.3
Tour
Artist
Steel Wheels, 1989
Magic Summer, 1990
Voodoo Lounge, 1994
The Division Bell, 1994
Hell Freezes Over, 1994
Bridges to Babylon, 1997
Popmart, 1997
Twenty-Four Seven, 2000
No Strings Attached, 2000
Elevation, 2001
Popodyssey, 2001
Black and Blue, 2001
The Rolling Stones
New Kids on the Block
The Rolling Stones
Pink Floyd
The Eagles
The Rolling Stones
U2
Tina Turner
‘N-Sync
U2
‘N-Sync
The Backstreet Boys
Total Revenue
(millions of dollars)
98.0
74.1
121.2
103.5
79.4
89.3
79.9
80.2
76.4
109.7
86.8
82.1
22
Solution 3-5
First we rank the given data in increasing order, as
follows:
74.1 76.4 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7 121.2
There are 12 values in this data set. Hence, n = 12
and
n  1 12  1
Position of the middle term 

 6.5
2
2
23
Solution 3-5
Therefore, the median is given by the mean of the sixth and
the seventh values in the ranked data.
74.1 76.4 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7 121.2
Median 
82.1  86.8
 84.45  $84.45 million
2
Thus the median revenue for the 12 top-grossing North
American concert tours of all time is $84.45 million.
24
Median cont.
The median gives the center of a histogram,
with half the data values to the left of the
median and half to the right of the median.
The advantage of using the median as a
measure of central tendency is that it is not
influenced by outliers. Consequently, the
median is preferred over the mean as a
measure of central tendency for data sets that
contain outliers.
25
Mode
Definition
The mode is the value that occurs with the
highest frequency in a data set.
26
Example 3-6
The following data give the speeds (in miles
per hour) of eight cars that were stopped
on I-95 for speeding violations.
77 69 74 81 71 68 74 73
Find the mode.
27
Solution 3-6
In this data set, 74 occurs twice and each of
the remaining values occurs only once.
Because 74 occurs with the highest
frequency, it is the mode. Therefore,
Mode = 74 miles per hour
28
Mode cont.

A data set may have none or many modes,
whereas it will have only one mean and
only one median.



The data set with only one mode is called
unimodal.
The data set with two modes is called
bimodal.
The data set with more than two modes is
called multimodal.
29
Example 3-7
Last year’s incomes of five randomly
selected families were $36,150. $95,750,
$54,985, $77,490, and $23,740. Find the
mode.
30
Solution 3-7
Because each value in this data set occurs
only once, this data set contains no mode.
31
Example 3-8
The prices of the same brand of television
set at eight stores are found to be $495,
$486, $503, $495, $470, $505, $470 and
$499. Find the mode.
32
Solution 3-8
In this data set, each of the two values $495
and $470 occurs twice and each of the
remaining values occurs only once.
Therefore, this data set has two modes:
$495 and $470.
33
Example 3-9
The ages of 10 randomly selected students
from a class are 21, 19, 27, 22, 29, 19, 25,
21, 22 and 30. Find the mode.
34
Solution 3-9
This data set has three modes: 19, 21 and
22. Each of these three values occurs with
a (highest) frequency of 2.
35
Mode cont.
One advantage of the mode is that it can be
calculated for both kinds of data,
quantitative and qualitative, whereas the
mean and median can be calculated for
only quantitative data.
36
Example 3-10
The status of five students who are
members of the student senate at a college
are senior, sophomore, senior, junior, senior.
Find the mode.
37
Solution 3-10
Because senior occurs more frequently than
the other categories, it is the mode for this
data set.
We cannot calculate the mean and median
for this data set.
38
Relationships among the
Mean, Median, and Mode
1.
For a symmetric histogram and frequency
curve with one peak (Figure 3.2), the
values of the mean, median, and mode
are identical, and they lie at the center of
the distribution.
39
Figure 3.2 Mean, median, and mode for a symmetric
histogram and frequency curve.
40
Relationships among the Mean,
Median, and Mode cont.
2.
For a histogram and a frequency curve skewed
to the right (Figure 3.3), the value of the mean
is the largest, that of the mode is the smallest,
and the value of the median lies between these
two.

Notice that the mode always occurs at the peak
point.
The value of the mean is the largest in this case
because it is sensitive to outliers that occur in
the right tail.

These outliers pull the mean to the right.
41
Figure 3.3
Mean, median, and mode for a histogram
and frequency curve skewed to the right.
42
Relationships among the Mean,
Median, and Mode cont.
If a histogram and a distribution curve are
skewed to the left ( Figure 3.4), the value
of the mean is the smallest and that of the
mode is the largest, with the value of the
median lying between these two.
3.

In this case, the outliers in the left tail pull the
mean to the left.
43
Figure 3.4
Mean, median, and mode for a histogram
and frequency curve skewed to the right.
44
MEASURES OF DISPERSION
FOR UNGROUPED DATA



Range
Variance and Standard Deviation
Population Parameters and Sample
Statistics
45
Range
Finding Range for Ungrouped Data
Range = Largest value – Smallest Value
46
Example 3-11
Table 3.4 gives the total areas in square
miles of the four western South-Central
states of the United States.
Find the range for this data set.
47
Table 3.4
State
Arkansas
Louisiana
Oklahoma
Texas
Total Area
(square miles)
53,182
49,651
69,903
267,277
48
Solution 3-11
Range = Largest value – Smallest Value
= 267,277 – 49,651
= 217,626 square miles
Thus, the total areas of these four states
are spread over a range of 217,626 square
miles.
49
Range cont.
Disadvantages
 The range, like the mean has the
disadvantage of being influenced by
outliers.
 Its calculation is based on two values only:
the largest and the smallest.
50
Variance and Standard Deviation


The standard deviation is the most used
measure of dispersion.
The value of the standard deviation tells
how closely the values of a data set are
clustered around the mean.
51
Variance and Standard Deviation
cont.


In general, a lower value of the standard
deviation for a data set indicates that the
values of that data set are spread over a
relatively smaller range around the mean.
In contrast, a large value of the standard
deviation for a data set indicates that the
values of that data set are spread over a
relatively large range around the mean.
52
Variance and Standard Deviation
cont.
The Variance calculated for population data
is denoted by σ² (read as sigma squared),
and the variance calculated for sample data
is denoted by s².
The standard deviation calculated for
population data is denoted by σ, and the
standard deviation calculated for sample data
is denoted by s.
53
Table 3.5
x
82
95
67
92
x–x
82 – 84 = -2
95 – 84 = +11
67 – 84 = -17
92 – 84 = +8
∑(x – x) = 0
54
Variance and Standard Deviation
cont.
Short-cut Formulas for the Variance and Standard
Deviation for Ungrouped Data
( x )
x 

N
2 
N

x

x  n
2
2
2
2
and s 2 
n 1
Where σ² is the population variance and s² is the
sample variance.
55
Variance and Standard Deviation
cont.
Short-cut Formulas for the Variance and
Standard Deviation for Ungrouped Data
The standard deviation is obtained by taking
the positive square root of the variance.
Population standard deviation:    2
Sample standard deviation: s  s 2
56
Example 3-12
Refer to data in Table 3.1 on the 2002 total
payroll (in millions of dollars) of five MLB
teams.
Find the variance and standard deviation of
these data
57
Solution 3-12
Table 3.6
x
62
93
126
75
34
∑x = 390
x²
3844
8649
15,876
5625
1156
∑x² = 35,150
58
Solution 3-12

x

x  n
2
(390) 2
35,150 
35,150  30,420
2
5
s 


 1182.50
n 1
5 1
4
s  1182.50  34.387498  $34,387,498
2
Thus, the standard deviation of the 2002
payrolls of these five MLB teams is
$34,387,498.
59
Two Observations
1.
The values of the variance and the
standard deviation are never negative.
2.
The measurement units of variance are
always the square of the measurement
units of the original data.
60
Example 3-13
The following data are the 2002 earnings
(in thousands of dollars) before taxes for all
six employees of a small company.
48.50 38.40 65.50
22.60 79.80 54.60
Calculate the variance and standard
deviation for these data.
61
Solution 3-13
Table 3.7
x
48.50
38.40
65.50
22.60
79.80
54.60
∑x = 309.40
x²
2352.25
1474.56
4290.25
510.76
6368.04
2981.16
∑x² = 17,977.02
62
Solution 3-13

x

x  N
2
(309.40) 2
17,977.02 
6
2 

 337.0489
N
6
  337.0489  $18,359 thousand  $18,359
2
Thus, the standard deviation of the 2002
earnings of all six employees of this
company is $18,359.
63
Population Parameters and
Sample Statistics


A numerical measure such as the mean,
median, mode, range, variance, or
standard deviation calculated for a
population data set is called a population
parameter, or simply a parameter.
A summary measure calculated for a
sample data set is called a sample statistic,
or simply a statistic.
64
MEAN, VARIANCE AND STANDARD
DEVIATION FOR GROUPED DATA


Mean for Grouped Data
Variance and Standard Deviation for
Grouped Data
65
Mean for Grouped Data
Calculating Mean for Grouped Data
Mean for population data:
mf


Mean for sample data:
mf

x
N
n
Where m is the midpoint and f is the frequency of
a class.
66
Example 3-14
Table 3.8 gives the frequency distribution of
the daily commuting times (in minutes) from
home to work for all 25 employees of a
company.
Calculate the mean of the daily commuting
times.
67
Table 3.8
Daily Commuting Time
(minutes)
Number of Employees
0 to less than 10
10 to less than 20
20 to less than 30
30 to less than 40
40 to less than 50
4
9
6
4
2
68
Solution 3-14
Table 3.9
Daily Commuting Time
(minutes)
f
m
mf
0 to less than 10
10 to less than 20
20 to less than 30
30 to less than 40
40 to less than 50
4
9
6
4
2
5
15
25
35
45
20
135
150
140
90
N = 25
∑mf = 535
69
Solution 3-14
mf


N
535

 21.40 minutes
25
Thus, the employees of this company spend
an average of 21.40 minutes a day
commuting from home to work.
70
Example 3-15
Table 3.10 gives the frequency distribution of
the number of orders received each day
during the past 50 days at the office of a
mail-order company.
Calculate the mean.
71
Table 3.10
Number of Orders
Number of Days
10 – 12
13 – 15
16 – 18
19 – 21
4
12
20
14
72
Solution 3-15
Table 3.11
Number of Orders
10
13
16
19
–
–
–
–
12
15
18
21
f
m
mf
4
12
20
14
11
14
17
20
44
168
340
280
n = 50
∑mf = 832
73
Solution 3-15
mf

x
n
832

 16.64 orders
50
Thus, this mail-order company received an
average of 16.64 orders per day during
these 50 days.
74
Variance and Standard Deviation
for Grouped Data
Short-Cut Formulas for the Variance and
Standard Deviation for Grouped Data
( mf )
m f

N
2 
N

mf 

m f  n
2
2
2
2
and s 2 
n 1
Where σ² is the population variance, s² is
the sample variance, and m is the midpoint
of a class.
75
Variance and Standard Deviation
for Grouped Data cont.
Short-cut Formulas for the Variance and
Standard Deviation for Grouped Data
The standard deviation is obtained by taking
the positive square root of the variance.
Population standard deviation:    2
Sample standard deviation: s  s 2
76
Example 3-16
Table 3.8 gives the frequency distribution of
the daily commuting times (in minutes) from
home to work for all 25 employees of a
company.
Calculate the variance and standard
deviation.
77
Table 3.8
Daily Commuting Time
(minutes)
Number of Employees
0 to less than 10
10 to less than 20
20 to less than 30
30 to less than 40
40 to less than 50
4
9
6
4
2
78
Solution 3-16
Table 3.12
Daily Commuting Time
(minutes)
f
m
mf
m²f
0 to less than 10
10 to less than 20
20 to less than 30
30 to less than 40
40 to less than 50
4
9
6
4
2
5
15
25
35
45
20
135
150
140
90
100
2025
3750
4900
4050
∑mf = 535
∑m²f = 14,825
N = 25
79
Solution 3-16
2
2
(
mf
)
(
535
)

2
m
f
14,825 

3376
2
N
25
 


 135.04
N
25
25
Hence, the standard deviation is
   2  135.04  11.62 minutes
Thus, the standard deviation of the daily
commuting times for these employees is
11.62 minutes.
80
Example 3-17
Table 3.10 gives the frequency distribution of
the number of orders received each day
during the past 50 days at the office of a
mail-order company.
Calculate the variance and standard
deviation.
81
Table 3.10
Number of Orders
10
13
16
19
–
–
–
–
12
15
18
21
f
4
12
20
14
82
Solution 3-17
Table 3.13
Number of
Orders
10
13
16
19
–
–
–
–
12
15
18
21
f
m
mf
m²f
4
12
20
14
11
14
17
20
44
168
340
280
484
2352
5780
5600
∑mf = 832
∑m²f = 14,216
n = 50
83
Solution 3-17
2
2
(
mf
)
(
832
)

2
m
f

14,216 

n
50  7.5820
s2 

n 1
50  1
Hence, the standard deviation is
s  s 2  7.5820  2.75 orders
Thus, the standard deviation of the number of
orders received at the office of this mail-order
company during the past 50 days in 2.75.
84
USE OF STANDARD
DEVIATION


Chebyshev’s Theorem
Empirical Rule
85
Chebyshev’s Theorem
Definition
For any number k greater than 1, at least
(1 – 1/k²) of the data values lie within k
standard deviations of the mean.
86
Figure 3.5
Chebyshev’s theorem.
At least 1 – 1/k² of the
values lie in the shaded
areas
μ – kσ
μ
kσ
μ + kσ
kσ
87
Figure 3.6
Percentage of values within two standard
deviations of the mean for Chebyshev’s
theorem.
At least 75% of the values
lie in the shaded areas
μ – 2σ
μ
μ + 2σ
88
Figure 3.7
Percentage of values within three standard
deviations of the mean for Chebyshev’s
theorem.
At least 89% of the values
lie in the shaded areas
μ – 3σ
μ
μ + 3σ
89
Example 3-18
The average systolic blood pressure for
4000 women who were screened for high
blood pressure was found to be 187 with a
standard deviation of 22. Using Chebyshev’s
theorem, find at least what percentage of
women in this group have a systolic blood
pressure between 143 and 231.
90
Solution 3-18
Let μ and σ be the mean and the standard
deviation, respectively, of the systolic blood
pressures of these women.
μ = 187 and σ = 22
143 - 187 = -44
143
231 - 187 = 44
μ = 187
231
91
Solution 3-18
The value of k is obtained by dividing the
distance between the mean and each point
by the standard deviation. Thus
k = 44/22 = 2
1
1
1
1  2  1  2  1   1  .25  .75 or 75%
k
(2)
4
92
Solution 3-18
Hence, according to Chebyshev's theorem,
at least 75% of the women have systolic
blood pressure between 143 and 231. this
percentage is shown in Figure 3.8.
93
Figure 3.8
Percentage of women with systolic blood
pressure between 143 and 231.
At least 75% of the
women have systolic
blood pressure between
143 and 231
143
μ – 2σ
187
μ
321
μ + 2σ
Systolic
blood
pressure
94
Empirical Rule
For a bell shaped distribution approximately
68% of the observations lie within one
standard deviation of the mean
2. 95% of the observations lie within two
standard deviations of the mean
3. 99.7% of the observations lie within three
standard deviations of the mean
1.
95
Figure 3.9
Illustration of the empirical rule.
99.7%
95%
68%
μ – 3σ
μ – 2σ
μ–σ
μ
μ+σ
μ + 2σ
μ + 3σ
96
Example 3-19
The age distribution of a sample of 5000
persons is bell-shaped with a mean of 40
years and a standard deviation of 12 years.
Determine the approximate percentage of
people who are 16 to 64 years old.
97
Solution 3-19
From the given information, for this distribution,
x = 40 and s = 12 years
Each of the two points, 16 and 64, is 24 units
away from the mean.
Because the area within two standard deviations
of the mean is approximately 95% for a bellshaped curve, approximately 95% of the people in
the sample are 16 to 64 years old.
98
Figure 3.10
Percentage of people who are 16 to 64
years old.
16 – 40 = -24
64 – 40 = 24
= -2s
= 2s
16
x – 2s
x = 40
64
Ages
x + 2s
99
MEASURES OF POSITION


Quartiles and Interquartile Range
Percentiles and Percentile Rank
100
Quartiles and Interquartile
Range
Definition
Quartiles are three summery measures that divide
a ranked data set into four equal parts. The second
quartile is the same as the median of a data set.
The first quartile is the value of the middle term
among the observations that are less than the
median, and the third quartile is the value of the
middle term among the observations that are
greater than the median.
101
Figure 3.11
Quartiles.
Each of these portions contains 25% of the observations of a
data set arranged in increasing order
25%
25%
Q1
25%
25%
Q2
Q3
102
Quartiles and Interquartile
Range cont.
Calculating Interquartile Range
The difference between the third and first
quartiles gives the interquartile range;
that is,
IQR = Interquartile range = Q3 – Q1
103
Example 3-20
a)
b)
Refer to Table 3.3 in Example 3-5 that lists
the total revenues for the 12 top-grossing
North American concert tours of all time.
Find the values of the three quartiles.
Where does the revenue of $103.5 million
fall in relation to these quartiles?
Find the interquartile range.
104
Table 3.3
Tour
Artist
Steel Wheels, 1989
Magic Summer, 1990
Voodoo Lounge, 1994
The Division Bell, 1994
Hell Freezes Over, 1994
Bridges to Babylon, 1997
Popmart, 1997
Twenty-Four Seven, 2000
No Strings Attached, 2000
Elevation, 2001
Popodyssey, 2001
Black and Blue, 2001
The Rolling Stones
New Kids on the Block
The Rolling Stones
Pink Floyd
The Eagles
The Rolling Stones
U2
Tina Turner
‘N-Sync
U2
‘N-Sync
The Backstreet Boys
Total Revenue
(millions of dollars)
98.0
74.1
121.2
103.5
79.4
89.3
79.9
80.2
76.4
109.7
86.8
82.1
105
Solution 3-20
a)
Values less than the median
Values greater than the median
74.1 76.4 79.4 79.9 80.2 82.1
79.4  79.9
2
 79.65
Q1 
86.8 89.3 98.0 103.5 109.7 121.2
82.1  86.8
2
 84.45
Q2 
98.0  103.5
2
 100.75
Q3 
Also the median
By looking at the position of $103.5 million, we
can state that this value lies in the top 25% of
the revenues.
106
Solution 3-20
b)
IQR = Interquartile range = Q3 – Q1
= 100.75 – 79.65
= $21.10 million
107
Example 3-21
a)
b)
The following are the ages of nine
employees of an insurance company:
47 28 39 51 33 37 59 24 33
Find the values of the three quartiles.
Where does the age of 28 fall in relation to
the ages of the employees?
Find the interquartile range.
108
Solution 3-21
a)
Values less than the median
24 28 33 33
28  33
2
 30.5
Q1 
Values greater than the median
37
Q2  37
39 47 51 59
47  51
2
 49
Q3 
The age of 28 falls in the lowest 25% of
the ages.
109
Solution 3-21
b)
IQR = Interquartile range = Q3 – Q1
= 49 – 30.5
= 18.5 years
110
Percentiles and Percentile
Rank
Figure 3.12
Percentiles.
Each of these portions contains 1% of the observations of
a data set arranged in increasing order
1% 1% 1%
P1 P2 P3
1% 1% 1%
P97 P98 P99
111
Percentiles and Percentile
Rank cont.
Calculating Percentiles
The (approximate) value of the kth percentile,
denoted by Pk, is
 kn 
Pk  Value of the 
 th term in a ranked data set
 100 
where k denotes the number of the percentile and
n represents the sample size.
112
Example 3-22
Refer to the data on revenues for the 12
top-grossing North American concert tours
of all time given in Example 3-20. Find the
value of the 42nd percentile. Give a brief
interpretation of the 42nd percentile.
113
Solution 3-22
The data arranged in increasing order as follows:
74.1 76.4 79.4 79.9 80.2 82.1
86.8 89.3 98.0 103.5 109.7 121.2
The position of the 42nd percentile is
kn (42)(12)

 5.04th term
100
100
114
Solution 3-22
The value of the 5.04th term can be
approximated by the value of the fifth term
in the ranked data. Therefore,
Pk = 42nd percentile = 80.2 = $80.2 million
Thus, approximately 42% of the revenues in
the given data are equal to or less than
$80.2 million and 58 % are greater than
$80.2 million
115
Percentiles and Percentile
Rank cont.
Finding Percentile Rank of a Value
Percentile rank of xi 
Number of values less than xi
100
Total number of values in the data set
116
Example 3-23
Refer to the data on revenues for the 12
top-grossing North American concert tours
of all time given in Example 3-20. Find the
percentile rank for the revenue of $89.3
million. Give a brief interpretation of this
percentile rank.
117
Solution 3-23
The data on revenues arranged in increasing order
is as follows:
74.1 76.4 79.4 79.9 80.2 82.1
86.8 89.3 98.0 103.5 109.7 121.2
In this data set, 7 of the 12 revenues are less than
$89.3 million. Hence,
7
Percentile rank of 89.3  100  58.33%
12
118
Solution 3-23
Rounding this answer to the nearest integral
value, we can state that about 58% of the
revenues are less than $89.3 million. In
other words, about 58% of these 12 North
American concert tours grossed less than
$89.3 million.
119
BOX-AND-WHISKER PLOT
Definition
A plot that shows the center, spread, and
skewness of a data set. It is constructed by
drawing a box and two whiskers that use the
median, the first quartile, the third quartile,
and the smallest and the largest values in
the data set between the lower and the
upper inner fences.
120
Example 3-24
The following data are the incomes (in
thousands of dollars) for a sample of 12
households.
35 29 44 72 34 64 41 50 54 104 39 58
Construct a box-and-whisker plot for these
data.
121
Solution 3-24
Step 1.
29 34 35 39 41 44 50 54 58 64 72 104
Median = (44 + 50) / 2 = 47
Q1 = (35 + 39) / 2 = 37
Q3 = (58 + 64) / 2 = 61
IQR = Q3 – Q1 = 61 – 37 = 24
122
Solution 3-24
Step 2.
1.5 x IQR = 1.5 x 24 = 36
Lower inner fence = Q1 – 36 = 37 – 36 = 1
Upper inner fence = Q3 + 36 = 61 + 36 = 97
123
Solution 3-24
Step 3.
Smallest value within the two inner fences = 29
Largest value within the two inner fences = 72
124
Solution 3-24
Step 4.
Figure 3.13
25
First
quartile
Third
quartile
Median
35
45
55
65
Income
75
85
95
105
125
Solution 3-24
Step 5.
Figure 3.14
Smallest value
within the two inner
fences
Third
First
quartile
quartile
Median
An
outlier
Largest value
within two inner
fences

25
35
45
55
65
Income
75
85
95
105
126