Download i Q - York University

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

World Values Survey wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Business Statistics, Can. ed.
By Black, Chakrapani & Castillo
Chapter 3
Discrete Distributions
Descriptive
Statistics
Prepared by Dr. Clarence S. Bayne
JMSB, Concordia University
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Learning Objectives
• How to describe data and transform data to provide information
• Define and quantify concepts of central tendency, variability, shape, and association.
• Understand and interpret the mean, median, mode, percentiles(including quartiles), the range, variance and standard deviation.
• Compute the mean, median, mode, percentiles ( and quartile); the range, mean absolute deviation, variance, and standard deviation using ungrouped data.
• Differentiate between sample and population variance and standard deviation.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Learning Objectives -- Continued
• Use of grouped data to compute the mean, mode, standard deviation, and variance. • Use of the empirical rule and Chebyshev’s theorem to understand probability distributions
• Understanding the meaning of standard deviation in the context of the empirical rule and Chebyshev’s theorem
• Understand skewness of shape of a distribution
• Using the box and whisker plot to describe data.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Measures of Central Tendency:
Ungrouped Data
• Ungrouped data is any array of numbers which have not been summarized by statistical techniques • Measures of central tendency reveal information about the values at the center, or middle part, of a group of things (or ordered array). • Common Measures of central tendency are the :
– Mode
– Median
– Mean
– Percentiles
– Quartiles
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
The Mode
• The mode is the value that occurs most frequently in the data or array. • This conceptualization of the mode applies to all levels of data measurement.
• Unimodal: describes data sets with a single mode • Bimodal: describes data sets that have two modes
• Multimodal: describes data sets that contain more than two modes
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Example of the Mode
• The arrangement of the numbers in the frame below is nonspecific and represents an array.
• 44 is the data value that occurs most frequently(5).
• The mode is 44.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
35
41
44
45
37
41
44
46
37
43
44
46
39
43
44
46
40
43
44
46
40
43
45
48
The Median
• The median is the middle value in an ordered array of numbers
• The median is unaffected by extremely large and extremely small values in the data set (array). Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Computing the Median

First Procedure
– Arrange the observations in an ordered array.
– If there is an odd number of observations, the median is the observation located at the middle the of the ordered array.

Second Procedure
– If there is an even number of observations, the median is a value located on the line interval between the two middle observations.

General Procedure – The median’s position in an ordered array is given by (n+1)/2.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Median: Example
with an Odd Number of Terms
 Let X be an ordered array such that X has the following values:
{3, 4, 5, 7, 8, 9, 11, 14, 15, 16, 16, 17, 19, 19, 20, 21, 22}
 There are 17 elements in the ordered array.
 Position of median = (n+1)/2 = (17+1)/2 = 9th position
 Counting from left top right, the median is 15.
 Extreme values do not distort the median value.  Note that if 22 (the maximum) is replaced by 100, the median is still 15.
 That if 3 (the minimum) is replaced by ‐103, the median is 15.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Median: Example
with an Even Number of Terms

Let X be an ordered array such that X assumes the following values:
{3, 4, 5, 7, 8, 9, 11, 14, 15, 16, 16, 17, 19, 19, 20, 21}
 There are 16 terms in the ordered array.
 Position of median = (n+1)/2 = (16+1)/2 = 8.5 position  That is the median is a value between observations in the 8th and 9th positions in the ordered array. The median is 14 + 0.5(15‐14) = 14.5 or simply, (14+15)/2 =14.5
 If the 21 is replaced by 100, the median is still 14.5.
 If the 3 is replaced by ‐88, the median is still 14.5.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
The Arithmetic Mean
• The arithmetic mean is commonly called ‘the mean’
• It is s the average of a group of numbers
• The mean is computed by summing all values in the data set and dividing the sum by the number of values in the data set
• Thus, its value is affected by each value in the data set, including extreme values
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Application of Arithmetic
Mean in Statistics
• Arithmetic mean used as a summary statistic of central tendency in data produced by business and economic processes.
• When used in these settings it is important to make the distinction between
−
−
The population mean: µ and the Sample mean X
• The population mean based on the measures on all the possible outcomes of a process.
• The sample mean is based on some of the outcome observation making up the population. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Population Mean
X X  X  X ...  X



1
2
N
N
24  13  19  26  11

5
93

5
 18. 6
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
3
N
Sample Mean
X X  X  X ...  X

X

1
2
3
n
n
57  86  42  38  90  66

6
379

6
 63.167
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
n
Impact of Extreme Values on the
Mean
• The mean is the most commonly used measure of central tendency because of its mathematical properties and because it uses all the data point in the data set. • However, the mean is affected by extremely large or extremely small numbers.
• Note that for the sample mean example, if the largest number 66 is replaced by the number 1 000 that the mean becomes 218.833 as opposed to 63.167
• If the smallest number 57 is replaced by the number 5 the mean becomes 54.5 as opposed to 63.167. • The distortions are significant in both cases.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Percentiles
• In general percentiles are not influenced by extreme values in the data set.
• Percentiles are measures of central tendency that divide a group of data into 100 parts
• The nth percentile: at least n% of the data lie below the nth percentile, and at most (100 ‐ n)% of the data lie above the nth percentile
• For example: the 90th percentile is a value such that at least 90% of the data lie below it, and at most (no more than) 10% of the data lie above it
• The median is defactoth the 50th percentile and has the same value as the 50 percentile.
• Percentile are stair step values: the 88th and 89 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
percentile have no values between them.
Percentiles are Stair Steps
• Percentiles are discrete values that serve to separate lower values from upper values in the data set. • A percentile indicates the proportion of things that have values below it; and is the lower bound to the reverse proportion of things with values above it.
• Thus, graphically, a percentile represents a single point at which there is a step up from the entire proportion of things less than it; or down from the proportion of values that is higher than it to the lower value proportion of things. • Stair Step Percentiles
• Note that there are no values between the 87th and 88th percentiles.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Percentiles: Computational Procedure
• Organize the data into an ascending ordered array.
• Calculate the percentile location index using: P
i
(n)
100
Where
Where
percentile
PP==percentile
i=percentile
percentile
i=
location
location
n=sample
samplesize
size
n=
• Search the ordered array counting from left to right to find where the percentile is located and determine its value.
• If i is a whole number, the percentile is the average of the values in the ith and (i + 1)th positions.
• If i is not a whole number, the percentile is at the whole number part of (i + 1) in the ordered array.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Calculating Percentiles: An Example
• Raw Data: 14, 12, 19, 23, 5, 13, 28, 17
• Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28
• Problem: Find 30th percentile
• Number of observations n=8
• Location of 30th Percentile: i 
30
(8 )  2 . 4
100
• The location index, i, is not a whole number.
• Therefore put location at whole number portion of ( i + 1) = 2.4 + 1 = 3.4. • The whole number portion is 3. The 30th percentile is at the 3rd
location of the array: 30th percentile = 13.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Quartiles
 Quartiles are measures of central tendency that divide a group of data into four subgroups
Quartile values are not necessarily members of the data set
– Q1: 25% of the data set is below the first quartile
– Q2: 50% of the data set is below the second quartile
– Q3: 75% of the data set is below the third quartile

Relationship between Quartiles and percentiles
– Q1 is equal to the 25th percentile
– Q2 is located at 50th percentile and equals the median
– Q3 is equal to the 75th percentile
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Calculating Quartiles: An Example
 Let X be an ordered array: If X={ 106, 109, 114, 116, 121, 122, 125, 129} then 109  114

 111.5
2
 Q1:
25
i
(8)  2
100
Q1
 Q2:
50
i
(8)  4
100
116  121
Q2 
 118.5
2
 Q3:
75
i
(8)  6
100
122  125
Q3 
 123.5
2
 Note that when i is a whole number the quartiles quartiles
the average of the ith and (i+1)th values in the ordered set
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Dispersion and Convergence
• Things tend to be alike or dissimilar; or to be associated in some way
• Complete information about a company’s sales effectiveness would be better understood if one knew how much sales staff are exceeding, falling short, or just meeting the company’s historical standards.
• The mean tells us about those staff in the middle, the average performers. But it does not tell us about the differences in performance. • Measures of dispersion or spread provide the tools that answer these the latter questions. • When used with measures of central tendency they make possible a more complete numerical description of the data
• This variability is most frequently expressed in terms of deviation from the norm or mean. The images in the next slides express this visually
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Variability
No Variability in Cash Flow (same amounts)
Mean
Mean
Variability in Cash Flow (different amounts)
Mean
Mean
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Variability
Variability
No Variability
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Quantitative Indicators of Variability For
Ungrouped Data
 Measures of variability describe the spread or the dispersion of a set of data.
 Common Measures of Variability are:
– Range
– Interquartile Range
– Mean Absolute Deviation
– Variance
– Standard Deviation
– Z scores
– Coefficient of Variation
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Range
The range is the difference between the largest and smallest values in the data set
Usefulness: − Simple to compute Disadvantages;
– Ignores all data points except extremes
– Influenced by extreme values
– Has no reference point
– Has limited use by itself
Example of range using data provided: Range  48  35  13
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
35
41
44
45
37
41 the two 44
46
37
43
44
46
39
43
44
46
40
43
44
46
40
43
45
48
Interquartile Range
• The interquartile range contain all values in the interval between the first and third quartiles
• The interquartile range account for the middle 50% of values in the ordered data set
• The interquartile range is especially useful in situations where data users are more interested in values toward the middle and less interested in extremes.
• The interquartile range is less influenced by extremes
Interquartile Range  Q 3  Q1
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Deviation from the Mean
• Data set: 5, 9, 16, 17, 18
• µ = 13
• An examination of deviations from the mean can reveal information about the variability of data. • However, the individual deviations are used mostly as a tool to compute other measures of variability
• (x ‐ ) show distances around the mean or individual deviation from the mean: ‐8, ‐4, 3, 4, 5 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Mean Absolute Deviation
• Show the average of the absolute deviations or the tendency for observations to differ on the average from the norm for the process or situation. • Easy to calculate but not as statistically good and unbiased estimate as the variance and standard deviation measures.
Observations
X
X- µ
|X-µ|
1
5
-8
+8
2
9
-4
+4
3
16
+3
+3
4
17
+4
+4
5
18
+5
+5
Totals
65
0
24


M . A. D.
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
 24
5
 4. 8
X 
N
Population Variance
• Average of the squared deviations from the arithmetic mean
• Statistics measured in squared units are problematic to interpret. Customary to use standard deviation
X
5
9
16
17
18
X   X
-8
-4
+3
+4
+5
0
 

64
16
9
16
25
130
2
 X   
2

Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
2

130

5
 26 .0
N
Population Standard Deviation
 X   
2
• Square root of the variance

• Easier to interpret in practice
2

N
130

5
 2 6 .0
 


2 6 .0
 5 .1
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
2
Sample Variance
• Average of the squared deviations from the arithmetic mean for a set of data
X
2,398
1,844
1,539
1,311
7,092
X  X X
625
71
-234
-462
0
 X

2
390,625
5,041
54,756
213,444
663,866
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
X  X 
2
S
2

n 1
663,866

3
 221,288.67
Sample Standard Deviation
 X  X 
2
• Square root of the sample variance
S
• Easier to interpret in practice than square units.
2

n1
6 6 3 ,8 6 6

3
 2 2 1, 2 8 8 .6 7
S 

S
2
2 2 1, 2 8 8 .6 7
 4 7 0 .4 1
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Uses of Standard Deviation
• Indicator of financial risk
• Quality Control
– construction of quality control charts
– process capability studies
• Comparing two or more populations
– household incomes in two cities:
– employee absenteeism at two plants
– used as a percentage of the mean, the coefficient of variation (CV).
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Standard Deviation as an
Indicator of Financial Risk
Annualized Rate of Return
Financial
Security


A
15%
3%
B
15%
7%
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Symmetric and Asymmetric Distributions
• Data are either symmetric or non‐symmetric with respect to some measure of central tendency
• Statisticians have observed that distributions describing many types of business and economic data tend to be symmetric or have a normal shape • They found that in practical terms the processes that generate symmetric data have special and exact properties(the empirical rule) with respect to data concentration. • Non‐symmetric distributions, in practice and theory, obey as a minimum specified rules with respect to the concentration of data values in a population (The Chebyschev Theorem). Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Empirical Rule
When data are normally distributed or approximately
normal.
Distance from
the Mean
  1
  2
  3
Percentage of Values
Falling Within Distance
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
68
95
99.7
- Chebyshev’s Theorem When Data are Normally Distributed or Nonsymmetric.
• The Chebyshev Theorem applies to all distributions
• It measures the minimum mass or concentration of data that lies within a specifies number of standard deviation around the mean.
1
P(  k  X    k )  1  2
k
for k > 1
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Coefficient of Variation
• Ratio of the standard deviation to the mean, expressed as a percentage
• Measurement of relative dispersion

C V  100

Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Coefficient of Variation
1  29

 2  84


4.6
1

CV  
1
1
100
1

10
2

CV  
4.6
100

29
 15.86
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
2
2
100
2
10
100

84
 11.90
3.2 MEASURES OF VARIABILITY: UNGROUPED DATA
73
Solution
The researcher computes the mean absolute deviation, the variance, and the standard
deviation for these data in the following manner.
X
55
100
125
140
60
29
36
1,681
16
841
1,936
1,296
I.x=480
I.lx- xl = 154
I.(x - x)2 = 5,110
n
of
41
4
44
X=
LX = 480 = 96
n
MAD =
5
5
154
= 30.8
5
2 = 5 •770 = 1,442.5
4
s=
Fs2 = 37.98
She then uses computational formulas to solve for s> and s and compares the results.
Jil
X
55
100
125
140
60
3,025
10,000
15,625
19,600
3,600
I.x= 480
I.x2 = 51,850
..
51 850-
4802
'
5
52=-------..::~
4
51,850 - 46,080 = 5, 770 = 1, 442.5
4
4
s = .J1,442.5 = 37.98
The results are the same. The sample standard deviation obtained by both methods is
37.98, or 38, years.
zSCORES
A z score represents the number of standard deviations a value (x) is above or below the
mean of a set of numbers when the data are normally distributed. Using z scores allows a
value's raw distance from the mean to be translated into units of standard deviations.
74
CHAPTER 3 DESCRIPTIVE STATISTICS
z Score
X -}1
z =--
cr
For samples,
x-x
z=--
s
If a z score is negative, the raw value (x) is below the mean. If the z score is positive,
the raw value (x) is above the mean.
For example, for a data set that is normally distributed with a mean of 50 and a standard deviation of 10, suppose a statistician wants to determine the z score for a value of 70.
This value (x = 70) is 20 units above the mean, so the z value is
z = 70 - 50 = +2.00
10
This z score signifies that the raw score of 70 is two standard deviations above the
mean. How is this z score interpreted? The empirical rule states that 95% of all values
are within two standard deviations of the mean if the data are approximately normally
distributed. Figure 3.7 shows that because the value of 70 is two standard deviations above
the mean (z= +2.00), 95% of the values are between 70 and the value (x = 30) that is two
standard deviations below the mean, or z = (30 - 50 )Ito = -2.00. Because 5% of the values
are outside the range of two standard deviations from the mean and the normal
distribution is symmetrical, 2¥2% (¥2 of the 5%) are below the value of 30. Thus 97¥2% of
the values are below the value of 70. Because a z score is the number of standard deviations an individual data value is from the mean, the empirical rule can be restated in terms
ofz scores.
Between z = -1.00 and z = +1.00 are approximately 68% of the values.
Between z= -!.oo andz= +2.00 are approximately95% of the values.
Between z = -3.00 and z = +3.00 are approximately 99.7% of the values.
The topic of z scores is discussed more extensively in Chapter 6.
COEFFICIENT OF VARIATION
The coefficient of variation is a statistic that is the ratio of the standard deviation to the
mean expressed in percentage and is denoted CV.
Coefficient of Variation
cv = ~(100)
f1
The coefficient of variation is essentially a relative comparison of a standard deviation
to its mean. The coefficient of variation can be useful in comparing standard deviations
that have been computed from data with different means.
Measures of Central Tendency
and Variability: Grouped Data
 Measures of Central Tendency
 Mean
 Median
 Mode
 Measures of Variability
 Variance
 Standard Deviation
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Mean of Grouped Data
• Weighted average of class midpoints
• Relative class frequencies are the weights
• Weight are: f for i  1, 2, 3,........ k
i
N
k
 



fi M
i1
i
k


i1
fi
fM
N
f 1M
1
 f 2M
f 1 f
2
2
 f 3M 3      fk M
 f 3      fk
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
k
Calculation of Grouped Mean
Class Interval Frequency Class Midpoint
20-under 30
6
25
30-under 40
18
35
40-under 50
11
45
50-under 60
11
55
60-under 70
3
65
70-under 80
1
75
50
fM 2150



 43. 0
 f 50
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
fM
150
630
495
605
195
75
2150
Variance and Standard Deviation
from Grouped Data
Population
Sample
 f  M   S
 
N
2
2
 

2
2

S 
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.

M  X 
2
f
n1
S
2
Population Variance and Standard
Deviation of Grouped Data
Class Interval
f
M
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
6
18
11
11
3
1
50
25
35
45
55
65
75

2


f
M
N


fM
150
630
495
605
195
75
2150
2
7200

 144
50
M 
 M  
-18
-8
2
12
22
32
324
64
4
144
484
1024

Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.

2
2
f
 144  12
M  
2
1944
1152
44
1584
1452
1024
7200
Descriptions and Measures of Shape
Skewness
– Absence of symmetry
– Presence of extreme values in one or other side of a distribution
Box and Whisker Plots
– Graphic display of a distribution using 5‐
summary statistics
– Reveals skewness and data location or clustering
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Probability Distributions Showing
Symmetry and Skewness
0.30
0.30
0.4
0.4
0.25
0.25
0.3
0.3
0.20
0.20
0.15
0.15
0.2
0.2
0.10
0.10
0.1
0.1
0.05
0.05
0.0
0.0
-4
0
-4
-3
-3
-2
-2
Symmetrical
-1
-1
0
0
1
12
10
8
6
4
2
0
1
2
2
3
0.00
0.00
0
3
0
0
0
2
2
4
4
6
6
8
8
10
10
12
12
Right or Positively
Skewed
12
10
8
6
4
2
0
0.70
0.70
0.75
0.75
0.80
0.80
0.85
0.85
0.90
0.90
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
0.95
0.95
0
1.00
1.00
0
Left or Negatively
Skewed
0
Symmetrical Shape Frequency Histogram
Showing Relationship of Mean, Median and
Mode
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Relationship of Mean, Median and Mode When
Data is Negatively Skewed (To the Left)
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Relationship of Mean, Median and Mode When
Data is Positively Skewed(To the Right)
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Requirements for A Box and Whisker Plot
 Five specific values are used:
– Median, Q2
– First quartile, Q1
– Third quartile, Q3
– Minimum value in the data set
– Maximum value in the data set
 Inner Fences: First Indicators of extreme values
– IQR = Q3 ‐ Q1
– Lower inner fence = Q1 ‐ 1.5 IQR
– Upper inner fence = Q3 + 1.5 IQR
 Outer Fences: Strong Indicators of extreme values
– Lower outer fence = Q1 ‐ 3.0 IQR
– Upper outer fence = Q3 + 3.0 IQR
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Box and Whisker Plot
Minimum
Q1
Q2
Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.
Q3
Maximum
AND WHI
Pl01'S
Another way to describe a distribution of data is by using a hox and whisker plot. A box
and whisker plot, sometimes called a box plot, is a diagram t/wt utilizes tilt: upper arui .l01ver
quartiles along with the median and the· two nuut extreme wdues to depict a distribution
graphically,. The plot is constructed by us.ing a box to enclose the median. Ihis box is extended outward from the median along a cmllinuum to the lower and upper quartiles, en dosing not only the median but also the middle 50% of the data. From the lower and upper
quartiles, lines refem·d to as whiskers ;m: extended Nit from the box toward the outermost
data values. The box and whisker plot is determined from five specific numbers.
'The median (Q,)
The lower quartile ( Ot)
J. The upper quartile ( CM
4· "ll1e smallest value in the distribution
s The hngest vah1e in the distribution
L
2.
'TI1e box of the plot is determined by local ing the median and the lower and upper
quartile:; on a continuum. A box is drawn around the median with the lower and upper
quartiles (Q, and Q,) as the box endpoints. These box endpoints (Q, and Q.) are referred to
as the hinges of the box.
Next, the value of the interquartile range (IQR) is computed by Q,- Q,. The interquartile range indudes the middle 5o% of the data and shotlld equal the length of the box.
However, here tbe interquartile range is used outside the box also. At a distance of 1.5 · 1QR
outward fwm the lower and upper quartiles are what are referred to as inner fences. A
whisket; a line segment, is drawn from the lower hinge of the box outward to the smallest
data value. A second whisker is drawn from the upper hinge of the box outward to the
largest data value. '!he inner fences are established as follows:
Q~ ~
1.5 · IQR
Q;+ LS ·IQR
PRINTED BY : Nuri Jazairi <[email protected]>. Printing is for personal, private use only. No pan of this book may be reproduced or transmitted without publisher's prior
permission. Violators will be prosecuted.
90
CHAPTER 3 DESCRIPTIVE STATISTICS
If data fall beyond the inner fen.:es, then out.:r fences can be constructed:
Q ~- 3.0 · IQR
Q ,+ 3.0 . IQR
Pigure 3.13 shows tht• features of a box and whisker plot,
Data values outside the .mainstn.•Jm ofvJlues in a distribution are viewed as outliers.
Outliers can be merely the more extreme values of a data set. However, sometimes outliers occur due to measurement or re.:ording errors. Other times they are values so unlike
the other values that they should not he considered in the same an., lysis as the rest of the
distribution. Values in the data distribution that are outside the inner fences but within
the outer fences are referred to a mild outliers. Values that are outside the outer fences are
called extremr outliers. Thus, one of the main uses of a box and whisker plot is to identify
outliers. In some computer-produced box and whisker plots, the whiskers are drawn Io
tlle largest and smallest data values within the inner fences. An asterisk i then printed for
each data value located between the inner and outer fences to indicate a mild outlier. Values outside the outer fences arc indicated by a zero on the graph. "These V'.llues are extreme
outliers.
G+ii¥if-
Hinge
Hin~e
Box and Whisker Plot
15 •1QR\
/U •IQR
~.0
3.0 • H)R
Data for Box and Whisker
Plot
71
76
70
82
74
~7
82
79
79
65
63
74
74
73
62
64
68
64
68
62
72
75
80
81
84
73
73
84
72
82
• IQR
81
85
77
81
69
69
71
73
65
71
Another U.!.e of box and whisker plots is to determine whether a c.listribution is
skewed. The location of the median in the box can relate information about the skewness
of the middle 50% of the data. If the median is located on the right side of the box, then
the middle 50% are skewed to the left. If the median is located on the left side of the box,
then the middle 50 % arc skewed to the right. By examining the length of the whi ken. on
ea.:h side of the box, a business re ear.~her can make a judgement about the skewness of
the outer values. If the longest whisker is to the right of the box, then the oukr data are
skewed to the right, and vice versa. We shall use the data gi\'l!ll in Table J.Io to construct a
box and whisker plot.
After organizing the data into an ordered array, as shown in Table 3.11, it is relatively
easy to determine the \'alues of the lower quartile (Q,). the median, and the upper quartile
( Q J) . From these, the value of the intcrquartile range can bl! computed.
·nu~ hinges of the box are Jocttted at the lower and upper quartiles, 69 and So. s.
The median is located within the box at distances of 4 from the hnver <1uartile and 6.5
from the upper quartile . Tile tl1stribution of the middle so "b of the data is skewed right,
PRINTED BY: Nuri Jazairi <nuri@yorku .ca>. Printing is for personal, private use only. No part of this book may be reproduced or transmitted without publisher's prior
permission. Violators will be prosecuted.
3.4 MEASURES OF SHAPE
91
because the median is nearer to the lower or left hinge. 'The inner fence is
constructed by:
Q,- 1.5 - lQR :::: 69 - 1.5(11.5) =69 - 17.25 =51.75
and
Q, + 1.5 · lQR = 80.5 + LS( 11.5) =80.5 + 17.25 =97.75
'Ihe whiskers arr constructed by drawing a linr segmrnt from the lower hinge outward to the smallest data value and a line segment from the upprr hinge outward to the
largest data value. An examination of the data reveals that no data values in this set of
numbers arc outside the inner fence. TI1e whiskers arc constructed outward to the lowest
value, which is 62. and to the highest value, which is 87.
To <onstruct an outer fence, we <akulate Q, - 3 · IQR and Q, + 3 · IQR, as follows:
Q, - 3 · IQR = 69 - 3(1 1.5) =69- 34.5 = 3·1.5
Q_, + 3 · IQR -= 80.5 + 3( 11.5) = 80.5 + 34.5 "" 115.0
Figure 3.14 is the computer printout for this box .md whisker plot.
Box and Whisker Plot
..
70
(,0
RO
90
Table data
IM§tJ<iiData in Ordered Array with
Quartiles and Median
87
80
73
69
85
79
73
68
84
79
84
77
73
72
68
65
82
76
72
65
82
75
71
64
82
74
71
64
Qt=69
=median =73
Q2 =80.5
IQR = Q3 - Q1 = 80.5 - 69 = 11.5
Qz
81
74
71
63
81
74
70
62
81
73
69
62
# text table 3.10/3.11
# N=40
87
80
73
69
85
79
73
68
84
79
73
68
84
77
72
65
82
76
72
65
82
75
71
64
82
74
71
64
81
74
71
63
81
74
70
62
81
73
69
62
#Q1=69
#Q2= median = 73
#Q3= 80.5
#IQR = Q3 - Q1 = 80.5 - 69 = 11.5
65
70
75
80
85
# 69-1.5*11.5 = 51.75
# 80.5+1.5*11.5 = 97.75
(min 62 no outlier)
(max 87 no outlier)
5
10
15
x<-c(1:9,16)
y<-1:10
boxplot(x,y,col="rainbow"(2))
1
2