Download Chapter 1: Data Collection

Document related concepts
no text concepts found
Transcript
Chapter 3: Numerically
Summarizing Data
3.1 Measures of Central Tendency
3.2 Measures of Dispersion
3.3 Measures of Central Tendency and Dispersion from Grouped Data
3.4 Measures of Position
3.5 The Five-Number Summary and Boxplots
September 25, 2008
1
The Mean of a Set
Suppose we have a set of numerical values {x1 , x2 ,K , xn } in an observation of
a population. We define the mean of this set to be the number:
n
x
x
i
i 1
n

x1  x2  K  xn
n
Hence, the mean is the sum of the observations divided by the number of observations.
Section 3.1
2
Remark
Your book distinguishes between two types of means: population and sample.
1 N
Population : x1 , x2 ,..., xN ,    x j
N j 1
1 n
Sample : x1 , x2 ,..., xn , x   x j , n  N.
n j 1
Note that they are both calculated in exactly the same way.
3
The Median of a Set
Consider a set of numerical observations, x1 , x2 ,K , xn , such that x1  x2  L  xn .
If n is an odd integer, then then the median of the set is the data point in the set: x(n1)/21 .
If n is an even interger, then the median of the set the number:
xn /2  xn /21
.
2
In other words, the median is the midpoint of the observations when they are
ordered from smallest to largest or vice-versa.
4
Example 1
Find the mean and median of the set of observations: {20, -3, 4, 10, 6, -1}.
Here n  6. Therefore, mean = x 
20  3  4  10  6 1 36

 6.
6
6
Since n  6 is an even integer, the median is the average of the two " central" data points after ordering.
4  6 10
We reorder the set : -3,-1,4,6,10,20. Then, median 
  5.
2
2

5
Example 2
Find the mean and median of the set of observations: {-10, -6 ,0, 4, 9}.
Here n  5. Therefore, mean = x 
10  6  0  4  9 3
3

 .
5
5
5
Since n  5 is an odd integer, the median is the " central" data points after ordering.
The set is already ordered. Then, median  0.

6
Mean and Dot Plot
Consider the set of observations : 1,1,2,3,5,5,5,6,7,7. The mean of this set is : x 
1 1 2  3  5  5  5  6  7  7 42

 4.2.
10
10
Next we construct the dot plot of this set and show the location of the mean on it.

Notice that the mean is a fulcrum for the
distribution of point masses on the lever (x-axis).
7
Add Points (“Weights”)
Suppose that we add two new points to the set,
1,1,2,3,5,5,5,6,7,7; namely,  2 and  4.
4  2  1 1 2  3  5  5  5  6  7  7 36

 3.
12
12
Next we construct the dot plot of this set and show the location of the mean on it.
The mean of the new set is : x 

The fulcrum has moved
1.2 units to the left.
8
Shape, Mean and Median
Right - skewed  median < mean
Left - skewed  mean < median
Symmetric  mean = median
9


Outlier
• An outlier is an observation (data point) that falls well
above or below the overall set of data.
• The mean can be highly influence by an outlier.
• The median is said to be resistant to outliers i.e., it value
is not changed significantly by the addition or removal of
an outlier.
10
Example
Consider the two sets : S1  1,3,5,6,7 and S2  1,3,5,6,7,25. The point 25 in S2 is an outlier.
We compute the mean and median for each set. For S1 the mean is 4.4 and the median is 5. For
S2, the mean is approximately 7.8 and the median is 5.5.

11
Mode
• The mode is the most frequent observation of the
variable.
• It is most often used with categorical data.
• For numerical data, it can be used when the data is
discrete.
Color
Count
Black
20
White
10
Red
35
Blue
15
Green
10
Other
20
The mode of the categorical variable color
is 35 (red).
12
Example
Mia Hamm, who retired at the 2004 Olympics, is considered to be the most
prolific player in international soccer. He is a list of the number of goals
scored over her 18-year career.
MHG = {0,0,0,4,10,1,10,10,19,9,18,20,13,13,2,7,8,13}.
Considering the population as the number of goals scored by Mia Hamm,
find the mean and median and mode of this set.
MHG  {0, 0, 0, 1, 2, 4, 7, 8, 9, 10, 10, 10, 13, 13, 13, 18, 19, 20}

1 18
157
xj 
 8.7222

18 j 1
18
median 
9  10 19

 9.5
2
2
mode  3
13
Mean, Median and Mode and
Distribution Shape
14
Measures of Dispersion
Consider the following sets of observations:
S1 = {0,0,0,0,0,0,0,0,0,0}
S2 = {-5,-4,-3,-2,-1,1,2,3,4,5}.
Both sets have the same mean and median (namely, 0). However, the
histograms or dot plots are quite different. Yet, their dot plot is very different.
Notice that the difference
between the smallest and
largest number in each set is
quite different.
Section 3.2
15
Range of a Set of Observations
Consider a set of numerical observations: S  {x1 , x2 ,K , xn }.
Let  =min xi and  =max xi . The number r     is called
1in
1in
the range of the set. It is a measure how spread out the
observations are.
Example : S  {3, 1, 9, 2, 4, 6, 8, 1, 9, 8, 9}.
  min{3, 1, 9, 2, 4, 6, 8, 1, 9, 8, 9}  4
  min{3, 1, 9, 2, 4, 6, 8, 1, 9, 8, 9}  9
r      9  (4)  13
Remark: The range is completely determined by only two points of the set of observations.
16
Example
Lance Armstrong won the Tour de France seven consecutive times (1999-2005). Here is data
about his victories.
Year
Winning
Time (h)
Distance
(km)
Winning Speed
(km/h)
Winning
Margin (min)
1999
91.538
3687
40.28
7.617
2000
92.552
3662
39.46
6.033
2001
86.291
3453
40.02
6.733
2002
82.087
3278
39.93
7.283
2003
83.687
3427
40.94
1.017
2004
83.601
3391
40.56
6.317
2005
86.251
3593
41.65
4.667
The ranges for each category of winning are:
Winning Time: range = 92.552 - 82.087 = 10.465
Distance: range = 3687 - 3278 = 409
Winning Speed: range = 41.65 - 39.46 = 2.19
Winning Margin: range = 7.283 - 1.017 = 6.266
17
The Spread of Quantitative Data
Consider the frequency distributions of two different data sets.
Notice how the tails of each distribution change from being close together to being far apart.
18
Section 2.4
The Deviation from the Mean
n
Consider a set of numerical observations: S  x1 , x2 ,K , xn . Let x 
x
i
i 1
n
be the mean. If z S, e.g., z  x j , then the deviation of z is defined to be the number v  z- x.
If z  x, then v  0; if z  x, then v  0.
Example : S  {2, 0,1, 3, 4}
2  0  1  3  4 6
x

5
5
6
16
z  2  v  2   
5
5
6
1
z  1 v  1  
5
5
6 14
z4v4 
5 5
19
Variance and Standard Deviation
Definition: The “average” of the square of all deviations in a
sample is called the variance of the sample. The standard
deviation of a sample is defined as the square root of the
variance.
vi  xi  x  variation of xi
n

v
2
i
i 1
n 1
n

 (x
i
i 1
n 1
n
s  
 x )2
 (x
i
 variance
 x )2
i 1
n 1
 standard deviation
Question: Why n -1 instead of n in these formulas?
20
Remark
There is an unfortunate duplicity on how the words, variance and standard deviation, are
used. These quantities are computed different ways, depending on whether the set under
consideration is a population or a sample of a population. It turns out that if we use the
formulas for variance and standard deviation where we divide by n instead of n-1, then the
standard deviation of the sample will consistently underestimate the standard deviation of
the population. This is called bias. Hence, we will sometimes use the following definitions
and will distinguish between sample standard deviation and population standard deviation.
Population :
n
variance of population   population 
 (xi  x )2
i 1
n
n
standard deviation of population  s population 
 (x
i
 x )2
i 1
n
Sample :
n
variance of sample   sample 
 (x
i
n
 x )2
i 1
n 1
standard deviation of sample  ssample 
 (x
i
 x )2
i 1
n 1
21
Example
For the set of observations (sample), {0,-3,10,7,5,-3,0},
• Find the range of the sample.
• Find the mean and median of the sample.
• Find the variance of the sample.
• Find the standard deviation of the sample.
  min{0, 3,10, 7, 5, 3, 0}  3
  max{0, 3,10, 7, 5, 3, 0}  10
r      10  (3)  13
n
x
x
16
 1
   0  3  10  7  5  3  0  
 7
7
i
i 1
n
Ordered set: 3, 3, 0, 0, 5, 7,10  median  0
n

 (x
i
 x )2
i 1
n 1
s  
2
2
2
1 
16  
16 
16   544

  0     3    K   0    
 25.9

6 
7 
7
7   21
544
34
4
 5.1
21
21
22
Example
For the two set of observations, S = {-1,0,0,0,1} and T = {-1,-1,-1,-1,0,1,1,1,1},
• Find the mean and median for each set.
• Find the standard deviation for each set.
1
 0.71
2
For the set T : x  0, median  0, s  1
For the set S : x  0, median  0, s 
We see from the dot plot that
the set T has more points that
vary from the mean and
hence, has a larger standard
deviation.
23
Properties of the Standard Deviation
• The larger the spread (variation) in the data, the larger the standard
deviation.
• The standard deviation is zero only if and only if the set from which it
is computed has all of its elements the same in which case the mean
of the set is this number.
• The standard deviation is influenced by outliers. This is true because
the deviation from the mean of the set to the outlier is a large number
in absolute value.
• The standard deviation yields more information than the range of the
set. (Why?)
24
Example
The following data represents the walking time (in minutes) from the dorm or apartment to
Professor Bisch’s course on operator algebras. We treat the nine students as the population of
Prof. Bisch’s class.
Student
Time
Student
Time
T.S.
39
S.Q.
45
P.C.
21
E.W.
11
A.A.
9
T.B.
12
C.S.
32
G.W.
39
N.G.
30
(a)
Find the population mean and
standard deviation.
(b)
Choose a sample of 4 and
compute the mean and standard
deviation of the sample.
25
population  39, 21, 9, 32, 30, 45,11,12, 39  xi 

39  21  9  32  30  45  11  12  39 238

 26.444
9
9
238
  xi  9 
j 1
9

2
9

13358
 12.8419
9
sample  21, 30, 45,12  xi 
x
21  30  45  12
 27
9
4
 x
i
s
 27 
j 1
4 1
2

13358
 22.2428
27
26
Bell-shaped (symmetric)
Distributions
Consider a set of observations that is bell-shaped.
All three distributions have different
standard deviations.
27
Empirical Rule for almost Bellshaped Distributions
Let  denote the standard deviation of the distribution (population) and  the mean of the distribution.
 68% of the observations fall within the interval    ,    .
 95% of the observations fall within the interval   2 ,   2 .
 Greater than 99% of the observations fall within the interval   3 ,   3 .
28
Caution
The Empirical Rule for bell-shaped distributions is an
empirical law, not a fact. The better the distribution is
being perfectly bell-shaped, then better the accuracy of
the law. It is useful in telling us how the data is
concentrated about the mean of the distribution.
29
Example
Consider the population: 1, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8, 8, 9.
Note that histogram of this data is approximately bell-shaped.
98
85
 5.2 and  
 2.1.
19
19
Hence,    ,      3.1, 7.3 and   2 ,   2   1.0, 9.4 
The mean and standard deviation of this set are:  
Think of the area between the yellow lines. Each bar is of width 1 and hence, its area
is 1 times its height. For example, the total area of all bars is 19 and the area between the yellow bars in the first
plot is
2
3
25
25/2 25
1 4  3 2 
. Then

 0.657.
2
4
2
19
38
30
Detailed Empirical Rule
31
Example
The distribution of the length of bolts produced by the Acme Bolt Company is approximately bellshaped with a mean of 4 inches and a standard deviation of 0.007 inches.
(a)
What is the range of length for 68% of the bolts produced by this company?
(b)
What percentage of bolts will be between 3.986 inches and 4.014 inches?
(c)
If the company discards any bolts that are less than 3.986 inches or greater than 4.014 inches,
what percentage of bolts will be discarded?
(d)
What percentage of the bolts will be between 4.007 inches and 4.021 inches?
(a)   4 and   0.007     ,      4  0.007, 4  0.007   3.993, 4.007 
The Emperical Rule states that 68% of the bolts will lie in this interval.
(b)   2 ,   2   4  0.014, 4  0.014   3.986, 4.014 
The Emperical Rule states that 95% of the bolts will lie in this interval.
(c) The bolts that lie outside of the interval 3.986, 4.014  comprise 100%-95% or 5% of the bolts.
(d) 4.007  4  0.007     and 4.021  4  0.021    3 . Hence, the interval
4.007, 4.021     ,   3 .
The area under the distribution for this interval is
13.5%  2.35%  15.85%.
32
Chebyshev Inequality
Theorem : Consider a set of data, S  x1 , x2 ,..., xn , with mean  and standard
1

deviation  . Let k be any positive integer. Then at least 100  1  2  % of the

k 
points of S will lie in the interval   k ,   k .
Example: Suppose that a population has a mean of 73.5 and a standard
deviation of 5.5. Find an interval that contains at least 75% of the data points
in the population.
1
1
1

75  100  1  2   0.75  1  2  2  0.25  k 2  4  k  2

k 
k
k
  k ,   k   73.5  2  5.5, 73.5  2  5.5   62.5,84.5 
33
Example
In December 2004, the average price of regular unleaded gasoline excluding taxes in the United States as
$1.37 per gallon. Researchers in the Department of Energy estimated that the standard deviation for this
mean price was $0.05. Using Chebyshev’s Inequality,estimate the percentage of gasoline stations that had
prices within 3 standard deviations of the mean? What percentage had prices within 2.5 standard
deviations?
1

 8
From Chebyshev's Inequality, at least 100  1  2  %  100   %  88.9% of the gas stations where

 9
3 
selling gas in the range:   3 ,   3   1.22,1.53.
1 

 21 
From Chebyshev's Inequality, at least 100  1 
%  100   %  84% of the gas stations where
2

 25 
2.5 
selling gas in the range:   2.5 ,   2.5   1.245,1.495 .
34
Remark
• Chebyshev’s Inequality does not place any preconditions on the
shape of the data set.
• It is true for populations and samples.
• The theorem does not say that there are exactly 100(1-1/k2)% points
in an interval that is one standard deviation from the mean, but rather
there are at least this number.
35
Mean and Standard Deviation
for Grouped Data
Suppose that we have a set (sample or population), S, for which the we have a histogram.
Let x1b , x2b ,..., xkb be the midpoints of the bins for the histogram and
let f1 , f2 ,..., fk be the frequencies for the k bins. Then an approximation
for the mean is given by
x1b f1  x2b f2  ...  xkb fk

.
f1  f2  ...  fk
Section 3.3
36
Example
S  1, 1,1, 0,1, 0, 2, 3,1, 0, 2,1
x1b  0.5, x2b  0.5, x3b  1.5, x4b  2.5, x5b  3.5
f1  1, f2  3, f3  5, f4  2, f5  1
11
 0.91666666
12
x1b f1  x2b f2  ...  x5b f5

 1.41667
f1  f2  ...  f5

37
Example
S  {-0.233419, -1.74643, -1.17611, 0.115127, -0.387499, -0.243923,
0.935241, -1.40094, 1.00318, -0.29893, -1.1775, -1.05954, -1.75079,
-0.570382, 1.78043, -0.890746, 0.274231, -1.88105, 0.431684,
-1.52741, 1.05588, -0.122219, 1.14102, -0.00826077, 0.81772,
-1.66893, -0.26497, -1.99627, -0.279399, 0.0530089, -1.15805,
-1.72074, -1.93831, -1.45983, 1.0851, -0.532795, 0.0568446,
-0.447141, 1.53799, 0.989186, 0.0532697, -0.178675, 1.68054,
-0.0318339, -1.51951, 0.519102, -0.545774, -0.64818, -1.76854,
-0.0157137, -1.56891, 1.55986, -1.37954, -1.81756, -0.357188,
0.430748, 1.49016, -1.32359, 0.503981, 1.88901, -0.690596, 0.457233,
1.29942, 0.431846, 0.538415, 1.48462, 0.979356, 1.18019, 1.30296,
1.50126, 1.75375, 0.281253, 0.917936, -1.57578, -1.93716, -0.876824,
1.87008, -1.8755, 0.117552, 0.851759, -1.47976, 0.37836, -0.826459,
-1.94213, 1.21858, -1.91226, -0.0167282, -0.716761, -0.383359,
1.00214, 0.853372, 0.668228, 0.395186, 0.913779, -0.749079,
-0.198149, 1.77186, 0.41528, -1.9636, -1.23352}
n  100
x1b  1.75, x2b  1.25, x3b  0.75, ..., x8b  1.75
f1  18, f2  10, f3  10, f4  16, f5  14, f6  12, f7  11, f8  9
  0.0091245

x1b f1  x2b f2  ...  x8b f8
 0.135
f1  f2  ...  f8
38
Weighted Mean of a Set
Given a set of numbers, suppose that we believe that some of the
numbers are more important than other numbers in the set. To reflect
this notation, we defined the weighted mean of a set of numbers.
Consider a set of numbers: S  x1 , x2 ,..., xn . Furthermore, suppose each number of
the set is assign a weight: w1 , w2 ,..., wn . The weighted mean of the set with respect
to its weights is defined as
n
w x  w2 x2  ...  wn xn
xw  1 1

w1  w2  ...  wn
w x
j
j 1
x
w
j
.
j
j 1
39
Example
Consider the set S = {-3, 1, 0, 3, -1, 1, 0} and the weights {1.5, 0, 1, -1, 1, 2, 1}. Find the
weighted mean of this set with respect to the given weights.
xw 
(1.5)(3)  (0)(1)  (1)(0)  (1)(3)  (1)(1)  (2)(1)  (1)(0) 9.5

 1.72727
1.5  0  1  1  1  2  1
5.5
40
Approximation for Standard Deviation and
Variance for Grouped Data
Suppose that we have a set (sample or population), S, for which the we have a histogram.
Let x1b , x2b ,..., xkb be the midpoints of the bins for the histogram and let f1 , f2 ,..., fk be the frequencies
for the k bins. Then approximations for the standard deviation (depending on population or sample) are given by
 f x
k
j

b
j

j 1
k
f
j 1
j

 f x
k
2
j
and s 
j 1
b
j
x

2
 k 
  fj   1
 j 1 
where  is the mean of the population and x is the mean of the sample.
41
Example
sample  1, 1,1, 0,1, 0, 2, 3,1, 0, 2,1
x1b  0.5, x2b  0.5, x3b  1.5, x4b  2.5, x5b  3.5
f1  1, f2  3, f3  5, f4  2, f5  1
x1b f1  x2b f2  ...  x5b f5
x
 1.41667
f1  f2  ...  f5
 f x
k
j
s
j 1
b
j
x



f

j

 1
 j 1 
k
Note : x 
2

x
b
1
x
f
2
1

 ...  x5b  x
f1  f2  ...  f5  1
f
2
5
 1.08362
11
155 / 33
,s
 1.08362
12
2
42
Approximating the Median of
grouped Data
In the problem section of 3.3 there is a formula for approximating the median of data that is given in
frequency tables:
n

 CF
2
 b
median  Lmedian  
xmedian
fmedian 


where
Lmedian  lower limit of bin that contains median
n  number of data points
CF  cumulative frequency of bin before the bin that contains median
fmedian  frequency of bin that contains median
b
xmedian
 width of bin that contains median
The bin that contains of median is the bin that has
n
in its cumulative frequency.
2
43
Example
Bin
Frequency
Cumulative
Frequency
[0,10)
24
24
[10,20)
14
38
[20,30)
39
77
[30,40)
18
95
[40,50]
5
100
n
 50  median bin is the third bin i.e., data in the interval [20, 30).
2
300
 50  38 
b
 20, CF  38, xmedian
 10, fmedian  39  median  20  
10 
 23.0769

 39 
13
n  100 
Lmedian
44
Measures of Position in a Distribution
• The mean and median give us information about the “center” of a set of
observations (the distribution).
• The range and standard deviation give us information about the “spread” of the
distribution.
• We now introduce a concept that is equivalent to the “position” in a distribution.
It will use the concept of percentiles. The percentile will how the distribution can
be divided into parts (sometimes equal) which in turn will give us the notion of
position within the distribution.
Section 3.4
45
z-score
Consider a population distribution with mean  and standard deviation  . Let x be a point in
x
the distribution. Then the z-score of x in the population is defined as the number: z population 
.

Consider a sample with mean x and standard deviation s. Let x be a point in
the sample. Then the z-score of x in the sample is defined as the number: zsample 
xx
.
s
x
 x    z  z standard deviations from the mean. When z  1, x     .

When z  2, x    2 . When z  2.5, x    2.5 .
Remark : z 
46
Example
Example: Consider the sample: {-1,0,1,5,19}. Compute the z-score for each data
point.
1  0  1  5  19 24

 4.8
5
5
(1  24 / 5)2  (0  24 / 5)2  (1  24 / 5)2  (5  24 / 5)2  (19  24 / 5)2
s

4
29
x  1  z  
 0.70
1705
24
x0z
 0.58
1705
19
x 1 z  
 0.46
1705
1
x5z
 0.02
1705
19
x  19  z 
 1.72
1705
x
341
 8.26
5
47
Application of z-score
The average 20- to 29- year old man is 69.6 inches tall with a standard deviation of 2.7 inches.
The average 20- to 29- year old woman is 64.1 inches with a standard deviation of 2.6 inches.
With respect to their population, who is relatively taller: a 75-inch man or a 70-inch woman?
As a measure of relativeness within each population we use the z-score.
Man: z 
x

Woman: z 

75  69.6
 2.0
2.7
x


70  64.1
 2.26923
2.6
Hence, the 70-inch man is 2 standard deviations from the mean of his population
and the woman is 2.37922 from the mean of her population. Hence, she is relatively
taller.
48
Percentile
Definition: The kth percentile in a distribution, Pk, is a number that is the
percentage of the observations that fall below or at this value. In other words, it
subdivides the total area enclosed by the distribution into two sub-areas, A1 and
A2, so that total area is divided into two parts: k and 100-k.
49
Algorithm for Percentiles
S  x1 , x2 ,..., xn  (sample or population)
Let k be the percentile to be computed.
 k 
Compute: i  
n  1. Note that 1  i  n.
 100 
If i is an integer, then Pk  xi .
If i is not an integer, then let j be the largest integer such that j  i. Then Pk 
For example, if i  10.31, the j  10 and Pk 
x j  x j 1
2
.
x10  x11
.
2
50
Example
Find the 20th percentile of the set: S = {-1,0,3,5,9,12,15,18,25}.
Next find the 45th percentile.
n9
k  20
n9
20
 k 
i
n  1 
10  2


 100 
100
45
 k 
i
n  1 
10  4.5 (not an integer)


 100 
100
 P20  x2  0
 P45 
k  45
x 4  x5 5  9

7
2
2
51
Remark
Your book proposes an algorithm on page 169 which tells us what percentile a particular data point represents in the set.
S  x1 , x2 ,..., xn  so that x1  x2  ...  xn .
 number of data points less than xk 
Let xk S. Then the percentile of xk is: Pxk  100 
 .

n
If Pxk is not an integer, we round to the nearest integer.
Example : What percentile does the number 3 represent in the set 2, 3,1,1, 4, 2?
n6
Ordered set: 1,1, 2, 2, 3, 4
Number of data points less than 3: 4
 4
P3  100    66.6666%  67 th perecentile
 6
52
Quartiles
When k = 50%, half of the observations are above and half are below this
position. One can argue that this is equivalent to the notion of the median of
the set of observations. When k = 25%, one quarter of the observations are
below this position and three quarters are above. Similarly for k = 75%.
These demarcation points are given special names. When k = 25%, it is
called the first quartile (Q1). When k = 50%, it is called the second quartile or
median (Q2) and finally, when k = 75%, it is called the third quartile (Q3).
53
To Find Quartiles
• To calculate Q1, we calculate P25%.
• To calculate Q2, we calculate P50%.
• To calculate Q3, we calculate P75%.
54
Example
Find the quartiles for the set {-1,1,5,5,0,7,2,7}.
Reordered Set: 1, 0,1, 2, 5, 5, 7, 7
n8
25  9 225
x  x3 0  1
 k 
k  25  i  
n  1 

 2.25  P25  2

 0.5  Q1  0.5


 100 
100 100
2
2
50  9 450
x  x5 2  5
 k 
k  50  i  
n  1 

 4.5  P50  4

 3.5  Q2  3.5


 100 
100
100
2
2
75  9 675
x  x7 5  7
 k 
k  75  i  
n  1 

 6.75  P75  6

 6  Q3  6


 100 
100
100
2
2
55
Example
Find the quartiles for the set {-1,1,5,5,0,7,2,7,2}. Same set as previous
example with the data point 2 added.
Reordered Set: 1, 0,1, 2, 2, 5, 5, 7, 7
n9
25 10 25
x  x3 0  1
 k 
k  25  i  
n  1 

 2.5  P25  2

 0.5  Q1  0.5


 100 
100
10
2
2
50 10 50
 k 
k  50  i  
n  1 

 5  P50  x5  2  Q2  2


 100 
100
10
75 10 75
x  x8 5  7
 k 
k  75  i  
n  1 

 7.5  P75  7

 6  Q3  6


 100 
100
10
2
2
56
Example
Find the median, Q1, and Q3 for the set of data:
{68,76,60,88,69,80,75,67,71,100,63,62,71,74,64,48,100,72,65,50,72,100,63,45,54,60,75,57,74,84,83}.
Reordered Set: {45, 48, 50, 54, 57, 60, 60, 62, 63, 63, 64, 65, 67, 68, 69, 71, 71, 72, 72, 74, 74,
75, 75, 76, 80, 83, 84, 88, 100, 100, 100}.
n  31
25  32 800
 k 
k  25  i  
n  1 

 8  P25  x8  62  Q1  62


 100 
100
100
50  32 1600
 k 
k  50  i  
n  1 

 16  P50  x16  71  Q2  71


 100 
100
10
75  32 2400
 k 
k  75  i  
n  1 

 24  P75  x24  76  Q3  76


 100 
100
10
57
Interquartile Range
Definition: Let Q1, Q2, and Q3 denote the quartiles for a set of
observations. The interquartile range (IQR) of the set is defined as
IQR = Q3 - Q1.
Hence, it is simply the distance between the first and third quantile.
Example: Consider {-1,1,5,5,0,7,2,7}. Previously, we showed that
Q1 = 0.5 and Q3 = 6.0. Hence, IQR = 6.0 - 0.5 = 5.5.
58
IQR and Outlier Criterion
Criterion: Consider a set of observations. An observation may be
a possible outlier on the left if the distance from it to Q1 is larger
than (1.5)IQR. It may be a possible outlier on the right if the
distance from it to Q3 is larger than 1.5xIQR. We can call these
demarcation values the upper and lower fences of the set:
LF = Q1 - 1.5(IQR)
UF = Q3 + 1.5(IQR)
59
Example
Example: Consider a set of data points: {-1,0,3,5,9,10,26}. Doe it have any potential
outliers?
Reordered Set: {-1, 0, 3, 5, 9,10, 26}
n7
25  8 200
 k 
k  25  i  
n  1 

 2  P25  x2  0  Q1  0


 100 
100 100
50  8 400
 k 
k  50  i  
n  1 

 4  P50  x4  5  Q2  5


 100 
100
100
75  8 600
 k 
k  75  i  
n  1 

 6  P75  x6  6  Q2  10


 100 
100
100
 IQR  Q3  Q1  10  0  10
Since 1.5  IQR  15 and 26  Q3  1.5  IQR   25, we consider the largest data point to be a possible outlier.
60
Example
The following sample of the concentration of dissolved organic carbon (mg/L) in mineral soil: {8.5, 10.3, 5.5, 8.05,
3.02, 12.57, 8.37, 4.6, 7.9, 9.11, 3.91, 11.56, 4.71,10.72, 7.45, 12.89, 7.92, 8.5, 11.72, 8.79, 9.29, 7, 7.66, 21.82,
11.33, 9.81, 17.9, 4.8, 4.85, 21, 3.99, 11.72, 22.62, 7.11, 17.99, 7.31, 4.9, 11.97,10.89, 3.79, 11.8, 10.74, 9.6, 21.4,
16.92, 9.1, 7.85}. Calculate the quartiles and IQR for this sample. Lastly, compute the upper and lower fences.
We first sort the set: {3.02, 3.79, 3.91, 3.99, 4.6, 4.71, 4.8, 4.85, 4.9, 5.5, 7, 7.11, 7.31, 7.45,7.66, 7.85, 7.9, 7.92, 8.05,
8.37, 8.5, 8.5, 8.79, 9.1, 9.11, 9.29, 9.6,9.81, 10.3, 10.72, 10.74, 10.89, 11.33, 11.56, 11.72, 11.72, 11.8, 11.97, 12.57,
12.89, 16.92, 17.9, 17.99, 21, 21.4, 21.82, 22.62} and notice that there are 47 points. Hence, the median (Q2) is the
middle point of the sorted set: 9.1 (the 24th point). Therefore, Q2 = 9.1. To calculate Q1 and Q3, we use the 25 th and 75th
percentiles: Q1 = 7.16 and Q3 = 11.72.
Therefore, IQR = 11.72 - 7.16 = 4.56.
The upper and lower fences are:
LF = Q1 - 1.5(IQR ) = 7.16 - 1.5(7.56) = 0.32
UF = Q3 +1.5(IQR ) = 7.16 + 1.5(7.56) = 18.56
61
The Five Number Summary of Position
It is often convenient to summarize the quartile information, the
smallest and largest values in the set as a 5-tuple: (smallest, Q1,
median, Q3, largest).
Example: Find the five number summary for the set:
{-2,-1,0,1,5,6,6,8,10,11,12}.
The small number is -2, the largest number is 12, Q2 = 6, Q1 = 0 and Q3 = 10.
Hence, the 5-tuple is (-2,0,6,10,12).
Section 3.5
62
Box-whisker Plot of the Five Number Summary
(smallest, Q1, median, Q3, largest)
63
Some Remarks
• A box-whisker is a very compact way of summarizing the spread
of the distribution.
• It does not give the shape of the distribution and hence, a
histogram and a box-whisker plot often go together.
• A box-whisker plot is a convenient way to compare two sets of
data.
64
Comparing Two Sets
Which set has the larger mean?
Which set has the larger median?
Which set has the largest member?
Which has the larger standard deviation?
65
Can Descriptive Summaries be Misleading?
Example: Suppose a sample of Vanderbilt students are asked to estimate how many
miles that they have driven during the month of August. After receiving the sample for
this population of Vanderbilt students, we compute the following statistics:
• number in sample: 954
• smallest value: 0
• largest value: 25,000
• mean: 2,072.6
• median: 1,903
• standard deviation: 1,662.9
• IQR: 1,908
Is it reasonable to say that average Vanderbilt student drove approximately 2,073 miles
with the median 1,903 during the month of August?
66
Actual Data Set
{0,1,5,9,13,...,2997,3000,3010,3020,3030,…,5000,24000,25000}
954 data points
25,000 and 24,000 are outliers
Remove outliers: mean = 2,025.5, median = 1,899, SD = 1307.5
67
Home Ownership in America
• SmartMoney Magazine, October 2006.
• Question: Is real estate making millionaires of the
average citizen?
• Median value of homes has risen from $131,000 in
2001 to $160,000 in 2004.
• 69% of American’s own homes.
• Average net worth of homeowners: $625,000.
• Median net worth of homeowners: $184,000.
68
Summary
•
•
•
•
Center of a Distribution
– Mean
– Median
– Mode
Spread of a Distribution
– Range
– Variance
– Standard Deviation
Position in a Distribution
– Quartile
– Percentile
– IQR
– Five number summary
– Box-whisker plot
– z-score
Grouped Data
69
Related documents