Download Sullivan 2nd ed Chapter 3

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Chapter 3
Numerically
Summarizing Data
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 1 of 3
Chapter 3
● Chapter 3 – Numerically Summarizing Data
3.1

3.2

3.3

3.4

3.5

Measures of Central Tendency
Measures of Dispersion
Measures of Central Tendency and Dispersion
from Grouped Data
Measures of Position
The Five Number Summary and Boxplots
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 2 of 3
Chapter 3
Section 1
Measures of
Central Tendency
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 3 of 3
Chapter 3 – Section 1
● Analyzing populations versus analyzing samples
● For populations
 We know all of the data
 Descriptive measures of populations are called
parameters
 Parameters are often written using Greek letters ( μ )
● For samples
 We know only part of the entire data
 Descriptive measures of samples are called statistics
 Statistics are often written using Roman letters ( x )
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 4 of 3
Chapter 3 – Section 1
● The arithmetic mean of a variable is often what
people mean by the “average” … add up all the
values and divide by how many there are
● Compute the arithmetic mean of
6, 1, 5
● Add up the three numbers and divide by 3
(6 + 1 + 5) / 3 = 4.0
● The arithmetic mean is 4.0
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 5 of 3
Chapter 3 – Section 1
● The arithmetic mean is usually called the mean
● For a population … the population mean
 Is computed using all the observations in a population
 Is denoted μ
 Is a parameter
● For a sample … the sample mean
 Is computed using only the observations in a sample
 Is denoted x
 Is a statistic
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 6 of 3
Chapter 3 – Section 1
● The median of a variable is the “center”
● When the data is sorted in order, the median is
the middle value
● The calculation of the median of a variable is
slightly different depending on
 If there are an odd number of points, or
 If there are an even number of points
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 7 of 3
Chapter 3 – Section 1
● To calculate the median (M) of a data set
 Arrange the data in order
 Count the number of observations, n
● If n is odd
 There is a value that’s exactly in the middle
 That value is the median M
● If n is even
 There are two values on either side of the exact
middle
 Take their mean to be the median M
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 8 of 3
Chapter 3 – Section 1
● An example with an odd number of observations
(5 observations)
● Compute the median of
6, 1, 11, 2, 11
● Sort them in order
1, 2, 6, 11, 11
● The middle number is 6, so the median is 6
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 9 of 3
Chapter 3 – Section 1
● An example with an even number of
observations (4 observations)
● Compute the median of
6, 1, 11, 2
● Sort them in order
1, 2, 6, 11
● Take the mean of the two middle values
(2 + 6) / 2 = 4
● The median is 4
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 10 of 3
Chapter 3 – Section 1
● One interpretation
● The median splits the data into halves
M = 79.5
62, 68, 71, 74, 77, 82, 84, 88, 90, 94
62, 68, 71, 74, 77
5 on the left
82, 84, 88, 90, 94
5 on the right
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 11 of 3
Chapter 3 – Section 1
● The mode of a variable is the most frequently
occurring value
● Find the mode of
6, 1, 2, 6, 11, 7, 3
● The values are
1, 2, 3, 6, 7, 11
● The value 6 occurs twice, all the other values
occur only once
● The mode is 6
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 12 of 3
Chapter 3 – Section 1
● Qualitative data
 Values are one of a set of categories
 Cannot add or order them … the mean and median
do not exist
 The mode is the only one of these three
measurements that exists
● Find the mode of
blue, blue, blue, red, green
● The mode is “blue” because it is the value that
occurs the most often
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 13 of 3
Chapter 3 – Section 1
● Quantitative data
 The mode can be computed but sometimes it is not
meaningful
 Sometimes each value will only occur once (which
can often happen with precise measurements)
● Find the mode of
5.1, 6.6, 6.8, 9.3, 1.9
● Each value occurs only once
● The mode is not a meaningful measurement
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 14 of 3
Chapter 3 – Section 1
● One interpretation
● In primary elections, the candidate who receives
the most votes is often called “the winner”
● Votes (data values) are
Candidate
Henry
Number of votes
194
Kayla
215
Jason
172
● The mode is “Kayla” … Kayla is the winner
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 15 of 3
Chapter 3 – Section 1
● The mean and the median are often different
● This difference gives us clues about the shape
of the distribution




Is it symmetric?
Is it skewed left?
Is it skewed right?
Are there any extreme values?
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 16 of 3
Chapter 3 – Section 1
● Symmetric – the mean will usually be close to
the median
● Skewed left – the mean will usually be smaller
than the median
● Skewed right – the mean will usually be larger
than the median
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 17 of 3
Chapter 3 – Section 1
● If a distribution is symmetric, the data values
above and below the mean will balance
 The mean will be in the “middle”
 The median will be in the “middle”
● Thus the mean will be close to the median, in
general, for a distribution that is symmetric
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 18 of 3
Chapter 3 – Section 1
● If a distribution is skewed left, there will be some
data values that are larger than the others
 The mean will decrease
 The median will not decrease as much
● Thus the mean will be smaller than the median,
in general, for a distribution that is skewed left
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 19 of 3
Chapter 3 – Section 1
● If a distribution is skewed right, there will be
some data values that are larger than the others
 The mean will increase
 The median will not increase as much
● Thus the mean will be larger than the median, in
general, for a distribution that is skewed right
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 20 of 3
Chapter 3 – Section 1
● For a mostly symmetric distribution, the mean
and the median will be roughly equal
● Many variables, such as birth weights below, are
approximately symmetric
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 21 of 3
Chapter 3 – Section 1
● What if one value is extremely different from the
others?( this is so called an outlier)?
others
● What if we made a mistake and
6, 1, 2
was recorded as
6000, 1, 2
● The mean is now ( 6000 + 1 + 2 ) / 3 = 2001
● The median is still 2
● The median is “resistant to extreme values”
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 22 of 3
Summary: Chapter 3 – Section 1
● Mean
 The center of gravity
 Useful for roughly symmetric quantitative data
● Median
 Splits the data into halves
 Useful for highly skewed quantitative data
● Mode
 The most frequent value
 Useful for qualitative data
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 23 of 3
Chapter 3
Section 2
Measures of
Dispersion
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 24 of 3
Chapter 3 – Section 2
● Learning objectives
1

2

3

4

5

The range of a variable
The variance of a variable
The standard deviation of a variable
Use the Empirical Rule
Use Chebyshev’s inequality
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 25 of 3
Chapter 3 – Section 2
● Comparing two sets of data
● The measures of central tendency (mean,
median, mode) measure the differences
between the “average” or “typical” values
between two sets of data
● The measures of dispersion in this section
measure the differences between how far
“spread out” the data values are
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 26 of 3
Chapter 3 – Section 2
● The range of a variable is the largest data value
minus the smallest data value
● Compute the range of
6, 1, 2, 6, 11, 7, 3, 3
● The largest value is 11
● The smallest value is 1
● Subtracting the two … 11 – 1 = 10 … the range
is 10
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 27 of 3
Chapter 3 – Section 2
● The range only uses two values in the data set –
the largest value and the smallest value
● The range is not resistant
● If we made a mistake and
6, 1, 2
was recorded as
6000, 1, 2
● The range is now ( 6000 – 1 ) = 5999
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 28 of 3
Chapter 3 – Section 2
● The variance is based on the deviation from the
mean
 ( xi – μ ) for populations
 ( xi – x ) for samples
● To treat positive differences and negative
differences, we square the deviations
 ( xi – μ )2 for populations
 ( xi – x )2 for samples
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 29 of 3
Chapter 3 – Section 2
● The population variance of a variable is the sum
of these squared deviations divided by the
number in the population
2
2
2
2
(x

μ)
(x

μ)

(x

μ)

...

(x

μ)
 i
2
N
 1
N
N
● The population variance is represented by σ2
● Note: For accuracy, use as many decimal places
as allowed by your calculator
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 30 of 3
Chapter 3 – Section 2
● Compute the population variance of
6, 1, 2, 11
● Compute the population mean first
μ = (6 + 1 + 2 + 11) / 4 = 5
● Now compute the squared deviations
(1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36
● Average the squared deviations
(16 + 9 + 1 + 36) / 4 = 15.5
● The population variance σ2 is 15.5
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 31 of 3
Chapter 3 – Section 2
● The sample variance of a variable is the sum of
these squared deviations divided by one less
than the number in the sample
2
(x1  x )2  (x2  x )2  ...  (xn  x )2
 (xi  x )

n -1
n 1
● The sample variance is represented by s2
● We say that this statistic has n – 1 degrees of
freedom
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 32 of 3
Chapter 3 – Section 2
● Compute the sample variance of
6, 1, 2, 11
● Compute the sample mean first
x = (6 + 1 + 2 + 11) / 4 = 5
● Now compute the squared deviations
(1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36
● Average the squared deviations
(16 + 9 + 1 + 36) / 3 = 20.7
● The sample variance s2 is 20.7
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 33 of 3
Chapter 3 – Section 2
● Why are the population variance (15.5) and the
sample variance (20.7) different for the same set
of numbers?
● In the first case, { 6, 1, 2, 11 } was the entire
population (divide by N)
● In the second case, { 6, 1, 2, 11 } was just a
sample from the population (divide by n – 1)
● These are two different situations
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 34 of 3
Chapter 3 – Section 2
● Why do we use different formulas?
● The reason is that using the sample mean is not
quite as accurate as using the population mean
● If we used “n” in the denominator for the sample
variance calculation, we would get a “biased”
result
● Bias here means that we would tend to
underestimate the true variance
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 35 of 3
Chapter 3 – Section 2
● The standard deviation is the square root of the
variance
● The population standard deviation
 Is the square root of the population variance (σ2)
 Is represented by σ
● The sample standard deviation
 Is the square root of the sample variance (s2)
 Is represented by s
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 36 of 3
Chapter 3 – Section 2
● If the population is { 6, 1, 2, 11 }
 The population variance σ2 = 15.5
 The population standard deviation σ = 15.5  3.9
● If the sample is { 6, 1, 2, 11 }
 The sample variance s2 = 20.7
 The sample standard deviation s = 20.7  4.5
● The population standard deviation and the
sample standard deviation apply in different
situations
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 37 of 3
Chapter 3 – Section 2
● The standard deviation is very useful for
estimating probabilities
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 38 of 3
Chapter 3 – Section 2
● The empirical rule
● If the distribution is roughly bell shaped, then
 Approximately 68% of the data will lie within 1
standard deviation of the mean
 Approximately 95% of the data will lie within 2
standard deviations of the mean
 Approximately 99.7% of the data (i.e. almost all) will
lie within 3 standard deviations of the mean
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 39 of 3
Chapter 3 – Section 2
● For a variable with mean 17 and standard
deviation 3.4
 Approximately 68% of the values will lie between
(17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4
 Approximately 95% of the values will lie between
(17 – 2  3.4) and (17 + 2  3.4), i.e. 10.2 and 23.8
 Approximately 99.7% of the values will lie between
(17 – 3  3.4) and (17 + 3  3.4), i.e. 6.8 and 27.2
● A value of 2.1 and a value of 33.2 would both be
very unusual
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 40 of 3
Chapter 3 – Section 2
● Chebyshev’s inequality gives a lower bound on
the percentage of observations that lie within k
standard deviations of the mean (where k > 1)
● This lower bound is
 An estimated percentage
 The actual percentage for any variable cannot be
lower than this number
● Therefore the actual percentage must be this
value or higher
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 41 of 3
Chapter 3 – Section 2
● Chebyshev’s inequality
● For any data set, at least


1
1 
 100%

k 2 

of the observations will lie within k standard
deviations of the mean, where k is any number
greater than 1
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 42 of 3
Chapter 3 – Section 2
● How much of the data lies within 1.5 standard
deviations of the mean?
● From Chebyshev’s inequality


1
1 
 100%  55.6%
2

1.5 

so that at least 55.6% of the data will lie within
1.5 standard deviations of the mean
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 43 of 3
Chapter 3 – Section 2
● If the mean is equal to 20 and the standard
deviation is equal to 4, how much of the data lies
between 14 and 26?
● 14 to 26 are 1.5 standard deviations from 20


1  1  100%  55.6%
2

1
.
5


so that at least 55.6% of the data will lie between
14 and 26
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 44 of 3
Summary: Chapter 3 – Section 2
● Range
 The maximum minus the minimum
 Not a resistant measurement
● Variance and standard deviation
 Measures deviations from the mean
 Not a resistant measurement
● Empirical rule
 About 68% of the data is within 1 standard deviation
 About 95% of the data is within 2 standard deviations
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 45 of 3
Chapter 3
Section 3
Measures of Central Tendency and
Dispersion from Grouped Data
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 46 of 3
Chapter 3 – Section 3
● Learning objectives
1

The mean from grouped data
2 The weighted mean
3
 The variance and standard deviation for grouped data
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 47 of 3
Chapter 3 – Section 3
● Data may come in groups rather than
individually
● The values may have been summarized in
frequency distributions
 Ranges of ages (20 – 29, 30 – 39, ...)
 Ranges of incomes ($10,000 – $19,999, $20,000 –
$39,999, $40,000 – $79,999, ...)
● The exact values for the mean, variance, and
standard deviation cannot be calculated
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 48 of 3
Chapter 3 – Section 3
● Learning objectives
1

The mean from grouped data
2 The weighted mean
3
 The variance and standard deviation for grouped data
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 49 of 3
Chapter 3 – Section 3
● To compute the mean for grouped data
 Assume that, within each class, the mean of the data
is equal to the class midpoint
 Use the class midpoint in the formula for the mean
 The number of times the class midpoint value is used
is equal to the frequency of the class
● If 6 values are in the interval [ 8, 10 ] , then we
assume that all 6 values are equal to 9 (the
midpoint of [ 8, 10 ]
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 50 of 3
Chapter 3 – Section 3
● As an example, for the following frequency table,
0 – 1.9
2 – 3.9
4 – 5.9
6 – 7.9
Midpoint
1
3
5
7
Frequency
3
7
6
1
Class
we calculate the mean as if




The value 1 occurred 3 times
The value 3 occurred 7 times
The value 5 occurred 6 times
The value 7 occurred 1 time
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 51 of 3
Chapter 3 – Section 3
0 – 1.9
2 – 3.9
4 – 5.9
6 – 7.9
Midpoint
1
3
5
7
Frequency
3
7
6
1
Class
● The calculation for the mean would be
1 1 1 3  3  3  3  3  3  3  5  5  5  5  5  5  7
17
or
(1 3)  (3  7)  (5  6)  (7  1)
17
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 52 of 3
Chapter 3 – Section 3
● Evaluating this formula
(1 3)  (3  7)  (5  6)  (7  1)
61

 3.6
3  7  6 1
17
● The mean is about 3.6
● In mathematical notation
 xi fi
 fi
● This would be μ for the population mean and x
for the sample mean
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 53 of 3
Chapter 3 – Section 3
● Sometimes not all data values are equally
important
● To compute a grade point average (GPA), a
grade in a 4 credit class is worth more than a
grade in a 1 credit class
● The weights wi quantify the relative importance
of the different values
● Higher weights correspond to more important
values
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 54 of 3
Chapter 3 – Section 3
● As an example, the following grades
Course
Statistics
French Literature
Biochemistry
Badminton
Credits
3
3
Grade
A
B
5
1
B
D
would yield a GPA (on a 4 point scale) of
(3  4)  (3  3)  (5  3)  (1 1)
37

 3.08
3  3  5 1
12
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 55 of 3
Chapter 3 – Section 3
● In mathematical notation, if wi is the weight
corresponding to the data value xi, then the
weighted mean is
 w i xi
xw 
wi
● This formula looks similar to one for the mean
for grouped data, and the concepts are similar
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 56 of 3
Chapter 3 – Section 3
● To compute the variance for grouped data
 Assume again that, within each class, the mean of the
data is equal to the class midpoint
 Use the class midpoint in the formula for the variance
 The number of times the class midpoint value is used
is equal to the frequency of the class
● If 6 values are in the interval [ 8, 10 ] , then we
assume that all 6 values are equal to 9 (the
midpoint of [ 8, 10 ]
● The same approach as for the mean
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 57 of 3
Chapter 3 – Section 3
● As an example, for the following frequency table,
0 – 1.9
2 – 3.9
4 – 5.9
6 – 7.9
Midpoint
1
3
5
7
Frequency
3
7
6
1
Class
we calculate the variance as if




The value 1 occurred 3 times
The value 3 occurred 7 times
The value 5 occurred 6 times
The value 7 occurred 1 time
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 58 of 3
Chapter 3 – Section 3
0 – 1.9
2 – 3.9
4 – 5.9
6 – 7.9
Midpoint
1
3
5
7
Frequency
3
7
6
1
Class
● From our previous example, the mean is 3.6
● Just as for the mean, the calculation for the
variance would then be
((1  3.6)2  3)  ((3  3.6)2  7)  ((5  3.6)2  6)  ((7  3.6)2  1)
17
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 59 of 3
Chapter 3 – Section 3
● Evaluating this formula
((1  3.6)2  3)  ((3  3.6)2  7)  ((5  3.6)2  6)  ((7  3.6)2  1)
17

46.1
 2.7
17
● The variance is about 2.7
● The standard deviation would be about
2.7  1.6
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 60 of 3
Chapter 3 – Section 3
● In mathematical notation
● The population variance would be
2
 ( xi   ) fi
 
 fi
2
● The sample variance would be
2
(
x

x
)
fi

2
i
s 
(  fi )  1
● The standard deviations would be the
corresponding square roots
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 61 of 3
Summary: Chapter 3 – Section 3
● The mean for grouped data
 Use the class midpoints
 Obtain an approximation for the mean
● The variance and standard deviation for grouped
data
 Use the class midpoints
 Obtain an approximation for the variance and
standard deviation
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 62 of 3
Chapter 3
Section 4
Measures of
Position
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 63 of 3
Chapter 3 – Section 4
● Learning objectives
1

Determine and interpret z-scores
2 Determine and interpret percentiles
3
 Determine and interpret quartiles
4
 Check a set of data for outliers
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 64 of 3
Chapter 3 – Section 4
● Mean / median describe the “center” of the data
● Variance / standard deviation describe the
“spread” of the data
● This section discusses more precise ways to
describe the relative position of a data value
within the entire set of data
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 65 of 3
Chapter 3 – Section 4
● The standard deviation is a measure of
dispersion that uses the same dimensions as the
data (remember the empirical rule)
● The distance of a data value from the mean,
calculated as the number of standard deviations,
would be a useful measurement
● This distance is called the z-score
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 66 of 3
Chapter 3 – Section 4
● If the mean was 20 and the standard deviation
was 6
 The value 26 would have a z-score of 1.0 (1.0
standard deviation higher than the mean)
 The value 14 would have a z-score of –1.0 (1.0
standard deviation lower than the mean)
 The value 17 would have a z-score of –0.5 (0.5
standard deviations lower than the mean)
 The value 20 would have a z-score of 0.0
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 67 of 3
Chapter 3 – Section 4
● The population z-score is calculated using the
population mean and population standard
deviation
z
x

● The sample z-score is calculated using the
sample mean and sample standard deviation
xx
z
s
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 68 of 3
Chapter 3 – Section 4
● z-scores can be used to compare the relative
positions of data values in different samples
 Pat received a grade of 82 on her statistics exam
where the mean grade was 74 and the standard
deviation was 12
 Pat received a grade of 72 on her biology exam
where the mean grade was 65 and the standard
deviation was 10
 Pat received a grade of 91 on her kayaking exam
where the mean grade was 88 and the standard
deviation was 6
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 69 of 3
Chapter 3 – Section 4
● Statistics
 Grade of 82
 z-score of (82 – 74) / 12 = .67
● Biology
 Grade of 72
 z-score of (72 – 65) / 10 = .70
● Kayaking
 Grade of 81
 z-score of (91 – 88) / 6 = .50
● Biology was the highest relative grade
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 70 of 3
Chapter 3 – Section 4
● Learning objectives
1

Determine and interpret z-scores
2 Determine and interpret percentiles
3
 Determine and interpret quartiles
4
 Check a set of data for outliers
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 71 of 3
Chapter 3 – Section 4
● The median divides the lower 50% of the data
from the upper 50%
● The median is the 50th percentile
● If a number divides the lower 34% of the data
from the upper 66%, that number is the 34th
percentile
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 72 of 3
Chapter 3 – Section 4
● The computation is similar to the one for the
median
● Calculation
 Arrange the data in ascending order
 Compute the index i using the formula
k 

i 
 n  1
 100 
● If i is an integer, take the ith data value
● If i is not an integer, take the mean of the two
values on either side of i
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 73 of 3
Chapter 3 – Section 4
● Compute the 60th percentile of
1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 34
● Calculations
 There are 14 numbers (n = 14)
 The 60th percentile (k = 60)
 The index
k 
 60  14  1  9
i  

n

1





 100 
 100 
● Take the 9th value, or P60 = 23, as the 60th
percentile
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 74 of 3
Chapter 3 – Section 4
● Compute the 28th percentile of
1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54
● Calculations
 There are 14 numbers (n = 14)
 The 28th percentile (k = 28)
 The index
k 
28 


i 
 n  1  
 14  1  4.2
 100 
 100 
● Take the average of the 4th and 5th values, or
P28 = (7 + 8) / 2 = 7.5, as the 28th percentile
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 75 of 3
Chapter 3 – Section 4
● Learning objectives
1

Determine and interpret z-scores
2 Determine and interpret percentiles
3
 Determine and interpret quartiles
4
 Check a set of data for outliers
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 76 of 3
Chapter 3 – Section 4
● The quartiles are the 25th, 50th, and 75th
percentiles
 Q1 = 25th percentile
 Q2 = 50th percentile = median
 Q3 = 75th percentile
● Quartiles are the most commonly used
percentiles
● The 50th percentile and the second quartile Q2
are both other ways of defining the median
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 77 of 3
Chapter 3 – Section 4
● Quartiles divide the data set into four equal parts
● The top quarter are the values between Q3 and
the maximum
● The bottom quarter are the values between the
minimum and Q1
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 78 of 3
Chapter 3 – Section 4
● Quartiles divide the data set into four equal parts
● The interquartile range (IQR) is the difference
between the third and first quartiles
IQR = Q3 – Q1
● The IQR is a resistant measurement of
dispersion
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 79 of 3
Chapter 3 – Section 4
● Learning objectives
1

Determine and interpret z-scores
2 Determine and interpret percentiles
3
 Determine and interpret quartiles
4
 Check a set of data for outliers
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 80 of 3
Chapter 3 – Section 4
● Extreme observations in the data are referred to
as outliers
● Outliers should be investigated
● Outliers could be




Chance occurrences
Measurement errors
Data entry errors
Sampling errors
● Outliers are not necessarily invalid data
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 81 of 3
Chapter 3 – Section 4
● One way to check for outliers uses the quartiles
● Outliers can be detected as values that are
significantly too high or too low, based on the
known spread
● The fences used to identify outliers are
 Lower fence = LF = Q1 – 1.5  IQR
 Upper fence = UF = Q3 + 1.5  IQR
● Values less than the lower fence or more than
the upper fence could be considered outliers
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 82 of 3
Chapter 3 – Section 4
● Is the value 54 an outlier?
1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54
● Calculations




Q1 = (4 + 7) / 2 = 5.5
Q3 = (27 + 31) / 2 = 29
IQR = 29 – 5.5 = 23.5
UF = Q3 + 1.5  IQR = 29 + 1.5  23.5 = 64
● Using the fence rule, the value 54 is not an
outlier
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 83 of 3
Summary: Chapter 3 – Section 4
● z-scores
 Measures the distance from the mean in units of
standard deviations
 Can compare relative positions in different samples
● Percentiles and quartiles
 Divides the data so that a certain percent is lower and
a certain percent is higher
● Outliers
 Extreme values of the variable
 Can be identified using the upper and lower fences
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 84 of 3
Chapter 3
Section 5
The Five-Number Summary
And Boxplots
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 85 of 3
Chapter 3 – Section 5
● Learning objectives
1

Compute the five-number summary
2 Draw and interpret boxplots
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 86 of 3
Chapter 3 – Section 5
● Learning objectives
1

Compute the five-number summary
2 Draw and interpret boxplots
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 87 of 3
Chapter 3 – Section 5
● The five-number summary is the collection of





The smallest value
The first quartile (Q1 or P25)
The median (M or Q2 or P50)
The third quartile (Q3 or P75)
The largest value
● These five numbers give a concise description of
the distribution of a variable
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 88 of 3
Chapter 3 – Section 5
● The median
 Information about the center of the data
 Resistant
● The first quartile and the third quartile
 Information about the spread of the data
 Resistant
● The smallest value and the largest value
 Information about the tails of the data
 Not resistant
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 89 of 3
Chapter 3 – Section 5
● Compute the five-number summary for
1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54
● Calculations





The minimum = 1
Q1 = P25, the index i = 3.75, Q1 = (4 + 7) / 2 = 5.5
M = Q2 = P50 = (16 + 19) / 2 = 17.5
Q3 = P75, the index i = 11.25, Q3 = (27 + 31) / 2 = 29
The maximum = 54
● The five-number summary is
1, 5.5, 17.5, 29, 54
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 90 of 3
Chapter 3 – Section 5
● Learning objectives
1

Compute the five-number summary
2 Draw and interpret boxplots
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 91 of 3
Chapter 3 – Section 5
● The five-number summary can be illustrated
using a graph called the boxplot
● An example of a (basic) boxplot is
● The middle box shows Q1, Q2, and Q3
● The horizontal lines (sometimes called
“whiskers”) show the minimum and maximum
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 92 of 3
Chapter 3 – Section 5
● To draw a (basic) boxplot:
 Calculate the five-number summary
 Draw a horizontal line that will cover all the data from
the minimum to the maximum
 Draw a box with the left edge at Q1 and the right edge
at Q3
 Draw a line inside the box at M = Q2
 Draw a horizontal line from the Q1 edge of the box to
the minimum and one from the Q3 edge of the box to
the maximum
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 93 of 3
Chapter 3 – Section 5
● To draw a (basic) boxplot
Draw the middle box
Draw in the median
Draw the minimum and maximum
Voila!
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 94 of 3
Chapter 3 – Section 5
● An example of a more sophisticated boxplot is
● The middle box shows Q1, Q2, and Q3
● The horizontal lines (sometimes called
“whiskers”) show the minimum and maximum
● The asterisk on the right shows an outlier
(determined by using the upper fence)
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 95 of 3
Chapter 3 – Section 5
● To draw this boxplot (in a slightly different way
than the text)
 Draw the center box and mark the median, as before
 Compute the upper fence and the lower fence
 Temporarily remove the outliers as identified by the
upper fence and the lower fence (but we will add
them back later with asterisks)
 Draw the horizontal lines to the new minimum and
new maximum
 Mark each of the outliers with an asterisk
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 96 of 3
Chapter 3 – Section 5
● To draw this boxplot
Draw the middle box and the median
Draw in the fences, remove the outliers (temporarily)
Draw the minimum and maximum
Draw the outliers as asterisks
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 97 of 3
Chapter 3 – Section 5
● The distribution shape and boxplot are related
 Symmetry (or lack of symmetry)
 Quartiles
 Maximum and minimum
● Relate the distribution shape to the boxplot for
 Symmetric distributions
 Skewed left distributions
 Skewed right distributions
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 98 of 3
Chapter 3 – Section 5
● Symmetric distributions
Distribution
Boxplot
Q1 is equally far from the
median as Q3 is
The median line is in the
center of the box
The min is equally far from
the median as the max is
The left whisker is equal
to the right whisker
Min
Q1 M Q3
Max
Min
Q1 M Q3
Max
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 99 of 3
Chapter 3 – Section 5
● Skewed left distributions
Distribution
Boxplot
Q1 is further from the median
than Q3 is
The median line is to the
right of center in the box
The min is further from the
median than the max is
The left whisker is longer
than the right whisker
Min
Q1 MQ3 Max
Min
Q1 MQ3 Max
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 100 of 3
Chapter 3 – Section 5
● Skewed right distributions
Distribution
Boxplot
Q1 is closer to the median
than Q3 is
The median line is to the
left of center in the box
The min is closer to the
median than the max is
The left whisker is shorter
than the right whisker
Min Q1M
Q3
Max
Min Q1M
Q3
Max
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 101 of 3
Chapter 3 – Section 5
● We can compare two distributions by examining
their boxplots
● We draw the boxplots on the same horizontal
scale
 We can visually compare the centers
 We can visually compare the spreads
 We can visually compare the extremes
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 102 of 3
Chapter 3 – Section 5
● Comparing the “flight” with the “control” samples
Center
Spread
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 103 of 3
Summary: Chapter 3 – Section 5
● 5-number summary
 Minimum, first quartile, median, third quartile
maximum
 Resistant measures of center (median) and spread
(interquartile range)
● Boxplots
 Visual representation of the 5-number summary
 Related to the shape of the distribution
 Can be used to compare multiple distributions
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 104 of 3
Chapter 3 Summary
● Numeric summaries of data
 Means, medians, modes
 Ranges, variances, standard deviations, IQR’s
 Calculations for grouped data
● Measures of relative position
 z-scores
 Percentiles and quartiles
● Exploratory data analysis
 Five-number summaries
 Box plots
Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 105 of 3