Download Chapter 1 Introduction to Data

Document related concepts
no text concepts found
Transcript
Chapter 3
Numerical
Summaries of
Center and
Variation
Copyright © 2017, 2014 Pearson Education, Inc.
Slide 1
Chapter 3 Topics
• Summaries for center and spread in:
• Symmetric distributions: Mean and standard
deviation
• Skewed distributions: Median and IQR
• Other summaries for variation (variance, range)
• The Empirical Rule and z-scores
• Boxplots, Five Number Summary, and outliers
• Comparing distributions
Copyright © 2017, 2014 Pearson Education, Inc.
Slide 2
Todd Taulman. Shutterstock
Section 3.1
SUMMARIES FOR SYMMETRIC
DISTRIBUTIONS
• Measure for Center (Balance Point): Mean
• Measure of Horizontal Spread (Variability):
Standard Deviation
Copyright © 2017, 2014 Pearson Education, Inc.
Slide 3
Appropriate Measures
Recall:
When dealing with numerical data, you need
to describe the distribution using these 3
characteristics:
Characteristic
Which Is
Shape
Center
Spread
Symmetric, skewed,
etc.
Typical value
Horizontal variability
In Chapter 3 we learn specific ways to describe
the center and spread of a distribution.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 4
The Mean
• Can be thought of as the “balancing point of
the distribution”
Africa Studio. Shutterstock
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 5
The Mean: Symmetric Distributions
This dotplot shows the distribution of ACT
scores for a sample of statistics students.
For symmetric distributions, the mean is a good
representation of a “typical value” of the data set.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 6
The Mean: Skewed Distributions
This distribution shows the salaries of professional
baseball players in 2010. Do you think the mean is a
good representation of the “typical” baseball salary for
that year?
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 7
Using the Mean to Describe
“Typical” Values
• The mean represents a typical value in a set of
data when the data is roughly symmetric.
• For skewed distributions, the mean is NOT a
good estimate of a typical value.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 8
Computing the Mean
For small data sets:
• Add data values.
• Divide by number of numbers.
For larger data sets:
• Use some kind of appropriate
technology.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 9
The Mean: Example
Suppose a sample of prices for 1 gallon of
regular gas at 10 different gas stations in a
neighborhood in Austin, Texas, is taken on one
fall day in 2013. Find and interpret the mean.
$3.19, $3.09, $3.09, $2.93, $2.95,
$3.09, $2.99, $2.99, $2.95, $2.97
(A dotplot will show this distribution is roughly symmetric.)
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 10
Dotplot of Gas Prices
What do you think a good estimate would be for a
“typical” gas prices for that neighborhood?
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 11
Calculating the Mean
3.19+3.09+3.09+2.93+2.95+3.09+2.99+2.99+2.95+2.97
10
30.24
x
 3.02
10
x
INTERPRETATION of the Mean:
The typical price of 1 gallon of gas at these
gas stations in Austin, Texas, was $3.02 on
this particular day in 2013.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 12
Dotplot of Gas Prices
Mean = $3.02
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 13
Using the TI-84 Calculator
NOTE:
Your calculator can find the mean, but you must
be able to interpret the results.
To find the mean on the TI-84 calculator:
1.
2.
3.
4.
5.
Push STAT then select option 1: Edit.
Enter the data set in L1.
Push STAT and arrow over to Calc.
Choose option 1: 1-var stats (press ENTER).
The mean is given at the top (x̄).
Note that the calculator gives you a lot more information.
We will cover this shortly.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 14
Measuring the Spread
Recall:
The variability in a distribution can be
measured by the horizontal spread.
Why care? We need to know if most of the data
is near the center or far from it.
However: Assigning a number to the horizontal
spread is not straightforward.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 15
Measuring the Spread: Example
The following histograms record the daily high
temperatures in degrees Fahrenheit over one recent
year at two locations:
Provo, Utah (elevation 4500 feet)
San Francisco, CA (at sea level)
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 16
Measuring the Spread: Example
Note:
• Both distributions have roughly the same:
– Shape: Symmetric
– Center: 67°F in Provo vs 65°F in SF
• The spread in both distributions, however, is very different:
– Provo: More spread out (data values farther from center)
– SF:
Less spread out (data values closer to the center)
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 17
Measure Spread: Standard Deviation
Standard deviation
• A number that measures how far away the typical
observation is from the mean (center)
• For most distributions, a majority of the data is
within one standard deviation of the mean.
Note:
• Think of the standard deviation as the typical distance
of the observations from their mean.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 18
Standard Deviation: Example
The graph below shows the distribution of the amount of
particulate matter, or smog, in the air in 333 cities in the
United States in 2008, as reported by the Environmental
Protection Agency (EPA). The mean particulate matter is 10.7
micrograms per cubic meter, and the standard deviation is 2.6
micrograms per cubic meter.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 19
Standard Deviation: Example
1. Find the level of particulate matter one standard deviation
above the mean and one standard deviation below the
mean.
2. Keeping in mind that the EPA says that levels over 15
micrograms per cubic meter are unsafe, what can we
conclude about the air quality of most of the cities in this
sample?
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 20
Standard Deviation: Example
1. The level of particulate matter one standard
deviation above/below the mean is:
10.7 + 2.6 = 13.3 micrograms per cubic meter
10.7 - 2.6 = 8.1 micrograms per cubic meter
2. Since most cities in this sample have a
particulate level between 8.1 and 13.3
micrograms per cubic meter, which is less than
15 micrograms per cubic meter, the air quality in
most of the cities in this sample is safe.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 21
Standard Deviation: Formula
The formula for the standard deviation is:
s
3. Add all squared deviations
x  x 
n 1
2
2. Square to make positive.
1. Deviation (or distance) of
observation, x, from the mean.
4. Divide by 1 less than the sample size (see text).
Think of this as averaging the squared deviations.
5. Take square root to restore original units.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 22
Standard Deviation: Example
Suppose a sample of prices for 1 gallon of
regular gas at 10 different gas stations in a
neighborhood in Austin, Texas, is taken on one
fall day in 2013. Find and interpret the standard
deviation.
$3.19, $3.09, $3.09, $2.93, $2.95,
$3.09, $2.99, $2.99, $2.95, $2.97
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 23
Standard Deviation: Example
From before, the mean is $3.02 (rounded).
Using the standard deviation formula, we have:
x
x  x
x  x
3.19
(3.19 – 3.02) = 0.17
(0.17)2 = 0.0289
3.09
(3.09 – 3.02) = 0.07
(0.07)2 = 0.0049
3.09
(3.09 – 3.02) = 0.07
(0.07)2 = 0.0049
2.93
(2.93 – 3.02) = −0.09 (−0.09)2 = 0.0081
2.95
(2.95 – 3.02) = −0.07 (−0.07)2 = 0.0049
3.09
(3.09 – 3.02) = 0.07
2.99
(2.99 – 3.02) = −0.03 (−0.03)2 = 0.0009
2.99
(2.99 – 3.02) = −0.03 (−0.03)2 = 0.0009
2.95
(2.95 – 3.02) = −0.07 (−0.07)2 = 0.0049
2.97
(2.97 – 3.02) = −0.05 (−0.05)2 = 0.0025
2
(0.07)2 = 0.0049
s
0.0289  0.0049  ...  0.0025
9
s=
0.0658
» 0.0855
9
Note:
This is slightly off because
of rounding the mean.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 24
Standard Deviation: Example
Therefore, we know:
• The mean is $3.02.
x  x
• The standard deviation is about $0.09.
2
INTERPRETATION of the standard deviation:
At most of these gas stations, the price of a gallon of
gas is within 9 cents of $3.02.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 25
Using the TI-84 Calculator
NOTE: Usually technology is used to calculate the
standard deviation.
To find the standard deviation on the TI-84
calculator:
1.
2.
3.
4.
5.
Push STAT then select option 1: Edit.
Enter the data set in L1.
Push STAT and arrow over to Calc.
Choose option 1: 1-var stats (press ENTER).
The standard deviation for a sample is given by sx
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 26
Variance: Formula
The variance is the standard deviation squared:

s2  


 x  x 
n 1
2
2
2

x x
  

n 1

The standard deviation is preferred over the
variance since it has the same units as the
original data set.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 27
Section 3.2
Ljupco Smokovski. Shutterstock
WHAT’S UNUSUAL?
THE EMPIRICAL RULE AND Z-SCORES.
• The Empirical Rule (for Symmetric Distributions)
and z-Scores
• Determining if a Data Value Is Unusual
Copyright © 2017, 2014 Pearson Education, Inc.
Slide 28
The Empirical Rule
The empirical rule is a rough guideline for the
approximate percentage of data within 1 to 3
standard deviations of the mean in unimodal,
symmetric distributions.
Guidelines: (% of data within __ standard deviation of the mean)
– 68% of the data is within 1 standard deviation.
– 95% of the data is within 2 standard deviations.
– Almost all of the data is within 3 standard
deviations.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 29
Empirical Rule: Example
Data on smog levels in a sample of cities was
collected and the distribution was found to be
roughly symmetric and unimodal. The mean
particulate level for the samples was 10.7
micrograms per cubic liter, with a standard
deviation of 2.6 micrograms per cubic liter.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 30
Empirical Rule: Smog Levels
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 31
Empirical Rule: Smog Levels
• About 68% of the cities will have smog levels
between 8.1 and 13.3 (10.7 ± SD).
• About 95% of the cities will have smog levels
between 5.5 and 15.9 (10.7 ± 2SD).
• Almost all of the cities will have smog levels
between 2.9 and 18.5 (10.7 ± 3SD).
• Note: SD = 2.6
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 32
The Empirical Rule: Temperatures
The mean daily high temperature in San Francisco
is 65°F with a standard deviation of 8°F.
1. Find the temperature ranges for 68%, 95%, and
99.7% of the data.
2. Using the Empirical Rule, decide whether it is
unusual to have a day when the maximum
temperature is colder than 49°F in San
Francisco.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 33
The Empirical Rule: Example
1. The temperature ranges are:
• 68% of data:
• 95% of data:
• 99.7% of data:
65°F ± 8°F → 57°F to 73°F
65°F ± 16°F → 49°F to 81°F
65°F ± 24°F → 41°F to 89°F
2. Since 95% of the daily high temperatures are
between 49°F to 81°F according to the Empirical
Rule, only 5% of the temperatures are outside
this range. Due to the symmetry of the
distribution, 2.5% of the days are colder than
49°F (and 2.5% are warmer than 81°F), so having
a temperature colder than 49°F in San Francisco
is fairly unusual.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 34
Z-Scores
Z-score
Measures how many standard
deviations an observed data value is
from the mean
Example:
A z-score of 1.5 means the observed data value
is 1.5 standard deviations above the mean.
A z-score of –1.5 means the observed data value
is 1.5 standard deviations below the mean.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 35
Z-Scores
This dotplot shows heights for a sample of men. How
many men would have z-scores:
• Greater than 2?
• Less than –2?
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 36
Z-Scores
This dotplot shows heights for a sample of men. How
many men would have z-scores:
• Two men have z-scores greater than 2
• Two men have z-scores less than –2
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 37
Z-Scores: Usefulness Example
Note:
Z-scores allow us to compare
observations in different distributions.
Example:
Suppose Road A has a mean speed of 60 mph
with a standard deviation of 5 mph, and Road B
has a mean speed of 60 mph with a standard
deviation of 10 mph. Is a driver going 70 mph on
Road A travelling relatively faster or slower than
a driver going 70 mph on Road B?
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 38
Z-Scores: Usefulness Example
Known facts:
Road A: x ̄ = 60 mph, s = 5 mph
Road B: x ̄ = 60 mph, s = 10 mph
Solution:
Although both drivers are traveling at 70 mph,
the driver on Road A is travelling relatively faster
since 70 mph is 2 standard deviations above the
mean on Road A and only 1 standard deviation
above the mean on Road B.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 39
Z-Scores: Formula
xx
z
s
Distance from mean
Divide by standard deviation
to determine how many
standard deviations x is from
the mean
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 40
Z-Scores: Example
Maria scored 80 out of 100 on her first stats
exam in a course and 85 out of 100 on her
second stats exam. On the first exam, the mean
was 70 and the standard deviation was 10. On
the second exam, the mean was 80 and the
standard deviation was 5.
On which exam did Maria perform better when
compared to the whole class?
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 41
Z-Scores: Example
On which exam did Maria perform better when compared
to the whole class?
80  70
z
1
First exam:
10
85  80
Second exam:
z
1
5
Conclusion
The second exam was a little easier; on average,
students scored higher and there was less variability in
the scores. But Maria scored one standard deviation
above average on both exams, so she did equally well
on both when compared to the whole class.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 42
Section 3.3
afoto6267. Shutterstock
SUMMARIES FOR SKEWED
DISTRIBUTIONS
• Measure for Center (Middle Point): Median
• Measure of Horizontal Spread (Variability): IQR
Copyright © 2017, 2014 Pearson Education, Inc.
Slide 43
The Median
The median is the middle number when the
data has been sorted from smallest to largest.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 44
The Median
This distribution shows the distribution of incomes
for a sample of New York City residents. About half
the residents have incomes above $25,200 and
about half have incomes below $25,200.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 45
Typical Value:
The Mean vs. the Median
Which is a better measure of the “typical”
income of a New York City resident: the mean or
the median?
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 46
The Median
Median
The middle number
• Arrange all numbers in order
• Find the middle number
(If two numbers in the middle, average them)
Note:
• Think of the median as the middle point.
• A good measure of a typical value for skewed
distributions
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 47
The Median: Example
Suppose a sample of prices for 1 gallon of
regular gas at 10 different gas stations in a
neighborhood in Austin, Texas is taken on one
fall day in 2013. Find and interpret the median.
$3.19, $3.09, $3.09, $2.93, $2.95,
$3.09, $2.99, $2.99, $2.95, $2.97
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 48
The Median: Example
First arrange the numbers in order:
2.93, 2.95, 2.95, 2.97, 2.99, 2.99, 3.09, 3.09, 3.09, 3.19
Since there are 2 numbers in the middle, average them:
2.93, 2.95, 2.95, 2.97, 2.99, 2.99, 3.09, 3.09, 3.09, 3.19
M
2.99  2.99
 2.99
2
INTERPRETATION of the Median:
The median price of 1 gallon of gas at these gas stations in
Austin, Texas, was $2.99 on this particular day in 2013.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 49
Using the TI-84 Calculator
NOTE: Your calculator can find the median, but
you must be able to interpret the results.
To find the median on the TI-84 calculator:
1.
2.
3.
4.
5.
Push STAT then select option 1: Edit.
Enter the data set in L1.
Push STAT and arrow over to Calc.
Choose option 1: 1-var stats (press ENTER).
Arrow down to see the median (Med).
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 50
Measuring the Spread
Recall:
The standard deviation measured
spread using the distance from the
mean.
Now?
Since we don’t use the mean in skewed
distributions, we need a measure of
spread related to the median.
So:
We use the Interquartile Range (IQR).
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 51
Range
Before we get to the IQR, we need to talk about the range
and quartiles.
Range
Difference between the largest and smallest
values
Example:
A group of eight children have the following heights
(in inches):
48.0, 48.0, 53.0, 53.5, 54.0, 60.0, 62.0, and 71.0
The range in the children’s heights is 71.0 - 48.0 =
23.0 inches.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 52
Quartiles
Quartiles
Example:
Divide the distribution into fourths.
Each quartile contains 25% of the data.
The dotplot shows the distribution of weights for
a class of introductory statistics students. The
vertical lines slice the distribution into four parts,
so each part has about 25% of the observations.
25% of the weight are
between 101 and 121
pounds, and so on.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 53
Interquartile Range: IQR
IQR
The range of the middle 50% of the data
Example:
The dotplot shows the distribution of weights for a
class of introductory statistics students. The vertical
lines slice the distribution into four parts, so each part
has about 25% of the observations.
IQR = 160 – 121 = 39 pounds
(distance between the first and third “slice”)
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 54
Using the TI-84 Calculator
NOTE: Your calculator can find the quartiles, and
you must use them to find the IQR.
To find the IQR on the TI-84 calculator:
1.
2.
3.
4.
5.
Push STAT then select option 1: Edit.
Enter the data set in L1.
Push STAT and arrow over to Calc.
Choose option 1: 1-var stats (press ENTER).
Arrow down to see the first and third quartiles (Q1
and Q3).
6. Calculate the IQR: IQR = Q3 – Q1.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 55
Interquartile Range: Example
A group of eight children have the following heights
(in inches):
48.0, 48.0, 53.0, 53.5, 54.0, 60.0, 62.0, and 71.0
Find the interquartile range for the distribution of the
children’s heights using your calculator.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 56
Interquartile Range: Example
The calculator display should look like this:
Solution:
IQR = Q3 - Q1 = 61.00 - 50.50 = 10.50
The interquartile range of the heights of the eight
children is 10.5 inches.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 57
Section 3.4
ArtWell. Shutterstock
COMPARING MEASURES OF CENTER
• Symmetric Distributions: Mean and Standard
Deviation
• Skewed Distributions: Median and IQR
• How to Compare Measures
Copyright © 2017, 2014 Pearson Education, Inc.
Slide 58
Appropriate Measures
Recall:
When dealing with numerical data,
you need to describe the distribution
using these 3 characteristics:
Characteristic
Which Is
Shape
Symmetric or Skewed
Center
Mean or Median
Spread
Standard Deviation or
IQR
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 59
Choosing a Measure
Primary Goal:
Always:
Shape
Choose a pair of measures that
is best suited for the data,
which depends on the shape!
Begin with a picture!
Symmetric
Measure for
Center
Mean
Skewed
Median
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Measure for
Spread
Standard
Deviation
IQR
Slide 60
Choosing a Measure: Example
One of the authors created a data set of the songs on his
mp3 player. He wants to describe the distribution of song
lengths.
1. What shape do you expect the distribution to have?
2. What measures should you use for this shape?
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 61
Choosing a Measure: Example
1. Shape:
• No song can be shorter than 0 seconds.
• Most songs on the radio are around 4 minutes long.
• A few songs (eg: classical tracks) are extremely long.
• The distribution is probably right-skewed.
2. Measures: Median and IQR
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 62
Choosing a Measure: Example
1. Shape: Right-Skewed
2. Measures: Median and IQR
It turns out the median length is 226 seconds (roughly 3
minutes and 46 seconds) and the interquartile range is 117
seconds (close to 2 minutes).
In other words, the typical track on the author’s mp3 player is
about 4 minutes, with the middle 50% of the tracks differing
by about 2 minutes.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 63
Mean vs. Median
Things to keep in mind:
• Skewed data and outliers affect the mean (and standard deviation).
• The median and IQR are resistant to (are not affected greatly by)
skewed data/outliers.
Roughly Speaking:
Shape
Mean vs. Median
Skewed Left
Mean < Median
Symmetric
Mean = Median
Skewed Right
Mean > Median
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 64
Mean vs. Median: Example
A (very small) fast-food restaurant has five employees, all
of whom work full-time for $7 per hour. Each employee’s
annual income is about $16,000 per year. The owner, on
the other hand, makes $100,000 per year.
Find both the mean and the median. Which would you use
to represent the typical income at this business—the mean
or the median? Which value is smaller?
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 65
Mean vs. Median: Example
• The mean income is $30,000.
• The median income is $16,000.
• Use the median income (since skewed) – better represents
typical income.
• Mean > Median (since skewed right)
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 66
Comparing Different Distributions
When comparing two distributions:
• Always use the same measures of center and spread for
both distributions. Otherwise, the comparison is not
valid.
• If one of the distributions is skewed, use Median and IQR
to compare both distributions!
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 67
Comparing Different Distributions:
Example
Comparing the distributions of running times for
amateur and Olympic marathon runners is
below.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 68
Comparing Different Distributions:
Example
Note the Shapes:
– Olympic runners: Right-skewed (use median)
– Amateur runners: Fairly symmetric (use median since
Olympic runners is skewed)
Answer:
The typical woman Olympic runner finishes the marathon
considerably faster: a median time of 154.8 minutes (about
2.6 hours) compared to 240.0 minutes (about 4 hours) for
the amateur athlete.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 69
prochasson frederic. Shutterstock
Section 3.5
USING BOXPLOTS TO DISPLAY
SUMMARIES
• Boxplots
• Five Number Summary
Copyright © 2017, 2014 Pearson Education, Inc.
Slide 70
Finding Outliers
Outliers
Extreme data values
General Rule for finding outliers:
– Find the fences (“cutoffs” ) for usual data values:
Lower fence = Q1 – 1.5 (IQR)
Upper fence = Q3 + 1.5 (IQR)
– Values more extreme than the fences are outliers
(values less than lower fence or greater than upper fence).
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 71
Finding Outliers: Example
The first and third quartiles in the distribution of
daily high temperatures in San Francisco are
59°F and 70°F respectively. Using these values,
what temperatures would be considered
outliers in San Francisco?
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 72
Finding Outliers: Example
The first and third quartiles in the distribution of daily high
temperatures in San Francisco are 59°F and 70°F
respectively. Using these values, what temperatures would
be considered outliers in San Francisco?
• Fences:
Lower fence = 59 – 1.5(70 – 59) = 42.5°F
Upper fence = 70 + 1.5(70 – 59) = 86.5°F
• Outliers:
Any temperature below 42.5°F or above 86.5°F
would be considered an outlier.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 73
Boxplots
– Help us visualize certain summary statistics.
– Show where the bulk of the data lie.
– The box is drawn from Q1 to Q3 with a line for
the median inside the box.
– Whiskers are drawn to the most extreme
values within the fences (extreme values that
are not outliers).
– Potential outliers are marked with an asterisk.
CAUTION! Boxplots work best for unimodal
distributions. (They hide multi-modal
information!)
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 74
Boxplots: Example
The boxplot for the daily high temperatures in San Francisco
is given below.
• What is the minimum data value?
• What is the median data value?
• How many potential outliers are there?
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 75
Boxplots: Example
The boxplot for the daily high temperatures in San Francisco
is given below.
• What is the minimum data value?
• What is the median data value?
• How many potential outliers are there?
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
49°F
64°F
5
Slide 76
Using the TI-84 Calculator
To create a boxplot on the TI-84 calculator:
1.
2.
3.
4.
5.
Push STAT then select option 1: Edit.
Enter the data set in L1.
Push 2nd Y= (for Stat Plot).
Turn on Plot1 (press ENTER twice).
Use the down arrow , followed by the right arrow, to
select Type that looks like a boxplot with outliers (first
boxplot option) and push ENTER.
6. Make sure Xlist is set to L1.
7. Push GRAPH > ZOOM followed by the number 9 (for
option 9:Zoom Stat) to see the boxplot.
8. Use TRACE and the arrow keys to navigate about the
graph (you will see relevant information on the screen).
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 77
Boxplots: Comparing Two Distributions
We examine the temperatures in Provo and SF earlier.
Note:
• Both distributions have roughly the same:
– Shape: Symmetric
– Center: 67°F in Provo vs 65°F in SF
• The spread in both distributions, however, is very different:
– Provo:
– SF:
More spread out (data values farther from center)
Less spread out (data values closer to the center)
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 78
Boxplots: Comparing Two Distributions
Now, looking at the boxplots, we can see a few more facts:
• Describing the distributions:
– Shape:
Symmetric (median in the center of the box)
– Center: 67°F in Provo vs 64°F in SF (close to the same)
– Spread: Provo:
More spread out (box is wider)
SF:
Less spread out (narrower box); outliers!
• Both cities have 100 degrees, these days are unusual in San Francisco but
merely fall in the upper 25% for Provo.
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 79
Five Number Summary
The key summary statistics that boxplots reveal are known as the
five number summary:
Minimum, Q1, Median, Q3, Maximum
Example: Consider the boxplot of daily high temperatures in SF.
The five number summary is: 49, 59, 64, 70, 97
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 80
Case Study
• Question: How do people perceive risk?
– Measure: 500 subjects rated activities on risks involved
(0 = no risk, 100 = greatest possible risk)
– Focus:
Risk of appliances and risk of x-rays
– Look for: Differences among men/women’s perceptions of
risk
• Analysis of data:
– The distributions for men and women
were similar. Both were right skewed.
– Summary stats:
– Why are the median and IQR being reported?
Copyright
Copyright©©2017,
2017,2014
2014Pearson
PearsonEducation,
Education,Inc.
Inc.
Slide 81