Download Histograms

Document related concepts
no text concepts found
Transcript
Chapter 3
Displaying and
Summarizing
Quantitative Data
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
1
NOTE on slides / What we can and cannot do

The following notice accompanies these slides, which have been downloaded from
the publisher’s Web site:
“This work is protected by United States copyright laws and is provided solely for the
use of instructors in teaching their courses and assessing student learning.
Dissemination or sale of any part of this work (including on the World Wide Web) will
destroy the integrity of the work and is not permitted. The work and materials from
this site should never be made available to students except by instructors using the
accompanying text in their classes. All recipients of this work are expected to abide
by these restrictions and to honor the intended pedagogical purposes and the needs
of other instructors who rely on these materials.”

Some of these slides are taken from the Third Edition; others are my own additions.
We can use these slides because we are using the text for this course. Please help
us stay legal. Do not distribute these slides any further.

The original slides are done in green / red and black. My additions are in red and blue.

Topics in brown and maroon are optional.
Slide 2- 2
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 1- 2
2
Overview – Organization of the
chapter


Pictorial Display
 Histogram
 Stem – and Leaf Plot
 Dotplot
Numerical summary
 Shape of data
 Center
 Spread
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 3
3
Slide 4- 3
Division of Mathematics, HCC
Course Objectives for Chapter 3
After studying this chapter, the student will be able to:
8.
Appropriately display quantitative data using a frequency distribution,
histogram, relative frequency histogram, stem-and-leaf display, dotplot.
9.
Describe the general shape of a distribution in terms of shape, center
and spread.
10.
Describe any anomalies or extraordinary features revealed by the
display of a variable.
11.
Compute and apply the concepts of mean and median to a set of data.
12.
Compute and apply the concept of the standard deviation and IQR to a
set of data.
13.
Select a suitable measure of center/spread for a variable based on
information about its distribution.
14.
Create a five-number summary of a variable.
15.
Construct a boxplot by hand and with technology.
16.
Use the 1.5 IQR rule to identify possible outliers
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
4
3.1
Displaying
Quantitative
Variables
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
5
Dealing With a Lot of Numbers…
•
•
•
•
Summarizing the data will help us when we
look at large sets of quantitative data.
Without summaries of the data, it’s hard to
grasp what the data tell us.
The best thing to do is to make a picture…
We can’t use bar charts or pie charts for
quantitative data, since those displays are
for categorical variables.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 6
6
A histogram of tsunami
generating earthquakes
Histograms
(The authors did not provide the raw data.)
•
Histogram: A chart that
displays quantitative data
• First used by Kaoru Ishikawa (Japan, 1950)
• Great for seeing the distribution of the data
•
Most earthquake generating tsunamis have magnitudes
between 6.5 and 8.
•
Japan and Sumatra quakes (9.0 and 9.1) are rare.
•
Quakes under 5 rarely cause tsunamis.
•
Quakes between 7.0 and 7.5 most common for
causing tsunamis
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
7
Choosing the Bin Width
•
Different bin widths tell different
stories.
•
Choose the width that best shows
the important features.
•
Presentations can feature two
histograms that present the same
data in different ways.
•
A gap in the histogram means that
there were no occurrences in that
range.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
8
Relative Frequency Histograms
•
Relative Frequency Histogram
•
The vertical axis represents
the relative frequency, the
frequency divided by the total.
• The horizontal axis is the same
as the horizontal axis for the frequency histogram.
• The shape of the relative frequency histogram is the
same as the frequency histogram.
• Only the scale of the y-axis is different.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
9
Histograms
Both histograms “look” the same.
The only difference is the vertical axis.
Did we see this in Chapter 2?
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 10
10
Slide 4- 10
Histograms
•
•
•
•
They can be displayed
horizontally as well as
vertically
I rotated this one 90 degrees
clockwise
To publish this, I would put
the “% of Earthquakes” axis
across the bottom instead of
the top.
I’d also retype the labels so
they can be more easily read!
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 11
11
Slide 4- 11
Histogram with the TI




Example: Data: 62, 63,
65, 66, 68, 70, 71, 73,
75
Use [STAT][EDIT] to put
the dataset in L1.
The first few data points
are shown.
NOTE: You will do this a
lot in this course!
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 12
12
Slide 1- 12
Histogram with the TI
•First,
select [Y1] and turn off or
erase any functions from
Algebra class!
•Press [2nd][Y1] and go to one
of the three plots. Turn it on.
•Select the histogram.
•Make sure that L1 (or wherever
you put the data) is in Xlist.
•Make sure the 1 is in Freq
(unless there is a separate
column giving frequency).
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 13
13
Histogram with the TI (default)
•
•
•
You can get a window
default by selecting
Zoom and then 9
Below is the window. It
shows a bin width of
3.25. It includes all of
the values.
Because we have
integers, I’d rather have
3 as a bin width.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 14
14
Histogram with the TI
•Choose
as window
X:[60,78];Y[-1,3]. You may have
to play with this.
–
For X, I picked a little lower than
the min and a little higher than the
max.
– For Y, I picked a little bigger than
the largest bin frequency than I
expected.
•Xscl
is the length of the bin. In
this case, choosing 3 makes cut
points at 60, 63, 66, 69. 72. 75,
and 78.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 15
15
Usefulness of the Trace function
Use the horizontal arrows to navigate the bins.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 16
16
Histograms and StatCrunch
•
•
•
•
•
•
Enter Data.
Graphics →
Histogram
Click on the data
variable and Next.
Select Frequency or
Relative Frequency.
Put in starting value
and/or Binwidth if
desired.
Click Next twice, and type in
labels. Click Create Graph.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
17
Histogram with StatCrunch
•
•
•
•
•
Select Graphics
Select Histogram
Select the column you want graphed.
Select Next. (Do not select “Create Graph” unless
you do not want to have control over the bin size.
For the same bins as with the TI, “Start Bins” at 60
and set Bin Width equal to 3. Then select “Create
Graph”.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 18
18
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 19
19
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 20
20
Results
With default bin size
Better size
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 21
21
How many bins?
•
•
•
No “hard and fast” rule. There is even some
disagreement among professionals.
Recommendations from slides from two Johns
Hopkins graduate Biostatistics Courses. Both depend
on the number (n) of data points.
○
Biostatistics 612: √n
○
Biostatistics 651: 2√n
I personally would use √n, but would try different
numbers to see what looks best.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 22
22
Publisher Instructions: Histogram
•
•
•
•
•
•
Histogram: Displays the frequency, relative frequency or density for
numerical data combined into classes. Select the column(s) to be displayed in
the plot(s). A separate plot will be generated for each column selected.
Enter an optional Where clause to specify the data rows to be included in the
computation.
Select an optional Group by column to construct a histogram for each distinct
value of this column.
Click the Next button to select either the Frequency, Relative Frequency or
Density histogram. In addition, optional values for the starting point of the bins
and the bin width may be specified. These parameters will apply to all of the
histograms to be constructed.
Click the Next button again to specify graph layout options.
Click the Create Graph! button to create the plot(s).
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 23
23
Thoughts on Histograms
•
•
•
•
•
Histograms are useful and easy to apply to mostly all types
of quantitative data.
This is especially true for larger data sets.
They can use a lot of ink and space! Color is more useful
than black-and-white or grayscale.
It can be difficult to display several related datasets at the
same time to compare datasets.
When you get a default, accept it if you can live with it! If
not, at least save (or remember) what you did.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 24
24
“Reading” Histograms
The percent of movie
lengths for 150 selected
movies is given in the
histogram on the right.
(Data are from the
StatCrunch collection for
Chapter 3.)
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 25
25
“Reading” Histograms
Question 1: How many
movies had lengths more
than two hours (120
minutes)?
Answer:
18 + 4 + 5 + 2 + 1 = 30
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 26
26
“Reading” Histograms
Question 1: What
percentage of movies were
more than two hours (120
minutes) in length?
Answer:
18 + 4 + 5 + 2 + 1 = 30
30/150 = 20%
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 27
27
Histograms of test scores for two classes of
the same course and instructor.
Period 1
Period 2
95
83
98
75
93
82
96
75
93
81
92
73
91
81
87
72
87
78
84
72
87
77
82
70
86
69
80
69
84
28
77
63
77
58
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
28
Comparing the two data sets using histograms
•
•
•
•
Make sure that you use the
same starting point and bin
width for both histograms.
Do each histogram
separately and look at them
side by side
The gap and suspected
outlier are apparent in the
first class.
The second one is a little
more symmetric.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
(Same for L2)
(Same for L2)
29
Stem-and-Leaf Displays
•
Stem-and-Leaf: Shows both the
shape of the distribution and all
of the individual values
•
Not as visually pleasing as a
histogram; more technical looking
•
Can only be used for small collections of data
•
The first column (stems) represents the leftmost digit.
•
The second column (leaves) shows the remaining digit(s).
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
30
Stem-and-Leaf Displays
•
•
Stem-and-leaf displays show the distribution of a
quantitative variable, like histograms do, while
preserving the individual values.
Stem-and-leaf displays contain all the information
found in a histogram and, when carefully drawn,
satisfy the area principle and show the distribution.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 31
31
Stem-and-Leaf Displays
•
•
•
They can show a complete dataset in very little
space.
It is easy to put them back-to-back to compare
groups.
Invented in 1972 by John Tukey (1915 – 2000)
○ Bell Labs’ NJ
○ “Exploratory Data Analysis”, 1977
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 32
32
Stem-and-Leaf Example
Compare the histogram and stem-and-leaf display for the pulse
rates of 24 women at a health clinic. Which graphical display do
you prefer?
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 33
33
Stem-and-Leaf Displays





Stem-and-Leaf plots give all of the data in pictorial
form.
Stem-and-Leaf plots are useful for smaller datasets.
It is not possible to do a stem-and-leaf plot with the TI.
They can be done with StatCrunch, but you have no
control over the bin sizes.
But if the data set is ordered, they are easy to do by
hand.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 34
34
Constructing a Stem-and-Leaf Display
(by hand)
•
•
•
•
•
First, draw a vertical line.
Next, to the left of the line, cut each data value into
leading digits (“stems”)
and to the right of the line, trailing digits (“leaves”).
Use the stems to label the bins.
Use only one digit for each leaf—either round or
truncate the data values to one decimal place after the
stem.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 35
35
Stem and Leaf with StatCrunch
•
Enter Data
•
Graphics → Stem and Leaf
•
Click on the variable name
and Next
•
Select Outlier Trimming
Type and Create Graph!
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
36
Stem and Leaf with StatCrunch
(using movie length data)
• Select Graphics.
• Select “Stem and Leaf”.
• Select the variable you want graphed.
• Select a choice for “Leaf unit” (the rough equivalent of
a bin width.)
• You have more limited control over the leaf unit than
you did with the bin width in a histogram.
• You can trim outliers if you wish to (after we study
outliers later in Chapter 3.)
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 37
37
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 38
38
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 39
39
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
40
Publisher’s Instructions
•
•
•
•
Stem and Leaf : Displays a character based plot of a column
that is similar to a histogram turned on its side. The actual (or
approximate) data values are represented in the plot. Select the
column(s) to be displayed in the plot(s). A separate plot will be
generated for each column selected.
Enter an optional Where clause to specify the data rows to be
included in the computation.
Select an optional Group by column to construct a separate
stem and leaf plot for each distinct value of this column.
Click the Create Graph! button to create the plot(s).
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 41
41
Stem and leaf plot of test scores for two
classes of the same course and instructor.
Period 1
Period 2
95
83
98
75
93
82
96
75
93
81
92
73
91
81
87
72
87
78
84
72
87
77
82
70
86
69
80
69
84
28
77
63
77
58
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
42
Stem and leaf plot of test scores for two
classes of the same course and instructor.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 43
43
Dotplots
•
Dotplot: Displays dots to describe
the shape of the distribution
•
There were 30 races with a winning
time of 122 seconds.
•
Good for smaller data sets
•
Visually more appealing than
stem-and-leaf
•
In StatCrunch:
Graphics → Dotplot
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
44
Dotplots with StatCrunch
Can’t do with the TI or EXCEL.
With StatCrunch, again select “Graphics”, then
“DotPlot” (as with the Histogram and the Stem and
Leaf).
 In the next panel, you can input axis labels and draw
grid lines if you wish. In the following one, you can pick
a color scheme.
 But you have no control over the bin size (see next
slide for an example of a dotplot that is not very useful).


Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 45
45
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
46
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
47
I personally recommend the histogram
over the dotplot.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 48
48
Publisher’s Instructions
•
•
•
•
•
Dotplot : Displays a graphical representation of numerical values as points
on a number line. Points with the same pixel representation are stacked on
top of each other. If the number of points in a stack exceeds the height of the
graphic, each point on the plot may represent more than one observation. If
this occurs, the number of observations per point will be shown in the title of
the graphic. Select the column(s) to be displayed in the plot(s). If multiple
columns are selected, the plots will be stacked in the reverse order of
selection in the same graphic.
Enter an optional Where clause to specify the data rows to be included in
the computation.
Select an optional Group by column to construct dotplots for each distinct
value of this column. If a Group by column is specified, select either to stack
the plots of each group for each column or to stack plots of each column for
each group.
Click the Next button to specify graph layout options.
Click the Create Graph! button to create the plot(s).
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 49
49
Think Before you Draw
•
Is the variable quantitative? Is the answer to the survey
question or result of the experiment a number whose
units are known?
•
Histograms, stem-and-leaf diagrams, and dotplots
can only display quantitative data.
•
Bar and pie charts display categorical data.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
50
Constructing Effective Graphs
Source: Agresti & Franklin
•
•
•
•
Label both axes and provide proper headings
To better compare relative size, the vertical axis
should start at 0 (if practical)
Be cautious in using anything other than bars, lines, or
points. Don’t use birds, dollar signs, ships, etc!
It can be difficult to portray more than one group on a
single graph when the variable values differ greatly
51
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
51
3.2
Shape
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
52
Shape, Center, and Spread
•
When describing a distribution, make sure
to always tell about three things: shape,
center, and spread…
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 53
53
What is the Shape of the
Distribution?
1. Does the histogram have a single,
central hump or several separated
humps?
2. Is the histogram symmetric?
3. Do any unusual features stick out?
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 54
54
Modes
•
A Mode of a histogram is a hump or high-frequency bin.
• One mode
→ Unimodal
• Two modes → Bimodal
• 3 or more
→ Multimodal
Unimodal
Bimodal
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Multimodal
55
Uniform Distributions
•
Uniform Distribution: All the bins have the same
frequency, or at least close to the same frequency.
• The histogram for a uniform distribution will be flat.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
56
Symmetry
•
The histogram for a symmetric distribution will look the
same on the left and the right of its center.
Symmetric
Not
Symmetric
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Symmetric
57
Skew
•
A histogram is skewed right if the longer tail is on the
right side of the mode.
•
A histogram is skewed left if the longer tail is on the left
side of the mode.
Skewed Right
Skewed Left
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
58
Examples of Skewness
Source: Agresti & Franklin, “Statistics: The Art and Science
of Learning from Data”; Pearson, 2007
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 59
59
Examples of Skewness
Source: Agresti & Franklin, “Statistics: The Art and Science
of Learning from Data”; Pearson, 2007
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 60
60
Outliers
•
An Outlier is a data value that is far above or far below
the rest of the data values.
•
An outlier is sometimes just
an error in the data collection.
•
An outlier can also be the
most important data value.
•
Income of a CEO
•
Temperature of a person with
a high fever
•
Elevation at Death Valley
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
61
My note on outliers
•
•
•
Currently, points that appear as outliers are labeled
(by me) as “suspected” outliers.
There is a method (explained later in this chapter) for
detecting outliers.
Once we learn this method and apply it to our data, we
have confirmed and outlier (or not), and if a data point
is an outlier, we can remove the word “suspected” for
that data point.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
62
Example
•
The histogram shows the amount
of money spent by a credit card
company’s customers. Describe
and interpret the distribution.
•
The distribution is unimodal. Customers most
commonly spent a small amount of money.
•
The distribution is skewed right. Many customers
spent only a small amount and a few were spread out
at the high end.
•
There is a suspected outlier at around $7000. One customer
spent much more than the rest of the customers.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
63
3.3
Center
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
64
The Median
•
Median: The center of the
data values
•
Half of the data values are to
the left of the median and half
are to the right of the median.
•
For symmetric distributions, the median is directly
in the middle.
• The median was first proposed by Sir Francis Galton in
1875 as a way of getting to the “average” value of a
dataset without cumbersome calculations.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
65
Calculating the Median: Odd Sample Size
•
First order the numbers.
•
If there are an odd number of numbers, n, the median is
n 1
at position
.
2
•
•
Find the median of the numbers: 2, 4, 5, 6, 7, 9, 9.
n 1 7 1

4
2
2
•
The median is the fourth number: 6
•
Note that there are 3 numbers to the left of 6 and 3 to
the right.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
66
Calculating the Median: Even Sample Size
•
First order the numbers.
•
If there are an even number of numbers, n, the median
is the average of the two middle numbers: n , n  1 .
2 2
•
Find the median of the numbers: 2, 2, 4, 6, 7, 8.
n 6
 3
•
2 2
•
The median is the average of the third and the fourth
numbers: Median  4  6  5
2
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
67
3.4
Spread
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
68
Spread
•
Locating the center is only part of the story
•
Are the data all near the center or are they spread out?
•
Is the highest value much higher than the lowest value?
•
To describe data, we must discuss both the center and
the spread.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
69
Range
•
The range is the difference between the maximum and
minimum values.
Range = Maximum – Minimum
•
The ages of the guests at your dinner party are:
16, 18, 23, 23, 27, 35, 74
•
The range is: 74 – 16 = 58
•
The range is sensitive to outliers. A single high or low
value will affect the range significantly. This makes the
range not useful as a measure of spread.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
70
Percentiles and Quartiles
•
Percentiles divide the data in one hundred groups.
•
The nth percentile is the data value such that n percent
of the data lies below that value.
•
For large data sets, the median is the 50th percentile.
•
The median of the lower half of the data is the 25th
percentile and is called the first quartile (Q1).
•
The median of the upper half of the data is the 75th
percentile and is called the third quartile (Q3).
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
71
StatCrunch, Q1, Median, and Q3
•
Enter the data.
•
Stat → Summary Stats
→ Columns
•
Click on the variable and
then Calculate.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
72
The Interquartile Range
•
The Interquartile Range (IQR) is the difference between
the upper quartile and the lower quartile
IQR = Q3 – Q1
•
The IQR measures the range of the middle half of the
data.
•
Example: If Q1 = 23 and Q3 = 44 then
IQR = 44 – 23 = 21
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
73
The Interquartile Range
•
The Interquartile Range for earthquake causing
tsunamis is 0.9.
• The picture below shows the meaning of the IQR.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
74
Benefits and Drawbacks of the IQR
•
The Interquartile Range is not sensitive to outliers.
•
The IQR provides a reasonable summary of the spread
of the distribution.
•
The IQR shows where typical values are, except for the
case of a bimodal distribution.
•
The IQR is not great for a general audience since most
people do not know what it is.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
75
3.5
Boxplots and
5-Number
Summaries
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
76
5-Number Summary
•
The 5-Number Summary provides a numerical
description of the data. It consists of
•
•
•
•
•
•
Minimum
First Quartile (Q1)
Median
Third Quartile (Q3)
Maximum
The list to the right shows the
5-Number Summary for the
tsunami data.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
77
Interpreting the 5-Number Summary
•
The smallest tsunami-causing earthquake
had magnitude 3.7.
•
The largest tsunami-causing earthquake
had magnitude 9.1.
•
The middle half of tsunami-causing
earthquakes is between 6.7 and 7.6.
•
Half of tsunami-causing earthquakes have
magnitudes below 7.2 and half are above 7.2.
•
A tsunami-causing earthquake less than 6.7 is small.
•
A tsunami-causing earthquake more than 7.6 is small.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
78
Example – Text data, page 53
•
The ordered values from the first batch:
• 17.5, 2.8, 3.2, 13.9, 14.1, 25.3, 45.8
• Let’s verify the text results with our
technology.
• Odd number of points
• Min = -17.5, Max = 45.8, Med = 13.9
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 79
79
Example – Text data, page 53
How about Q1 and Q3?
Book’s method:
• For Q1, take the median of the first four points
(i.e. including the median). That is, take the
median of -17.5, 2.8, 3.2, 13.9, which is 3.0.
• For Q3, take the median of the last four points
(i.e. including the median). That is, take the
median of 13.9, 14.1, 25.3, 45.8, which is 19.7.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 80
80
5 – number summary – TI
(except newer 84’s)
Select [2nd][STAT]
Select [CALC]
Select #1, “1-Var Stats”, and
then add L1.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 81
81
5 – number summary – TI
(newer 84’s)
Old operating system:
1-var Stats L1
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 82
82
Hmmmmm!
For Q1, the text got 3.0 and the
TI got 2.8.
For Q3, the text got 19.7 and
the TI got 25.3.
Difference in methodology.
The text included the median in
the upper-half dataset; the TI
did not.
Let’s go on to StatCrunch.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 83
83
5-Number Summary - StatCrunch
Select Stat, then Summary Statistics, then Columns.
Then select the column you want summarized.
You will see a list of summary statistics. De-select all
except those you want; i.e. Max, Min, Q1, Q3 and
Median.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 84
84
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 85
85
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 86
86
The Result with StatCrunch
Summary statistics:
Column
Median Min Max Q1 Q3 var1
13.9 -17.5 45.8 2.8 25.3
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 87
87
Publisher Instructions for
Summary Statistics
Columns : Provides the following descriptive statistics in tabular format for the column(s) selected:
sample size (n), mean, variance, standard deviation (Std. Dev.), Standard Error (Std. Err.),
median, range, minimum, maximum, first quartile (Q1) and third quartile (Q3). Select the columns
for which summary statistics will be computed.
Enter an optional Where clause to specify the data rows to be included in the computation.
Select an optional Group By column to group results. If a Group By column is selected, choose
whether to display the output in separate tables for each column selected or in separate tables
for each group.
Click the Next button to select the summary statistics (by default, all are selected) to be computed.
The statistics will be displayed in the order in which they are selected (from right to left).
Additional percentiles may also be entered as a space or comma delimited list.
Check the Store output in data table option if the output is to be placed in the data table.
Click the Calculate button to view the results.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 88
88
Other technologies
•
SAS, StatDisk and MINITAB all agree with the TI and
StatCrunch.
• EXCEL: PERCENTILE(Array,.25)=3,
PERCENTILE(Array,.75)=19.7!
• Data Desk, an add-on to EXCEL, gives Q1 = 2.9 and Q3 = 22.5!
• There are different ways of computing Q1 (same for Q3)
○ Split list into two halves, include median in each (text)
○ Split list into two halves; don’t include median(TI, SC)
• I think that Data Desk used cut points of 0, (1/6),
(2/6),(3/6),(4/6),(5/6) and1, and interpolated.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 89
89
Boxes on pp. 53 and 54 of text
•
•
•
•
•
•
There are several ways to compute a quartile (we’ve seen 3; the
authors have seen 6; other texts say that there are 9.)
For large datasets, it makes very little difference.
For smaller datasets (where it might make a difference), you do
as well to just give the whole dataset rather than the summary
statistics!
You will be using technology – state which (StatCrunch or TI.)
Even StatCrunch and the TI do not agree on some datasets!
The IQR can also be different!
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 90
90
Professional Disagreement on basic concepts
is common.
•
•
•
How do English grammar manuals direct us in writing
about a hat that belongs to Boris?
• Chicago Manual of Style: Boris’s hat.
• American Psychological Association (APA): Boris’ hat.
• The HCC English Department uses MLA (Modern
Language Association), which accepts either one.
Historians disagree on the date of birth of Portuguese
explorer Vasco da Gama – 1460 or 1469?
Astronomers disagree on the definition of a planet! Is
Pluto a real planet or a dwarf planet?
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
91
Boxplots
•
A Boxplot is a chart that displays the
5-Point Summary and the outliers.
• Boxplots were invented in 1977 by John
Tukey (1915 – 2000) of Bell Labs.
• The Box shows the Interquartile Range.
• The dashed lines are called fences,
outside the fences lie the outliers.
• Above and below the box are the whiskers
that display the most extreme data values
within the fences.
• The line inside the box shows the median.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
92
Finding the Fences
•
The lower fence is defined by
Lower Fence = Q1 – 1.5 × IQR
•
The upper fence is defined by
Upper Fence = Q3 + 1.5 × IQR
•
Tsunami Example: Q1 = 6.7, Q3 = 7.6
IQR = 7.6 – 6.7 = 0.9
•
Lower Fence = 6.7 – 1.5 × 0.9 = 5.35
•
Upper Fence = 7.6 + 1.5 × 0.9 = 8.95
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
93
Constructing Boxplots by hand
1.
Draw a single vertical axis
spanning the range of the
data. Draw short horizontal
lines at the lower and upper
quartiles and at the median.
Then connect them with
vertical lines to form a box.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 5- 94
94
Constructing Boxplots by hand (cont.)
2.
Erect “fences” around the main
part of the data.
•
The upper fence is 1.5 IQRs
above the upper quartile.
•
The lower fence is 1.5 IQRs
below the lower quartile.
•
Note: the fences only help with
constructing the boxplot and
should not appear in the final
display.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 5- 95
95
Constructing Boxplots by hand (cont.)
3.
Use the fences to grow
“whiskers.”
•
Draw lines from the ends of
the box up and down to the
most extreme data values
found within the fences.
•
If a data value falls outside
one of the fences, we do not
connect it with a whisker.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 5- 96
96
Constructing Boxplots by hand(cont.)
4. Add the outliers by
displaying any data values
beyond the fences with
special symbols.
• We often use a different
symbol for “far outliers”
that are farther than 3
IQRs from the quartiles.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 5- 97
97
BOXPLOTS on the TI
Clear any prior graph by going
to [Y=] and then [CLEAR] for
each function.
Do [2nd][Y=]
Turn Plot1 ON and have your
data in list L1.
If the other plots are on, turn
them off.
Select the picture of the
BoxPlot.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 1- 98
98
BOXPLOTS on the TI
With the standard window, you
will likely not get anything!
Try Zoom, 9. If that does not
work, …
Make the window reflect the
data, i.e. X[60,80].
Y could be -- say [0,10].
Here’s what you get.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 1- 99
99
TI: 5-number summary, boxplot
Step through with the TRACE button:
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 1- 100
100
Identifying Outliers: The 1.5 IQR rule
•
•
•
•
•
•
•
•
The lower fence is Q1 – 1.5 IQR.
Anything below that is an outlier (low outlier)
The upper fence is Q3 + 1.5 IQR.
Anything above that is an outlier (high outlier)
If a number is more than 3 IQR’s away, it is a far
outlier.
We can now check suspected outliers.
Error in the text: Page 55, Lower Fence is stated
incorrectly (has a + instead of a -).
(the math is right, though.)
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
101
Identifying Outliers: The 1.5 IQR rule
•
•
•
•
•
Example: SAT Math scores for a group of students
770, 740, 570, 560, 560, 560. 550. 540, 530, 530, 420
Q1 = 530, Median = 560, Q3 = 570
IQR = 570 – 530 = 40
Check for Low Outliers:
• Q1 – (1.5*IQR) = 530 – (1.5 * 40) = 470
• 420 is a low outlier
• Check for High Outliers
• Q3 + (1.5*IQR) = 570 + (1.5 * 40) = 630
• 770 and 740 are high outliers
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
102
Movie lengths with StatCrunch
•
Enter data and go to
Graphics → Boxplot.
•
Click on the variable and
Next.
•
Check “Use fences to
identify outliers.” Then
Next
•
Type in labels and click on
Create Graph
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
103
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
104
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
105
Summary statistics:
Column
Running Time
Min
Q1
43
Median
98
104.5
Q3
Max
116
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
IQR
160
18
106
Outlier Movie lengths
•
•
•
•
•
•
•
•
Upper outlier:
Q3 + (1.5 * IQR) = 116 + (1.5*18) = 143
Any movie with a running time above 143 min. is an
upper outlier.
There are seven, but you cannot see them all on the
boxplot.
Lower Outlier:
Q1 - (1.5 * IQR) = 98 - (1.5*18) = 71
Any movie with a running time below 71 min. is a lower
outlier.
There is one.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
107
Source: http://www.causeweb.org/resources/fun/
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 1- 108
108
StatCrunch and Boxplots
•
Enter data and go to
Graphics → Boxplot.
•
Click on the variable and
Next.
•
Check “Use fences to
identify outliers.” Then
Next
•
Type in labels and click on
Create Graph.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
109
Step-by-Step Example of Shape, Center,
Spread: Flight Cancellations
•
Question: How often are flights cancelled?
•
Who?
•
What? Percentage of Flights Cancelled at U.S. Airports
•
When? 1995 – 2011
•
Where? United States
•
How? Bureau of Transportation Statistics Data
Months
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
110
Flight Cancellations: Think
•
Identify the Variable
• Percent of flight cancellations at U.S. airports
• Quantitative: Units are percentages.
•
How will be data be summarized?
• Histogram
• Numerical Summary
• Boxplot
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
111
Flight Cancellations: Show
•
Use StatCrunch (or the TI) to create the histogram,
boxplot, and numerical summary.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
112
Flight Cancellations: Tell
•
Describe the shape, center, and spread of the
distribution. Report on the symmetry, number of modes,
and any gaps or outliers. You should also mention any
concerns you may have about the data.
•
Skewed to the Right: Can’t be a negative percent.
Bad weather and other airport troubles can cause
extreme cancellations.
•
IQR is small: 1.23%. Consistency among cancellation
percents
•
Extraordinary outlier at 20.2%: September 2001
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
113
Boxplots of test scores for two classes of
the same course and instructor.
Period 1
Period 2
95
83
98
75
93
82
96
75
93
81
92
73
91
81
87
72
87
78
84
72
87
77
82
70
86
69
80
69
84
28
77
63
77
58
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
114
Boxplots of test scores for two classes of the
same course and instructor.
•
•
•
•
•
•
•
Put Period 1 in L1 and
Period 2 in L2.
Select 2nd [STAT PLOT]
Turn the first two plots on
Set up Plot 1 for a
boxplot with L1
Set up Plot 2 for a
boxplot with L2
Execute Zoom-9
We can tell some things
about the scores.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 115
115
3.6
The Center of
Symmetric
Distributions:
The Mean
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
116
The Mean
•
The Mean is what most people think of as the average.
•
Add up all the numbers and divide by the number of
numbers.
y

y
n
•
Recall that S means “Add them all.”
•
In StatCrunch, the mean is listed in the
Summary Statistics.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
117
First use of the “average”
•
•
•
•
450 BC : Hippias used the average length of a king’s reign
to estimate the date of occurrence of the first Olympic
Games (about 300 years before then, or about 750 BC.)
How he estimated the “average” is unknown. His estimate
may have been subjective.
He estimated the date by multiplying his “average” by the
number of kings, which was precisely documented.
The most accurate estimate that we have is 776 BC,
based on engravings on Mount Olympus giving the names
of the winners of a foot race held every four years since
that date.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
118
The Mean is the “Balancing Point”
•
If you put your finger
on the mean, the
histogram will
balance perfectly.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
119
Mean Vs. Median
•
For symmetric distributions, the mean and the median
are equal.
• The balancing point is at the center.
•
The tail “pulls” the mean towards it more than it does to
the median.
•
The mean is more sensitive to outliers than the median.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
120
The Mean Is Attracted to the Outlier
•
The mean is larger
than the median
since it is “pulled”
to the right by the
outlier.
•
The median is a better
measure of the center
for data that is skewed.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
121
Why Use the Mean?
•
Although the median is a better measure of the center,
the mean weighs in large and small values better.
•
The mean is easier to work with.
•
For symmetric data, statisticians would rather use the
mean.
• For skewed data, statisticians prefer the median.
•
It is always ok to report both the mean and the median.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
122
5 – number summary – TI
(except newer 84’s)
Select [2nd][STAT]
Select [CALC]
Select #1, “1-Var Stats” and
Add L1
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 123
123
5 – number summary – TI
(newer 84’s)
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 124
124
What’s wrong with these quotes?
“We look forward to the day when everyone will receive
more than the average wage.”
○ Australian Minister of Labour, 1973
“Lake Woebegone, Minnesota : Where all the women are
strong, all the men are good-looking, and all the
children are above average”
Garrison Keillor (made in jest on the show “A Prairie
Home Companion”)
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 125
125
*Weighted Arithmetic Mean
Weighted Arithmetic Mean is computed by using following
formula:
Where:
Stands for weighted arithmetic mean.
x Stands for values of the items and
w Stands for weight of the item
Source: http://www.emathzone.com/tutorials/basicstatistics/weighted-arithmetic-mean.html
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 126
126
*Example: Weighted Mean - GPA
A freshman receives the following grades Assume 4 points for an A, 3 for a B.
What is his grade point average?
Course Credits Grade
Intro to Literature
3
B
Russian I
3
A
Physics I
4
A
Calculus I
4
A
Chemistry I
4
B
Physical Education I
1
A
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Points
3
4
4
4
3
4
Slide 4- 127
127
*Example: Weighted Mean - GPA
Use ∑Credits*Points
∑Credits
∑Credits*Points = 69
∑ Credits =19
69/19 = 3.63.
Credits
Grade
Points
Credits*P
oints
3
B
3
9
3
A
4
12
4
A
4
16
4
A
4
16
4
B
3
12
1
A
4
4
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 128
128
*Example: Weighted Mean –
Customer Ratings
Amazon.com is reviewing
the ratings on a line of
projects.
Customers rate 1 to 5, 1 =
Worst, 5 = Best
Ratings (and number giving
each rating) are on the right
What is the average rating
for the product.
Rating
5
Number of
customers
57
4
73
3
36
2
7
1
10
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 129
129
*Example: Weighted Mean –
Customer Ratings
Use ∑Ratings*Customers
∑Customers
∑Rtgs*Cust = 709
∑ Cust =183
789/183 = 3.874.
Rating
5
Customers
57
Total
285
4
73
292
3
36
108
2
7
14
1
10
10
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 130
130
With the TI
Put the ratings (5 to 1) in L1
and the number in L2.
Old operating system:
Do 1-varStats L1,L2.
L1 comma L2
New operating system:
1-var Stats
List: L1
Frequency: L2
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 131
131
3.7
The Spread of
Symmetric
Distributions:
The Standard
Deviation
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
132
The Variance
•
•
•
•
•
•
s
2
y  y 


2
n 1
The variance is a measure of how far the data is spread
out from the mean.
The difference from the mean is: y  y .
To make it positive, square it.
Then find the average of all of these distances, except
instead of dividing by n, divide by n – 1.
Use s2 to represent the variance.
The variance will mostly be used to find the standard
deviation s which is the square root of the variance.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
133
Standard Deviation
s
y  y 
2
n 1
The variance’s units are the square of the original units.
Taking the square root of the variance gives the
standard deviation, which will have the same units as y.
• The standard deviation was first used by Karl Pearson
in 1894.
• The standard deviation is a number that is close to the
average distances that the y values are from the mean.
• If data values are close to the mean (less spread out),
then the standard deviation will be small.
• If data values are far from the mean (more spread out),
then the standard deviation will be large.
•
•
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
134
Standard Deviation by hand
(Technology is easier!)
X
(X – Xbar)
(X –Xbar)2
98
9
81
96
7
49
92
3
9
87
- 2
4
85
- 4
16
83
- 6
36
82
- 7
49
• 244 / 6 = 40.66667 square dollars (What are those?).
• The square root of 40.66667 is $6.38
• We have obtained the standard deviation.
• Units are the same as in the original data (dollars)
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 135
135
Questions about Variance
s



2
y  y


2
n 1
Why n – 1 instead of n? It has to do with a concept
called degrees of freedom.
We will see this in later chapters (Chapter 18).
Essentially, it is the number of entitles that can be
freely changed if the sum (or the mean) remains
constant.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 136
136
Slide 5- 136
The Standard Deviation and Histograms
Order the histograms below from smallest
standard deviation to largest standard deviation.
A
B
C
Answer: C, A, B
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
137
Mean and Standard Deviation - TI
For the numbers 62, 63, 65,
66, 68, 70, 71, 73, 75:
Press [STAT], [CALC], 1Var Stats
The mean is 68.1111
The st. dev is 4.4845
Use the sx instead of the σx
(will explain later in the
course.)
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 138
138
Mean and Standard Deviation - StatCrunch
Numbers in Var1.
Select Stat, then Summary Stats, then Column as before.
Give Var1 as your input column.
Under Statistics, make sure that Mean and Standard Deviation are
checked. (You can check others.)
Click Create
Summary statistics:
Column
n Mean
Std. Dev.
var1 9 68.111115 4.4845414
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 139
139
**EXCEL summary statistics
Summary Statistic
EXCEL function
Mean
=average(a1:a7)
12.514
2.8
Standard Deviation
=stdev(a1:a7)
19.824
3.2
Median
=median(a1:a7)
13.9
1st quartile
=quartile(A1:A7,1)
3
14.1
3rd quartile
=quartile(A1:A7,3)
19.7
25.3
Minimum
=min(a1:a7)
-17.5
45.8
Maximum
=max(a1:a7)
45.8
Skewness
=skew(a1:a7)
0.3059
Kurtosis
=KURT(A1:A7)
0.8924
-17.5
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Answer
13.9
Slide 4- 140
140
*Other summary measures: Skewness
For data points Y1, Y2, …, Yn, the skewness is defined as
Note that it involves “cubes”, the third power.
The data are positively or negatively skewed depending on
whether this quantity is greater than or less than 0.
The magnitude of this quantity is a measure of how skewed
the data are.
Source: Wikipedia
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 141
141
*Other summary measures: Kurtosis
Kurtosis is a measure of how peaked or flat your data
are.
Mathematically, kurtosis is defined as:
_
2
 ( x  x)
_
4
3
 ( x  x)
Note that this involves the fourth power.
A value of 0 indicates a perfect bell shape
Greater than 0: More peaked
Less than 0: Flatter
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 142
142
*Other summary measures:
Coefficient of Variation
You may see this in upper level textbooks.
The “coefficient of variation” is the standard deviation
divided by the mean.
For the most recent example, CV = 0.06584.
This is normally expressed as a percent, i.e. CV=6.584%.
Notice that the CV is “unitless”.
This is an advantage since it allows us to compare
different populations. We will see this a lot in the
course.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 143
143
Thinking About Variation
Since Statistics is about variation, spread is an important
fundamental concept of Statistics.
Measures of spread help us talk about what we don’t
know.
When the data values are tightly clustered around the
center of the distribution, the IQR and standard
deviation will be small.
When the data values are scattered far from the center,
the IQR and standard deviation will be large.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 144
144
Tell - Draw a Picture
When telling about quantitative variables, start by
making a histogram or stem-and-leaf display
and discuss the shape of the distribution.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 145
145
3.8
Summary—What
to Tell About a
Quantitative
Variable
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
146
What to Tell
•
Histogram, Stem-and-Leaf, Boxplot
• Describe modality, symmetry, outliers
•
Center and Spread
• Median and IQR if not symmetric
• Mean and Standard Deviation if symmetric.
• Unimodal symmetric data: IQR > s. Check for errors.
•
Unusual Features
• For multiple modes, possibly split the data into groups.
• When there are outliers, report the mean and standard
deviation with and without the outliers.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
147
Example: Fuel Efficiency
•
The car owner has checked the fuel efficiency each time
he filled the tank. How would you describe the fuel
efficiency?
•
Plan: Summarize the distribution of the car’s fuel
efficiency.
•
Variable: mpg for 100 fill ups, Quantitative
•
Mechanics: show a histogram
• Fairly symmetric
• Low outlier
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
148
Fuel Efficiency Continued
•
Which to report?
• The mean and median are close.
• Report the mean and standard deviation.
•
Conclusion
• Distribution is unimodal and symmetric.
• Mean is 22.4 mpg.
• Low outlier may be investigated, but limited effect on
the mean
• s = 2.45; from one filling to the next, fuel efficiency
differs from the mean by an average of about 2.45 mpg.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
149
Practice
Recall: Suppose a basketball player scored the following
number of points in his last 15 games: 4, 4, 3, 4, 7, 16,
12, 15, 6, 8, 5, 9, 8, 25, 11
Describe the shape of the distribution (modality, skew,
and unusual features) . Use a starting point of 3 and a
bin width of 4. Reordered, the points are:
3, 4, 4, 4, 5, 6, 7, 8, 8, 9, 11, 12, 15, 16, 25
If you are using technology, you need not reorder.
What measures of center or spread would be most
appropriate for this data set?
Source: Mrs. Emily Francis, Instructor of Mathematics, HCC
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 150
150
Answer to practice exercise
•
•
•
Modality: Unimodal
Symmetry: Skewed right.
Gap and suspected
outlier.
•
Because of skewness:
•
•
Measure of center: Median
Measure of spread: IQR
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
151
Practice
#26: A meteorologist preparing a talk about global warming
compiled a list of weekly low temperatures (in degrees
Fahrenheit) he observed at his south Florida home last year.
The coldest temp. for any week was 36F, but he inadvertently
recorded the Celsius value of 2 degrees. Assuming he
correctly listed all the other temperatures, explain how this error
will affect these summary statistics:
• Measures
• Measures
of center: mean and median
of spread: range, IQR, and standard deviation
Source: Mrs. Emily Francis, Instructor of Mathematics, HCC
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 152
152
Answer to practice exercise
•
•
•
•
•
•
Recording 2oC instead of 36oF:
Mean: Will decrease
Median: Should remain the same or decrease by a
small amount.
Range: Should increase.
IQR: Should remain the same or increase by a small
amount.
Standard deviation: Should increase unless there are
a lot of cold temperatures recorded.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
153
Practice
The table displays the heights (in
inches) of 130 members of a
choir
a) Find the median and IQR
b) Find the mean and standard
deviation
c) Display these data with a
histogram
d) Write a few sentences
describing the distribution
Put the data into the TI as we did
with the weighted mean
calculation.
Height
Count
Height
Count
60
2
69
5
61
6
70
11
62
9
71
8
63
7
72
9
64
5
73
4
65
20
74
2
66
18
75
4
67
7
76
1
68
12
Source: Mrs. Emily Francis, Instructor of Mathematics, HCC
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
Slide 4- 154
154
3.end
Wrap-up
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
155
What Can Go Wrong?
•
Don’t make a histogram for categorical data.
•
Don’t look for shape, center,
and spread for a bar chart.
•
Choose a bin width appropriate
for the data.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
156
What Can Go Wrong? Continued
•
Do a reality check
• Don’t blindly trust your calculator. For example, a
mean student age of 193 years old is nonsense.
•
Sort before finding the median and percentiles.
• 315, 8, 2, 49, 97 does not have median of 2.
•
Don’t worry about small differences in the quartile
calculation.
•
Don’t compute numerical summaries for a categorical
variable.
• The mean Social Security number is meaningless.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
157
What Can Go Wrong? Continued
•
Don’t report too many decimal places.
• Citing the mean fuel efficiency as 22.417822453 is
going overboard.
•
Don’t round in the middle of a calculation.
•
For multiple modes, think about separating groups.
• Heights of people → Separate men and women
•
Beware of outliers, the mean and standard deviation are
sensitive to outliers.
• Use a histogram or dotplot to ensure that the mean
and standard deviation really do describe the data.
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
158
Division of Mathematics, HCC
Course Objectives for Chapter 3
After studying this chapter, the student will be able to:
8.
9.
10.
11.
12.
13.
14.
15.
16.
Appropriately display quantitative data using a frequency distribution,
histogram, relative frequency histogram, stem-and-leaf display,
dotplot.
Describe the general shape of a distribution in terms of shape, center
and spread.
Describe any anomalies or extraordinary features revealed by the
display of a variable.
Compute and apply the concepts of mean and median to a set of
data.
Compute and apply the concept of the standard deviation and IQR to
a set of data.
Select a suitable measure of center/spread for a variable based on
information about its distribution.
Create a five-number summary of a variable.
Construct a boxplot by hand and with technology.
Use the 1.5 IQR rule to identify possible outliers
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
159
Division of Mathematics, HCC
Course Objectives for Chapter 4
After studying this chapter, the student will be able to:
17. Construct side-by-side histograms or boxplots for
two or more groups.
18.
Compare the distributions of two or more groups by
comparing their shapes, centers, spreads, and
unusual features.
We have already completed this!
Copyright © 2014, 2012, 2009 Pearson Education, Inc.
160