Download Slide 4

Document related concepts

History of statistics wikipedia , lookup

Transcript
Chapter 4
Displaying and Summarizing
Quantitative Data
Copyright © 2009 Pearson Education, Inc.
NOTE on slides / What we can and cannot do

The following notice accompanies these slides, which have been downloaded
from the publisher’s Web site:
“This work is protected by United States copyright laws and is provided solely
for the use of instructors in teaching their courses and assessing student
learning. Dissemination or sale of any part of this work (including on the
World Wide Web) will destroy the integrity of the work and is not permitted.
The work and materials from this site should never be made available to
students except by instructors using the accompanying text in their classes.
All recipients of this work are expected to abide by these restrictions and to
honor the intended pedagogical purposes and the needs of other instructors
who rely on these materials.”

We can use these slides because we are using the text for this course.
Please help us stay legal. Do not distribute these slides any further.

The original slides are done in orange / brown and black. My additions are in
red and blue. Topics in green are optional.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 3
Overview – Organization of the chapter


Pictorial Display
 Histogram
 Stem – and Leaf Plot
 Dotplot
Numerical summary
 Shape of data
 Center
 Spread
Copyright © 2009 Pearson Education, Inc.
Slide 4- 4
Division of Mathematics, HCC
Course Objectives for Chapter 4
After studying this chapter, the student will be able to:
7.
Appropriately display quantitative data using a frequency distribution,
histogram, relative frequency histogram, stem-and-leaf display,
dotplot.
8.
Describe the general shape of a distribution in terms of shape, center
and spread.
9.
Describe any anomalies or extraordinary features revealed by the
display of a variable.
10.
Compute and apply the concepts of mean and median to a set of data.
11.
Compute and apply the concept of the standard deviation and IQR to a
set of data.
12.
Select a suitable measure of center/spread for a variable based on
information about its distribution.
13.
Create a five-number summary of a variable.
Copyright © 2009 Pearson Education, Inc.
Dealing With a Lot of Numbers…




Summarizing the data will help us when we look
at large sets of quantitative data.
Without summaries of the data, it’s hard to grasp
what the data tell us.
The best thing to do is to make a picture…
We can’t use bar charts or pie charts for
quantitative data, since those displays are for
categorical variables.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 6
Histograms: Earthquake Magnitudes




The chapter example discusses earthquake
magnitudes
First, slice up the entire span of values covered
by the quantitative variable into equal-width piles
called bins. The bins will be the horizontal axis of
the plot.
The counts (i.e. number of data points that go into
each bin, or frequency, will be the vertical axis.
The bins and the counts in each bin give the
distribution of the quantitative variable.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 7
Histograms: Earthquake Magnitudes (cont.)



A histogram plots the bin
counts as the heights of
bars (like a bar chart).
This concept was also
invented by William
Playfair
Here is a histogram of
earthquake magnitudes
Copyright © 2009 Pearson Education, Inc.
Slide 4- 8
Histograms: Earthquake magnitudes (cont.)


A relative frequency histogram displays the percentage
of cases in each bin instead of the count.
 In this way, relative
frequency histograms
are faithful to the
area principle.
Here is a relative
frequency histogram of
earthquake magnitudes:
Copyright © 2009 Pearson Education, Inc.
Slide 4- 9
Histograms



Both histograms “look” the same.
The only difference is the vertical axis.
Did we see this in Chapter 3?
Copyright © 2009 Pearson Education, Inc.
Slide 4- 10
Histograms




They can be displayed
horizontally as well as vertically
I rotated this one 90 degrees
clockwise
To publish this, I would put the
“% of Earthquakes” axis across
the bottom instead of the top.
I’d also retype the labels so they
can be more easily read!
Copyright © 2009 Pearson Education, Inc.
Slide 4- 11
Histogram with the TI




Example: Data: 62,
63, 65, 66, 68, 70, 71,
73, 75
Use [STAT][EDIT] to
put the dataset in L1.
The first few data
points are shown.
NOTE: You will do this
a lot in this course!
Copyright © 2009 Pearson Education, Inc.
Slide 14- 12
Histogram with the TI





First, select [Y1] and turn
off any functions from
Algebra class!
Press [2nd][Y1] and go to
one of the three plots.
Turn it on.
Select the histogram.
Make sure that L1 (or
wherever you put the data)
is in Xlist.
Make sure the 1 is in Freq
Copyright © 2009 Pearson Education, Inc.
Slide 4- 13
Histogram with the TI (default)



You can get a window
default by selecting
Zoom and then 9
Below is the window.
It shows a bin width of
3.25. It includes all of
the values.
Because we have
integers, I’d rather
have 3 as a bin width.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 14
Histogram with the TI

Choose as window
X:[60,78];Y[-1,3]. You
may have to play with this.



For X, I picked a little lower
than the min and a little
higher than the max.
For Y, I picked a little bigger
than the largest bin
frequency than I expected.
Xscl is the length of the
bin. In this case, choosing
3 makes cut points at 60,
63, 66, 69. 72. 75, and 78.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 15
Usefulness of the Trace function
Use the horizontal arrows to navigate the bins.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 16
Histogram with StatCrunch





Select Graphics
Select Histogram
Select the column you want graphed.
Select Next. (Do not select “Create Graph”
unless you do not want to have control over the
bin size.
For the same bins as with the TI, “Start Bins” at
60 and set Bin Width equal to 3. Then select
“Create Graph”.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 17
Copyright © 2009 Pearson Education, Inc.
Slide 4- 18
Copyright © 2009 Pearson Education, Inc.
Slide 4- 19
Copyright © 2009 Pearson Education, Inc.
Slide 4- 20
Results
With default bin size
Copyright © 2009 Pearson Education, Inc.
Better size
Slide 4- 21
How many bins?



No “hard and fast” rule. There is even some
disagreement among professionals.
Recommendations from sides from two Johns
Hopkins graduate Biostatistics Courses. Both
depend on the number (n) of data points.
 Biostatistics 612: √n
 Biostatistics 651: 2√n
I personally would use √n, but would try different
numbers to see what looks best.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 22
Publisher Instructions: Histogram






Histogram: Displays the frequency, relative frequency or density for
numerical data combined into classes. Select the column(s) to be
displayed in the plot(s). A separate plot will be generated for each
column selected.
Enter an optional Where clause to specify the data rows to be
included in the computation.
Select an optional Group by column to construct a histogram for
each distinct value of this column.
Click the Next button to select either the Frequency, Relative
Frequency or Density histogram. In addition, optional values for the
starting point of the bins and the bin width may be specified. These
parameters will apply to all of the histograms to be constructed.
Click the Next button again to specify graph layout options.
Click the Create Graph! button to create the plot(s).
Copyright © 2009 Pearson Education, Inc.
Slide 4- 23
Histograms with EXCEL

There is a good You-tube tutorial on this – better
than anything I can provide. See
http://www.youtube.com/watch?v=RyxPp22x9PU
Copyright © 2009 Pearson Education, Inc.
Slide 4- 24
Thoughts on Histograms





Histograms are useful and easy to apply to mostly all
types of quantitative data.
This is especially true for larger data sets.
They can use a lot of ink and space! Color is more
useful than black-and-white or grayscale.
It can be difficult to display several related datasets at
the same time to compare datasets.
When you get a default, accept it if you can live with
it! If not, at least save (or remember) what you did.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 25
Stem-and-Leaf Displays


Stem-and-leaf displays show the distribution of a
quantitative variable, like histograms do, while
preserving the individual values.
Stem-and-leaf displays contain all the information
found in a histogram and, when carefully drawn,
satisfy the area principle and show the
distribution.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 26
Stem-and-Leaf Displays



They can show a complete dataset in very little
space.
It is easy to put them back-to-back to compare
groups.
Invented in 1972 by John Tukey (1915 – 2000)


Bell Labs’ NJ
“Exploratory Data Analysis”, 1977
Copyright © 2009 Pearson Education, Inc.
Slide 4- 27
Stem-and-Leaf Example

Compare the histogram and stem-and-leaf display for the
pulse rates of 24 women at a health clinic. Which
graphical display do you prefer?
Copyright © 2009 Pearson Education, Inc.
Slide 4- 28
Constructing a Stem-and-Leaf Display





First, draw a vertical line.
Next, to the left of the line, cut each data value
into leading digits (“stems”)
and to the right of the line, trailing digits
(“leaves”).
Use the stems to label the bins.
Use only one digit for each leaf—either round or
truncate the data values to one decimal place
after the stem.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 29
Stem-and-Leaf Displays





Stem-and-Leaf plots give all of the data in
pictorial form.
Stem-and-Leaf plots are useful for smaller
datasets.
It is not possible to do a stem-and-leaf plot with
the TI.
Or EXCEL either.
But if the data set is ordered, they are easy to do
by hand.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 30
Stem and Leaf with StatCrunch








Select Graphics.
Select “Stem and Leaf”.
Select the variable you want graphed.
All you can do is “Create Graph.”
You are not free to select bin sizes.
The TI does not do stem and leaf plots.
Nor does EXCEL
Variable: var1
6 : 23568
7 : 0145
Copyright © 2009 Pearson Education, Inc.
Slide 4- 31
Copyright © 2009 Pearson Education, Inc.
Slide 4- 32
Copyright © 2009 Pearson Education, Inc.
Slide 4- 33
Publisher’s Instructions




Stem and Leaf : Displays a character based plot of a
column that is similar to a histogram turned on its side.
The actual (or approximate) data values are represented
in the plot. Select the column(s) to be displayed in the
plot(s). A separate plot will be generated for each column
selected.
Enter an optional Where clause to specify the data rows
to be included in the computation.
Select an optional Group by column to construct a
separate stem and leaf plot for each distinct value of this
column.
Click the Create Graph! button to create the plot(s).
Copyright © 2009 Pearson Education, Inc.
Slide 4- 34
Dotplots




A dotplot is a simple display. It
just places a dot along an axis
for each case in the data.
The dotplot to the right shows
Kentucky Derby winning times,
plotting each race as its own dot.
You might see a dotplot
displayed horizontally (such as
this one) or vertically.
It looks “sorta” like a histogram.
You might see a dotplot
displayed horizontally or
vertically.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 35
Dotplots with StatCrunch




Can’t do with the TI or EXCEL.
With StatCrunch, again select “Graphics”, then
“DotPlot” (as with the Histogram and the Stem
and Leaf).
In the next panel, you can input axis labels and
draw grid lines if you wish. In the following one,
you can pick a color scheme.
But you have no control over the bin size (see
next slide for an example of a dotplot that is not
very useful).
Copyright © 2009 Pearson Education, Inc.
Slide 4- 36
Copyright © 2009 Pearson Education, Inc.
Slide 4- 37
Publisher’s Instructions





Dotplot : Displays a graphical representation of numerical values as points
on a number line. Points with the same pixel representation are stacked on
top of each other. If the number of points in a stack exceeds the height of the
graphic, each point on the plot may represent more than one observation. If
this occurs, the number of observations per point will be shown in the title of
the graphic. Select the column(s) to be displayed in the plot(s). If multiple
columns are selected, the plots will be stacked in the reverse order of
selection in the same graphic.
Enter an optional Where clause to specify the data rows to be included in the
computation.
Select an optional Group by column to construct dotplots for each distinct
value of this column. If a Group by column is specified, select either to stack
the plots of each group for each column or to stack plots of each column for
each group.
Click the Next button to specify graph layout options.
Click the Create Graph! button to create the plot(s).
Copyright © 2009 Pearson Education, Inc.
Slide 4- 38
Think Before You Draw, Again



Remember the “Make a picture” rule?
Now that we have options for data displays, you
need to Think carefully about which type of
display to make.
Before making a stem-and-leaf display, a
histogram, or a dotplot, check the
 Quantitative Data Condition: The data are
values of a quantitative variable whose units
are known.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 39
Constructing Effective Graphs
Source: Agresti & Franklin




Label both axes and provide proper headings
To better compare relative size, the vertical axis
should start at 0.
Be cautious in using anything other than bars,
lines, or points. Don’t use birds, dollar signs,
ships, etc!
It can be difficult to portray more than one group
on a single graph when the variable values differ
greatly
40
Copyright © 2009 Pearson Education, Inc.
Now on over
to . . .

Copyright © 2009 Pearson Education, Inc.
Slide 4- 41
Shape, Center, and Spread

When describing a distribution, make sure to
always tell about three things: shape, center, and
spread…
Copyright © 2009 Pearson Education, Inc.
Slide 4- 42
What is the Shape of the Distribution?
1. Does the histogram have a single, central hump
or several separated humps?
2. Is the histogram symmetric?
3. Do any unusual features stick out?
Copyright © 2009 Pearson Education, Inc.
Slide 4- 43
Humps
1. Does the histogram have a single, central hump
or several separated bumps?

Humps in a histogram are called modes.

A histogram with one main peak is dubbed
unimodal; histograms with two peaks are
bimodal; histograms with three or more peaks
are called multimodal.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 44
Humps (cont.)

A bimodal histogram has two apparent peaks:
Diastolic Blood Pressure
Copyright © 2009 Pearson Education, Inc.
Slide 4- 45
Humps (cont.)

A histogram that doesn’t appear to have any mode and
in which all the bars are approximately the same height
is called uniform:
Proportion of Wins
Copyright © 2009 Pearson Education, Inc.
Slide 4- 46
Symmetry
2.
Is the histogram symmetric?

If you can fold the histogram along a vertical line
through the middle and have the edges match
pretty closely, the histogram is symmetric.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 47
Symmetry (cont.)


The (usually) thinner ends of a distribution are called
the tails. If one tail stretches out farther than the other,
the histogram is said to be skewed to the side of the
longer tail.
In the figure below, the histogram on the left is said to
be skewed left, while the histogram on the right is said
to be skewed right.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 48
Symmetry (cont.)




The skewness is in the direction of the tail, not the
hump!
Think of a playground “sliding board” – when you go
down the slide, in which direction are you going?
That’s the direction of the skewness.
There is a numerical measure of skewness that I will
show you later.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 49
Examples of Skewness
Source: Agresti & Franklin, “Statistics: The Art and Science
of Learning from Data”; Pearson, 2007
Copyright © 2009 Pearson Education, Inc.
Slide 4- 50
Examples of Skewness
Source: Agresti & Franklin, “Statistics: The Art and Science
of Learning from Data”; Pearson, 2007
Copyright © 2009 Pearson Education, Inc.
Slide 4- 51
Anything Unusual?
3. Do any unusual features stick out?

Sometimes it’s the unusual features that tell
us something interesting or exciting about the
data.

You should always mention any stragglers, or
suspected outliers, that stand off away from
the body of the distribution.

Are there any gaps in the distribution? If so,
we might have data from more than one
group.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 52
Anything Unusual? (cont.)

The following histogram has suspected outliers—
there are three cities in the leftmost bar:
Copyright © 2009 Pearson Education, Inc.
Slide 4- 53
Center of a Distribution – Median

The median is the value with exactly half the data
values below it and half above it.
 It is the middle data
value (once the data
values have been
ordered) that divides
the histogram into
two equal areas.
 It has the same
units as the data.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 54
Finding the median




First, make sure that the data are arranged
smallest to largest (or largest to smallest).
Count the number, say N, of data points
If N is odd, take the middle one. For example, if
N = 21, the 11th point is the median
If N is even,
 there is no middle “one”!
 So we average the middle two!
Copyright © 2009 Pearson Education, Inc.
Slide 54- 55
Examples


Data: 62, 63, 65, 66, 68, 70, 71, 73, 75
 N = 9; it’s odd
th point is the middle one
 The 5
 68 is the median
Data: 197, 195,193, 192, 187, 185, 182, 179
 N = 8; it’s even
 Average the two middle points; 192 and 187
 The median is 189.5
Copyright © 2009 Pearson Education, Inc.
Slide 54- 56
Notice that …



If N is even, the median does not have to be one
of the data points.
The median can be affected by outliers (but
maybe not that much)
Example: 77, 72, 70, 69, 68, 67. Let’s for now
label 77 as an outlier.
 With the outlier (77), the median is 69.5.
 Without the 77, we have 72, 70, 69, 68, 67
and the median is 69.
Copyright © 2009 Pearson Education, Inc.
Slide 54- 57
Comparing the Mean and Median
(Source: Agresti & Franklin)

In a skewed distribution, the mean is farther
out in the long tail than is the median
 For skewed distributions the median is
preferred because it is better representative
of a typical observation
58
Copyright © 2009 Pearson Education, Inc.
Spread: Home on the Range




Always report a measure of spread along with a measure
of center when describing a distribution numerically.
The range of the data is the difference between the
maximum and minimum values:
Range = max – min
A disadvantage of the range is that a single extreme value
can make it very large and, thus, not representative of the
data overall.
 Example: 77, 72, 70, 69, 68, 67. The range is 10. But
if we take out the “outlier”, the range drops to 5.
CAUTION: In the above example (with the 77), the range
is not “67 to 77”; it is 10!
Copyright © 2009 Pearson Education, Inc.
Slide 4- 59
Spread: The Interquartile Range


The interquartile range (IQR) lets us ignore
extreme data values and concentrate on the
middle of the data.
To find the IQR, we first need to know what
quartiles are…
Copyright © 2009 Pearson Education, Inc.
Slide 4- 60
Spread: The Interquartile Range (cont.)


Quartiles divide the data into four equal sections.
 One quarter of the data lies below the lower
quartile, Q1
 One quarter of the data lies above the upper
quartile, Q3.
The difference between the quartiles is the
interquartile range (IQR), so
IQR = upper quartile – lower quartile
Copyright © 2009 Pearson Education, Inc.
Slide 4- 61
Spread: The Interquartile Range (cont.)


The lower and upper quartiles are the 25th and 75th percentiles of the
data, so…
The IQR contains the middle 50% of the values of the distribution, as
shown in figure:
Copyright © 2009 Pearson Education, Inc.
Slide 4- 62
5-Number Summary


The 5-number summary of a distribution reports its median, quartiles,
and extremes (maximum and minimum)
The 5-number summary for the recent tsunami earthquake
Magnitudes looks like this:
Copyright © 2009 Pearson Education, Inc.
Slide 4- 63
A little clarification!


What do we mean by “half of the data below the
median” and “half of the data above the
median”?
Data: 197, 195,193, 192, 187, 185, 182, 179
 N = 8; it’s even – the median was 189.5
 For the first quartile, we take the median of the
last four numbers, i.e. 187, 185, 182, 179.
This is 183.5.
 Similarly, the third quartile is 194, the median
of 197, 195, 193, 193
Copyright © 2009 Pearson Education, Inc.
Slide 54- 64
Example – Text data, page 58





The ordered values from the first batch:
-17.5, 2.8, 3.2, 13.9, 14.1, 25.3, 45.8
Let’s verify the text results with our
technology.
Odd number of points
Min = -17.5, Max = 45.8, Med = 13.9
Copyright © 2009 Pearson Education, Inc.
Slide 4- 65
Example – Text data, page 58


How about Q1 and Q3?
Book’s method:
 For Q1, take the median of the first four
points (i.e. including the median). That
is, take the median of -17.5, 2.8, 3.2,
13.9, which is 3.0.
 For Q3, take the median of the last four
points (i.e. including the median). That
is, take the median of 13.9, 14.1, 25.3,
45.8, which is 19.7.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 66
5 – number summary – TI
(except newer 84’s)



Select [2nd][STAT]
Select [CALC]
Select #1, “1-Var Stats”
Copyright © 2009 Pearson Education, Inc.
Slide 4- 67
5 – number summary – TI
(newer 84’s)
Copyright © 2009 Pearson Education, Inc.
Slide 4- 68
Hmmmmm!





For Q1, the text got 3.0
and the TI got 2.8.
For Q3, the text got 19.7
and the TI got 25.3.
Difference in methodology.
The text included the
median in the upper-half
dataset; the TI did not.
Let’s go on to StatCrunch.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 69
5-Number Summary - StatCrunch



Select Stat, then Summary Statistics, then
Columns.
Then select the column you want summarized.
You will see a list of summary statistics. Deselect all except those you want; i.e. Max, Min,
Q1, Q3 and Median.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 70
Copyright © 2009 Pearson Education, Inc.
Slide 4- 71
Copyright © 2009 Pearson Education, Inc.
Slide 4- 72
Copyright © 2009 Pearson Education, Inc.
Slide 4- 73
The Result with StatCrunch



Summary statistics:
Column
Median Min Max Q1 Q3 var1
13.9 -17.5 45.8 2.8 25.3
Copyright © 2009 Pearson Education, Inc.
Slide 4- 74
Publisher Instructions for
Summary Statistics






Columns : Provides the following descriptive statistics in tabular format for the
column(s) selected: sample size (n), mean, variance, standard deviation (Std. Dev.),
Standard Error (Std. Err.), median, range, minimum, maximum, first quartile (Q1) and
third quartile (Q3). Select the columns for which summary statistics will be computed.
Enter an optional Where clause to specify the data rows to be included in the
computation.
Select an optional Group By column to group results. If a Group By column is
selected, choose whether to display the output in separate tables for each column
selected or in separate tables for each group.
Click the Next button to select the summary statistics (by default, all are selected) to be
computed. The statistics will be displayed in the order in which they are selected (from
right to left). Additional percentiles may also be entered as a space or comma delimited
list.
Check the Store output in data table option if the output is to be placed in the data
table.
Click the Calculate button to view the results.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 75
Other technologies




SAS, StatDisk and MINITAB all agree with the TI and
StatCrunch.
EXCEL: PERCENTILE(Array,.25)=3,
PERCENTILE(Array,.75)=19.7!
Data Desk, an add-on to EXCEL, gives Q1 = 2.9 and Q3
= 22.5!
There are different ways of computing Q1 (same for Q3)
 Split list into two halves, include median in each (text)
 Split list into two halves; don’t include median(TI, SC)
 I think that Data Desk used cut points of 0, (1/6),
(2/6),(3/6),(4/6),(5/6) and1, and interpolated.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 76
Boxes in blue – pp. 59 and 68 of text






There are several ways to compute a quartile (we’ve seen
3; the authors have seen 9.)
For large datasets, it makes very little difference.
For smaller datasets (where it might make a difference),
you do as well to just give the whole dataset rather than
the summary statistics!
You will be using the TI on the assessments.
Even StatCrunch and the TI do not agree on some
datasets! Therefore, on homework, say which technology
you used.
The IQR can also be different!
Copyright © 2009 Pearson Education, Inc.
Slide 4- 77
Summarizing Symmetric Distributions – The Mean



When we have symmetric data, there is an
alternative other than the median,
If we want to calculate a number, we can average
the data.
We use the Greek letter sigma to mean “sum” and
write:
Total  y
y

n
n
The formula says that to find the
mean, we add up the numbers
and divide by n.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 78
Summarizing Symmetric Distributions – The Mean
(cont)

The mean feels like the center because it is the point
where the histogram balances:
Copyright © 2009 Pearson Education, Inc.
Slide 4- 79
Summarizing Symmetric Distributions – The Mean
(cont)



Because the median considers only the order of values, it
is resistant to values that are extraordinarily large or
small; it simply notes that they are one of the “big ones” or
“small ones” and ignores their distance from center.
To choose between the mean and median, start by
looking at the data. If the histogram is symmetric and
there are no outliers, use the mean.
However, if the histogram is skewed or with outliers, you
are better off with the median.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 80
Source: http://www.causeweb.org/resources/fun/
Copyright © 2009 Pearson Education, Inc.
Slide 4- 81
What’s wrong with these quotes?




“We look forward to the day when everyone will
receive more than the average wage.”
Australian Minister of Labor, 1973
“Lake Woebegone, Minnesota : Where all the
women are strong, all the men are good-looking,
and all the children are above average”
Garrison Keillor (made in jest on the show “A
Prairie Home Companion”)
Copyright © 2009 Pearson Education, Inc.
Slide 4- 82
*Weighted Arithmetic Mean

Weighted Arithmetic Mean is computed by using
following formula:

Where:

Stands for weighted arithmetic mean.
x Stands for values of the items and
w Stands for weight of the item
Source:
http://www.emathzone.com/tutorials/basicstatistics/weighted-arithmetic-mean.html
Copyright © 2009 Pearson Education, Inc.
Slide 4- 83
*Example: Weighted Mean - GPA



A freshman receives the following grades Assume 4 points for an A, 3 for a B.
What is his grade point average?
Course Credits
Intro to Literature
3
Russian I
3
Physics I
4
Calculus I
4
Chemistry I
4
Physical Education I
1
Copyright © 2009 Pearson Education, Inc.
Grade
B
A
A
A
B
A
Points
3
4
4
4
3
4
Slide 4- 84
*Example: Weighted Mean - GPA





Use ∑Credits*Points
∑Credits
∑Credits*Points = 69
∑ Credits =19
69/19 = 3.63.
Copyright © 2009 Pearson Education, Inc.
Credits
Grade
Points
Credits*
Points
3
B
3
9
3
A
4
12
4
A
4
16
4
A
4
16
4
B
3
12
1
A
4
4
Slide 4- 85
*Example: Weighted Mean –
Customer Ratings




Amazon.com is
reviewing the ratings
on a line of projects.
Customers rate 1 to 5,
1 = Worst, 5 = Best
Ratings (and number
giving each rating) are
on the right
What is the average
rating for the product.
Copyright © 2009 Pearson Education, Inc.
Rating
5
Number of
customers
57
4
73
3
36
2
7
1
10
Slide 4- 86
*Example: Weighted Mean –
Customer Ratings





Use
∑Ratings*Customers
∑Customers
∑Rtgs*Cust = 709
∑ Cust =183
789/183 = 3.874.
This is what you will
use in Project 1.
Copyright © 2009 Pearson Education, Inc.
Rating
5
Customers
57
Total
285
4
73
292
3
36
108
2
7
14
1
10
10
Slide 4- 87
With the TI




Put the ratings (5 to 1)
in L1 and the number in
L2.
Do 1-varStats L1,L2.
L1 comma L2
But there is a setting
that you need if you
have the new operating
system.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 88
With the TI (adjust the TI with the
new operating system)




Make sure that
StatWizard is off.
To do this,
[MODE]
StatWizard is just
above the clock.
If it is on, then
1-varStats L1,L2
will not work.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 89
What About Spread? The Standard Deviation


A more powerful measure of spread than the IQR
is the standard deviation, which takes into
account how far each data value is from the
mean.
A deviation is the distance that a data value is
from the mean.
 Since adding all deviations together would total
zero, we square each deviation and find an
average of sorts for the deviations.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 90
Standard Deviation by hand
(Don’t do this yourself!)






A student goes shopping for an external hard
drive for her computer.
She finds the same hard drive in seven places.
The prices are $98, $96, $92, $87, $85, $83, $82.
The mean is easy to compute
(∑x/n) = 623/7 = $89.
Let’s do the standard deviation.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 91
Attempt at a measure of spread
(not a very good one!)
X
(X – Xbar)
98
9
96
7
92
3
87
- 2
85
- 4
83
- 6
82
- 7
• However, all of these deviations add to zero.
• This is not a very good measure of spread!
Copyright © 2009 Pearson Education, Inc.
Slide 4- 92
What About Spread? The Standard Deviation
(cont.)

The variance, notated by s2, is found by summing
the squared deviations and (almost) averaging
them:
y  y



2
s

2
n 1
The variance will play a role later in our study, but
it is problematic as a measure of spread—it is
measured in squared units!
Copyright © 2009 Pearson Education, Inc.
Slide 4- 93
Variance by hand
(Technology is easier!)
•
•
•
•
X
(X – Xbar)
(X –Xbar)2
98
9
81
96
7
49
92
3
9
87
- 2
4
85
- 4
16
83
- 6
36
82
- 7
49
By squaring, we get rid of the negatives.
The deviations squared (last col.) add to $244.
Then $244/6 = $40.67 square dollars.
This is the variance. (What’s a square dollar?)
Copyright © 2009 Pearson Education, Inc.
Slide 4- 94
What About Spread? The Standard Deviation
(cont.)

The standard deviation, s, is just the square root
of the variance and is measured in the same
units as the original data.
 y  y 
2
s
Copyright © 2009 Pearson Education, Inc.
n 1
Slide 4- 95
Standard Deviation by hand
(Technology is easier!)
•
•
•
•
X
(X – Xbar)
(X –Xbar)2
98
9
81
96
7
49
92
3
9
87
- 2
4
85
- 4
16
83
- 6
36
82
- 7
49
244 / 6 = 40.66667.
The square root of 40.66667 is $6.38
We have obtained the standard deviation.
Units are the same as in the original data (dollars)
Copyright © 2009 Pearson Education, Inc.
Slide 4- 96
Questions about Variance
s



2
y  y


2
n 1
Why n – 1 instead of n? It has to do with a concept
called degrees of freedom.
We will see this in later chapters (Chapter 23).
Essentially, it is the number of entitles that can be freely
changed if the sum (or the mean) remains constant.
Copyright © 2009 Pearson Education, Inc.
Slide 54- 97
Source: http://www.causeweb.org/resources/fun/
Copyright © 2009 Pearson Education, Inc.
Slide 4- 98
Mean and Standard Deviation - TI





For the numbers 62,
63, 65, 66, 68, 70, 71,
73, 75:
Press [STAT], [CALC],
1-Var Stats
The mean is 68.1111
The st. dev is 4.4845
Use the sx instead of
the σx (will explain
later in the course.)
Copyright © 2009 Pearson Education, Inc.
Slide 4- 99
Mean and Standard Deviation - StatCrunch








Numbers in Var1.
Select Stat, then Summary Stats, then Column as before.
Give Var1 as your input column.
Under Statistics, make sure that Mean and Standard
Deviation are checked. (You can check others.)
Click Create
Summary statistics:
Column
n Mean
Std. Dev.
var1 9 68.111115 4.4845414
Copyright © 2009 Pearson Education, Inc.
Slide 4- 100
**EXCEL summary statistics
Summary Statistic
EXCEL function
Mean
=average(a1:a7)
12.514
2.8
Standard Deviation
=stdev(a1:a7)
19.824
3.2
Median
=median(a1:a7)
13.9
1st quartile
=quartile(A1:A7,1)
3
14.1
3rd quartile
=quartile(A1:A7,3)
19.7
25.3
Minimum
=min(a1:a7)
-17.5
45.8
Maximum
=max(a1:a7)
45.8
Skewness
=skew(a1:a7)
0.3059
Kurtosis
=KURT(A1:A7)
0.8924
-17.5
Copyright © 2009 Pearson Education, Inc.
Answer
13.9
Slide 4- 101
*Other summary measures: Skewness





For data points Y1, Y2, …, YN, the skewness is
defined as
Note that it involves “cubes”, the third power.
The data are positively or negatively skewed
depending on whether this quantity is greater than or
less than 0.
The magnitude of this quantity is a measure of how
skewed the data are.
Source: Wikipedia
Copyright © 2009 Pearson Education, Inc.
Slide 4- 102
*Other summary measures: Kurtosis


Kurtosis is a measure of how peaked or flat your
data are.
Mathematically, kurtosis is defined as:
_
2
 ( x  x)
_
4
3
 ( x  x)




Note that this involves the fourth power.
A value of 0 indicates a perfect bell shape
Greater than 0: More peaked
Less than 0: Flatter
Copyright © 2009 Pearson Education, Inc.
Slide 4- 103
*Other summary measures:
Coefficient of Variation






You may see this in upper level textbooks.
The “coefficient of variation” is the standard
deviation divided by the mean.
For the most recent example, CV = 0.06584.
This is normally expressed as a percent, i.e.
CV=6.584%.
Notice that the CV is “unitless”.
This is an advantage since it allows us to
compare different populations. We will see this a
lot in the course.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 104
Thinking About Variation




Since Statistics is about variation, spread is an
important fundamental concept of Statistics.
Measures of spread help us talk about what we
don’t know.
When the data values are tightly clustered around
the center of the distribution, the IQR and
standard deviation will be small.
When the data values are scattered far from the
center, the IQR and standard deviation will be
large.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 105
Tell - Draw a Picture

When telling about quantitative variables,
start by making a histogram or stem-andleaf display and discuss the shape of the
distribution.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 106
Tell - Shape, Center, and Spread

Next, always report the shape of its distribution,
along with a center and a spread.
 If the shape is skewed, report the median and
IQR.
 Note:



Skewed to the right: Mean is larger than median.
Skewed to the left: Median is larger.
If the shape is symmetric, report the mean and
standard deviation and possibly the median
and IQR as well.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 107
Tell - What About Unusual Features?


If there are multiple modes, try to understand
why. If you identify a reason for the separate
modes, it may be good to split the data into two
groups.
If there are any clear outliers and you are
reporting the mean and standard deviation, report
them with the outliers present and with the
outliers removed. The differences may be quite
revealing.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 108
What Can Go Wrong?


Don’t make a histogram of a categorical variable—
bar charts or pie charts should be used for
categorical data.
Don’t look for shape,
center, and spread
of a bar chart.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 109
What Can Go Wrong? (cont.)


Don’t use bars in every display—save them for
histograms and bar charts.
Below is a badly drawn plot and the proper
histogram for the number of juvenile bald eagles
sighted in a collection of weeks:
Copyright © 2009 Pearson Education, Inc.
Slide 4- 110
What Can Go Wrong? (cont.)

Choose a bin width appropriate to the data.
 Changing the bin width changes the
appearance of the histogram:
Copyright © 2009 Pearson Education, Inc.
Slide 4- 111
What Can Go Wrong? (cont.)




Don’t forget to do a reality check – don’t let the
calculator do the thinking for you.
Don’t forget to sort the values before finding the
median or percentiles.
Don’t worry about small differences when using
different methods.
Don’t compute numerical summaries of a
categorical variable.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 112
Source: http://www.causeweb.org/resources/fun/
Copyright © 2009 Pearson Education, Inc.
Slide 4- 113
What Can Go Wrong? (cont.)






Don’t report too many decimal places.
Don’t round in the middle of a calculation.
Watch out for multiple modes
Beware of outliers
Make a picture !!!
Check for typos (An example follows)
Copyright © 2009 Pearson Education, Inc.
Slide 4- 114
An example that we’ll use in Chapter 6


Here are the prices per gallon for
regular gas as reported by 12 gas
stations in and around HCC’s Zip
code 21044 on the morning of
August 4, 2012
Source:
http://www.marylandgasprices.com
Copyright © 2009 Pearson Education, Inc.
3.459
3.539
3.539
3.539
3.559
3.629
3.649
3.699
3.699
3.699
3.699
3.699
Slide 4- 115
Summary statistics and histogram
Mean
$3.6173 / gallon
Standard Deviation
$0.086, or about
8.6 cents a gallon
Copyright © 2009 Pearson Education, Inc.
Slide 4- 116
Let’s pretend we’re seeing this
for the first time


Here are the prices per gallon for
regular-grade gasoline as
reported by thirteen Columbia
area gas stations on the morning
of August 4, 2012
Source:
http://www.marylandgasprices.com
Copyright © 2009 Pearson Education, Inc.
3459 
3.539
3.539
3.539
3.559
3.629
3.649
3.699
3.699
3.699
3.699
3.699
Slide 4- 117
OOPS!
Mean
$291.58/ gallon!
Standard Deviation
$997.48 / gallon
Copyright © 2009 Pearson Education, Inc.
Slide 4- 118
Results of our mistake





Mean and standard deviation – HUGE effect!
Histogram – totally obscured the real data; not just the
typo but everything else!
Five number summary – moderate effect on everything
but the maximum (huge effect there)
The effect on the five-number summary depends on
where you made your mistake.
Two courses of action
 Best action: Correct what your mistake if you know the
correct entry.
 If you don’t know what the entry should be, remove the
3459 as an outlier and document what happened.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 119
What have we learned?



We’ve learned how to make a picture for quantitative data
to help us see the story the data have to Tell.
We can display the distribution of quantitative data with a
histogram, stem-and-leaf display, or dotplot.
We’ve learned how to summarize distributions of
quantitative variables numerically.
 Measures of center for a distribution include the
median and mean.
 Measures of spread include the range, IQR, and
standard deviation.
 Use the median and IQR when the distribution is
skewed. Use the mean and standard deviation if the
distribution is symmetric.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 120
What have we learned? (cont.)

We’ve learned to Think about the type of variable
we are summarizing.
 All methods of this chapter assume the data
are quantitative.
 The Quantitative Data Condition serves as a
check that the data are, in fact, quantitative.
Copyright © 2009 Pearson Education, Inc.
Slide 4- 121
Overview – Organization of the chapter


Pictorial Display
 Histogram
 Stem – and Leaf
Plot
 Dotplot
Numerical summary
 Shape of data
 Center
 Spread
Copyright © 2009 Pearson Education, Inc.


First set of measures
 Median
 Range
 Quartiles
 IQR
Second set
 Mean
 Variance
 Standard Deviation
Slide 4- 122
Division of Mathematics, HCC
Course Objectives for Chapter 4
After studying this chapter, the student will be able to:
7.
Appropriately display quantitative data using a frequency distribution,
histogram, relative frequency histogram, stem-and-leaf display,
dotplot.
8.
Describe the general shape of a distribution in terms of shape, center
and spread.
9.
Describe any anomalies or extraordinary features revealed by the
display of a variable.
10.
Compute and apply the concepts of mean and median to a set of data.
11.
Compute and apply the concept of the standard deviation and IQR to a
set of data.
12.
Select a suitable measure of center/spread for a variable based on
information about its distribution.
13.
Create a five-number summary of a variable.
Copyright © 2009 Pearson Education, Inc.