Download standard deviation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Transcript
Chapter 5
Describing
Distributions
Numerically
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Finding the Center: The Median


When we think of a typical value, we usually look
for the center of the distribution.
For a unimodal, symmetric distribution, it’s easy
to find the center—it’s just the center of
symmetry.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 2
Finding the Center: The Median (cont.)


As a measure of center, the midrange (the
average of the minimum and maximum values) is
very sensitive to skewed distributions and
outliers.
The median is a more reasonable choice for
center than the midrange.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 3
Finding the Center: The Median (cont.)

The median is the value with exactly half the data
values below it and half above it.
 It is the middle data
value (once the data
values have been
ordered) that divides
the histogram into
two equal areas.
 It has the same
units as the data.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 4
Spread: Home on the Range



Always report a measure of spread along with a measure
of center when describing a distribution numerically.
The range of the data is the difference between the
maximum and minimum values:
Range = max – min
A disadvantage of the range is that a single extreme value
can make it very large and, thus, not representative of the
data overall.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 5
Spread: The Interquartile Range


The interquartile range (IQR) lets us ignore
extreme data values and concentrate on the
middle of the data.
To find the IQR, we first need to know what
quartiles are…
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 6
Spread: The Interquartile Range (cont.)


Quartiles divide the data into four equal sections.
 The lower quartile is the median of the half of
the data below the median.
 The upper quartile is the median of the half of
the data above the median.
The difference between the quartiles is the IQR,
so
IQR = upper quartile – lower quartile
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 7
Spread: The Interquartile Range (cont.)


The lower and upper quartiles are the 25th and 75th
percentiles of the data, so…
The IQR contains the
middle 50% of the
values of the distribution,
as shown in Figure 5.3
from the text:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 8
The Five-Number Summary

The five-number summary
of a distribution reports its
median, quartiles, and
extremes (maximum and
minimum).
 Example: The fivenumber summary for
the ages at death for
rock concert goers who
died from being
crushed is
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Max
47 years
Q3
22
Median
19
Q1
17
Min
13
Slide 5- 9
Rock Concert Deaths: Making Boxplots


A boxplot is a graphical display of the five-number
summary.
Boxplots are particularly useful when comparing
groups.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 10
Constructing Boxplots
1.
Draw a single vertical
axis spanning the range
of the data. Draw short
horizontal lines at the
lower and upper
quartiles and at the
median. Then connect
them with vertical lines
to form a box.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 11
Constructing Boxplots (cont.)
1.
Erect “fences” around the
main part of the data.

The upper fence is 1.5
IQRs above the upper
quartile.

The lower fence is 1.5
IQRs below the lower
quartile.

Note: the fences only help
with constructing the
boxplot and should not
appear in the final display.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 12
Constructing Boxplots (cont.)
1.
Use the fences to grow
“whiskers.”

Draw lines from the ends
of the box up and down to
the most extreme data
values found within the
fences.

If a data value falls
outside one of the fences,
we do not connect it with a
whisker.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 13
Constructing Boxplots (cont.)
1. Add the outliers by
displaying any data
values beyond the
fences with special
symbols.

We often use a
different symbol for
“far outliers” that are
farther than 3 IQRs
from the quartiles.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 14
Rock Concert Deaths: Making Boxplots
(cont.)

Compare the histogram and boxplot for rock
concert deaths:

How does each display represent the
distribution?
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 15
Box Plots on Calculator

Performance of fourth grade boys and girls on
agility test. The numbers represent the number of
lines they can clear in 30 seconds.

Boys: 22,17,18,29,22,22,23,24,23,17,21
Girls: 25,20,12,19,28,24,22,21,25,26,25,16,27,22


How does these fourth graders compare in terms
of agility?
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 16
Comparing Groups With Boxplots

The following set of boxplots compares the
effectiveness of various coffee containers:

What does this graphical display tell you?
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 17
Summarizing Symmetric Distributions



Medians do a good job of identifying the center of
skewed distributions.
When we have symmetric data, the mean is a
good measure of center.
We find the mean by adding up all of the data
values and dividing by n, the number of data
values we have.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 18
Summarizing Symmetric Distributions (cont.)

The distribution of pulse rates for 52 adults is
generally symmetric, with a mean of 72.7 beats
per minute (bpm) and a median of 73 bpm:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 19
The Formula for Averaging

The formula for the mean is given by
y
T
o
ta
l 
y

n
n

The formula says that to find the mean, we add
up the numbers and divide by n.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 20
Mean or Median?

Regardless of the
shape of the
distribution, the
mean is the point
at which a
histogram of the
data would
balance:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 21
Mean or Median? (cont.)


In symmetric distributions, the mean and median
are approximately the same in value, so either
measure of center may be used.
For skewed data, though, it’s better to report the
median than the mean as a measure of center.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 22
What About Spread? The Standard Deviation


A more powerful measure of spread than the IQR
is the standard deviation, which takes into
account how far each data value is from the
mean.
A deviation is the distance that a data value is
from the mean.
 Since adding all deviations together would total
zero, we square each deviation and find an
average of sorts for the deviations.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 23
What About Spread? The Standard Deviation
(cont.)

The variance, notated by s2, is found by summing
the squared deviations and (almost) averaging
them:
yy

s
2
2
n
1

The variance will play a role later in our study, but
it is problematic as a measure of spread—it is
measured in squared units!
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 24
What About Spread? The Standard Deviation
(cont.)

The standard deviation, s, is just the square root
of the variance and is measured in the same units
as the original data.
yy

s
2
n
1
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 25
Practicing Standard Deviation

Suppose the batch of values is 1,2,3,4,5

Find the 5 number summary

Find the standard deviation
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 26
Thinking About Variation




Since Statistics is about variation, spread is an
important fundamental concept of Statistics.
Measures of spread help us talk about what we
don’t know.
When the data values are tightly clustered around
the center of the distribution, the IQR and
standard deviation will be small.
When the data values are scattered far from the
center, the IQR and standard deviation will be
large.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 27
Shape, Center, and Spread

When telling about a quantitative variable, always
report the shape of its distribution, along with a
center and a spread.
 If the shape is skewed, report the median and
IQR.
 If the shape is symmetric, report the mean and
standard deviation and possibly the median
and IQR as well.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 28
What About Outliers?


If there are any clear outliers and you are
reporting the mean and standard deviation, report
them with the outliers present and with the
outliers removed. The differences may be quite
revealing.
Note: The median and IQR are not likely to be
affected by the outliers.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 29
What Can Go Wrong?




Don’t forget to do a reality check—don’t let
technology do your thinking for you.
Don’t forget to sort the values before finding the
median or percentiles.
Don’t compute numerical summaries of a
categorical variable.
Watch out for multiple modes—multiple modes
might indicate multiple groups in your data.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 30
What Can Go Wrong? (cont.)



Be aware of slightly different methods—different
statistics packages and calculators may give you
different answers for the same data.
Beware of outliers.
Make a picture (make a picture, make a picture).
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 31
What Can Go Wrong? (cont.)

Be careful when
comparing groups
that have very
different spreads.
 Consider these
side-by-side
boxplots of
cotinine levels:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 32
*Re-expressing to Equalize the
Spread of Groups

Here are the
side-by-side boxplots
of the log(cotinine)
values:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 33
What have we learned?


We can now summarize distributions of
quantitative variables numerically.
 The 5-number summary displays the min, Q1,
median, Q3, and max.
 Measures of center include the mean and
median.
 Measures of spread include the range, IQR,
and standard deviation.
We know which measures to use for symmetric
distributions and skewed distributions.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 34
What have we learned? (cont.)

We can also display distributions with boxplots.
 While histograms better show the shape of the
distribution, boxplots reveal the center, middle
50%, and any outliers in the distribution.
 Boxplots are useful for comparing groups.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 35
Want to know a short cut to get your
values?



STAT CALC
1-Var Stats and hit enter
Voila!!!
IQR is not given – to find take Q3-Q1
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 5- 36