Download Center: Finding the Median Median Spread: Home on the Range

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Center: Finding the Median
Center: Finding the Median (cont.)
• When we think of a typical value, we usually look
for the center of the distribution.
• For a unimodal, symmetric distribution, it’s easy
to find the center—it’s just the center of
symmetry.
• We could average the minimum and maximum
data values (called the midrange) as a measure
of center, but the midrange is very sensitive to
skewed distributions and outliers.
• A more reasonable choice for center than
the midrange is the value with exactly half
the data values below it and half above it.
This particular value is called the median.
• The median is the middle data value (once
the data values have been ordered) that
divides the histogram into two equal areas.
• The median has the same units as the
data.
Slide 5-1
Copyright © 2004 Pearson Education, Inc.
Median
Slide 5-2
Spread: Home on the Range
• When describing a distribution numerically, we
always report a measure of its spread along with
its center.
• The range of the data is the difference between
the maximum and minimum values: Range =
max – min.
• A disadvantage of the range is that a single
extreme value can make it very large and, thus,
not representative of the data overall.
The sample median is the n + 1 largest observation.
2
n +1
is not a whole number, the median is the
2
average of the two observations on either side.
If
Copyright © 2004 Pearson Education, Inc.
Copyright © 2004 Pearson Education, Inc.
Slide 5-3
Copyright © 2004 Pearson Education, Inc.
Slide 5-4
The Interquartile Range
Quartiles
• The interquartile range (IQR) allows us to
ignore extreme data values and
concentrate on the middle of the data.
• To find the IQR, we first need to know
what quartiles are…
Quartiles split the data into quarters
• Lower quartile (Q1) divides bottom half of data
into two
– median of observations below the median
• Upper quartile (Q3) divides upper half of data
into two
– median of observations above the median
• The difference between the quartiles is the
IQR, so
IQR = upper quartile – lower quartile.
Copyright © 2004 Pearson Education, Inc.
Slide 5-5
The Interquartile Range (cont.)
• The lower and upper quartiles are the 25th and
75th percentiles of the data, so…
• The IQR contains the middle 50% of the values
of the distribution, as shown in Figure 5.3 from
the text:
Copyright © 2004 Pearson Education, Inc.
Slide 5-7
Copyright © 2004 Pearson Education, Inc.
Slide 5-6
The Five-Number Summary
• Five number summary
{ Min, Q1, Median, Q3, Max }
• Example:
Copyright © 2004 Pearson Education, Inc.
Slide 5-8
Boxplot
Boxplots
• A boxplot is a graphical display of the fivenumber summary. The steps involved in
constructing a boxplot can also be found
on pages 60-61 of the text.
• Boxplots are particularly useful when
comparing groups.
Q 1 Med
Q3
Data
1.5 IQR
1.5 IQR
(pull back until hit observation)
(pull back until hit observation)
Scale
Figure 2.4.4
Construction of a box plot.
From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.
Copyright © 2004 Pearson Education, Inc.
Slide 5-9
Construction of Boxplot
Copyright © 2004 Pearson Education, Inc.
Slide 5-10
Comparing Groups With Boxplots
• The following set of boxplots compares the
effectiveness of various coffee containers:
Data: breaking strength of wire in kilograms
220 214 222 218 223 210 223 210 227 225 212
Leaf Unit = 1.0 kg
4
5
(4)
2
•
•
•
•
21
21
22
22
0024
8
0233
57
Find Median
Find Quartiles Q1 =
Q3 =
Calculate Interquartile range
Q3 - Q1 =
Calculate whisker length
1.5 x (Q3 - Q1) =
Copyright © 2004 Pearson Education, Inc.
• What does this graphical display tell you?
Slide 5-11
Copyright © 2004 Pearson Education, Inc.
Slide 5-12
Sample Mean – average
Summarizing Symmetric Distributions
• Medians do a good job of identifying the
center of skewed distributions. When we
have symmetric data, the mean is a good
measure of center.
• We find the mean by adding up all of the
data values and dividing by n, the number
of data values we have.
• The sample mean is denoted by x
The sample mean =
Sum of the observations
Number of observations
Mean
(a)
Figure 2.4.1
Copyright © 2004 Pearson Education, Inc.
Slide 5-13
(b)
(c)
Mechanical construction representing a dot plot:
(a) shows a balanced rod while (b) and (c) show unbalanced rods.
Slide 5-14
Copyright © 2004 Pearson Education, Inc.
Relationship between mean and
median
Mean or Median?
• Regardless of the shape of the distribution, the
mean is the point at which a histogram of the
data would balance.
• In symmetric distributions, the mean and median
are approximately the same in value, so either
measure of center may be used.
• For skewed data, though, it’s better to report the
median than the mean as a measure of center.
P
Med = x
(a) Data symmetric about P
P
Med
x
(b) Two largest points moved to the right
Figure 2.4.2
The mean and the median.
[Grey disks in (b) are the ``ghosts'' of the points that were moved.]
From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.
Copyright © 2004 Pearson Education, Inc.
Slide 5-15
Copyright © 2004 Pearson Education, Inc.
Slide 5-16
What About Spread?
Variance
• A more powerful measure of spread than
the IQR is the standard deviation, which
takes into account how far each data value
is from the mean.
• A deviation is the distance that a data
value is from the mean. Since adding all
deviations together would total zero, we
square each deviation and find an average
of sorts for the deviations.
• The sample variance, denoted by s2, is
found using the formula
s
Slide 5-17
Copyright © 2004 Pearson Education, Inc.
sx =
) (
2
)
2
(
+ x2 − x + ... + xn − x
1 − x
n −1
)
2
=
(
1
∑ xi − x
n −1
)
2
• In same units as data
– So preferable to sample variance
• Equals zero only if all observations identical
• Sensitive to outliers (extreme observations)
• Button on calculator – learn to use it!
– Much simpler than applying formula
Copyright © 2004 Pearson Education, Inc.
1
)
2
2
(
− x + ... + xn − x
n −1
)
2
=
(
1
∑ xi − x
n −1
Copyright © 2004 Pearson Education, Inc.
)
2
Slide 5-18
Shape, Center, and Spread
Sample Standard Deviation
(x
(x − x ) + (x
=
2
2
Slide 5-19
• When telling about a quantitative variable,
always report the shape of its distribution,
along with a center and a spread.
• If the shape is skewed, report the median
and IQR.
• If the shape is symmetric, report the mean
and standard deviation and possibly the
median and IQR as well.
Copyright © 2004 Pearson Education, Inc.
Slide 5-20
What About Outliers?
What Can Go Wrong?
• If there are any clear outliers and you are
reporting the mean and standard
deviation, report them with the outliers
present and with the outliers removed. The
differences may be quite revealing.
• Note: The median and IQR are not likely to
be affected by the outliers.
• Do a reality check—don’t let technology do
your thinking for you.
• Don’t forget to sort the values before
finding the median or percentiles.
• Don’t compute numerical summaries of a
categorical variable.
• Watch out for multiple modes—multiple
modes might indicate multiple groups in
your data.
Copyright © 2004 Pearson Education, Inc.
Slide 5-21
What Can Go Wrong? (cont.)
• Be aware of slightly different methods—
different statistics packages and
calculators may give you different answers
for the same data.
• Beware of outliers.
• Make a picture (make a picture, make a
picture).
• Be careful when comparing groups that
have very different spreads.
Copyright © 2004 Pearson Education, Inc.
Slide 5-23
Copyright © 2004 Pearson Education, Inc.
Slide 5-22
So What Do We Know?
• We describe distributions in terms of shape,
center, and spread.
• For symmetric distributions, it’s safe to use the
mean and standard deviation; for skewed
distributions, it’s better to use the median and
interquartile range.
• Always make a picture—don’t make judgments
about which measures of center and spread to
use by just looking at the data.
Copyright © 2004 Pearson Education, Inc.
Slide 5-24