Download Outliers - Lyndhurst Schools

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Gibbs sampling wikipedia , lookup

Transcript
Chapter 6:
Interpreting the Measures of
Variability
• To date we have discussed three measures of
central tendency (mean, median, and mode)
and three measures of variability (range, IQR,
and standard deviation).
• To adequately summarize a set of data we
need both measures.
• Remember that mean and range are may not
be the best measures to use when a
distribution is skewed or contains outliers.
Better options would be median and IQR.
Five-Number Summary
• The five-number summary of a distribution for
numerical data consists of:
– The smallest observation (minimum)
– The first quartile
– The median
– The third quartile
– The largest observation (maximum)
• These five values are used to draw a boxplot (aka
box-and-whisker plot).
Boxplots
• Boxplots are a statistical device used to examine
graphically the shape of a distribution, the range
and IQR of a distribution, and the sides with the
greatest concentration of observations.
• Boxplots serve as a statistical tool for summarizing
and comparing numerical data from two or more
samples, in particular where their medians are
located, how spread out they are, and whether
they are symmetric, positively skewed (skewed
right) or negatively skewed (skewed left).
• Boxplots also identify outliers of a distribution.
• The whiskers of the boxplot extend to the lowest
and highest values in data set that are not
outliers.
• Outliers are marked separately with an asterisk or
circle.
Positively Skewed Data
• When the median is closer to the bottom of
the box, and if the whisker is shorter on the
lower end of the box, then the distribution is
positively skewed (skewed right).
Negatively Skewed Data
• When the median is closer to the top of the
box, and if the whisker is shorter on the upper
end of the box, then the distribution is
negatively skewed (skewed left).
Symmetrical Data
• When the median is in the middle of the box,
and the whiskers are about the same on both
sides of the box, then the distribution is
symmetric.
Outliers
• Outliers are any value that falls out of the
pattern of the rest of the data (unusually high
or unusually low values in a distribution).
• The rule of thumb for an observation being an
outlier is if the observation lies more than 1.5
IQR’s below the first quartile or above the
third quartile.
Example: On September 20, 2009, the Tennessee
Titans played the Houston Texans. Here are the
rushing yards Titan’s running back Chris Johnson had
for each of his 16 rushing attempts. Determine if
there are any outliers.
Steps to Make a Boxplot
1) Draw a central box (rectangle) from the first
quartile to the third quartile
2) Draw a vertical line to mark the median
3) Draw horizontal lines (whiskers) that extend
from the box out to the smallest and largest
observations that are not outliers
4) If there are any outliers, mark them
separately
Example: Draw a boxplot for the Chris Johnson data.
• From the boxplot, we can see that our
distribution contains no outliers. In addition,
due to the location of the median and the
shorter left whisker, we can also state that our
distribution is slightly skewed right (positively
skewed).
Example: Construct a boxplot for the following set of
data: 1, 6, 5, 4, 10, 16, 8, 3, 18, 13
Order: 1, 3, 4, 5, 6, 8, 10, 13, 16, 18
Median: 7
Q1: 4
Q3: 13
IQR: 9
There are no
outliers.
The Empirical Rule
• In symmetric (bell-shaped) distributions, the
values are distributed symmetrically about the
mean in such a way that the values are clustered
most densely around the mean and become rarer
as the distance between the values and the mean
widens.
• This distribution is called a Normal distribution
and the graph is called the Normal curve.
• The empirical rule is a statement about the
proportion of the items that falls within different
standard deviation units from the mean, when the
distribution is a Normal distribution.
• The empirical rule is also known as the 68-9599.7 rule, for an obvious reason.
• In general, in a Normal distribution:
– Approximately 68% of the observations will be
within 1 standard deviation of the mean.
– Approximately 95% of the observations will be
within 2 standard deviations of the mean.
– Approximately 99.7% of the observations will be
within 3 standard deviations of the mean.
A Visual Summary of the Empirical Rule
Example: Suppose a sample of scores yields a mean
of 100 and a standard deviation of 15. Assume that
the distribution is Normal.
What percent of scores should fall between 85 and
115? (Hint: Draw a diagram first!)
85 and 115 are both one standard deviation from the mean, so the
percent of scores that fall between 85 and 115 is approximately 68%
Let’s try some more with the same distribution…
What percent of scores should fall:
a) Between 70 and 130?
95%
c) Between 70 and 115?
13.5%+68%=81.5%
e) Less than 70?
2.5%
b) Between 55 and 145?
99.7%
d) Greater than 115?
13.5%+2.5%=16%