Download Outlier Question Using Std Deviation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Outlier Question – Christian Cruze
8/4/15 8:27 PM

Reply
Question that came up in my class today.
We were discussing the concept of outliers in the basic idea that they are values that are unusual
from the rest of the herd. I mentioned to my students that we'll learn an algebraic way of
detecting outliers later (with the 1.5 X IQR rule in my mind). One of my students mentioned that
in another class (I believe some sort of science class) they used the standard deviation to detect
outliers.
My initial thought was that you shouldn't be able to use standard deviation for this...as I was
under the impression you generally only want to use mean and standard deviation with a set of
data that is approximately normal with no outliers.
Then I thought about how it could be possible...after all, if a data value is more than 2 or 3
standard deviations away from the mean, it could be considered an outlier.
So my question is this...can you use standard deviation for this purpose or no? Is it worthwhile or
useful to do so? Is it totally wrong? Any thoughts?
RE: Outlier Question
8/5/15 3:10 AM as a reply to Christian Cruze.

Reply
Christian,
Your student is correct: one can use a rule based on a sample standard deviation calculation to
detect outliers in a random sample. But you raise a good point: if a data value is an outlier, it will
not only be far from the center of the data, but (because of that) will have an effect on the sample
standard deviation as well.
For these reasons, the standard deviation-based method is to compare a potential outlier to the
mean of the remaining data values, standardized to get a t-statistic -- by scaling by an appropriate
measure of standard deviation that excludes the candidate outlier so that the s.d. is not affected
by the possible outlier.
There are a lot of issues with this approach that the 1.5 X IQR rule does not suffer from (see the
end of this post), and all of this is BAP (beyond AP statistics). But in case you (or anyone else) is
interested here are the details:
Suppose that you have a random sample of n values: x1, x2, …, xn, and you want to know if xn is
an outlier relative to the remaining values. (A common way to think about this is to view all but the
last x as coming from a normal population with mean mu and standard deviation sigma, and
suppose that the last x, xn comes from a normal population with a mean different from mu, but
with the same sigma as all of the other x’s.)
First, calculate the value of x-bar you would get excluding x n, call this xbar*, this would be your
estimate of mu based on x1, …, xn-1only. The difference d = xn – (xbar*) measures how far the
possible outlier xn is from the best guess for mu (xbar*) when you exclude x n from the mean
calculation. If xn is not an outlier, then the difference d should be close to zero (the mean of the
sampling distribution of d will be zero in that case), but if xn is an observation from a population
with a different mean than all of the other x’s, then you would expect d to be far from zero, in
absolute value (the mean of the sampling distribution of d will be non-zero). In either case, the
mathematical theory says that the standard deviation of d will be sigma X sqrt[n/(n-1)].
To decide if this d is big enough to call x n an outlier, you need to scale d by an estimate of its
standard deviation. To get this, first you need to estimate sigma, but without using xn in the
calculation. So calculate s* as the sample standard deviation based on all but the last
observation:
s* = sqrt[sum( (x – xbar*)2) / (n-2)]
where the sum is over all the x values EXCEPT xn the possible outlier.
The scaled version of d is
T = d / [s* x sqrt(n/(n-1))]
T can be interpreted as how far in units of standard deviation x n is from the mean of the rest of
the data. Large values (in absolute value) support that x n is an outlier. If the assumption that all of
your data comes from the same normal population is true, than T would behave like a t random
variable with (n-2) degrees of freedom. Assuming you’ve got a sample size n larger than 30 or so,
this suggests that if T is larger than 1.96 (or round it to 2.00), that would indicate that xn is a
somewhat unusual observation compared to the remaining values, and if it is bigger than 3 or 4, it
is very unusual.
The problem with this procedure is that it relies fairly heavily on the assumption that your random
sample including the outlier come from normal distributions (or approximate normal distributions),
and it requires a lot of calculation! The advantage of the procedure we teach in AP statistics (e.g.,
the 1.5 X IQR rule), is that it doesn’t have these restrictive assumptions involving normal
distributions. In addition, because the IQR is relatively insensitive to outliers, we don’t have to
recalculate a comparison scale factor every time we assess whether another point in our dataset
is an outlier.
Here is an example:
Consider the sample of n = 25 data values:
106 118 109 110 89 97 120 93 96 105 85 87
92 108 106 104 101 89 102 104 103 113 104 103 130
A boxplot and histogram of the data for all 25 data values is in the attached file (from JMP). The
sample mean is 102.96 and the sample s.d. is 10.73. It is clear from the boxplot that the
25th observation (130) is an outlier. (JMP calculates the lower quartile as 94.5 and the upper
quartile as 108.5, so that the IQR is 14 and 1.5 x IQR = 129.5.)
If you exclude the last data value (130) and re-calculate the sample mean and standard deviation,
you get xbar*= 101.83 and s* = 9.32. Using these values, d = 130 – 101.83 = 28.17, so
T = d / [s* x sqrt(25/24)] = 2.96 indicates that the point is almost 3 standard deviations away from
the mean of the remaining points. So using this method, based on the assumption that the data
come from a normal distribution, you would identify x 25 = 130 as an outlier in the dataset.
- Roy
Roy St Laurent | Mathematics & Statistics | Northern Arizona University |
Flagstaff, AZ
RE: Outlier Question
8/5/15 6:54 AM as a reply to Roy St Laurent.

Reply
My experience with lab scientists is that they do NOT exclude the outlier when computing the
mean and s.d. And excluding just ONE outlier is not very robust against multiple outliers. The
IQR approach has only been around for fifty years or so and has not penetrated into application
areas.
Robert Hayden