Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Outlier Question – Christian Cruze 8/4/15 8:27 PM Reply Question that came up in my class today. We were discussing the concept of outliers in the basic idea that they are values that are unusual from the rest of the herd. I mentioned to my students that we'll learn an algebraic way of detecting outliers later (with the 1.5 X IQR rule in my mind). One of my students mentioned that in another class (I believe some sort of science class) they used the standard deviation to detect outliers. My initial thought was that you shouldn't be able to use standard deviation for this...as I was under the impression you generally only want to use mean and standard deviation with a set of data that is approximately normal with no outliers. Then I thought about how it could be possible...after all, if a data value is more than 2 or 3 standard deviations away from the mean, it could be considered an outlier. So my question is this...can you use standard deviation for this purpose or no? Is it worthwhile or useful to do so? Is it totally wrong? Any thoughts? RE: Outlier Question 8/5/15 3:10 AM as a reply to Christian Cruze. Reply Christian, Your student is correct: one can use a rule based on a sample standard deviation calculation to detect outliers in a random sample. But you raise a good point: if a data value is an outlier, it will not only be far from the center of the data, but (because of that) will have an effect on the sample standard deviation as well. For these reasons, the standard deviation-based method is to compare a potential outlier to the mean of the remaining data values, standardized to get a t-statistic -- by scaling by an appropriate measure of standard deviation that excludes the candidate outlier so that the s.d. is not affected by the possible outlier. There are a lot of issues with this approach that the 1.5 X IQR rule does not suffer from (see the end of this post), and all of this is BAP (beyond AP statistics). But in case you (or anyone else) is interested here are the details: Suppose that you have a random sample of n values: x1, x2, …, xn, and you want to know if xn is an outlier relative to the remaining values. (A common way to think about this is to view all but the last x as coming from a normal population with mean mu and standard deviation sigma, and suppose that the last x, xn comes from a normal population with a mean different from mu, but with the same sigma as all of the other x’s.) First, calculate the value of x-bar you would get excluding x n, call this xbar*, this would be your estimate of mu based on x1, …, xn-1only. The difference d = xn – (xbar*) measures how far the possible outlier xn is from the best guess for mu (xbar*) when you exclude x n from the mean calculation. If xn is not an outlier, then the difference d should be close to zero (the mean of the sampling distribution of d will be zero in that case), but if xn is an observation from a population with a different mean than all of the other x’s, then you would expect d to be far from zero, in absolute value (the mean of the sampling distribution of d will be non-zero). In either case, the mathematical theory says that the standard deviation of d will be sigma X sqrt[n/(n-1)]. To decide if this d is big enough to call x n an outlier, you need to scale d by an estimate of its standard deviation. To get this, first you need to estimate sigma, but without using xn in the calculation. So calculate s* as the sample standard deviation based on all but the last observation: s* = sqrt[sum( (x – xbar*)2) / (n-2)] where the sum is over all the x values EXCEPT xn the possible outlier. The scaled version of d is T = d / [s* x sqrt(n/(n-1))] T can be interpreted as how far in units of standard deviation x n is from the mean of the rest of the data. Large values (in absolute value) support that x n is an outlier. If the assumption that all of your data comes from the same normal population is true, than T would behave like a t random variable with (n-2) degrees of freedom. Assuming you’ve got a sample size n larger than 30 or so, this suggests that if T is larger than 1.96 (or round it to 2.00), that would indicate that xn is a somewhat unusual observation compared to the remaining values, and if it is bigger than 3 or 4, it is very unusual. The problem with this procedure is that it relies fairly heavily on the assumption that your random sample including the outlier come from normal distributions (or approximate normal distributions), and it requires a lot of calculation! The advantage of the procedure we teach in AP statistics (e.g., the 1.5 X IQR rule), is that it doesn’t have these restrictive assumptions involving normal distributions. In addition, because the IQR is relatively insensitive to outliers, we don’t have to recalculate a comparison scale factor every time we assess whether another point in our dataset is an outlier. Here is an example: Consider the sample of n = 25 data values: 106 118 109 110 89 97 120 93 96 105 85 87 92 108 106 104 101 89 102 104 103 113 104 103 130 A boxplot and histogram of the data for all 25 data values is in the attached file (from JMP). The sample mean is 102.96 and the sample s.d. is 10.73. It is clear from the boxplot that the 25th observation (130) is an outlier. (JMP calculates the lower quartile as 94.5 and the upper quartile as 108.5, so that the IQR is 14 and 1.5 x IQR = 129.5.) If you exclude the last data value (130) and re-calculate the sample mean and standard deviation, you get xbar*= 101.83 and s* = 9.32. Using these values, d = 130 – 101.83 = 28.17, so T = d / [s* x sqrt(25/24)] = 2.96 indicates that the point is almost 3 standard deviations away from the mean of the remaining points. So using this method, based on the assumption that the data come from a normal distribution, you would identify x 25 = 130 as an outlier in the dataset. - Roy Roy St Laurent | Mathematics & Statistics | Northern Arizona University | Flagstaff, AZ RE: Outlier Question 8/5/15 6:54 AM as a reply to Roy St Laurent. Reply My experience with lab scientists is that they do NOT exclude the outlier when computing the mean and s.d. And excluding just ONE outlier is not very robust against multiple outliers. The IQR approach has only been around for fifty years or so and has not penetrated into application areas. Robert Hayden