Download Outlier Analysis - Washington ICEAA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Washington Capital Area Chapter of ICEAA Luncheon Series
March 23, 2016 • Washington, DC
Outlier Analysis
Presented by:
Marc Greenberg
Cost Analysis Division (CAD)
National Aeronautics and Space Administration
D.O.U.S.
Meanwhile in the fire
swamp with Westley
and Buttercup:
Inconceivable!
‘Data of Unusual Size’ (aka D.O.U.S.’s) do exist!
So how do we find these D.O.U.S.’s … and THEN what do we do?
2
Outline
• Introduction
• Three Common Ways to Identify
Outliers
1. Outliers w/ respect to X
2. Outliers w/ respect to Y
3. Outliers w/ respect to Yx
•
•
•
•
“What to do if you find an Outlier
Other Outlier Detection Methods
Recap / Conclusion
Notional “Before-and-After” Example
3
Introduction
• The underlying principle of outlier analysis is to:
– detect whether a small minority of data observations (e.g. 3 or less) have an
unusual amount of influence on the regression line, and
– apply techniques to mitigate this “unusual” amount of influence
• Determining what is deemed “an outlier” does
require some judgment on the part of the analyst.
– For example, there is no true consensus in the cost community on “outlier”
thresholds for (X,Y) values. Some analysts prefer 2 standard deviations from
the mean, others prefer 3 standard deviations.
– We deal with similar challenges with other statistical measures such as lowest
acceptable t-stat, R-threshold for determining multicollinearity, “most
preferred” confidence level, etc.
Note: Examples in this outlier analysis section assume data that’s normally
distributed. The last slide of this section summarizes other methods that
tend to also work reasonably well for data not normally distributed.
4
3 Common Ways to Identify Outliers
• Outliers can have a significant effect on the regression
coefficients
• Which points on the graph would you predict to be
influential observations?
• How do we tell? 3 ways
15
Y 500
450
16
400
9
350
300
1. Outliers w/ respect to X
2. Outliers w/ respect to Y
3. Outliers w/ respect to Yx
250
200
150
100
50
X
0
0
2
4
6
8
10
12
14
16
18
20
• A best practice is to test each data point in all 3 ways
– Test each as a possible outlier w/respect to X,Y and/or Yx
Assume for our example, that we are
evaluating a CER where Yx = b1 + b2 X
5
Outliers with Respect to X: # Std. Devs
• All data should be from the same population
– Assumes data is normally distributed
• Analyze observations
– Based on the values of X for each data point, are there any data points
that look very different from the rest?
• How to identify potential outliers with respect to X
– Calculate mean and standard deviation of X_i values in the dataset
– Divide the difference between each Xi and X by the Sx
(xi-x)
# Std Deviations =
sx
-
Standard Deviation of X data
Identify observations that fall more than 2 standard deviations from the
mean (or 3 standard deviations from the mean, if preferred)
6
Outliers with Respect to X: # Std. Devs
Calculated mean and standard deviation of X_i values in the dataset
Divided the difference between each Xi and X by the sx ----->
Mean of Xi ‘s = 8.73
_
Xi
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Σ=
n=
mean =
2
2.5
3.2
4
5
6
7
8
8
9
10
11
12
13
19
20
139.7
16
8.73125
X =Mean
8.73
8.73
8.73
8.73
8.73
8.73
8.73
8.73
8.73
8.73
8.73
8.73
8.73
8.73
8.73
8.73
_
x i -x
-6.73
-6.23
-5.53
-4.73
-3.73
-2.73
-1.73
-0.73
-0.73
0.27
1.27
2.27
3.27
4.27
10.27
11.27
std dev =
_
(x i -x)2
45.2929
38.8129
30.5809
22.3729
13.9129
7.4529
2.9929
0.5329
0.5329
0.0729
1.6129
5.1529
10.6929
18.2329
105.4729
127.0129
430.7344
28.72
5.36
_
(x i -x)/sx
-1.256
-1.163
-1.032
-0.883
-0.696
-0.509
-0.323
-0.136
-0.136
0.050
0.237
0.424
0.610
0.797
1.917
2.103
Std Dev of Xi ‘s = 5.36
# Standard
– 8.73 )
Deviations
_ = ( Xi 5.36
from X
X15 = 19 and X16 = 20
are about 2 standard
_
deviations from X
Therefore, 2 of the 16
observations are
potential outliers.
7
Outliers with Respect to X: Leverage
• Leverage Value is one indicator on the degree of
influence a given Xi may have on regression
coefficients
• Looks at how much influence an observation could
have on the coefficients of the regression equation
– Leverage values sum up to p (# of parameters)
– Average leverage value = p/n (n = # of observations)
– An observation is considered a potential outlier with respect
to X if its leverage value is greater than 2(p/n) to 3(p/n)
xi  x 

1
Leverage =
+
2
n
 xi  x 
2
8
Outliers with Respect to X: Leverage
_
p = 2 & n = 16, (p/n) = 2/16 = 0.125. 2(p/n) = 0.25. 3(p/n) = 0.375., X = 8.73 & SD of Xi ’s = 5.36
ID
_
X
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Σ=
n=
mean =
p=
2
2.5
3.2
4
5
6
7
8
8
9
10
11
12
13
19
20
139.7
16
8.73125
2
(x i -x)2
LV*
2 (p/n)** 2.5 (p/n)
45.29
0.168
0.25
0.3125
38.81
0.153
0.25
0.3125
30.58
0.133
0.25
0.3125
22.37
0.114
0.25
0.3125
13.91
0.095
0.25
0.3125
7.45
0.080
0.25
0.3125
2.99
0.069
0.25
0.3125
0.53
0.064
0.25
0.3125
0.53
0.064
0.25
0.3125
0.07
0.063
0.25
0.3125
1.61
0.066
0.25
0.3125
5.15
0.074
0.25
0.3125
10.69
0.087
0.25
0.3125
18.23
0.105
0.25
0.3125
105.47
0.307
0.25
0.3125
127.01
0.357
0.25
0.3125
430.73
28.72 = variance
5.36 = std deviation
3 (p/n)
0.375 Recall from Slide 4 that we
0.375
have a CER with 2
0.375
population parameters,
0.375
Therefore, p = 2.
0.375
0.375
0.375
0.375
0.375
0.375
0.375
X15 = 19 and X16 = 20
0.375
have leverage values
0.375
exceeding 2(p/n).
0.375
0.375
0.375
Therefore, these 2
observations are
potential outliers.
_2
1  Xi  X _
1 45.29

LV




 .063  .105  .168
2
* Example for X1 :
n   Xi  X  16 430.73
9
Outliers with Respect to Y and Yx
• To evaluate potential outliers with Respect to Y, use
the same method for “Outliers with Respect to X”
– Refer to the 2 methods shown in Slides 5 – 8, but instead apply to Y
• Outliers with respect to Yx : These represent
observations that the model doesn’t predict well
– The further the observation is from the regression line, the larger
the estimating error
– Approaches in evaluating size of residual
– Compare with the standard error of the estimate (SE, SEE, syx) which
is based upon the sum of squared errors (aka “squared residuals”).
– Individual variance on the residual: Studentized Residual
1
0
Outliers with Respect to Yx: # Std. Errors
• Observations that are not predicted well by the
regression equation
– Calculate predicted cost and standard error of the dataset
– Calculate difference between each Yi and Yx and divide by the
standard error of Yx (denoted as SYx)
(Yi –Yx)
# of Standard Errors = ------------SYx
Standard Error of Yx Data
– Identify observations that fall more than 2 standard errors from
the calculated Yx (or 3 standard errors from Yx , if preferred)
11
Outliers with Respect to Y and Yx
Evaluating “flagged” Obs. #9 and #16 by calculating Standard Deviations & Standard Errors
Evaluate actual Y’s *
# Standard
▬
(
Y
–
Y
)
i
Deviations =
▬
SY
from Y
# St Devs = ( Yi – 245.9 )
108.25
# St Devs ( 345 – 245.9 )
for Y9 =
108.25
# St Devs
for Y9 =
0.915
# St Devs ( 350 – 245.9 )
for Y16 =
108.25
# St Devs
for Y16 =
0.961
Given
Given
Actual Y
Obs
X
Y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
2.5
3.2
4
5
6
7
8
8
9
10
11
12
13
19
20
Mean of Y =
SE of Y =
100
125
130
140
180
185
200
205
345
240
280
290
330
335
500
350
245.9
Estimated or
Calculated Y
Yx
122.4
131.6
144.4
159.1
177.5
195.8
214.2
232.5
232.5
250.9
269.2
287.6
305.9
324.3
434.4
452.8
245.9
108.25
Using the ± 2 std dev rule,
neither observation is an
outlier with respect to Y
Mean of Y-hat
(Y x - Y)
(Y x - Y) 2
Error
Square Error
2
ei
ei
Residual
Residual
2
-22.39
501
-6.56
43
-14.41
208
-19.10
365
2.55
7
-10.81
117
-14.16
201
-27.52
757
112.48
12,652
-10.87
118
10.77
116
2.42
6
24.07
579
10.71
115
65.58
4,301
-102.77
10,562
SSE = 30,646
SE of Y x = 46.79
Evaluate calculated Yx’s *
# Standard
Deviations =
from Yx
( Yi – Yx )
SYx
# St Errors ( 345 – 232.5 )
for Y9 =
46.79
# St Errors
for Y9 =
2.404
# St Errors ( 350 – 452.8 )
for Y16 =
46.79
# St Errors
for Y16 = -2.196
Using the ± 2 std dev rule,
both observations ARE
outliers with respect to Yx
▬
* Note 1: SY = SQRT( S(Yi – Y )2 / (n–1) ) = SQRT ( (175,761 / (16 -1) ) = 108.25
* Note 2: SYx = SQRT( S (Yi – Yx )2 / (n–p) ) = SQRT ( (30,646 / (16 – 2) ) = 46.79
12
Outliers with Respect Yx
Highlighting steps to calculate Leverage (LV) and Studentized Residual (ei*) for Obs. #16
Evaluation of Yx
(Y x - Y)
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Estimated or
Error
Calculated Y
ei
Yx
Residual
122.4
-22.39
131.6
-6.56
144.4
-14.41
159.1
-19.10
177.5
2.55
195.8
-10.81
214.2
-14.16
232.5
-27.52
232.5
112.48
250.9
-10.87
269.2
10.77
287.6
2.42
305.9
24.07
324.3
10.71
434.4
65.58
452.8 -102.77
245.9
SSE =
Mean of Y-hat SE of Y x =
(Y x - Y) 2
The Square of Sum of LV = p = Sq root of
ei * = e i / s{e i}
Square Error Calculated Ys
2.00
unbiasd estim
Internally
ei 2
Residual
vs Mean of
2
501
43
208
365
7
117
201
757
12,652
118
116
6
579
115
4,301
10,562
30,646
46.79
Leverage
Calculated Ys
LV
15,264
0.1677
13,082
0.1527
10,308
0.1335
7,541
0.1145
4,691
0.0948
2,513
0.0798
1,010
0.0695
180
0.0637
180
0.0637
24
0.0627
543
0.0662
1,734
0.0744
3,599
0.0873
6,139
0.1048
35,525
0.3073
42,779
0.3573
145,112
MSE = SSE =
Sum
n- k -1
of variance
Studentized
s{ei}
Residual
42.7
43.1
43.6
44.0
44.5
44.9
45.1
45.3
45.3
45.3
45.2
45.0
44.7
44.3
38.9
37.5
30,646
14
-0.52
-0.15
-0.33
-0.43
0.06
-0.24
-0.31
-0.61
2.48 R
-0.24
0.24
0.05
0.54
0.24
1.68
-2.74
R, D
=
2,189
R: an observation w/an unusual Dependent variable value
D: an observation w/an unusual Cook's D-statistics value
𝐿𝑉 = 𝑛1 +
𝐿𝑉 =
1
16
+
𝑌𝑥 −𝑌𝑥 2
𝑌𝑥 −𝑌𝑥 2
452.8−245.9 2
145,112
𝐿𝑉 = 0.0625 + 0.2948
𝐿𝑉 = 0.3573
s2{ei}= MSE (1 – LV )
s2{ei}= 2,189 (1 – 0.3573 )
s2{ei}= 1,406.9
s {ei}= 37.5
ei*= ei / s{ei}
ei*= -102.77 / 37.5
ei*= - 2.74
Internally Studentized Residual for
observation #16
As already noted in slide 9, obs. #16 has a Leverage > 2(p/n).
Driven by its high ei, obs. #16 has an ei* > 2 std dev (unusual Yx)
13
Outliers with Respect to Yx
Observations Influencing the Regression Coefficients
An observation is considered influential by having:
•
•
•
a moderate leverage value and a large residual,
a large leverage value and a moderate residual, or
a large leverage value and a large residual.
Cook’s Distance (Cook’s D) is a statistic that is commonly
used to determine if an observation is influential.
• The distance an observation would be from a regression equation
built with this observation omitted from the dataset


 
ei 2
Leverage
Di = 
 
2 
 p  MSE    1 - Leverage  
p  # of population parameters in the equation
MSE = MSE from the equation with all the observations
If Cook’s D > 50th percentile of the F distribution for (p, n-p)
degrees of freedom, then the observation is considered influential.
14
Outliers with Respect to Yx
Observations Influencing the Regression Coefficients
1. Calculation of Cook’s D statistic for observation #16:


 
ei 2
Leverage
Di = 
 
2 
p
MSE
   1 - Leverage  
 
=2
p  # of population parameters in the equation
MSE = MSE from the equation with all the observations = 2,189
𝐷𝑖 =
−102.772
2 2,189
0.3573
1−0.3573
= 2.412 x 0.556 = 1.341
2. Lookup 50th %-tile of the F distribution for (p, n-p) degrees of freedom:
• F distribution (2, 16-2) degrees of freedom = F distribution (2, 14) = 0.729
-
Excel’s F.INV function provided this reference value for F (a =0.50, numerator = 2, denominator =14)
Therefore, evaluating observation #16:
Cook’s D > F (0.50, 2, 14)
1.341 > 0.729
Cook’s D indicates that obs. #16 is influential (aka “an unusual value”)
15
What to do if you find an Outlier
Part 1: Evaluate Outlier with respect to X or Y
A. Investigation
•
•
•
•
•
B.
Do you have the right value for the observation?
Has the observation been normalized correctly?
Is the observation part of the population?
How different is the outlier?
Were there any unusual events that impacted the value of the observation?
Actions based upon results of Investigation
•
•
•
•
Correct data entry errors
Improve normalization process
Remove data point if not part of population
Determine if unusual program events make a difference
Part 2: Outlier with respect to Yx (note: do this after completing part 1)
A. Investigation
• Did you choose the correct functional form?
• Are there any omitted cost drivers?
• Was the same criteria applied to all outliers?
B.
Actions based upon results of Investigation
• Add another cost driver and/or choose another functional form
• Dampen or lessen Yx influence by transforming X or Y data
• Create and compare two regression equations:
– One with and one without the outlier(s)
16
Other Outlier Detection Methods
• Median and Median Absolute Deviation Method (MAD)
– For this outlier detection method, the median of the residuals is calculated. Then, the
difference is calculated between each historical value and this median. These differences are
expressed as their absolute values, and a new median is calculated and multiplied by an
empirically derived constant to yield the MAD.
– If a value is a certain number of MAD away from the median of the residuals, that value is
classified as an outlier. The default threshold is 3 MAD.
• This method is generally more effective than the mean and standard deviation method for detecting
outliers, but it can be too aggressive in classifying values that are not really extremely different. Also, if
more than 50% of the data points have the same value, MAD is computed to be 0, so any value
different from the residual median is classified as an outlier.
• Median and Interquartile Deviation Method (IQD)
– For this outlier detection method, the median of the residuals is calculated, along with the
25th percentile and the 75th percentile. The difference between the 25th and 75th percentile
is the IQD. Then, the difference is calculated between each historical value and the residual
median. If the historical value is a certain number of MAD away from the median of the
residuals, that value is classified as an outlier.
– The default threshold is 2.22, which is equivalent to 3 standard deviations or MADs.
• This method is somewhat susceptible to influence from extreme outliers, but less so than the mean
and standard deviation method. Box plots are based on this approach. The median and interquartile
deviation method can be used for both symmetric and asymmetric data.
17
Recap / Conclusion
• The main reason for outlier analysis is to identify if one or more
observations have an unusual amount of influence on the regression.
• Outlier analysis calculations are done with respect to X, Y and Yx
• Because there’s no consensus on what is a ‘true’ outlier on a single
metric, it’s ‘good practice’ to calculate and account for all metrics:
–
–
–
–
–
# of Standard Deviations with respect to actual X’s and actual Y’s
# of Standard Errors with respect to calculated Yx ’s
Leverage Value (LV) with respect to actual X’s (… get same result wrt Yx ’s)
Residual (denoted as has an ei ) & Studentized Residual (denoted as has an ei* )
Cook’s Distance (‘Cook’s D’) = a function of ei , Mean Squared Error and LV
• Identifying outliers is typically just half the effort; what to do if you
find an outlier can & should be handled on a case-by-case basis.
• This “mini-lesson” covered fundamental methods. Keep in mind that
there are several other outlier analysis methods out there.
18