Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Washington Capital Area Chapter of ICEAA Luncheon Series March 23, 2016 • Washington, DC Outlier Analysis Presented by: Marc Greenberg Cost Analysis Division (CAD) National Aeronautics and Space Administration D.O.U.S. Meanwhile in the fire swamp with Westley and Buttercup: Inconceivable! ‘Data of Unusual Size’ (aka D.O.U.S.’s) do exist! So how do we find these D.O.U.S.’s … and THEN what do we do? 2 Outline • Introduction • Three Common Ways to Identify Outliers 1. Outliers w/ respect to X 2. Outliers w/ respect to Y 3. Outliers w/ respect to Yx • • • • “What to do if you find an Outlier Other Outlier Detection Methods Recap / Conclusion Notional “Before-and-After” Example 3 Introduction • The underlying principle of outlier analysis is to: – detect whether a small minority of data observations (e.g. 3 or less) have an unusual amount of influence on the regression line, and – apply techniques to mitigate this “unusual” amount of influence • Determining what is deemed “an outlier” does require some judgment on the part of the analyst. – For example, there is no true consensus in the cost community on “outlier” thresholds for (X,Y) values. Some analysts prefer 2 standard deviations from the mean, others prefer 3 standard deviations. – We deal with similar challenges with other statistical measures such as lowest acceptable t-stat, R-threshold for determining multicollinearity, “most preferred” confidence level, etc. Note: Examples in this outlier analysis section assume data that’s normally distributed. The last slide of this section summarizes other methods that tend to also work reasonably well for data not normally distributed. 4 3 Common Ways to Identify Outliers • Outliers can have a significant effect on the regression coefficients • Which points on the graph would you predict to be influential observations? • How do we tell? 3 ways 15 Y 500 450 16 400 9 350 300 1. Outliers w/ respect to X 2. Outliers w/ respect to Y 3. Outliers w/ respect to Yx 250 200 150 100 50 X 0 0 2 4 6 8 10 12 14 16 18 20 • A best practice is to test each data point in all 3 ways – Test each as a possible outlier w/respect to X,Y and/or Yx Assume for our example, that we are evaluating a CER where Yx = b1 + b2 X 5 Outliers with Respect to X: # Std. Devs • All data should be from the same population – Assumes data is normally distributed • Analyze observations – Based on the values of X for each data point, are there any data points that look very different from the rest? • How to identify potential outliers with respect to X – Calculate mean and standard deviation of X_i values in the dataset – Divide the difference between each Xi and X by the Sx (xi-x) # Std Deviations = sx - Standard Deviation of X data Identify observations that fall more than 2 standard deviations from the mean (or 3 standard deviations from the mean, if preferred) 6 Outliers with Respect to X: # Std. Devs Calculated mean and standard deviation of X_i values in the dataset Divided the difference between each Xi and X by the sx -----> Mean of Xi ‘s = 8.73 _ Xi ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Σ= n= mean = 2 2.5 3.2 4 5 6 7 8 8 9 10 11 12 13 19 20 139.7 16 8.73125 X =Mean 8.73 8.73 8.73 8.73 8.73 8.73 8.73 8.73 8.73 8.73 8.73 8.73 8.73 8.73 8.73 8.73 _ x i -x -6.73 -6.23 -5.53 -4.73 -3.73 -2.73 -1.73 -0.73 -0.73 0.27 1.27 2.27 3.27 4.27 10.27 11.27 std dev = _ (x i -x)2 45.2929 38.8129 30.5809 22.3729 13.9129 7.4529 2.9929 0.5329 0.5329 0.0729 1.6129 5.1529 10.6929 18.2329 105.4729 127.0129 430.7344 28.72 5.36 _ (x i -x)/sx -1.256 -1.163 -1.032 -0.883 -0.696 -0.509 -0.323 -0.136 -0.136 0.050 0.237 0.424 0.610 0.797 1.917 2.103 Std Dev of Xi ‘s = 5.36 # Standard – 8.73 ) Deviations _ = ( Xi 5.36 from X X15 = 19 and X16 = 20 are about 2 standard _ deviations from X Therefore, 2 of the 16 observations are potential outliers. 7 Outliers with Respect to X: Leverage • Leverage Value is one indicator on the degree of influence a given Xi may have on regression coefficients • Looks at how much influence an observation could have on the coefficients of the regression equation – Leverage values sum up to p (# of parameters) – Average leverage value = p/n (n = # of observations) – An observation is considered a potential outlier with respect to X if its leverage value is greater than 2(p/n) to 3(p/n) xi x 1 Leverage = + 2 n xi x 2 8 Outliers with Respect to X: Leverage _ p = 2 & n = 16, (p/n) = 2/16 = 0.125. 2(p/n) = 0.25. 3(p/n) = 0.375., X = 8.73 & SD of Xi ’s = 5.36 ID _ X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Σ= n= mean = p= 2 2.5 3.2 4 5 6 7 8 8 9 10 11 12 13 19 20 139.7 16 8.73125 2 (x i -x)2 LV* 2 (p/n)** 2.5 (p/n) 45.29 0.168 0.25 0.3125 38.81 0.153 0.25 0.3125 30.58 0.133 0.25 0.3125 22.37 0.114 0.25 0.3125 13.91 0.095 0.25 0.3125 7.45 0.080 0.25 0.3125 2.99 0.069 0.25 0.3125 0.53 0.064 0.25 0.3125 0.53 0.064 0.25 0.3125 0.07 0.063 0.25 0.3125 1.61 0.066 0.25 0.3125 5.15 0.074 0.25 0.3125 10.69 0.087 0.25 0.3125 18.23 0.105 0.25 0.3125 105.47 0.307 0.25 0.3125 127.01 0.357 0.25 0.3125 430.73 28.72 = variance 5.36 = std deviation 3 (p/n) 0.375 Recall from Slide 4 that we 0.375 have a CER with 2 0.375 population parameters, 0.375 Therefore, p = 2. 0.375 0.375 0.375 0.375 0.375 0.375 0.375 X15 = 19 and X16 = 20 0.375 have leverage values 0.375 exceeding 2(p/n). 0.375 0.375 0.375 Therefore, these 2 observations are potential outliers. _2 1 Xi X _ 1 45.29 LV .063 .105 .168 2 * Example for X1 : n Xi X 16 430.73 9 Outliers with Respect to Y and Yx • To evaluate potential outliers with Respect to Y, use the same method for “Outliers with Respect to X” – Refer to the 2 methods shown in Slides 5 – 8, but instead apply to Y • Outliers with respect to Yx : These represent observations that the model doesn’t predict well – The further the observation is from the regression line, the larger the estimating error – Approaches in evaluating size of residual – Compare with the standard error of the estimate (SE, SEE, syx) which is based upon the sum of squared errors (aka “squared residuals”). – Individual variance on the residual: Studentized Residual 1 0 Outliers with Respect to Yx: # Std. Errors • Observations that are not predicted well by the regression equation – Calculate predicted cost and standard error of the dataset – Calculate difference between each Yi and Yx and divide by the standard error of Yx (denoted as SYx) (Yi –Yx) # of Standard Errors = ------------SYx Standard Error of Yx Data – Identify observations that fall more than 2 standard errors from the calculated Yx (or 3 standard errors from Yx , if preferred) 11 Outliers with Respect to Y and Yx Evaluating “flagged” Obs. #9 and #16 by calculating Standard Deviations & Standard Errors Evaluate actual Y’s * # Standard ▬ ( Y – Y ) i Deviations = ▬ SY from Y # St Devs = ( Yi – 245.9 ) 108.25 # St Devs ( 345 – 245.9 ) for Y9 = 108.25 # St Devs for Y9 = 0.915 # St Devs ( 350 – 245.9 ) for Y16 = 108.25 # St Devs for Y16 = 0.961 Given Given Actual Y Obs X Y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 2 2.5 3.2 4 5 6 7 8 8 9 10 11 12 13 19 20 Mean of Y = SE of Y = 100 125 130 140 180 185 200 205 345 240 280 290 330 335 500 350 245.9 Estimated or Calculated Y Yx 122.4 131.6 144.4 159.1 177.5 195.8 214.2 232.5 232.5 250.9 269.2 287.6 305.9 324.3 434.4 452.8 245.9 108.25 Using the ± 2 std dev rule, neither observation is an outlier with respect to Y Mean of Y-hat (Y x - Y) (Y x - Y) 2 Error Square Error 2 ei ei Residual Residual 2 -22.39 501 -6.56 43 -14.41 208 -19.10 365 2.55 7 -10.81 117 -14.16 201 -27.52 757 112.48 12,652 -10.87 118 10.77 116 2.42 6 24.07 579 10.71 115 65.58 4,301 -102.77 10,562 SSE = 30,646 SE of Y x = 46.79 Evaluate calculated Yx’s * # Standard Deviations = from Yx ( Yi – Yx ) SYx # St Errors ( 345 – 232.5 ) for Y9 = 46.79 # St Errors for Y9 = 2.404 # St Errors ( 350 – 452.8 ) for Y16 = 46.79 # St Errors for Y16 = -2.196 Using the ± 2 std dev rule, both observations ARE outliers with respect to Yx ▬ * Note 1: SY = SQRT( S(Yi – Y )2 / (n–1) ) = SQRT ( (175,761 / (16 -1) ) = 108.25 * Note 2: SYx = SQRT( S (Yi – Yx )2 / (n–p) ) = SQRT ( (30,646 / (16 – 2) ) = 46.79 12 Outliers with Respect Yx Highlighting steps to calculate Leverage (LV) and Studentized Residual (ei*) for Obs. #16 Evaluation of Yx (Y x - Y) Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Estimated or Error Calculated Y ei Yx Residual 122.4 -22.39 131.6 -6.56 144.4 -14.41 159.1 -19.10 177.5 2.55 195.8 -10.81 214.2 -14.16 232.5 -27.52 232.5 112.48 250.9 -10.87 269.2 10.77 287.6 2.42 305.9 24.07 324.3 10.71 434.4 65.58 452.8 -102.77 245.9 SSE = Mean of Y-hat SE of Y x = (Y x - Y) 2 The Square of Sum of LV = p = Sq root of ei * = e i / s{e i} Square Error Calculated Ys 2.00 unbiasd estim Internally ei 2 Residual vs Mean of 2 501 43 208 365 7 117 201 757 12,652 118 116 6 579 115 4,301 10,562 30,646 46.79 Leverage Calculated Ys LV 15,264 0.1677 13,082 0.1527 10,308 0.1335 7,541 0.1145 4,691 0.0948 2,513 0.0798 1,010 0.0695 180 0.0637 180 0.0637 24 0.0627 543 0.0662 1,734 0.0744 3,599 0.0873 6,139 0.1048 35,525 0.3073 42,779 0.3573 145,112 MSE = SSE = Sum n- k -1 of variance Studentized s{ei} Residual 42.7 43.1 43.6 44.0 44.5 44.9 45.1 45.3 45.3 45.3 45.2 45.0 44.7 44.3 38.9 37.5 30,646 14 -0.52 -0.15 -0.33 -0.43 0.06 -0.24 -0.31 -0.61 2.48 R -0.24 0.24 0.05 0.54 0.24 1.68 -2.74 R, D = 2,189 R: an observation w/an unusual Dependent variable value D: an observation w/an unusual Cook's D-statistics value 𝐿𝑉 = 𝑛1 + 𝐿𝑉 = 1 16 + 𝑌𝑥 −𝑌𝑥 2 𝑌𝑥 −𝑌𝑥 2 452.8−245.9 2 145,112 𝐿𝑉 = 0.0625 + 0.2948 𝐿𝑉 = 0.3573 s2{ei}= MSE (1 – LV ) s2{ei}= 2,189 (1 – 0.3573 ) s2{ei}= 1,406.9 s {ei}= 37.5 ei*= ei / s{ei} ei*= -102.77 / 37.5 ei*= - 2.74 Internally Studentized Residual for observation #16 As already noted in slide 9, obs. #16 has a Leverage > 2(p/n). Driven by its high ei, obs. #16 has an ei* > 2 std dev (unusual Yx) 13 Outliers with Respect to Yx Observations Influencing the Regression Coefficients An observation is considered influential by having: • • • a moderate leverage value and a large residual, a large leverage value and a moderate residual, or a large leverage value and a large residual. Cook’s Distance (Cook’s D) is a statistic that is commonly used to determine if an observation is influential. • The distance an observation would be from a regression equation built with this observation omitted from the dataset ei 2 Leverage Di = 2 p MSE 1 - Leverage p # of population parameters in the equation MSE = MSE from the equation with all the observations If Cook’s D > 50th percentile of the F distribution for (p, n-p) degrees of freedom, then the observation is considered influential. 14 Outliers with Respect to Yx Observations Influencing the Regression Coefficients 1. Calculation of Cook’s D statistic for observation #16: ei 2 Leverage Di = 2 p MSE 1 - Leverage =2 p # of population parameters in the equation MSE = MSE from the equation with all the observations = 2,189 𝐷𝑖 = −102.772 2 2,189 0.3573 1−0.3573 = 2.412 x 0.556 = 1.341 2. Lookup 50th %-tile of the F distribution for (p, n-p) degrees of freedom: • F distribution (2, 16-2) degrees of freedom = F distribution (2, 14) = 0.729 - Excel’s F.INV function provided this reference value for F (a =0.50, numerator = 2, denominator =14) Therefore, evaluating observation #16: Cook’s D > F (0.50, 2, 14) 1.341 > 0.729 Cook’s D indicates that obs. #16 is influential (aka “an unusual value”) 15 What to do if you find an Outlier Part 1: Evaluate Outlier with respect to X or Y A. Investigation • • • • • B. Do you have the right value for the observation? Has the observation been normalized correctly? Is the observation part of the population? How different is the outlier? Were there any unusual events that impacted the value of the observation? Actions based upon results of Investigation • • • • Correct data entry errors Improve normalization process Remove data point if not part of population Determine if unusual program events make a difference Part 2: Outlier with respect to Yx (note: do this after completing part 1) A. Investigation • Did you choose the correct functional form? • Are there any omitted cost drivers? • Was the same criteria applied to all outliers? B. Actions based upon results of Investigation • Add another cost driver and/or choose another functional form • Dampen or lessen Yx influence by transforming X or Y data • Create and compare two regression equations: – One with and one without the outlier(s) 16 Other Outlier Detection Methods • Median and Median Absolute Deviation Method (MAD) – For this outlier detection method, the median of the residuals is calculated. Then, the difference is calculated between each historical value and this median. These differences are expressed as their absolute values, and a new median is calculated and multiplied by an empirically derived constant to yield the MAD. – If a value is a certain number of MAD away from the median of the residuals, that value is classified as an outlier. The default threshold is 3 MAD. • This method is generally more effective than the mean and standard deviation method for detecting outliers, but it can be too aggressive in classifying values that are not really extremely different. Also, if more than 50% of the data points have the same value, MAD is computed to be 0, so any value different from the residual median is classified as an outlier. • Median and Interquartile Deviation Method (IQD) – For this outlier detection method, the median of the residuals is calculated, along with the 25th percentile and the 75th percentile. The difference between the 25th and 75th percentile is the IQD. Then, the difference is calculated between each historical value and the residual median. If the historical value is a certain number of MAD away from the median of the residuals, that value is classified as an outlier. – The default threshold is 2.22, which is equivalent to 3 standard deviations or MADs. • This method is somewhat susceptible to influence from extreme outliers, but less so than the mean and standard deviation method. Box plots are based on this approach. The median and interquartile deviation method can be used for both symmetric and asymmetric data. 17 Recap / Conclusion • The main reason for outlier analysis is to identify if one or more observations have an unusual amount of influence on the regression. • Outlier analysis calculations are done with respect to X, Y and Yx • Because there’s no consensus on what is a ‘true’ outlier on a single metric, it’s ‘good practice’ to calculate and account for all metrics: – – – – – # of Standard Deviations with respect to actual X’s and actual Y’s # of Standard Errors with respect to calculated Yx ’s Leverage Value (LV) with respect to actual X’s (… get same result wrt Yx ’s) Residual (denoted as has an ei ) & Studentized Residual (denoted as has an ei* ) Cook’s Distance (‘Cook’s D’) = a function of ei , Mean Squared Error and LV • Identifying outliers is typically just half the effort; what to do if you find an outlier can & should be handled on a case-by-case basis. • This “mini-lesson” covered fundamental methods. Keep in mind that there are several other outlier analysis methods out there. 18