Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Regression Analysis Defense Resources Management Institute Unscheduled Maintenance Issue: 36 flight squadrons Each experiences unscheduled maintenance actions (UMAs) UMAs costs $1000 to repair, on average. You’ve got the Data… Now What? Unscheduled Maintenance Actions (UMAs) Sq Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 101 36 53 51 61 63 54 50 65 62 51 68 45 104 60 42 56 63 39 65 63 67 66 52 59 60 108 53 61 59 87 61 46 52 85 84 75 78 68 What do you want to know? How many UMAs will there be next month? What is the average number of UMAs ? Sample Mean xi x 60 n Sample Standard Deviation s ( xi x ) 12.05 n 1 2 UMA Sample Statistics UMAs Mean Standard Error of Mean Median Mode Standard Deviation Minimum Maximum Count 60 2.01 60.5 61 12.05 36 87 36 UMAs Next Month 95% Confidence Interval x 60 212 36 x 84 Average UMAs 95% Confidence Interval 12 60 2 36 56 64 Model: Cost of UMAs for one squadron If the cost per UMA = $1000, the Expected cost for one squadron = $60,000 Model: Total Cost of UMAs Expected Cost for all squadrons = 60 * $1000 * 36 = $2,160,000 Model: Total Cost of UMAs Expected Cost for all squadrons = 60 * $1000 * 36 = $2,160,000 How confident are we about this estimate? .3413 .3413 .1359 .1359 .0215 -3 .0215 -2 -1 0 1 ~ 95% mean (=60) standard error =12/36 = 2 2 3 .3413 .3413 .1359 .1359 .0215 -3 .0215 -2 -1 0 1 2 ~56 ~58 60 ~62 ~64 (1 standard unit = 2) ~ 95% 3 95% Confidence Interval on our estimate of UMAs and costs 60 + 2(2) = [56, 64] low cost: 56 * $1000 * 36 = $2,016,000 high cost: 64 * $1000 * 36 = $2,304,000 What do you want to know? How many UMAs will there be next month? What is the average number of UMAs ? Is there a relationship between UMAs and and some other variable that may be used to predict UMAs? What is that relationship? Relationships What might be related to UMAs? Pilot Experience ? Flight hours ? Sorties flown ? Mean time to failure (for specific parts) ? Number of landings / takeoffs ? Regression: To estimate the expected or mean value of UMAs for next month: look for a linear relationship between UMAs and a “predictive” variable If a linear relationship exists, use regression analysis Regression analysis: describes and evaluates relationships between one variable (dependent or explained variable), and one or more other variables (called the independent or explanatory variables). What is a good estimating variable for UMAs? quantifiable predictable logical relationship with dependent variable must be a linear relationship: Y = a + bX Sorties Sq Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 101 100 120 114 132 146 124 110 138 140 114 157 106 104 130 106 124 140 100 146 142 141 148 118 128 130 108 122 134 126 190 136 110 120 196 184 154 172 157 Pilot Experience Sq Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 101 6.06 2.81 3.37 3.87 4.22 6.67 2.61 1.96 2.96 2.45 3.29 3.73 104 4.61 2.45 4.65 5.71 7.23 3.01 2.53 1.54 4.49 1.73 4.81 5.17 108 1.11 5.75 4.9 3.59 6.88 1.17 2.59 5.87 7.28 7.79 5.87 2.47 Sample Statistics Sorties Mean Standard Error of Mean Median Mode Standard Deviation Minimum Maximum Count 135 3.99 131 100 23.92 100 196 36 Exp 4.06 0.31 3.80 #N/A 1.84 1.11 7.79 36 Describing the Relationship Is there a relationship? Do the two variables (UMAs and sorties or experience) move together? Do they move in the same direction or in opposite directions? How strong is the relationship? How closely do they move together? Positive Relationship 60 50 40 Y 30 20 10 0 0 10 20 30 X 40 50 60 Strong Positive Relationship 60 50 40 30 20 10 0 0 10 20 30 40 50 60 Negative Relationship 50 40 Y 30 20 10 0 0 10 20 30 X 40 50 Strong Negative Relationship 60 50 40 30 20 10 0 0 10 20 30 40 50 60 No Relationship 25 20 15 10 5 0 0 10 20 30 40 50 60 Relationship? 400 350 300 Y 250 200 150 100 50 0 0 10 20 30 X 40 50 60 Correlation Coefficient Statistical measure of how closely two variables are moving together in a coordinated fashion Measures strength and direction Value ranges from -1.0 to +1.0 +1.0 indicates “perfect” positive linear relation -1.0 indicates “perfect” negative linear relation 0 indicates no relation between the two variables Correlation Coefficient r n ( xi yi ) xi yi 2 2 2 2 n xi ( xi ) n yi ( yi ) Sorties vs. UMAs 90 80 70 UMAs 60 50 40 30 20 10 0 0 50 100 Sorties r = .9788 150 200 Experience vs. UMAs 90 80 70 UMAs 60 50 40 30 20 10 0 0.00 2.00 4.00 6.00 Pilot Experience r = .1896 8.00 10.00 Correlation Matrix Correlation UMAs Sorties Exp UMAs Sorties 1 0.9787613 1 0.1895905 0.198641 Exp 1 A Word of Caution... Correlation does NOT imply causation It simply measures the coordinated movement of two variables Variation in two variables may be due to a third common variable The observed relationship may be due to chance alone What is the Relationship? In order to use the correlation information to help describe the relationship between two variables we need a model The simplest one is a linear model: Y a bX Fitting a Line to the Data 10 9 8 7 Y 6 5 4 3 2 1 0 0 2 4 6 8 X 10 12 14 One Possibility 10 9 8 7 Y 6 5 4 3 2 1 0 0 2 4 6 8 X Sum of errors = 0 10 12 14 Another Possibility 10 9 8 7 Y 6 5 4 3 2 1 0 0 2 4 6 8 X Sum of errors = 0 10 12 14 Which is Better? Both have sum of errors = 0 Compare sum of absolute errors: Y 8 1 6 4 6 Y1 6 5 4 5.5 4.5 Error 2 -4 2 -1.5 1.5 Abs err 2 4 2 1.5 1.5 0 11 Y2 2 5 8 3.5 6.5 Error 6 -4 -2 0.5 -0.5 Abs err 6 4 2 0.5 0.5 0 13 Fitting a Line to the Data 10 9 8 7 Y 6 5 4 3 2 1 0 0 2 4 6 X 8 10 12 One Possibility 10 9 8 7 Y 6 5 4 3 2 1 0 0 2 4 6 8 10 X Sum of absolute errors = 6 12 Another Possibility 10 9 8 7 Y 6 5 4 3 2 1 0 0 2 4 6 8 10 X Sum of absolute errors = 6 12 Which is Better? Sum of the absolute errors are equal Compare sum of errors squared: Y 4 7 2 5 2 Y1 4 3 2 3.5 2.5 Abs err 0 4 0 1.5 0.5 Sum Sq 0 16 0 2.25 0.25 6 18.5 Y2 5.6 3.8 2 4.7 2.9 Abs err 1.6 3.2 0 0.3 0.9 Sum Sq 2.56 10.24 0 0.09 0.81 6 13.7 The Correct Relationship: Y = a + bX + U Y systematic random 100 90 80 70 60 50 X 100 110 120 130 Y The correct relationship: Y = a + bX + U systematic random 100 90 80 70 60 50 X 100 110 120 130 Least-Squares Method Penalizes large absolute errors Y- intercept: b Slope: a Y bX XY nXY X nX 2 2 Assumptions Linear relationship: Y a bX U Errors are random and normally 2 distributed with mean = 0 and variance = Supported by Central Limit Theorem Least Squares Regression for Sorties and UMAs 100 90 80 70 UMAs 60 50 40 30 20 10 0 0 50 100 Sorties 150 200 Regression Calculations SUMMARY OUTPUT Regression Statistics 0.978761339 Multiple R 0.957973758 R Square Adjusted R Square 0.956737692 2.505836188 Standard Error 36 Observations ANOVA df Regression Residual Total Intercept Sorties 1 34 35 SS 4866.50669 213.49331 5080 MS 4866.50669 6.279215001 t Stat Coefficients Standard Error -6.542935597 2.426476306 -2.696476195 0.492910634 0.017705663 27.83915093 Significance F F 775.0183246 5.51636E-25 P-value 0.01082052 5.51636E-25 Upper 95% Lower 95% -11.4741255 -1.611745688 0.456928421 0.528892848 Sorties vs. UMAs 100 90 Y 654 . .49 X 80 70 UMAs 60 50 40 30 20 10 0 0 50 100 Sorties 150 200 Regression Calculations: Confidence in the predictions SUMMARY OUTPUT Regression Statistics 0.978761339 Multiple R 0.957973758 R Square Adjusted R Square 0.956737692 2.505836188 Standard Error 36 Observations ANOVA df Regression Residual Total Intercept Sorties 1 34 35 SS 4866.50669 213.49331 5080 MS 4866.50669 6.279215001 t Stat Coefficients Standard Error -6.542935597 2.426476306 -2.696476195 0.492910634 0.017705663 27.83915093 Significance F F 775.0183246 5.51636E-25 P-value 0.01082052 5.51636E-25 Upper 95% Lower 95% -11.4741255 -1.611745688 0.456928421 0.528892848 Confidence Interval for Estimate 100 90 UMAs 80 70 60 50 40 30 90 100 110 120 130 140 150 160 Sorties Y a bX ( t / 2 ) se 170 180 190 200 95% Confidence Interval for the model (b) Y X Testing Model Parameters How well does the model explain the variation in the dependent variable? Does the independent variable really seem to matter? Is the intercept constant statistically significant? Variation 100 90 UMAs 80 Y Y 70 60 Y 50 40 30 90 100 110 120 130 140 150 Sorties 160 170 180 190 200 Coefficient of Determination Explained Variation R = Total Variation 2 Values between 0 and 1 R2 = 1 when all data on line (r=1) R2 = 0 when no correlation (r=0) Regression Calculations: How well does the model explain the variation? SUMMARY OUTPUT Regression Statistics 0.978761339 Multiple R 0.957973758 R Square Adjusted R Square 0.956737692 2.505836188 Standard Error 36 Observations ANOVA df Regression Residual Total Intercept Sorties 1 34 35 SS 4866.50669 213.49331 5080 MS 4866.50669 6.279215001 t Stat Coefficients Standard Error -6.542935597 2.426476306 -2.696476195 0.492910634 0.017705663 27.83915093 Significance F F 775.0183246 5.51636E-25 P-value 0.01082052 5.51636E-25 Upper 95% Lower 95% -11.4741255 -1.611745688 0.456928421 0.528892848 Does the Independent Variable Matter? Y a bX If sorties do not help predict UMAs we expect b = 0 If b is not 0, is it statistically significant? Regression Calculations: Does the Independent Variable Matter? SUMMARY OUTPUT Regression Statistics 0.978761339 Multiple R 0.957973758 R Square Adjusted R Square 0.956737692 2.505836188 Standard Error 36 Observations ANOVA df Regression Residual Total Intercept Sorties 1 34 35 SS 4866.50669 213.49331 5080 MS 4866.50669 6.279215001 t Stat Coefficients Standard Error -6.542935597 2.426476306 -2.696476195 0.492910634 0.017705663 27.83915093 Significance F F 775.0183246 5.51636E-25 P-value 0.01082052 5.51636E-25 Upper 95% Lower 95% -11.4741255 -1.611745688 0.456928421 0.528892848 95% Confidence Interval for the slope (a) Y Mean of Y Mean of X X Confidence Interval for Slope 100 90 UMAs 80 70 60 50 40 30 90 100 110 120 130 140 150 Sorties 160 170 180 190 200 Is the Intercept Statistically Significant? SUMMARY OUTPUT Regression Statistics 0.978761339 Multiple R 0.957973758 R Square Adjusted R Square 0.956737692 2.505836188 Standard Error 36 Observations ANOVA df Regression Residual Total Intercept Sorties 1 34 35 SS 4866.50669 213.49331 5080 MS 4866.50669 6.279215001 t Stat Coefficients Standard Error -6.542935597 2.426476306 -2.696476195 0.492910634 0.017705663 27.83915093 Significance F F 775.0183246 5.51636E-25 P-value 0.01082052 5.51636E-25 Upper 95% Lower 95% -11.4741255 -1.611745688 0.456928421 0.528892848 Confidence Interval for Y-intercept 100 90 UMAs 80 70 60 50 40 30 90 110 130 150 Sorties 170 190 210 Basic Steps of Regression Analysis Formulate the model Plot scatter diagram for visual inspection Compute correlation coefficient Fit the regression line Test the model Factors affecting estimation accuracy Sample size (larger is better) Range of X values (wider is better) Standard deviation of U (smaller is better) Uses and Limitations of Regression Analysis Identifying relationships Not necessarily cause May be due to chance only Forecasting future outcomes Only valid over the range of the data Past may not be good predictor of future Common pitfalls in regression Failure to draw scatter diagrams Omitting important variables from the model The “two point” phenomenon Unfounded claims of model sophistication Insufficient attention to interval estimates and predictions Predicting too far outside of known range Lines can be deceiving... X Variable 1 Line Fit Plot 14 12 10 Y 8 6 4 2 0 0 5 10 X Variable 1 R2 = .6662 15 20 Nonlinear Relationship y = -0.1267x 2 + 2.7808x - 5.9957 R2 = 1 14 12 10 Y 8 6 4 2 0 0 5 10 X 15 20 Best fit? X Variable 1 Line Fit Plot 14 12 10 Y 8 6 4 2 0 0 5 10 X Variable 1 15 20 Misleading data X Variable 1 Line Fit Plot 14 12 10 Y 8 6 4 2 0 0 5 10 X Variable 1 15 20 Summary Regression Analysis is a useful tool Helps quantify relationships But be careful Does not imply cause and effect Don’t go outside range of data Check linearity assumptions Use common sense! Cost Non-linear relationship between output and cost 50 45 40 35 30 25 20 15 10 5 0 r = 0.0 0 5 10 Output 15 20