Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
v1.1 Regression Analysis How to develop and assess a CER “All models are wrong, but some are useful.” -George Box “In mathematics, context obscures structure. In data analysis, context provides meaning.” -George Cobb “Mathematical theorems are true; statistical methods are sometimes effective when used with skill.” -David Moore Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. 1 v1.1 Regression Overview • Key Ideas – – – – – Y a bX Yˆ aˆ bˆX Correlation Best fit / minimum error Homoscedasticity! Statistical significance Quantification of uncertainty • Analytical Constructs – – – – • Practical Applications – CER Development – Learning Curves • Related Topics OLS Regression Analysis of Variance (ANOVA) Confidence Intervals Linear algebra – Parametrics – Distributions 3 • Normal, Chi, t, F 10 – Hypothesis testing – Risk Analysis 9 Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. NEW! 2 Bivariate Data Analysis v1.1 6 • One independent variable and one dependent variable (i.e., y is a function of x) What does it look like? • Visual display of information – Scatter plot, residual plots • Measures of central tendency What’s your best guess? – Ŷ from regression equation • Measures of variability – Standard Error of the Estimate (SEE), R2 • Measures of uncertainty – Confidence and Prediction Intervals • Statistical Tests – F test, t test, ANOVA How much remains unexplained? How precise are you? How can you be sure? Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. NEW! 3 v1.1 Definition of Regression • Regression Analysis is used to describe a statistical relationship between variables • Specifically, it is the process of estimating the “best fit” parameters of a specified function that relates a dependent variable to one or more independent variables (including implicit uncertainty) y Data y=a+bx x y Regression x Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. 4 Regression Analysis in Cost Estimating v1.1 • If the dependent variable is a cost, the regression equation is often referred to as a Cost Estimating Relationship or CER – The independent variable in a CER is often called a cost driver Cost Driver (single) Aircraft Design # of Drawings Software Lines of Code Power Cable Linear Feet CER 3 Power Cable Cost Examples of cost drivers: Cost Linear Ft – A CER may have multiple cost drivers: Example with multiple cost drivers: Cost Driver (multiple) Cost Power Cable CER Linear Feet Power Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. 5 v1.1 Preliminary Concepts – Correlation • Correlation is quantified by a correlation coefficient (r) 6 – r can range from -1 to +1 r = -1 Tip: The stronger the correlation, the farther r is from zero r= 0 r = -0.8 r= +1 r= +0.8 The presence of correlation is what leads you to regression analysis. Scatter plotting is the best to detect it. Explicit estimation of r will be addressed later… Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. 6 v1.1 Preliminary Concepts – Types of Models • A mathematical function must be specified before regression analysis is performed 19 – The specified function is called a regression model – Many types of models may be considered: power b<0 b>0 6 0<b<1 linear y=a+bx 3 exponential y=axb y=axb quadratic y=aebx b<0 y=a+bx y = a + b x + c x2 y = a + b ln x Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. logarithmic 7 v1.1 Key Concept of Regression • The regression procedure uses the “method of least squares” to find the “best fit” parameters of a specified 3 function – We will focus on Ordinary Least Squares (OLS) Regression1 – The idea is to minimize the sum of squared deviations (called “errors” or “residuals”) between the Y data and the regression equation y=a+bx 20 Called the “Sum of Squared Errors” or “SSE” 15 10 5 In other words… Find the equation that minimizes the sum of the squared distances of the blue lines Let’s look at the math… 0 0 5 10 15 20 Unit III - Module 8 1 OLS Regression is most commonly used and is the foundation to the understanding of all regression techniques. To learn about variations and more advanced techniques, see Resources and Advanced © 2002-2010 SCEA. All rights reserved. Topics. 8 v1.1 Finding the Regression Equation ^ 20 • Problem: Find a^ and b such that the SSE is minimized… 15 10 5 20 0 Data 0 X Y 4 5 10 7 5 5 8 10 12 7 16 13 the line passes through the means of X and Y1 10 15 20 ^ y^ = a^ + b x 0 0 The SSE is minimized when the slope is equal to the correlation coefficient, r multiplied by the quotient of standard deviations, sY/sX and 5 15 5 10 15 20 This formulation reduces to the following set of equations 8 __ ^ SXY – nXY _ b= SX2 – nX2 _ ^ _ ^ a=Y–bX 8 Where: X _ and Y _ are the raw values of the 2 variables and X and Y are the means of the 2 variables Unit III - Module 8 1 The geometry of this formulation is provided under Related and Advanced Topics © 2002-2010 SCEA. All rights reserved. 9 Finding the Regression Equation: Example 20 4 ^ ^ ^ y=a +bx 15 ^ b= 10 SXY – nXY SX2 – nX2 ^ a^ = Y – b X 5 0 0 5 10 n=5 Data 15 20 X Y XY X2 4 5 20 16 7 5 35 49 8 10 80 64 12 7 84 144 16 13 208 256 SXY= SX2= 427 529 X = avg of X data =9 Example Calculation Calculations Y = avg of Y data =8 ^ b= SXY – nXY SX2 – nX2 ^ 427 – 5*9*8 b= 529 – 5*92 b^ = 0.6 Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. a^ = Y – b^X ^ a = 8 – 0.6*9 a^ = 2.5 Ŷ = 2.5 + 0.6 X 10 v1.1 v1.1 Finding the Regression Equation ^ 20 • Problem: Find a^ and b such that the SSE is minimized… 15 10 5 20 0 Data 0 X Y 4 5 10 7 5 5 8 10 12 7 16 13 the line passes through the means of X and Y1 10 15 20 ^ y^ = a^ + b x 0 0 The SSE is minimized when the slope is equal to the correlation coefficient, r multiplied by the quotient of standard deviations, sY/sX and 5 15 5 10 15 20 This formulation reduces to the following set of equations 8 __ ^ SXY – nXY _ b= SX2 – nX2 _ ^ _ ^ a=Y–bX 8 Where: X _ and Y _ are the raw values of the 2 variables and X and Y are the means of the 2 variables Unit III - Module 8 1 The geometry of this formulation is provided under Related and Advanced Topics © 2002-2010 SCEA. All rights reserved. 11 Finding the Regression Equation: Example 20 4 ^ ^ ^ y=a +bx 15 ^ b= 10 SXY – nXY SX2 – nX2 ^ a^ = Y – b X 5 0 0 5 10 n=5 Data 15 20 X Y XY X2 4 5 20 16 7 5 35 49 8 10 80 64 12 7 84 144 16 13 208 256 SXY= SX2= 427 529 X = avg of X data =9 Example Calculation Calculations Y = avg of Y data =8 ^ b= SXY – nXY SX2 – nX2 ^ 427 – 5*9*8 b= 529 – 5*92 b^ = 0.6 Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. a^ = Y – b^X ^ a = 8 – 0.6*9 a^ = 2.5 Ŷ = 2.5 + 0.6 X 12 v1.1 Statistical Error and Residual Analysis v1.1 20 • In addition to the equation we have just found, we must describe the statistical error – the “fuzz” or “noise” – in the data This is done by adding an “error term” () to the basic regression equation to model the residuals: 15 10 5 0 0 • y=a+bx+ • There are two key assumptions in OLS regression regarding the error term: 1. It is independent with X 2. It is normally distributed with a mean of zero and constant variance for all X These assumptions need to be checked, as the entire analysis hinges on their validity • 5 10 A “residual plot” is useful in determining whether these assumptions apply to the data This is called homoscedasticity! The error about the “true underlying” line generates a data cloud whose best-fit line is slightly different The error term is what makes Regression more than just Curve Fitting Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. 15 13 20 v1.1 Residual Analysis: Example Create a Residual Plot to verify that the OLS 20 assumptions hold: Residuals 15 Plot the distance from the regression 4 line for each 3 data point Residual Plot 2 1 y = 2.5 + 0.6 x + 10 x 0 -1 0 5 5 10 15 20 -2 -3 5 0 -4 0 yes yes 5 10 15 20 Questions to ask: 1. Does the residual plot show independence? 2. Are the points symmetric about the the x-axis with constant variability for all x? The OLS assumptions are reasonable. This tells us: •That our assumption of linearity is reasonable •The error term can be modeled as Normal with mean of zero and constant variance Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. 14 v1.1 Example Residual Patterns Good residual pattern: • Independent with x • Constant variation 0 Residuals do not have constant variation: Weighted Least Squares or Multiplicative Error approach should be examined 0 Tip: A residual plot is the primary way of indicating whether a non-linear model (and which one) might be appropriate 0 0 Residuals not independent with x: A curvilinear model is probably more appropriate in this case Residuals not independent with x: e.g., in learning curve analysis, this pattern might indicate loss of learning or injection of new work 7 Usually the residual plot provides enough visual insight to determine whether or not linear OLS regression is appropriate. If the picture is inconclusive, statistical tests exist to help determine if the OLS assumptions hold1. Unit III - Module 8 15 1 See references to learn about Weighted Least Squares Regression and statistical tests for residuals © 2002-2010 SCEA. All rights reserved. Excel Demo – Parameters and Residuals v1.1 SUMMARY OUTPUT Regression Statistics SUMMARY Multiple R OUTPUT R Square Regression Adjusted R SquareStatistics Multiple R Standard Error R Square Observations Adjusted R Square Standard ANOVA Error Observations 0.79 0.62 0.50 0.79 2.5 0.62 5 0.50 2.5 5 df ANOVA Regression Residual Total Regression Residual Total 1 3 df 4 1 3 Coefficien 4 ts 2.5 Coefficien 0.6 ts 2.5 0.6 Intercept x Intercept x SS 29.8 18.2 SS 48 29.8 18.2 Standard 48 Error 2.7 Standard 0.3 Error 2.7 0.3 MS 29.8 6.1 MS 29.8 6.1 Equation Parameters Significan ce F 4.9 0.11 Significan ce F F 4.9 0.11 F t Stat P-value 0.9 0.42 2.2 0.11 t Stat P-value 0.9 0.42 2.2 0.11 Lower 95% -6.1 Lower -0.3 95% -6.1 -0.3 Ŷ = 2.5 + 0.6 X Upper 95% 11.1 Upper1.4 95% 11.1 1.4 Residuals 20 RESIDUAL OUTPUT 15 RESIDUAL OUTPUT Observation Observation 1 2 3 1 4 2 5 3 4 5 Predicted Standard y Residuals Residuals 4.8 0.2 Predicted 6.6 -1.6 Standard -0.7 y 7.2 Residuals 2.8 Residuals 1.3 4.8 0.2 9.5 -2.5 -1.2 6.6 -1.6 -0.7 11.9 1.1 0.5 7.2 2.8 1.3 9.5 -2.5 -1.2 11.9 1.1 0.5 Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. 10 5 0 0 5 10 15 20 16 v1.1 Application Note: A seeker is a component of the missile that is used to detect a target • Suppose our toy problem defines X and Y as follows: • Assuming the regression is a good fit1, the following is true: – The cost estimate for a new seeker with weight of11 is 2.5 + 0.6*(11) = $9.1M – The result is the estimated mean of the distribution of all possible costs, which is assumed to be a Normal distribution2 1 Goodness of Fit is described in the next section 2 Standard deviation of the cost distribution will be discussed later Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. Seeker Cost ($M) 6 – Y is the cost of a seeker in a missile – X is the weight of the seeker 20 15 10 5 0 0 5 10 15 20 Weight Tip: The Mean is usually calculated in the cost model. The distribution is usually accounted for in the risk analysis. 17 9 v1.1 Analysis of Variance (ANOVA) Measures of Variation 7 1. 2 ~ n 1 Total Sum of Squares (SST): The sum of the squared deviations between the data and the average 2. 2 ~ n k Residual or Error Sum of Squares (SSE): The sum of the squared deviations between the data and the regression line “The unexplained variation” 3. 2 ~ Regression Sum of Squares (SSR): k 1 The sum of the squared deviations between the regression line and the average “The explained variation” 20 (Y Y) 15 2 10 5 0 0 5 10 15 20 2 ( Y Ŷ ) 20 15 10 5 0 0 5 10 15 20 2 ( Ŷ Y ) 20 15 10 8 SST = SSE + SSR 5 “total” = “unexplained” + “explained” 0 Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. 0 5 10 15 18 20 v1.1 Uncertainty of Coefficients • Each estimated coefficient has an associated standard error As a result of previous assumptions, coefficients are normally distributed SUMMARY OUTPUT Regression Statistics 0.79 Multiple R 0.62 R Square 0.50 Adjusted R Square 2.5 Standard Error 5 Observations Ŷ = 2.5 + 0.6 X Intercept: ANOVA df Regression Residual Total Intercept x The true coefficients then have a t distribution about the estimated values 1 3 4 SS 29.8 18.2 48 Coefficien Standard Error ts 2.7 2.5 0.3 0.6 MS 29.8 6.1 Significan ce F F 0.11 4.9 P-value t Stat 0.42 0.9 0.11 2.2 2.5 -2.7 +2.7 Lower 95% -6.1 -0.3 Upper 95% 11.1 1.4 Slope: 0.6 The standard error values are affected by the MSE as well as variability within the x data points RESIDUAL OUTPUT Observation Standard Predicted Unit III - Module 8 Residuals Residuals y 0.2 4.8 1 SCEA. All rights reserved. -0.7 -1.6 © 2002-2010 6.6 2 -0.3 +0.3 19 v1.1 Uncertainty of the Estimate • The estimated regression equation has a standard error of the estimate, or SEE The SEE is the estimated standard deviation of the Normal distribution that models the residuals SEE MSE SUMMARY OUTPUT Regression Statistics Multiple R 0.79 R Square 0.62 Adjusted R Square 0.50 Standard Error 2.5 Observations 5 SEE 6.1 2.5 15 10 ANOVA df Regression Residual Total Intercept x 20 1 3 4 SS 29.8 18.2 48 Coefficien Standard ts Error 2.5 2.7 0.6 0.3 MS 29.8 6.1 F Significan ce F 4.9 0.11 5 0 t Stat P-value 0.9 0.42 2.2 0.11 Lower 95% -6.1 -0.3 Unit III - Module 8 Upper 95% 11.1 1.4 © 2002-2010 SCEA. All rights reserved. 0 5 10 15 20 20 v1.1 Coefficient of Variation • The coefficient of variation, or CV is the ratio of the SEE to the mean of the dependent variable SEE CV Y Tip: CV of less than 15% is desirable SUMMARY OUTPUT Regression Statistics Multiple R 0.79 R Square 0.62 Adjusted R Square 0.50 Standard Error 2.5 Observations 5 SEE 6.1 Example: 2.5 From our sample problem, SEE = 2.5 and Y = 8, so CV = 2.5 / 8 = 31% ANOVA df Regression Residual Total Intercept x 1 3 4 SS 29.8 18.2 48 Coefficien Standard ts Error 2.5 2.7 0.6 0.3 MS 29.8 6.1 F Significan ce F 4.9 0.11 t Stat P-value 0.9 0.42 2.2 0.11 Lower 95% -6.1 -0.3 Upper 95% 11.1 1.4 The CV expresses the standard error of the estimate as a percent (of the mean) Unit III - Module 8 RESIDUAL OUTPUT Predicted The mean is the only value not found on the regression output Standard © 2002-2010 SCEA. All rights reserved. 21 Statistical Significance in Regression v1.1 • Statistical significance means the probability is “acceptably low” that a stated hypothesis is true. 10 8 – The “acceptable” probability is referred to as a significance level, or a Typical significance levels are a =0.01 and 0.05 – In regression analysis, the standard hypothesis that is checked is that one or more variables has a coefficient with an actual value of zero • The statistical tests of interest in Regression Analysis involve the x coefficient(s) in the Regression Equation Tip: To say b = 0 implies there is no relationship between x and y y=a+bx Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. Are the statistics good enough to convince us that b is not zero? i.e., “Is Probability that the hypothesis “b = 0” is true < a ?” 22 v1.1 The t statistic • For a regression coefficient, the determination of 10 statistical significance is based on a t test 17 – The test depends on the ratio of the coefficient’s estimated value to its standard error, called a t statistic Example Setup: SUMMARY OUTPUT Set a = 0.05 Regression Statistics Multiple R 0.79 R Square 0.62 Adjusted R Square 0.50 Standard Error 2.5 Observations 5 y=a+bx Hypothesis: H0 : b 0 Ha : b 0 ANOVA df Regression Residual Total 1 3 4 SS 29.8 18.2 48 Coefficien Standard ts Error 2.5 2.7 0.6 0.3 Intercept x MS 29.8 6.1 F Significan ce F 4.9 0.11 Test Statistic: t t Stat P-value 0.9 0.42 2.2 0.11 Lower 95% -6.1 -0.3 Upper 95% 11.1 1.4 We cannot conclude there is a relationship between x and y RESIDUAL OUTPUT Observation Predicted Standard y Residuals Residuals 1 4.8 0.2 2 6.6 -1.6 -0.7 3 7.2 2.8 1.3 Since 0.11 > 0.05 we cannot reject Ho Therefore, this coefficient is not statistically significant Estimated Coefficien t 0.6 2.2 Standard Error 0.3 p-value: 0.11 t stat Decision: We reject H0 if the p-value is Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. less than the chosen significance level (0.05) 23 v1.1 Calculating R2 How much of the total variability is explained by adding x as an independent variable? 10 y 8 Remember… “total variation” = SST “explained variation” = SSR x Data Calculations _ _ 2 ^ _ ^ explained variation SSR SSE 2 1 R total variation X Y Y (Y - Y) Y (Y - Y) 4 7 8 12 16 5 5 10 7 13 8 8 8 8 8 9 9 4 1 25 5 7 7 10 12 10 2 1 2 15 S=48 _ 20 Y=8 15 15 20 SSR 2 R SST S=30 Y = 0.6X+2.5 SSR 20 15 15 10 10 5 5 10 10 5 5 0 0 0 5 10 15 20 SST Toy Problem Calculation ^ SST 20 SST 2 30 48 (Ŷ Y) (Y Y ) 0.62 0 0 5 10 15 20 0 0 0 5 10 15 5 10 15 20 20 Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. 24 2 2 v1.1 Confidence Intervals Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. 25 Introduction to Confidence Intervals v1.1 • A confidence interval suggests to us that we are (1-a)*100% confident that the true value of the random variable is contained within the calculated range* • Confidence intervals are calculated for: – Regression Equation Parameters – The Regression Equation at a given value of x • The calculation combines the Student t distribution with the associated standard error – The general formula is: 10 (mean est) t α/2, df stdev * Note this statement provides a general sense of what a confidence interval does for us in concise language for ease of understanding. The specific statistical interpretation is that if many independent samples are taken where the levels of the predictor variable are the same as in the data set, and a (1-a)*100% confidence interval is constructed for each sample, then (1-a)*100% of the intervals will contain the true value of the parameter (or value of the dependent variable at a given value of x, depending on which interval is being calculated). a/2 a/2 1-a “t” = Student t distribution with n-k degrees of freedom. Critical values can be looked up on a table or calculated in Excel critical values Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. 26 v1.1 Confidence Bands • When we plot the confidence intervals for all X, we get a pair of hyperbolic curves called the “Confidence Band” Toy Problem: Confidence Interval vs X 20 95% -3.18 Stdevs 15 Ŷ +3.18 Stdevs 10 5 0 0 5 10 15 20 Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. Note that the interval becomes larger as X moves away from the mean 27 v1.1 Prediction Intervals • It is more common that we would want a confidence interval for a new observation of Y at a given value of X 16 – This is referred to as a Prediction Interval and has the following formula: n 1 ( X - X) 2 Ŷ t α/2, df SEE 2 2 n X n X The t distribution critical values remain the same as previously calculated Tip: The Prediction Interval provides an interval for y as opposed to ŷ Prediction Interval vs X 25 20 Standard deviation of Y for a fixed X 15 10 5 0 0 5 10 15 -5 Unit III - Module 8 -10 © 2002-2010 SCEA. All rights reserved. 20 Note that prediction intervals are wider than confidence intervals 28 v1.1 Confidence Intervals when n is large • When n is large (at least 30), the t distribution approaches a Normal distribution • This means critical values from the Normal can be used, which produce the following standard confidence intervals: e.g. Ŷ 1 stdev 68% 20 -1 stdev 15 . 10 95% -2 stdevs 5 0 0 5 10 +1 stdev 15 +2 stdevs provides a 68% confidence interval for the mean 10 99% 20 -3 stdevs Unit III - Module 8 © 2002-2010 SCEA. All rights reserved. +3 stdevs 29