Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Time series wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Regression toward the mean wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Choice modelling wikipedia , lookup
Linear regression wikipedia , lookup
Chapter 15 Model Building and Model Diagnostics McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Model Building and Model Diagnostics 15.1 The Quadratic Regression Model 15.2 Interaction 15.3 Logistic Regression (Optional) 15.4 Model Building, and the Effects of Multicollinearity 15.5 Improving the Regression Model I: Diagnosing and Using Information about Outlying and Influential Observations 15-2 Model Building and Model Diagnostics 15.6 Improving the Regression Model II: Transforming the Dependent and Independent Variables 15.7 Improving the Regression Model III: The Durbin-Watson Test and Dealing with Autocorrelation 15-3 The Quadratic Regression Model • One useful form of linear regression is the quadratic regression model • Assume we have n observations of x and y • The quadratic regression model relating y to x is y = β0 + β1x + β2x2 + – β0 + β1x + β2x2 is the mean value of the dependent variable y when the value of the independent variable is x – β0, β1 and β2 are unknown regression parameters relating the mean value of y to x – is an error term that describes the effects on y of all factors other than x and x2 15-4 The Quadratic Regression Model Visually 15-5 A Note on the Quadratic Model • Even though the quadratic model employs the squared term x2 and, as a result, assumes a curved relationship between the mean value of y and x, this model is a linear regression model • This is because b0 + b1x + b2x2 expresses the mean value y as a linear function of the parameters b0, b1, and b2 • As long as the mean value of y is a linear function of the regression parameters, we have a linear regression model 15-6 Example 15.1: The Gasoline Additive Case #1 15-7 Example 15.1: The Gasoline Additive Case #2 15-8 Example 15.1: The Gasoline Additive Case #3 (MINITAB Output) 15-9 Example 15.1: The Gasoline Additive Case #4 • Oil company wishes to find the value of x that maximizes predicted mileage • Using calculus, it can be shown that x=2.44 maximizes mileage – Therefore, the oil company should blend 2.44 units of additive ST-3000 to each gallon • The resulting (maximized) mileage is: – ŷ = 25.7152 + 4.9762(2.44) – 1.01905(2.44)2 – ŷ = 31.7901 miles per gallon 15-10 More Variables • We have only looked at the simple case where we have y and x • That gave us the quadratic regression model y = β0 + β1x + β2x2 + • However, we are not limited to just two terms • The following would also be a valid quadratic regression model y = β 0 + β 1x 1 + β 2x 12 + β 3 x 2 + β 4x 3 + 15-11 Interaction • Multiple regression models often contain interaction variables – These are variables that are formed by multiplying two independent variables together – For example, x1·x2 • In this case, the x1·x2 variable would appear in the model along with both x1 and x2 • We use interaction variables when the relationship between the mean value of y and one of the independent variables is dependent on the value of another independent variable 15-12 Interaction Variable Example • Consider a company that runs both radio and television ads for its products • It is reasonable to assume that raising either ad amount would raise sales • However, it is also reasonable to assume that the effectiveness of television ads depends, in part, on how often consumers hear the radio ads • Thus, an interaction variable would be appropriate 15-13 Spotting Interactive Terms • It is fairly easy to construct data plots to check for interaction when a careful experiment is carried out • It is often not possible to construct the necessary plots with less structured data • If an interaction is suspected, we can include the interactive term and see if it is significant 15-14 Example 15.3: The Fresh Detergent Case • Enterprise Industries produces Fresh liquid laundry detergent • Would like to predict demand • Gathers the following data – Demand (y) – Price of Fresh (x1) – Average industry price (x2) – Advertising (x3) – Price difference, e.g. x2 – x1 (x4) 15-15 Example 15.3: The Fresh Detergent Case #2 15-16 Example 15.3: The Fresh Detergent Case #3 • In Example 15.2 developed the model: ŷ = 17.3244 + 1.3070 x4 – 3.6956 x3 + 0.3486 x23 • Since there might be interaction between x4 and x3, wish to add x4x3 term to model 15-17 Example 15.3: Excel and MegaStat Output 15-18 Example 15.3: Illustrating the Interaction 15-19 A Note on Interactive Model Construction When an interaction term (say x1x2) is important to a model, it is the usual practice to leave the corresponding linear terms (x1 and x2) in the model no matter what their p-values 15-20 Logistic Regression • Logistic regression and least squares regression are very similar – Both produce prediction equations • The y variable is what makes logistic regression different – With least squares regression, the y variable is a quantitative variable – With logistic regression, it is usually a dummy 0/1 variable • With large data sets, y variable may be the probability of a set of observations having a dummy variable value of one 15-21 Regression Drawbacks When Using Dummy Dependent Variable • It is possible to have a predicted y value less than zero or greater than one • One assumption is constant variance but this is not possible with a dummy variable – When 50 percent of the y’s are ones, the variance is .25, its maximum value – As the percentage of y’s that are one approaches one or zero, the variance approaches zero • Another assumption is the error terms are normally distributed – Since y can be only 0 or 1, this is a hard to justify • Logistic regression overcomes these drawbacks 15-22 Example: Price Reduction Coupons x 1 2 3 4 5 6 y 4 7 20 35 44 46 p .08 .14 .40 .70 .88 .92 • The x values are six coupon amounts – Each were sent to 50 people • The y values are the number who responded • The p value is the probability that someone from that group responded – This will be the dependent variable 15-23 Logistic Curve 15-24 Logistic Example e b o b1x px b o b1 x 1 e • The formula above is the logistic curve • p(x) denotes the probability that a household will redeem a coupon • We know from prior slide that b0 = –3.7456 and b1 = 1.1109 – In logistic regression, these are computed using maximum likelihood estimators • Prior slide gives estimates for x=1 to x=6 15-25 General Logistic Regression Model b o b1 x1 b 2 x2 b k xk e px1 , x2 ,, xk b o b1 x1 b 2 x2 b k xk 1 e • p(x1,x2,…xk) is the probability that the event under consideration will occur when the values of the independent variable are x1,x2,…xk • The odds of the event occurring are p(x1,x2,…xk)/(1-p(x1,x2,…xk)) – The probability that the event will occur divided by the probability it will not occur 15-26 Model Building and the Effects of Multicollinearity • Multicollinearity is the condition where the independent variables are dependent, related or correlated with each other • Effects – Hinders ability to use t statistics and p-values to assess the relative importance of predictors – Does not hinder ability to predict the dependent (or response) variable • Detection – Scatter plot matrix – Correlation matrix – Variance inflation factors (VIF) 15-27 Variance Inflation Factors (VIF) • The variance inflation factor for the jth independent (or predictor) variable xj is 1 VIFj 1 R 2j • where Rj2 is the multiple coefficient of determination for the regression model relating xj to the other predictors: x1,…,xj-1,xj+1, xk x j = β 0 + β 1x 1 + β2x2+…+βj+1xj+1+…+βkxk+ 15-28 The Sale Territory Performance Case #1 Sales 3669.88 3473.95 2295.10 4675.56 6125.96 2134.94 5031.66 3367.45 6519.45 4876.37 2468.27 2533.31 2408.11 2337.38 4586.95 2729.24 3289.40 2800.78 3264.20 3453.62 1741.45 2035.75 1578.00 4167.44 2799.97 Time 43.10 108.13 13.82 186.18 161.79 8.94 365.04 220.32 127.64 105.69 57.72 23.58 13.82 13.82 86.99 165.85 116.26 42.28 52.84 165.04 10.57 13.82 8.13 58.54 21.14 MktPoten Adver MktShare 74065.11 4582.88 2.51 58117.30 5539.78 5.51 21118.49 2950.38 10.91 68521.27 2243.07 8.27 57805.11 7747.08 9.15 37806.94 402.44 5.51 50935.26 3140.62 8.54 35602.08 2086.16 7.07 46176.77 8846.25 12.54 42053.24 5673.11 8.85 36829.71 2761.76 5.38 33612.67 1991.85 5.43 21412.79 1971.52 8.48 20416.87 1737.38 7.80 36272.00 10694.20 10.34 23093.26 8618.61 5.15 26879.59 7747.89 6.64 39571.96 4565.81 5.45 51866.15 6022.70 6.31 58749.82 3721.10 6.35 23990.82 860.97 7.37 25694.86 3571.51 8.39 23736.35 2845.50 5.15 34314.29 5060.11 12.88 22809.53 3552.00 9.14 Change 0.34 0.15 -0.72 0.17 0.50 0.15 0.55 -0.49 1.24 0.31 0.37 -0.65 0.64 1.01 0.11 0.04 0.68 0.66 -0.10 -0.03 -1.63 -0.43 0.04 0.22 -0.74 Accts WkLoad 74.86 15.05 107.32 19.97 96.75 17.34 195.12 13.40 180.44 17.64 104.88 16.22 256.10 18.80 126.83 19.86 203.25 17.42 119.51 21.41 116.26 16.32 142.28 14.51 89.43 19.35 84.55 20.02 119.51 15.26 80.49 15.87 136.58 7.81 78.86 16.00 136.58 17.44 138.21 17.98 75.61 20.99 102.44 21.66 76.42 21.46 136.58 24.78 88.62 24.96 Rating 4.9 5.1 2.9 3.4 4.6 4.5 4.6 2.3 4.9 2.8 3.1 4.2 4.3 4.2 5.5 3.6 3.4 4.2 3.6 3.1 1.6 3.4 2.7 2.8 3.9 15-29 The Sale Territory Performance Case MINITAB Output of Correlation Matrix 15-30 Sale Territory Case MegaStat Output of the t Statistics, p-Values, and VIF 15-31 The Sale Territory Performance Case #4 • From prior slide: – Maximum VIFj = 5.639 (for Accts) – Mean VIFj = 2.667 • Probably not severe multicollinearity 15-32 Variance Inflation Factors Notes • VIFj = 1 implies xj not related to other predictors • Largest VIFj is greater than ten suggest severe multicollinearity • Average VIF substantially greater than one suggests severe multicollinearity 15-33 Impact of Multicollinearity • Multicollinearity can hinder our ability to use the t statistics and related p-values to assess the importance of the independent variables – Even when the multicollinearity itself is not severe • With multicollinearity, the t statistic and pvalue measure the additional importance of the independent variable xj over the combined importance of the other independent variables 15-34 Impact of Multicollinearity Continued • When two variables are multicollinear, they contribute redundant information • This causes the resulting t statistic to be smaller than it would be if the variable were used alone 15-35 Comparing Regression Models on R2, s, Adjusted R2, and Prediction Interval • Multicollinearity causes problems evaluating the p-values of the model • Therefore, we need to evaluate more than the additional importance of each independent variable • We also need to evaluate how the variables work together • One way to do this is to determine if the overall model gives a high R2 and adjusted R2, a small s, and short prediction intervals 15-36 Effect of Adding Independent Variable • Adding any independent variable will increase R2 • Even adding an unimportant independent variable • Thus, R2 cannot tell us that adding an independent variable is undesirable 15-37 A Better Criterion • A better criterion is the size of the standard error s • If s increases when an independent variable is added, we should not add that variable • However, decreasing s alone is not enough – Adding a variable reduces degrees of freedom and that makes the prediction interval for y wider – Therefore, an independent variable should only be included if it reduces s enough to offset the higher t value and reduces the length of the desired prediction interval for y 15-38 C Statistic • Another quantity for comparing regression models is called the C statistic – Also known as CP statistic • First, calculate mean square error for the model containing all p potential independent variables – Denoted s2p • Next, calculate SSE for a reduced model with k independent variables SSE • Calculate C as C 2 n 2k 1 sp 15-39 C Statistic Continued • We want the value of C to be small • Adding unimportant independent variables will raise the value of C • While we want C to be small, we also wish to find a model for which C roughly equals k+1 – A model with C substantially greater than k+1 has substantial bias and is undesirable – If a model has a small value of C and C for this model is less than k+1, then it is not biased and the model should be considered desirable 15-40 Stepwise Regression and Backward Elimination • Testing various combinations of variables can be tedious • In many situations, it is useful to have an iterative model selection procedure – At each step, a single independent variable is added to or deleted from the model – The model is then reevaluated – This continues until a final model is found • There are two such approaches – Stepwise regression – Backward elimination 15-41 Stepwise Regression #1 • Assume there are p potential independent variables – Further, assume that p is large • Stepwise regression uses t statistics to determine the significance of the independent variables in various models • Stepwise regression needs two alpha values – entry, the probability of a type I error related to entering an independent variable into the model – stay, the probability of a type I error related to retaining an independent variable that was previously entered into the model 15-42 Stepwise Regression #2 • Step 1: The stepwise procedure considers the p possible one-independent variable regression models – Finds the variable with the largest absolute t statistic • Denoted as x[1] – If x[1] is not significant at the entry level, the process terminates by concluding none of the independent variables are significant – Otherwise, x[1] is retained for use in Step 2 15-43 Stepwise Regression #3 • Step 2: The stepwise procedure considers the p-1 possible two-independent variable models of the form y = b0 + b1x[1] + b2xj + – For each new variable, it tests H0: b2 = 0 Ha: b2 0 • Pick the variable giving the largest t statistic • If resulting variable is significant, checks x[1] against stay to see if it should stay in the model • This is needed due to multicollinearity 15-44 Stepwise Regression #4 • Further steps: This adding and checking for removal continues until all nonselected independent variables are insignificant and will not enter model – Will also terminate when the variable to be added to the model is the one just removed from it 15-45 Backward Elimination • With backwards elimination, we begin with a full regression model containing a p potential independent variables • We then find the one having the smallest t statistic – If this variable is significant, we stop – If this variable is insignificant, it is dropped and the regression is rerun with p-1 potential independent variables • The process continues to remove variables one-at-a-time until all the variables are significant 15-46 Diagnosing and Using Information About Outlying and Influential Observations • Observation 1: Outlying with respect to y value • Observation 2: Outlying with respect to x value • Observation 3: Outlying with respect to x value and y value not consistent with regression relationship (Influential) 15-47 Leverage Values • Leverage values can help us identify outliers • The leverage value for an observation is the distance value (discussed earlier) • This value is a measure of the distance between the x value and the center of the experimental region • If the leverage value for an observation is large, it is an outlier with respect to its x value – Large means greater than twice the average of all the leverage values – This can be shown to be 2(k+1)/n 15-48 Hospital Labor Needs Data y Monthly labor hours required x1 Monthly X-ray exposures x2 Monthly occupied bed days x3 Average length of patient stay (days) 15-49 Hospital Labor Needs Data #2 Leverage Values Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Predicted Hours 688.409 566.520 721.848 696.820 965.393 1,033.150 1,172.464 1,603.620 1,526.780 1,611.370 1,993.869 1,613.270 1,676.558 1,854.170 1,791.405 2,160.550 2,798.761 2,305.580 4,191.333 3,503.930 3,190.957 3,571.890 4,364.502 3,741.400 4,364.229 4,026.520 8,713.307 10,343.810 11,732.170 12,080.864 15,414.940 15,133.026 18,854.450 19,260.453 Residual Leverage 0.121 -121.889 0.226 -25.028 0.130 67.757 0.159 431.156 0.085 84.590 0.112 -380.599 0.084 177.612 0.083 369.145 0.085 -493.181 0.120 -687.403 0.077 380.933 0.177 -623.102 0.064 -337.709 0.146 1,630.503 0.682 -348.694 0.785 281.914 0.863 -406.003 Studentized Residual -0.211 -0.046 0.118 0.765 0.144 -0.657 0.302 0.627 -0.838 -1.192 0.645 -1.117 -0.568 2.871 -1.005 0.990 -1.786 Studentized Deleted Residual Cook's D 0.002 -0.203 0.000 -0.044 0.001 0.114 0.028 0.752 0.000 0.138 0.014 -0.642 0.002 0.291 0.009 0.612 0.016 -0.828 0.049 -1.214 0.009 0.630 0.067 -1.129 0.006 -0.553 0.353 4.558 0.541 -1.006 0.897 0.989 5.033 -1.975 15-50 Residuals and Studentized Residuals • One way to identify an outlier is residuals • Any residual that is substantially different from the others is suspect • For a more precise idea, we can calculate the studentized residual – This is the observation’s residual divided by the residual’s standard error • If the studentized residual is outside of the range –2 to +2, we have some evidence that it is an outlier 15-51 Residuals and Studentized Residuals Continued Residual Studentize d Residual Residual Standard Error e i ei s 1 h i An observation is outlying with respect to y if it has a large studentized (or standardized) residual, |StRes| greater than 2 15-52 Hospital Labor Needs Data • From earlier slide, studentized residual for observation #14 is 2.871 • This exceeds 2 • This observation is an outlier with respect to y 15-53 Calculating the Deleted Residual For Observation i • Compute model using all observations • Calculate yi • Compute regression model using all observations except i • Use this reduced model to recompute yi • Subtract the two values • Divide this value by its standard error to get the studentized deleted residual • The value is compared to –t0.025 and +t0.025 • Values outside this range are outliers 15-54 Studentized Deleted Residuals Deleted Residual Studentize d Deleted Residual Deleted Residual Standard Error di nk 2 ei 2 sd SSE ( 1 h ) e i i i An observation is outlying with respect to y if it has a large studentized deleted residual, |tRes| greater than t/2 [with (n-k-2) d.f.] 15-55 Hospital Labor Needs Data • From earlier slide, observation #14 has a studentized deleted residual of 4.558 • The data has n-k-2 = 17-3-2 = 12 degrees of freedom • t0.025 = 2.179 • 4.558 > 2.179 • Observation #14 is outlying with respect to y 15-56 Cook’s Distance • An observation is influential with respect to the estimated regression parameters b0, b1,…, bk if it has a large Cook’s distance – Di greater than F.50 [with k+1 and n-(k+1) degrees of freedom] ei2 Cook' s Distance Di (k 1) s 2 hi 2 ( 1 h ) i 15-57 Hospital Labor Needs Data • From earlier slide, observation #17 has a Cook’s D value of 5.033 • F.05 with k+1 = 3+1 = 4 numerator degrees of freedom and 13 (from earlier slide) denominator degrees of freedom is 0.8845 • 5.033 > 0.8845 – Observation # 17 is influential with respect to the estimated regression parameters 15-58 What to do About Outliers? • First, check to see if the data was recorded correctly – If not correct, discard the observation and rerun • If correct, search for a reason for the observation – Might be caused by a situation we do not wish to model • If so, drop the observation • If no reason found, consider that there might be an important independent variable not currently included in the model 15-59 Transforming the Dependent and Independent Variables • A possible remedy for violations of the constant variance, correct functional form and normality assumptions is to transform the dependent variable • Possible transformations include – Square root – Quartic root – Logarithmic • The appropriate transformation will depend on the specific problem with the original data set 15-60 The Durbin-Watson Test and Dealing with Autocorrelation • One type of autocorrelation is called first-order autocorrelation • This is when the error term in time period t (t) is related to the error term in time period t-1 (t-1) • The Durbin-Watson statistic checks for first-order autocorrelation 15-61 Durbin-Watson Test Statistic n d 2 e e t t 1 t 2 n 2 e t t 1 • Where e1, e2,…, en are time-ordered residuals – If d < dL,, we reject H0 – If d > dU,, we reject H0 – If dL, ≤ d ≤ dU,, the test is inconclusive • Tables A.10, A.11, and A.12 give values for dL, and dU, at different alpha values 15-62