Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3: Diagnostics One cannot be certain in advance of performing a linear regression if this technique is appropriate. We need to check if the model is suitable based on the given data. That means we need to check the assumptions associated with linear regression before we make inferences based on our results. Simple Diagnostics for X and Y – Graphing Graphing assists in determining the range and checking for outliers. 1. Stem and leaf plot 2. Histograms 3. Box Plots Diagnostics for Residuals Since the Y values are a function of the level of the X variable, diagnostic plots on Y are typically not useful. So instead diagnostics for Y are often carried out by looking at the residuals. A box plot of residuals shows any outliers based on the IQR method. Another way of identifying residual outliers is by using Semi-Studentized residuals. Semistudentized (or Standardized) Residuals: Since standardized is technically not correct because MSE is an approximation of the SD of the residuals, the term semistudentized is used. The calculation is as follows and MTB does not provide an option, thus you need to divide the residuals by MSE which is S in the output: ei e* = and general rule is that |e*| ≥ 4 are considered outliers MSE Model Departures Studied by Residuals 1. The regression function is not linear 2. The error terms do not have constant variance (homescedasticity) 3. The error terms are not independent 4. The model fits all but one or a few outlier observations 5. The error terms are not normally distributed 6. One or more important predictor variables have been omitted from the model Residual Diagnostics to Study Departures A. Plot of residuals against predictor variable(s) [1, 4] B. Plot of residuals against fitted values (same as A when simple linear model) [1,4] C. Plot of absolute or squared residuals against predictor variable(s) [2] D. Plot of residuals against time [3] E. Plots of residuals against omitted variables [6] F. Box Plot, dot plots, etc. of residuals [4,5] G. Normal probability plot of residuals. [5] 1 Tests - NOTE: for tests in 2 differ from what you may find is done in software packages 1. Normality – Stat > Basic Stat > Normality Test or Graph > Probability Plot for A-D a. Kolmogorov-Smirnov b. Anderson-Darling (MINITAB) Hypotheses: Ho: Data follow a normal distribution Ha: not normal 2. Constancy of Error – MTB has no automatic feature for any of these: a. White's Test - Regress Y on predictor(s) and store residuals and fits. Get squared terms of both. Regress squared residuals on both fits and squared fits. Use the F-test in the ANOVA output to test Ho: All slopes are zero. b. Breusch- Pagan – Regress Y on predictor(s) and store residuals. Get squared residuals. Regress squared residuals on the predictor(s). Use the F-test in the ANOVA output to test Ho: All slopes are zero. If more than one predictor and F-test is significant you can use the individual t-tests of this regression to identify which predictor(s) indicate possible non-constant variance c. Modified Levene Test – split data by making a Group variable then Stat > Basic Statistics > ANOVA > Test of Equal Variances and test Ho: Error variance is constant 3. F Test Lack of Fit – under "Options" tab in Regression a) Pure Error and b) Data Subsetting [Pure Error Assumes Y is independent, normally distributed and have constant variance. Need to have replicates for some values of X but not necessarily all. If no replicates, then use the Data Subsetting option. F* = MSLF/MSPE where MSLF = SSLF/(c-2) and MSPE = SSPE/(n-c) SSLF = Y Yˆ and DF = c – 2 SSPE = Y 2 j ij ij >>> Lack of Fit Sum of Squares Y j and DF = n – c >>> Pure Error Sum of Squares 2 C is the number of unique observations in X j equals 1 to C The basic premise of the Lack of Fit test is that if the linear regression function is appropriate then the Yj will be near the fitted values Yˆj calculated from the linear regression model. Therefore, Ho is “The regression function is linear” versus Ha that “The regression function is not linear”. That means Ha includes all possible regression functions that are non-linear. 2 Example: Sales 1 1 2 2 2 3 4 Adv 1 2 2 3 4 4 5 RESI1 0.19 -0.48 0.52 -0.14 -0.81 0.19 0.52 X=1 2 Y = 1 1.5 FITS1 0.81 1.48 1.48 2.14 2.81 2.81 3.48 3 2 4 2.5 5 4 C = 5 and N = 7 Regression Analysis: Sales versus Adv Predictor Constant Adv Coef 0.1429 0.6667 S = 0.552052 SE Coef 0.5216 0.1594 R-Sq = 77.8% T 0.27 4.18 P 0.795 0.009 R-Sq(adj) = 73.3% Analysis of Variance Source Regression Residual Error Lack of Fit Pure Error Total DF 1 5 3 2 6 SS 5.3333 1.5238 0.5238 1.0000 6.8571 MS 5.3333 0.3048 0.1746 0.5000 F 17.50 P 0.009 0.35 0.798 3 rows with no replicates Transformations 1. Non-linear Relation – look for best transformation on the Xs. We use the X’s in this case, especially when the residuals are assumed to follow a normal distribution and have constant error variance. Transforming the Ys in such a situation could change the assumptions regarding the residuals since the residuals are a dependent on the Ys. 2. Nonnormality or Unequal Error Variances – Start with transformations on the Ys, but keep in mind that similar transformations on X may need to be performed to help keep the integrity of the linear relationship A common transformation approach is to use the Box-Cox method. In Minitab, go to Stat > Control Charts > Box-Cox Transformation. Select “All the observations for a chart in one column” select the variable and for group size enter the number of observations. The resulting graph will provide the best lamba transformation value. 3 NOTE: if you transform a variable(s) then the regression must be performed using these transformed values and the diagnostics need to be re-evaluated. Just as important, any reporting must be based on the transformed variable and not the original variable. For instance, if the X variable is transformed using a log, then the reporting of a significant relationship between X and Y must be stated as a significant relationship between logX and Y. Chapter 3 – Lab 1. Create boxplots of Y and X 2. Check for duplicates of X by Stat > Tables > Tally Individual Variables 3. Regress Y on X and store residuals and fits 4. Make scatterplot of residuals vs X 5. Calculate semi-studentized residuals. Go to Calc > Calculator and type Semi in 'Store result' and in the expression window create expression that takes the residuals divided by the value of S in the output. Check if any of these new 'semi' values are outside -4 to 4 6. Use Graph > Probability Plot to check normality assumption of residuals 7. Conduct the three tests of constant variance. 1. Modified Levene: a. Create 'group' by Calc > Calculator. Type Group in 'Store result' and in the expression window enter c2 <= medi(c2). b. Go to Stat > ANOVA > Test Equal Variances. Enter the residuals in the Response and Group in the Factor 2. Breusch-Pagan: a. Create 'sqres' by Calc > Calculator. Type sqres in 'Store result' and in the expression window enter an expression that squares the residuals b. Regress sqres on X. Check F-test of ANOVA output 3. White's Test: a. Create 'sqres' by Calc > Calculator. Type sqres in 'Store result' and in the expression window enter an expression that squares the residuals b. Create 'sqfits' by Calc > Calculator. Type sqfits in 'Store result' and in the expression window enter an expression that squares the fits b. Regress sqres on fits and sqfits (i.e. enter both fits and sqfits together in Predictor window). Check F-test of ANOVA output 8. Check Lack of Fit using both Pure Error and Data Subsetting. In the Regression window click Options and check the boxes for both under Lack of Fit 9. Check using Box-Cox method to see if a transformation of Y is needed. - Stat > Control Charts > Box Cox Transformations - From the dropdown box select “All observations for a chart are in one column” (should be default) and in the window below that enter the variable of interest, Y. - For subgroup size enter the size of the sample, 10. - The resulting graph provides a legend with the best estimate and rounded value. Use the rounded value. 4