Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 6 Regression Diagnostics STAT 512 Spring 2011 Background Reading KNNL: 3.1-3.6 6-1 Topic Overview Chapter 3 – Diagnostics & Remedial Measures for Simple Linear Regression Diagnostics: Look at the data to diagnose situations where the assumptions of our model are violated. (Lecture 6) Remedies: Changes in analytic strategy to fix these problems. (Lecture 7) 6-2 What Do We Need To Check? Main Assumptions: Errors are independent, normal random variables with common variance 2 Does the assumption of linearity make sense? Were any important predictors excluded from the model? 6-3 What Do We Need To Check? Are there “outlying” values for the predictor variables (X) that could unduly influence the regression model? Are there outliers? (Generally the term outlier refers to a response that is vastly different from other responses (Y) – see KNNL pg 108) How to Get Started? - Look at the Data! 6-4 Diagnostics for Predictors (X) We do not make any specific assumptions about X. However, understanding what is going on with X is necessary to interpreting what is going on with the response (Y). So, we can look at some basic summaries of the X variables to get oriented. However, we are not checking our assumptions at this point. 6-5 Diagnostics for Predictors (X) Dot plots, Stem-and-leaf plots, Box plots, and Histograms can be useful in identifying potential outlying observations in X. Note that just because it is an outlying observation does not mean it will create a problem in the analysis. However it is a data point that will probably have higher influence over the regression estimates. Sequence plots can be useful for identifying potential problems with independence. 6-6 SAS Procedures PROC UNIVARIATE for getting basic statistics and creating histograms for both response and predictor variables. Check for outliers, unusual skewness, clumping. PROC GPLOT to create a scatter plot of X against Y. Assess linearity visually. 6-7 Reminder – Scatterplot 6-8 UNIVARIATE Procedure (1) (06_misc.sas) PROC UNIVARIATE data=muscle plot; var age; histogram age / normal (mu=est sigma=est); title 'Histogram for Age'; RUN; Basic Statistical Measures Location Mean Median Mode 59.98333 60.00000 78.00000 Variability Std Deviation Variance Range Interquartile Range 11.79700 139.16921 37.00000 20.50000 6-9 UNIVARIATE Procedure (2) Stem Leaf 78 00000 76 0000 74 0 72 000 70 000 68 000 66 0 64 0000 62 000 60 0000 58 000 56 0000 54 000 52 000 50 0 48 00 46 0000 44 00 42 0000 40 000 ----+----+----+----+ # Boxplot 5 | 4 | 1 | 3 | 3 +-----+ 3 | | 1 | | 4 | | 3 | | 4 *--+--* 3 | | 4 | | 3 | | 3 | | 1 | | 2 +-----+ 4 | 2 | 4 | 6-10 UNIVARIATE Procedure (3) 6-11 UNIVARIATE Procedure (4) What if we add in a data point for: age=100, mmass =40? Stem Leaf # Boxplot 10 0 1 0 9 9 8 8 7 5666788888 10 | 7 001223 6 +-----+ 6 5556889 7 | | 6 00013334 8 *--+--* 5 56777999 8 | | 5 123344 6 +-----+ 4 5677788 7 | 4 11122334 8 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1 6-12 UNIVARIATE Procedure (5) 6-13 Diagnostics for Residuals (1) Basic Distributional Assumptions on Errors Model: Yi = β0 + β1Xi + εi (i.e., the εi are o Where independent, normal, and have constant variance). The ei (residuals) should be similar to the εi i ~ N 0, 2 iid How do we check this? Plot the Residuals! 6-14 Diagnostics for Residuals (2) Basic Questions addressed by diagnostics for residuals o Is the relationship linear? o Does the variance depend on X? o Are the errors normal? o Are the errors independent? o Are their outliers? o Are any important predictors omitted? 6-15 Checking Linearity Plot Y vs. X (scatterplot) Plot e vs X (or Yˆ ) - residual plot Generally can see from a scatter plot when a relationship is nonlinear Patterns in residual plots can emphasize deviations from linear pattern 6-16 Checking Constant Variance Plot e vs X (or Yˆ ) - residual plot Patterns suggest issues! Megaphone shape indicates increasing/decreasing variance with X Other shapes can indicate non-linearity Outliers show up in obvious way 6-17 SAS Code PROC REG data=muscle; model mmass=age; output out=diag p=pred r=resid; RUN; *Plot residuals vs age; symbol1 v=dot i=none; PROC GPLOT data=diag; plot resid*age; title 'Residuals for Muscle Mass Data'; run; 6-18 6-19 6-20 6-21 6-22 Checking for Normality Plot residuals in a Normal Probability Plot o Compare residuals to their expected value under normality (normal quantiles) o Should be linear IF normal Plot residuals in a Histogram PROC UNIVARIATE is used for both of these Book shows method to do this by hand – you do not need to worry about having to do that. 6-23 SAS Code PROC REG data=muscle; model mmass=age; output out=diag p=pred r=resid; RUN; *Check normality assumption; PROC UNIVARIATE data=diag normal; var resid; histogram resid /normal(mu=est sigma=est); qqplot resid /normal; title 'Check for Normality'; RUN; 6-24 6-25 6-26 Normality Plot Outliers show up in a quite obvious way. Non-normal distributions can look very wacky. Symmetric / Heavy tailed distributions show an “S” shape. Skewed distributions show exponential looking curves (see figure 3.9) 6-27 6-28 6-29 150000 100000 R e s i d u a l 50000 0 - 50000 - 100000 -4 -3 -2 -1 0 Nor m al 1 2 3 4 Quant i l es 6-30 Checking Independence Sequence Plot: Residuals against time/order Patterns suggest non-independence See figure 3.8 in KNNL. 6-31 Additional Predictors Plot residuals against other potential predictors (not predictors from the model) Patterns indicate an important predictor that maybe should be in the model. Example: Suppose we use a muscle mass dataset that includes both men and women. 6-32 6-33 Residuals vs Age Plot looks great, right? But what happens if we separate male and female? PROC GPLOT data=diag; plot resid*age=gender /overlay; RUN; 6-34 6-35 Additional Predictors Seems like gender is also an important predictor of muscle mass (note that gender is categorical, so we’ll have to wait until later in the semester for further analysis) For continuous variables, you look for a linear pattern with a non-zero slope. 6-36 Summary of Diagnostic Plots You will have noticed that the same plots are used for checking more than one assumption. These are your basic tools. o Plot Y vs. X (check for linearity, outliers) o Plot Residuals vs. X (check for constant variance, outliers, linearity) o Normal Probability Plot and/or Histogram of residuals (normality, outliers) If it makes sense, consider also doing a sequence plot of the residuals (independence) 6-37 Plots vs. Significance Tests If you are uncertain what to conclude after examining the plots, you may additionally wish to perform hypothesis tests for model assumptions (normality, homogeneity of variance, independence). These tests are not a replacement for the plots, but rather a supplement to them. Note of caution: Plots are more likely to suggest a remedy and significance test results are very dependent on sample size. 6-38 Significance Tests for Model Assumptions Constancy of Variance: o Brown-Forsythe (modified Levene) o Breusch-Pagan Normality o Kolmogorov-Smirnov, etc. Independence of Errors: o Durbin-Watson Test 6-39 Tests for Normality PROC UNIVARIATE data=diag normal; var resid; Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling W D W-Sq A-Sq Pr Pr Pr Pr 0.979585 0.079433 0.057805 0.383556 < > > > W D W-Sq A-Sq 0.4112 >0.1500 >0.2500 >0.2500 Small p-values indicate non-normality 6-40 Upcoming in Lecture 7... Remedial Measures: What to do when there is a problem with your model assumptions (KNNL: 3.8-3.11) 6-41