* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download linear regression
Interaction (statistics) wikipedia , lookup
Data assimilation wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Time series wikipedia , lookup
Choice modelling wikipedia , lookup
Regression toward the mean wikipedia , lookup
Regression analysis wikipedia , lookup
Advanced Statistics for Interventional Cardiologists What you will learn • • • • • • • • • • • • • • • Introduction Basics of multivariable statistical modeling Advanced linear regression methods Hands-on session: linear regression Bayesian methods Logistic regression and generalized linear model Resampling methods Meta-analysis Hands-on session: logistic regression and meta-analysis Multifactor analysis of variance Cox proportional hazards analysis Hands-on session: Cox proportional hazard analysis Propensity analysis Most popular statistical packages Conclusions and take home messages 1st day 2nd day What you will learn • Multiple Linear Regression – Basic concepts • • • • • – – – – – – – – Some examples Linear regression model Estimation and testing the regression coefficients Testing and evaluating the regression model Predictions Multiple regression models The model building process Selection of predictor variables Model diagnostics Remedial measures Model validation Qualitative Predictor variables Practical examples Multiple linear regression Example from cardiology How can I predict the impact of balloon dilation pressure on post-procedure minimum lumen diameter (MLD), taking concomitantly into account diabetes status and ACC/AHA lesion type? In other words, how can I predict the impact of a given variable (aka independent) on another continuous variable (aka dependent), taking concomitantly into account other variables ? Multiple linear regression Example from cardiology Time to restenosis (days) (mm) lumen diameter Minimum 400 350 300 250 200 150 100 50 0 0 10 20 30 40 Dilation pressureLesion duringLenght stenting (ATM) 50 60 Multiple linear regression Example from cardiology Briguori et al, Eur Heart J 2002 Multiple linear regression Example from cardiology Briguori et al, Eur Heart J 2002 Multiple linear regression Example from cardiology Mauri et al, Circulation 2005 Multiple linear regression Example from cardiology Mauri et al, Circulation 2005 Multiple linear regression Fitness demo example Aerobic fitness can be evaluated using a special test that measures the oxygen uptake of a person running on a treadmill for a prescribed distance. However, it would be more economical to evaluate fitness with a formula that predicts oxygen uptake using simple measurements such as running time and pulse. The table on the next slide shows the partial listing of the measurements for 31 subjects. Objective Find the regression equation that allows us to make the most reliable predictions of O2 uptake. Multiple Regression Fitness Data Subject • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Donna Gracie Luanne Mimi Chris Allen Nancy Patty Suzanne Teresa Bob Harriett Jane Harold Sammy Buffy Trent Jackie Ralph Jack Annie Kate Carl Don Effie George Iris Mark Steve Vaughn William Sex F F F F M M F F F F M F F M M F M F M M F F M M F M F M M M M Age 42 38 43 50 49 38 49 52 57 51 40 49 44 48 54 52 52 47 43 51 51 45 54 44 48 47 40 57 54 44 45 Weight 68,15 81,87 85,84 70,87 81,42 89,02 76,32 76,32 59,08 77,91 75,07 73,37 73,03 91,63 83,12 73,71 82,78 79,15 81,19 69,63 67,25 66,45 79,38 89,47 61,24 77,45 75,98 73,37 91,63 81,42 87,66 Oxy 59,57 60,06 54,30 54,63 49,16 49,87 48,67 45,44 50,55 46,67 45,31 50,39 50,54 46,77 51,85 45,79 47,47 47,27 49,09 40,84 45,12 44,75 46,08 44,61 47,92 44,81 45,68 39,41 39,20 39,44 37,39 RunTime 8,17 8,63 8,65 8,92 8,95 9,22 9,40 9,63 9,93 10,00 10,07 10,08 10,13 10,25 10,33 10,47 10,50 10,60 10,85 10,95 11,08 11,12 11,17 11,37 11,50 11,63 11,95 12,63 12,88 13,08 14,03 RunPulse 166 170 156 146 180 178 186 164 148 162 185 168 168 162 166 186 170 162 162 168 172 176 156 178 170 176 176 174 168 174 186 RstPulse 40 48 45 48 44 55 56 48 49 48 62 67 45 48 50 59 53 47 64 57 48 51 62 62 52 58 70 58 44 63 56 MaxPulse 172 186 168 155 185 180 188 166 155 168 185 168 168 164 170 188 172 164 170 172 172 176 165 182 176 176 180 176 172 176 192 Multiple linear regression • Simple linear regression is a statistical model to predict the value of one continuous variable Y (dependent, response) from another continuous variable X (independent, predictor, covariate, prognostic factor). • Multiple linear regression is a natural extension of the simple linear regression model – We use it to investigate the effect on the response variable of several predictor variables, simultaneously – It is a hypothetical model of the relationship between several independent variables and a response variable. • Let’s start by reviewing the concepts of the simple linear regression model. Simple linear regression The theoretical model Y = β0 + β1 X + ε Independent variable Distribution of the dependent variable Mean of the distribution of values of the dependent variable Regression line Simple linear regression The estimated model Yestimated = b0 + b1 X Yˆi Yestimated Y Re siduals : ei Yi Yˆi Yi b1 : slope unit b0: intercept X (independent) Linear Regression An estimation problem • Estimate the model parameters β0 and β1 as good as possible. • Find the ‘best-fitting’ line (Y = b0 + b1.X) through the measured coördinates. • How do we find this line? Minimize the sum of squared differences (least squares) between y and yestimated • Parameter Estimators b1 = (n Σ Xi Yi - Σ Xi Σ Yi) / ( n Σ Xi2 – (Σ Xi)2 ) b0 = mean Y – (b1 . mean X) Linear regression Assumptions Least square assumptions Linear relation between X and Y : E(εi) = 0 for all i Homoscedasticity (constant variance): Var(εi) = σ2 for all i Uncorrelated residuals : E(εi εj) = σ2 for all i ≠ j Significance tests assumptions Residuals are normally distributed : εi ≈ N ( 0, σ2 ) for all i Linear Regression Testing parameter significance • Testing significance of the regression parameters allows us to evaluate if there is an effect of the independent variable on the dependent variable. • If the slope β1 is significantly different from zero than we will conclude that the independent variable has a significant effect on the dependent variable. • Is the slope β1 significantly different from 0 ? – No : the value of x will not improve the prediction of y over the ordinary mean, – Yes : knowledge of the x-values will significantly improve the predictions • We can test if the slope is significantly different from 0 in two ways – Using a classical t-test – Construction of a confidence interval Linear Regression Testing parameter significance • The t- test is based on a function of the slope estimate b1 which has the t-distribution when the null hypothesis of ‘zero slope’ is true : tdf = b1/SE(b1) Decision rule: reject H0 if t t / 2,n2 • Testing with confidence intervals : b1 - tn-2,α/2 SEb < β1 < b1 - tn-2,α/2 SEb Decision rule: reject H0 if 0 is not in the confidence region Simple Linear Regression Growth example Examine how weight to height ratio changes as kids grow up. Measurements were taken from 72 children between birth and 70 months. What are your conclusions looking at the scatterplot ? Linear Regression Growth Example Linear Fit ratio = 0,66562 + 0,00528 age Summary of Fit RSquare 0,822535 RSquare Adj 0,819999 Root Mean Square Error 0,051653 Mean of Response 0,855556 Observ ations (or Sum Wgts) 72 Analy sis of Variance Source Model DF Sum of Squares Mean Square F Ratio 1 0,8656172 0,865617 324,4433 Error 70 0,1867605 0,002668 Prob>F C Total 71 1,0523778 <,0001 Parameter Estimates Term Std Error t Ratio Prob>|t| Intercept 0,6656231 Estimate 0,012176 54,67 <,0001 Lower 95% 0,6413397 Upper 95% 0,6899065 age 0,0052759 0,000293 18,01 <,0001 0,0046917 0,0058601 Evaluate the effect of age on ratio using the parameter estimates table? What you will learn • Multiple Linear Regression – Basic concepts • • • • • – – – – – – – – Some examples Linear regression model Estimation and testing the regression coefficients Testing and evaluating the regression model Predictions Multiple regression models The model building process Selection of predictor variables Model diagnostics Remedial measures Model validation Qualitative Predictor variables Practical examples Linear regression Assessing the fit of the model Analysis of Variance summarizes info about the sources of variation in the data by splitting the total sum of squares into two or more components. Y X SSTotal = SSModel + SSError n n n i 1 i 1 2 2 ˆ ˆ (Yi Y ) (Yi Y ) (Yi Yi ) 2 i 1 Linear regression ANOVA table for simple regression Source of Variation Sum of Squares Degrees of Freedom Mean Square F-ratio Model SSM 1 SSM/dfM MSM/MSE Error SSE n-2 SSE/dfE Total SST n-1 SST/dfT P-value In general, dfM equals the number of predictor terms in the model; dfE equals the number of observations minus the number of estimated coefficients in the model; and dfT equals the number of observations minus 1 (if the intercept is included in the model) Linear regression Significance of the model • F-ratio (and its p-value) is used to evaluate significance of the regression model. • MSModel / MSError ≈ F1;n-2 for simple regression • If the observed F-ratio is greater than a critical F-value, than we can conclude that this ratio is significantly greater than 1 and that the regression model explains a significant portion of the variation of the response variable. • Since the simple regression model has only one predictor variable, the F-ratio can also be used to determine if β1 = 0, i.e. if there is a significant effect of the predictor on the response variable (note: squared t-ratio = F-ratio) Linear regression Measure of Fit • A natural measure of ‘goodness of fit’ for the regression model is the coefficient of determination: R2 • R2 expresses the % of variability of the dependent variable explained by the variations in the independent variable • R2 = total variation (SST) – unexplained variation (SSE) total variation (SST) • Properties – R2 varies between 0 and 1 (perfect fit) – The larger R2 is, the more variation of Y is explained by the predictor X – Large values indicate a “strong” relationship between predictor and response variables. – For simple linear regression, R2 is the square of the correlation coefficient. Linear Regression Growth Example Linear Fit ratio = 0,66562 + 0,00528 age Summary of Fit RSquare 0,822535 RSquare Adj 0,819999 Root Mean Square Error 0,051653 Mean of Response 0,855556 Observ ations (or Sum Wgts) 72 Analy sis of Variance Source Model DF Sum of Squares Mean Square F Ratio 1 0,8656172 0,865617 324,4433 Error 70 0,1867605 0,002668 Prob>F C Total 71 1,0523778 <,0001 Parameter Estimates Term Std Error t Ratio Prob>|t| Intercept 0,6656231 Estimate 0,012176 54,67 <,0001 Lower 95% 0,6413397 0,6899065 age 0,0052759 0,000293 18,01 <,0001 0,0046917 0,0058601 Is this regression model significant ? What % of the variation of the response is explained by the model? Upper 95% Linear regression Examine residuals It is always a good idea to look at the residuals from a regression (the difference between the actual values and the predicted values). Residuals should be scattered randomly about a mean of zero. Linear Regression Residual analysis • Residual is the difference between the observed value and the fitted value at a certain level of X ^ ei Yi Y i • Once a model has been fit, the residuals are used to: • Validate the assumptions of the model • Diagnose departures from those assumptions • Identify corrective methods to refine the model Linear Regression Predictions • Two type of predictions of the response Y at new levels of X, the predictor variable, can be made using a validated regression equation. – Estimating the mean response (mean of the distribution of the response Y at new level of X) – Estimating a new observation of Y (individual outcome drawn from the distribution of the response Y at new level of X) • We calculate prediction intervals using the variance of the estimators. The estimation interval for an individual outcome is always larger than the one for a mean response, since the variance of the individual responses is greater than the variance of the mean response. Linear Regression Confidence band for regression line • This band allows us to see the region in which the entire regression line lies. It is useful for determining the appropriateness of a fitted regression function. Linear Regression Demonstration How to do a linear regression analysis with the EXCEL data analysis option ? What you will learn • Multiple Linear Regression – – – – – – – – – Basic concepts Multiple regression models The model building process Selection of predictor variables Model diagnostics Remedial measures Model validation Qualitative Predictor variables Practical examples Multiple linear regression Fitness demo example Aerobic fitness can be evaluated using a special test that measures the oxygen uptake of a person running on a treadmill for a prescribed distance. However, it would be more economical to evaluate fitness with a formula that predicts oxygen uptake using simple measurements such as running time and pulse. The table on the next slide shows the partial listing of the measurements for 31 subjects. Objective Find the regression equation that allows us to make the most reliable predictions of O2 uptake. Multiple Regression Fitness Data Subject • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Donna Gracie Luanne Mimi Chris Allen Nancy Patty Suzanne Teresa Bob Harriett Jane Harold Sammy Buffy Trent Jackie Ralph Jack Annie Kate Carl Don Effie George Iris Mark Steve Vaughn William Sex F F F F M M F F F F M F F M M F M F M M F F M M F M F M M M M Age 42 38 43 50 49 38 49 52 57 51 40 49 44 48 54 52 52 47 43 51 51 45 54 44 48 47 40 57 54 44 45 Weight 68,15 81,87 85,84 70,87 81,42 89,02 76,32 76,32 59,08 77,91 75,07 73,37 73,03 91,63 83,12 73,71 82,78 79,15 81,19 69,63 67,25 66,45 79,38 89,47 61,24 77,45 75,98 73,37 91,63 81,42 87,66 Oxy 59,57 60,06 54,30 54,63 49,16 49,87 48,67 45,44 50,55 46,67 45,31 50,39 50,54 46,77 51,85 45,79 47,47 47,27 49,09 40,84 45,12 44,75 46,08 44,61 47,92 44,81 45,68 39,41 39,20 39,44 37,39 RunTime 8,17 8,63 8,65 8,92 8,95 9,22 9,40 9,63 9,93 10,00 10,07 10,08 10,13 10,25 10,33 10,47 10,50 10,60 10,85 10,95 11,08 11,12 11,17 11,37 11,50 11,63 11,95 12,63 12,88 13,08 14,03 RunPulse 166 170 156 146 180 178 186 164 148 162 185 168 168 162 166 186 170 162 162 168 172 176 156 178 170 176 176 174 168 174 186 RstPulse 40 48 45 48 44 55 56 48 49 48 62 67 45 48 50 59 53 47 64 57 48 51 62 62 52 58 70 58 44 63 56 MaxPulse 172 186 168 155 185 180 188 166 155 168 185 168 168 164 170 188 172 164 170 172 172 176 165 182 176 176 180 176 172 176 192 Multiple linear regression • To investigate the effect on the response variable Y, of several independent X variables, simultaneously. • Even if we are interested in the effect of only one variable, it is wise to include other variables as regressors to reduce the residual variance and improve significance tests of the effects. • Multiple regression models often improve precision of the predictions. Multiple linear regression • The model : yi 0 1 xi1 2 xi 2 ... p xip i • βi represents the change in the response for an incremental change in the ith predictor variable, while all other predictor variables are held constant. βi is referred to as the partial regression coefficient. • Assumptions: residuals (or errors) εi are independent and normally distributed with mean 0 and standard deviation σ Multiple linear regression Estimated additive model with two predictors yi = b0 + b1 xi1 + b2 xi2 + ei b0 : Y value when X1 and X2 equal 0 b1 : effect of X1 on Y controlling for X2 b2 : effect of X2 on Y controlling for X1 Multiple linear regression Fitness Example Response: O2 Uptake Summary of Fit RSquare 0,761424 RSquare Adj 0,744383 Root Mean Square Error 2,693374 Mean of Response 47,37581 Observ ations (or Sum Wgts) Model explains 76% of the variation around the mean of 02 uptake. 31 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Lower 95% 76,191947 Upper 95% Intercept 93,088766 8,248823 11,29 <,0001 Run Time -3,140188 0,373265 -8,41 <,0001 -3,90478 -2,375595 Run Pulse -0,073509 0,050514 -1,46 0,1567 -0,176983 0,0299637 Ef fect Test Source Nparm DF Sum of Squares F Ratio Prob>F Run Time 1 1 513,41745 70,7746 <,0001 Run Pulse 1 1 15,36208 2,1177 0,1567 109,98559 Evaluate the effect of Run Time and Run Pulse on O2 uptake If you take an effect away from the model, then the SSError will be higher. Difference in SS is used to construct an F-test on whether the contribution of the variable is significant. Prediction Equation: O2 uptake = 93,089 – 3,14 Run time – 0.074 Run Pulse Multiple Regression Whole Model Leverage Plot Graphical method to view the wholemodel hypothesis using a scatterplot of actual response values against the predicted values. The vertical distance from a point to the 45° line is the residual error. The idea is to get a feel for how much better the slope line fits than the horizontal line at the mean. If the confidence curves cross the horizontal line, the whole model F test is significant. Mulitple Regression Effect Leverage Plots partial plot, partial regression leverage plot, added variable plot Plot shows how each effect contributes to the fit after all the other effects have been included in the model. The distance from each point to the sloped line measures the residual for the full model. The distance from each point to the horizontal line measures the residual for a model without the effect (reduced model). Multiple regression models Interaction model with two predictor variables Y = β0 + β1 X1 + β2 X2 + β12 X1X2 + ε The change in the response associated with X1 depends on the level of X2 (and vice versa) Multiple regression models Quadratic model with two predictor variables Y = β0 + β1 X1 + β2 X2 + β12 X1X2 + β11 X12 + β22 X22 + ε Quadratic models can only represent three basic types of shapes: mountains, valleys, saddles. Multiple regression models • Model terms may be divided into the following categories – – – – – Constant term Linear terms / main effects (e.g. X1) Interaction terms (e.g. X1X2) Quadratic terms (e.g. X12) Cubic terms (e.g. X13) • Models are usually described by the highest term present – Linear models have only linear terms – Interaction models have linear and interaction terms – Quadratic models have linear, quadratic and first order interaction terms – Cubic models have terms up to third order. What you will learn • Multiple Linear Regression – – – – – – – – – – Basic concepts Multiple regression models The model-building process Selection of predictor variables Model diagnostics Remedial measures Model validation Qualitative Predictor variables Interaction effects Practical examples The model-building process Source: Applied Linear Statistical Models, Neter, Kutner, Nachtsheim, Wasserman The model-building process Aerobic fitness can be evaluated using a special test that measures the oxygen uptake of a person running on a treadmill for a prescribed distance. However, it would be more economical to evaluate fitness with a formula that predicts oxygen uptake using simple measurements such as running time and pulse. The table shows the partial listing of the measurements for 31 subjects. Age Weight O2 uptake Runtime Rest pulse Run pulse Max Pulse 38 81,87 60,055 8,63 48 170 186 38 89,02 49,874 9,22 55 178 180 40 75,07 45,313 10,07 62 185 185 40 75,98 45,681 11,95 70 176 180 42 68,15 59,571 8,17 40 166 172 44 85,84 54,297 8,65 45 156 184 Objective Find the regression equation that allows us to make the most reliable predictions of O2 uptake. Selection of predictor variables Objective • Goal is to find the “best” model that is able to predict well over the range of interest. Many variables (especially in exploratory observational studies) may contribute to the response variation. Include them all in the model ? • No, the model must also be parsimonious – A simple model with only a few relevant explanatory variables is easier to understand and use than a complex model – Pareto principle – a few variables contribute the most of information • Reducing the number of variables reduces ‘multicollinearity’ • Increasing ratio observations/variables reduces the variability of b and improves prediction. • Gathering and maintaining data on many factors is difficult and expensive • In the words of Albert Einstein : “Make things as simple as possible but no simpler” Model Selection Methods • Find the ‘best’ model by comparing all possible regression models using all combinations of the explanatory variables. • Automatic model selection methods – Forward selection – Backward elimination – Stepwise selection Model selection methods All Possible Subsets • A regression model is estimated for each possible subset of the predictor variables. The constant term is included in each of the subset models. • If there are k possible terms to be included in the regression model, there are 2k possible subsets to be estimated. • Purpose of the all subsets approach is to identify a small group of models that are “good” according to a specified criterion, so that further detailed examination can be done of these models. Determining the Best Model • A variety of statistics have been developed to help determine the “best” subset model. These include : – – – – – – Coefficient of determination : R2 Adjusted R2 Relative PRESS Mallows Cp Akaike information criterion (AIC) Swarz Information Criterion (SIC or BIC) • Residual analysis and various graphical displays also help to select the “best” subset regression model. Coefficient of Determination • R2 measures the proportion of the total variation in the response that is explained by the regression model, i.e. R2 SSModel SSError 1 SSTotal SSTotal • Values of R2 close to 1 indicate a good fit to the data. However, R2 can be arbitrarily increased by adding extra terms in the model without necessarily improving the predictive ability of the model. • To correct for the number of terms in the model, Adjusted R2 is defined as SSError / df Error 2 2 with Radj 1 Radj 1 SSTotal / dfTotal 2 • Large difference between R2 and Radj indicates the presence of unnecessary terms in the model. Relative PRESS • The PRedictive Error Sums of Squares is given by, n 2 ˆ ( Y Y ) i (i ) PRESS = i 1 where Yˆ represents the predicted value of Yi using the model that was fitted (i ) with the ith observation deleted. • Relative Press is similar to R2 and 2 Radj Relative PRESS = 1 – PRESS / SSTotal • It can be shown that Relative PRESS ≤ 1 Mallows Cp Statistic Cp (n p) s 2p s 2 (n 2 p) • n is the number of observations • p is the number of terms in the model (including the intercept) • sp2 is the estimate of error from the subset model containing p terms • s2 is the estimate of error from the model containing all possible terms. s2 is assumed to be a “good” estimate of the experimental error. If a model with p terms is adequate, Cp ≈ p Otherwise, if the model contains unnecessary terms, Cp > p AIC and BIC AIC (Akaike Information Criterion) and BIC (Swarz Information Criterion) are two popular model selection methods. They not only reward goodness of fit, but also include a penalty that is an increasing function of the number of estimated parameters. This penalty discourages overfitting. The preferred model is the one with the lowest value for AIC or for BIC. These criteria attempt to find the model that best explain the data with a minimum of free parameters. The AIC penalizes free parameters less strongly than does the Schwarz criterion. AIC = 2k + n [ln (SSError / n)] BIC = n ln (SSError / n) + k ln(n) Model selection methods Stepwise regression • While the All Subsets procedure is the only way to guarantee that the “best” subset is chosen, other techniques have been developed that are less computationally intensive. • Stepwise Regression is an automatic model selection procedure which enters or removes terms sequentially. • The basic steps in the procedure are: – Compute an initial regression model – Enter terms that significantly improve the fit (p-value less than the p-toenter value), or remove terms that do not significantly harm the fit (pvalue greater than the p-to-remove value) – Compute the new regression model – Stop when entering or removing terms will not significantly improve the model • Unfortunately, the order in which the terms are entered or removed can lead to different models. Variable Selection Fitness Example p-values to enter or remove a variable from the model model comparison criteria Overview of the variable selection procedure Best model Best Model with Leverage Plot Fitness Example Response: Oxy Summary of Fit RSquare 0,835425 RSquare Adj 0,810106 Root Mean Square Error 2,321437 Mean of Response 47,37581 Observ ations (or Sum Wgts) 31 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| 11,65754 8,34 <,0001 Intercept 97,185202 Age -0,189218 0,09439 -2,00 0,0555 Runtime -2,775606 0,341602 -8,13 <,0001 RunPulse -0,345272 0,118209 -2,92 0,0071 MaxPulse 0,2714364 0,134383 2,02 0,0538 Ef fect Test Source F Ratio Prob>F Age Nparm 1 DF 1 Sum of Squares 21,65647 4,0186 0,0555 Runtime 1 1 355,78610 66,0199 <,0001 RunPulse 1 1 45,97614 8,5314 0,0071 MaxPulse 1 1 21,98663 4,0799 0,0538 Multiple linear regression Example from cardiology Mauri et al, Circulation 2005 Multiple linear regression Example from cardiology Mauri et al, Circulation 2005 What you will learn • Multiple Linear Regression – – – – – – – – – Basic concepts Multiple regression models The model-building process Selection of predictor variables Model diagnostics Remedial measures Model validation Qualitative Predictor variables Practical examples What you will learn • Multiple Linear Regression – Model diagnostics and Remedial measures • • • • • • Assumptions Scaling of Residuals Distribution of Residuals Unequal Variances Outliers and Influential Observations Multicollinearity – Remedial Measures – Model Validation Building the Regression Model Diagnostics Remember that the residuals are useful for : Validating the assumptions of the model Diagnosing departures from those assumptions Identifying corrective methods to refine the model ei Yi Yˆi Linear regression Assumptions Least square assumptions Linear relation between X and Y : E(εi) = 0 for all i Homoscedasticity (constant variance): Var(εi) = σ2 for all i Uncorrelated residuals : E(εi εj) = σ2 for all i ≠ j Significance tests assumptions Residuals are normally distributed : εi ≈ N ( 0, σ2 ) for all i Scaling of Residuals • Raw – the unscaled residuals These are in the original units of the response. They are used to diagnose extreme values in magnitude and direction. • Standardized – to make the residuals more comparable Standardized residuals are calculated dividing the residuals by their standard deviation. They are approximately normal with a mean of zero and a standard deviation of one. Hence approximately 95% of the standardized residuals should be between + 2σ. Examining residuals Growth example The picture you hope to see is the residuals scattered randomly about a mean of zero. Look for patterns and for points that violate this random scatter. The plot above is suspicious. Why? Violation against which assumption? Examining residuals Regression line with extreme points excluded. Now the residuals are scattered randomly around a mean of zero. But excluding points is probably not the best solution. Examining residuals Comparison of Linear and Second Order Polynomial Fit Ratio = b0 + b1 age + b2 age2 + residual There still apppears to be a pattern in the residuals. Continue by fitting a model with higher order terms. Examining residuals Comparison of Linear and Higher Order Polynomial Fit The residuals of the fourth order polynomial fit do not show any pattern anymore. Violation Linearity Assumption Remedial measures • Polynomial regression: previous example • Transformation (Box-Cox) of the independent variable • Broken line (or piecewise) regression Distribution of residuals Normality check • Graphical Methods – Histogram, Box-plot – Normal Probability Plot (NPP) • Formal Tests given in many stat. packages – Shapiro-Wilk statistic W for small samples (< 50) – Kolmogorov-Smirnov test • If violation against the normality assumption of Y we can try to solve the problem using transformations of Y (use NPP and Box-Cox to decide which transformations). Distribution of residuals Normality check Polynomial 4th order model Simple regression 0,10 .01 .05 .10 .25 .50 .75 .90 .95 0,08 .99 .01 .05 .10 .25 .50 .75 .90 .95 .99 0,06 0,05 0,04 -0,00 0,02 -0,05 0,00 -0,10 -0,02 -0,15 -0,04 -0,20 -0,06 -0,25 -0,08 -3 -2 -1 0 1 2 3 -3 Normal Quantile -2 Normal Quantile Test f or Normality Test f or Normality Shapiro-Wilk W Test Shapiro-Wilk W Test W 0,862849 W Prob<W <,0001 -1 Conclusions ? 0,973852 Prob<W 0,3949 0 1 2 3 Normality check Example from cardiology: late loss Mauri et al, Circulation 2005 Normality check Example from cardiology: late loss Mauri et al, Circulation 2005 Why graphics are important ? Linear Fit Statistical reports for four analysis Linear Fit Y 1 = 3,00009 + 0,50009 X1 Y 2 = 3,00091 + 0,5 X2 Summary of Fit Summary of Fit RSquare 0,666542 RSquare 0,666242 RSquare Adj 0,629492 RSquare Adj 0,629158 Root Mean Square Error 1,236603 Root Mean Square Error 1,237214 Mean of Response 7,500909 Mean of Response 7,500909 Observ ations (or Sum Wgts) 11 Observ ations (or Sum Wgts) Analy sis of Variance Source DF Model What do you expect about the underlying data ? C Total Analy sis of Variance Sum of Squares Mean Square F Ratio Source DF Model 1 27,500000 27,5000 17,9656 9 13,776291 1,5307 Prob>F 10 41,276291 1 27,510001 27,5100 17,9899 9 13,762690 1,5292 Prob>F Error 10 41,272691 0,0022 C Total Error Parameter Estimates Term 11 Sum of Squares Mean Square 0,0022 Parameter Estimates Estimate Std Error t Ratio Prob>|t| Term Intercept 3,0000909 1,124747 2,67 0,0257 Intercept X1 0,5000909 0,117906 4,24 0,0022 X2 Estimate 3,0009091 0,5 Std Error t Ratio Prob>|t| 1,125302 2,67 0,0258 0,117964 4,24 0,0022 Linear Fit Linear Fit Y 4 = 3,00173 + 0,49991 X4 Y 3 = 3,00245 + 0,49973 X3 Summary of Fit Summary of Fit RSquare 0,666324 RSquare 0,666707 RSquare Adj 0,629249 RSquare Adj 0,629675 Root Mean Square Error 1,236311 Root Mean Square Error 1,235695 Mean of Response 7,500909 Mean of Response 7,5 Observ ations (or Sum Wgts) Observ ations (or Sum Wgts) 11 Model Error C Total DF 11 Analy sis of Variance Analy sis of Variance Source F Ratio Sum of Squares Mean Square F Ratio Source Model 1 27,470008 27,4700 17,9723 9 13,756192 1,5285 Prob>F Error 10 41,226200 0,0022 C Total DF Sum of Squares Mean Square F Ratio 1 27,490001 27,4900 18,0033 9 13,742490 1,5269 Prob>F 10 41,232491 0,0022 Parameter Estimates Parameter Estimates Std Error t Ratio Prob>|t| Term Std Error t Ratio Prob>|t| Intercept 3,0024545 1,124481 2,67 0,0256 Intercept 3,0017273 1,123921 2,67 0,0256 X3 0,4997273 0,117878 4,24 0,0022 X4 0,4999091 0,117819 4,24 0,0022 Term Estimate Estimate Why graphics are important ? Regression lines for four analysis Unequal Variances • Model Assumption: var (εi) = var (yi) is a constant σ2 • Heteroscedasticity (unequal variances) does not bias the estimates of the regression parameters β but it causes variances of parameter estimates to be large and can affect R2, s2 and tests substantially. • Detect heteroscedasticity through plots of the (standardized) residuals against ŷ • Remedial actions : – Variance stabilizing transformations of yi (eg. square root, logarithm) – Weighting the regression parameters (WLS) Unequal Variances example Residuals e ŷ Outliers and Influential Observations • Sometimes, while most of the observations fit the model, some of the observations clearly do not. This occurs when there is something wrong with the observations or if the model is faulty. • A point has great influence when it has a large effect on the parameter estimates. • There are two types of influential observations – Outliers: extreme observations of the dependent variable that exhibit large residuals – Leverage points: observation with extreme value on one of the independent variables • Outliers are detected by examining various types of residuals. • Leverage points are detected with the leverage hii This measure describes how far away a point is from the centroid of all points in the space of the independent variables. So, leverage is a measure of remoteness. • Influence is assessed by examining the residuals and the leverages. Outliers and Influential Observations Influential Outlier Influential Leverage point Outliers and Influential diagnostics • For detecting leverage points we examine the leverage hii of the observations. ( X i X )2 1 hii n n 2 ( X X ) j j 1 • leverage value of ith observation increases with increasing devation of Xi of the mean of this variable. • hii takes values between 0 and 1 and the sum is p (number of parameters) • hii with values greater than 2p/n are called leverage points and need further investigation Outliers and Influential diagnostics • For detecting outliers that do not belong to the model, Studentized (deleted) residuals are mostly used. e * i s(i ) ei t n p 1 1 hii s(i) is equivalent to the standard deviation s if least squares is run after deleting the ith case. hii is the leverage Outliers and Influential diagnostics • Not all influential points have large ei* s. • Additional measures are defined, measures that tell us how much parameter b or ŷ would change if a given point were deleted. • Most used measures are : DFBETAS DFFITS ei2 hii COOK’s Distance = 2 2 k 1s (1 h ) ii • Criteria can be defined to decide if point is influential. “Large” Cook’s Distance values are the most influential and should be investigated. Checking for Influential Observations Contour curves show Cook’s Distances corresponding to the 50th, 75th and 95th influence percentiles. As a result, 5% of the residuals will always fall outside the 95th percentile curve. Multicollinearity • The quality of estimates, as measured by their variances, can be seriously affected if the independent variables are closely related (highly correlated) to each other. • An obvious method of assessing the degree to which each independent variable is related to all other independent variables is to examine R2 • Popular measures: – tolerance TOLj = 1 – Rj2 – variance inflation factor VIFj = TOLj -1 Remedial measures Overview • Depending on the nature of the problem, one or more of the following may be appropriate : – – – – – – – – Consider Transforming the data Consider using Weighted Least Squares Consider using Robust Regression Use a more complicated equation e.g. add quadratic or cubic terms to the model Add an omitted predictor variable Consider more complicated models e.g. time series models Consider variable reduction techniques Consider ridge regression Model validation • After remedial measures have been taken and diagnostics analyzed to make sure that the remedial measures were succesful, the final step of the modelbuilding proces is the validation of the selected regression model. • Three basic ways of validating a regression model are: – Collection of new data to check predictive ability of the model → preferred method, but not practical – Comparison of the results with theory, simulation results or previous empirical results – Split the study data into model-building (training) and validation data set randomly : cross-validation. Validation data set is used to re-estimate and compare the regression coefficients. Sample Size • For the linear regression model the desired sample size is determined via an analysis of the power of the F-test. • For this analysis we need following info : – the number of predictors you want to analyze (rule of thumb: number of observations must be at least 15 times number of predictors in the study) – the significance level alpha (type 1 error) – the size of the effect in the population (use measures for proportion of explained variation such as R2 for the whole model) • The bigger the expected effect, the smaller the size of the sample. What you will learn • Multiple Linear Regression – – – – – – – – – Basic concepts Multiple regression models The model-building process Selection of predictor variables Model diagnostics Remedial measures Model validation Qualitative predictor variables Practical examples Categorical predictors • So far we have utilized only quantitative predictor variables in the regression models. • Qualitative predictor variables can also be incorporated in the linear regression model by using indicator variables. • Indicator variables or dummy variables are variables that take only two values eg. 0 and 1 or -1 and 1. • Let’s have a look at the simplest example : one qualitative predictor with two levels: dichotomous predictor. Dichotomous predictor Fitness example Regression model with one dichotomous predictor variable. yi = β0 + β1 xi + ε with xi = -1 for male and xi = 1 for female Response: Oxy This is a model for the familiar two-sample testing problem : H0: µ1 = µ2 against H1: µ1 ≠ µ2 Summary of Fit RSquare 0,234692 RSquare Adj 0,208302 Root Mean Square Error 4,740032 Mean of Response 47,37581 Observ ations (or Sum Wgts) 31 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 47,293867 0,851778 55,52 <,0001 Sex[F-M] 2,5401333 0,851778 2,98 0,0057 Ef fect Test Source Sex Nparm 1 DF 1 Sum of Squares 199,81246 F Ratio Prob>F 8,8932 0,0057 What can you conclude from the statistical output about the mean response for the males versus the females ? Dichotomous predictor Leverage Plots Polychotomous predictor • To incorporate a categorical variable with more than two levels in the regression model, we will create several dummy variables. • For a variable with g categories, we need to incorporate only g -1 dummy variables in the model. The category for which the dummy variable is not in the model is the reference category. Category A B X1 (A) 1 0 X2 (B) 0 1 X3 (C) 0 0 C 0 0 1 Polychotomous predictor Example Let’s examine the effect of drug treatment (a, d or placebo) on the response (LBS=bacteria count) in 30 subjects. Which statistical method would you use to tackle this problem ? Polychotomous predictor Example Response: LBS Summary of Fit RSquare 0,227826 RSquare Adj 0,170628 Root Mean Square Error 6,070878 Mean of Response 7,9 Observ ations (or Sum Wgts) 30 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| 7,9 1,108386 7,13 <,0001 Drug[a-placebo] -2,6 1,567494 -1,66 0,1088 Drug[d-placebo] -1,8 1,567494 -1,15 0,2609 Intercept Ef fect Test Source Drug Nparm 2 DF 2 Sum of Squares 293,60000 F Ratio Prob>F 3,9831 0,0305 From linear regression to the general linear model. Coding scheme for the categorical variable defines the interpretation of the parameter estimates. Polychotomous predictor Example - Regressor construction • Terms are named according to how the regressor variables were constructed. • Drug[a-placebo] means that the regressor variable is coded as 1 when the level is “a”, - 1 when the level is “placebo”, and 0 otherwise. • Drug[d-placebo] means that the regressor variable is coded as 1 when the level is “d”, - 1 when the level is “placebo”, and 0 otherwise. • You can write the notation for Drug[a-placebo] as ([Drug=a][Drug=Placebo]), where [Drug=a] is a one-or-zero indicator of whether the drug is “a” or not. • The regression equation then looks like: Y = b0 + b1*((Drug=a)-(Drug=placebo)) + b2*(Drug=d)-(Drug=placebo)) + error Polychotomous predictor Example – Parameters and Means • With this regression equation, the predicted values for the levels “a”, “d” and “placebo” are the means for these groups. • For the “a” level: Pred y = 7.9 + -2.6*(1-0) + -1.8*(0-0) = 5.3 For the “d” level: Pred y = 7.9 + -2.6*(0-0) + -1.8(1-0) = 6.1 For the “placebo” level: Pred y = 7.9 + -2.6(0-1) + -1.8*(0-1) = 12.3 • The advantage of this coding system is that the regression parameter tells you how different the mean for that group is from the means of the means for each level (the average response across all levels). • Other coding schemes result in different interpretations of the parameters. What did you learn - hopefully • Multiple Linear Regression – – – – – – – – – Basic concepts Multiple regression models The model-building process Selection of predictor variables Model diagnostics Remedial measures Model validation Qualitative predictor variables Practical examples Linear regression: do-it-yourself with SPSS Scatterplot Linear regression Linear regression Linear regression Questions? Take home messages • Multiple regression models are generally used to study the effect of continuous independent variables on a continuous response variable, simultaneously. • Before modelling, look carefully at the data and always start the analysis by plotting all variables individually and against each other. Evaluate correlations. • Keep in mind the model assumptions from the start. • Building the ‘best’ regression model is an iterative process. • Be careful with automatic variable selection procedures. • Examine residuals and leverages to diagnose and remedy the model. • Validate the model using an independent data set. And now a real break… For further slides on these topics please feel free to visit the metcardio.org website: http://www.metcardio.org/slides.html