Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lasso (statistics) wikipedia , lookup
Data assimilation wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Time series wikipedia , lookup
Choice modelling wikipedia , lookup
Regression toward the mean wikipedia , lookup
Regression analysis wikipedia , lookup
Multiple Linear Regression Model By Walden University Statsupport Team March 2011 Multiple Linear Regression Model • • • • • Introduction Assumptions ANOVA for Multiple Linear Regression Regression Coefficients Examining Multiple Regression Conditions Introduction Multiple linear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data. Every value of the independent variable x is associated with a value of the dependent variable y. The model for multiple linear regression given n observations is: Yi o 1 xi1 2 xi 2 ... p xip i where Yi is the dependent variable, βo is the intercept, β1, β2 and βp are the regression coefficient of each independent variable included in the regression model. εi is the random error term and it is usually described as residual. It is the difference between observed and predicted values of the dependent variable. The best-fitting line for the observed data is calculated by minimizing the sum of the squares of the vertical deviations from each data point to the line Multiple Linear Regression Assumptions The assumptions indicated under simple linear regression (in Week 10) also hold for the multiple linear regression. An important additional assumption in multiple linear regression is there is no exact linear relationship among the X variables (they are linearly independent). That means no multicollinearity problem. Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Important indicators of the presence of multicollinearity problem are conditions in which none of the individual coefficients is statistically significant but the overall F statistic of the ANOVA model is, the regression coefficients are not stable when different samples are used and as variables are added to the model there are changes in the signs of the regression coefficients. ANOVA for Multiple Linear Regression Analysis of Variance (ANOVA) consists of calculations that provide information about levels of variability within a regression model and form a basis for tests of significance. The ANOVA calculations for multiple regression are nearly identical to the calculations for simple linear regression, except that the degrees of freedom are adjusted to reflect the number of explanatory variables included in the model. For p explanatory variables, the model degrees of freedom (DFM) are equal to p, the error degrees of freedom (DFE) are equal to (n - p - 1), and the total degrees of freedom (DFT) are equal to (n - 1). ANOVA for Multiple Linear Regression Continued… The column labeled F gives the overall F-test of H0 the regression coefficients equal to 0 versus Ha that at least one of the regression coefficients does not equal zero. The column labeled significance F has the associated P-value. When pvalue > 0.05, we do not reject H0 at significance level 0.05. When pvalue < 0.05, we do reject H0 at significance level 0.05. The P value tells you how confident you can be that each individual variable has some correlation with the dependent variable. The R-squared of the regression is the fraction of the variation in your dependent variable that is accounted for by your independent variables. ANOVA for Multiple Linear Regression Continued… In multiple linear regression, the regression coefficient tells you how much the dependent variable is expected to increase when that independent variable increases by one unit, holding all the other independent variables constant. A demonstration of multiple linear regression model fitting in SPSS is given next. We will use the data on Forced expiratory volumes (FEV.sav). In this dataset, we want to examine the effects parental cigarette smoking status and age on FEV. The FEV.sa dataset is shown on the following slide. Sex is dummy coded as 0 for Female and 1 for male. Likewise, SMOKE is dummy coded 0 for nonsmoker and 1 for smoker. A display of the FEV data in SPSS To fit multiple linear regression model in SPSS using the FEV data do the following: Analyze > Regression > Linear and then move forced expiratory volume into the dependent box and Smoke and age into independent(s) box. Then Click OK. This will give you the model summary table, ANOVA table and the regression coefficients table in the output window. A demonstration of how to start fitting the multiple regression model in SPSS A demonstration of how to select the dependent and independent variable(s) for fitting multiple regression model in SPSS. Model Summary Std. Error of the Model R R Square .759a 1 Estimate Adjusted R Square .577 .575 .5650627 a. Predictors: (Constant), smoke, age ANOVAb Model 1 Sum of Squares df Mean Square Regression 283.058 2 141.529 Residual 207.862 651 .319 Total 490.920 653 F Sig. 443.254 .000a a. Predictors: (Constant), smoke, age b. Dependent Variable: Forced Expiratory Volume (l/sec) Coefficientsa Standardized Unstandardized Coefficients Model 1 B Std. Error (Constant) .367 .081 age .231 .008 -.209 .081 smoke Coefficients Beta t Sig. 4.511 .000 .786 28.176 .000 -.072 -2.588 .010 a. Dependent Variable: Forced Expiratory Volume (l/sec) Model outputs having fitted a multiple linear regression of FEV as a function of Age and SMOKE 1. In multiple linear regression, it is necessary to look at the adjusted R square instead of R squared to examine the percentage of variability in the dependent variable that is explained by the dependent variable. In multiple regression setting adjustment is needed because as predictors are added to the model, some of the variance in Y is explained simply by chance. The adjusted R square in this case shows that about 58% of the variability in FEV is explained by age and smoking status. 2. The p-value of the F statistic of the ANOVA table is less than 0.05. Hence we reject the null hypothesis and state that at least one of the regression coefficients is statistically significantly different from zero. 3. The p-value for the regression coefficient of age and smoke are both less 0.05. That means they are statistically significantly different from zero. Hence both smoke and age are significant predictors of FEV. Our regression equation based on the indicated regression parameter estimates would be: FEV = 0.367 +0.231*Age-0.209*Smoke. Regression Coefficients The interpretation of the regression coefficients in multiple regression should be made as a rate of change in the conditional mean of Y instead of as a rate of change in Y. The t-test tests the significance of each regression coefficients. Coefficients that are not statistically significant should not be interpreted further. They need to be stated as “No statistically significant linear dependence of the mean of Y on x was found”. In our case the coefficient of smoke =-0.209. This suggests that smoking status is associated with an average 0.209 decline in FEV comparing to nonsmoking status. The coefficient of age = 0.231. This indicates that each additional year of age is associated with a 0.231 increase in FEV. Examining Multiple Regression Conditions In multiple regression analysis, a huge task is to check for any violations of the multiple linear regression model assumptions stated earlier. These include checking for the linearity assumption, normality assumption, presence of outliers and influential observations, multicollinearity and non-constant variance (heteroscedasticity). 1. A violation of the linearity assumption can be checked using scatter plots or by plotting observed versus predicted values or by plotting residuals versus predicted values. The points should be symmetrically distributed around a diagonal line in the former plot or a horizontal line in the latter plot. 2. A Normal Q-Q plot of the standardized residuals can be used to check for violations of the assumptions of normality. Deviations from the diagonal line suggest non- Normality. 3. Residual plots can be used to detect outliers and influential observations. In SPSS, outliers can be detected using the casewise diagnostic analysis by setting the standard deviations within 2 or 3 units. Influential observations can be detected using leverage values. If the leverage value is higher than (3*p-1)/n where p is the number of parameters in the model including the intercept and n is the number of data points (observations), an observation is declared an influential observation. 4. Multicollinearity can be detected using correlation matrix before fitting the model. If two independent variables to be included in the model have a statistically significant linear correlation, they are likely to cause multicollinearity problems. A variance inflation factor is also used to detect the problem of multicollinearity. The variance inflation factor (VIF) allows a quick measure of how much a variable is contributing to the standard error in the fitted regression model. When significant multicollinearity issues exist, the variance inflation factor will be very large for the variables involved. a VIF of 10 and above indicates a multicollinearity problem. 5. Constant variance assumption violation can be checked using residual plots of regression standardized residual versus regression standardized predicted value. To generate these diagnostics in SPSS for the FEV data, do the following: Analyze > Regression > Linear and move Forced expiratory volume into the dependent box and smoke and age into the independent box. Then click on Statistics and select Collinearity diagnostics, Durbin-Watson and Casewise diagnostics. Note that outlier outside 3 standard deviations is selected by default. Change 3 into 2. Then click continue and then click OK. This will give you output for checking multicollinearity and outliers. A demonstration of how to select diagnostic statistic for checking outliers and multicollinearity issues in SPSS. Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) smoke age Std. Error .367 .081 -.209 .081 .231 .008 Coefficients Beta Collinearity Statistics t Sig. Tolerance VIF 4.511 .000 -.072 -2.588 .010 .837 1.195 .786 28.176 .000 .837 1.195 a. Dependent Variable: Forced Expiratory Volume (l/sec) Casewise Diagnosticsa Forced Expiratory Case Std. Volume Predicted Number Residual (l/sec) Value Residual 79 2.476 3.842 2.442814 1.399186 226 2.191 3.681 2.442814 1.238186 321 2.205 4.842 3.595837 1.246163 322 2.505 4.55 3.134628 1.415372 332 -2.037 2.236 3.386842 -1.15084 365 2.227 4.393 3.134628 1.258372 372 2.89 4.789 3.156238 1.632762 404 2.055 4.065 2.904023 1.160977 415 -2.151 1.458 2.673419 -1.21542 422 2.831 4.756 3.156238 1.599762 442 2.989 4.593 2.904023 1.688977 444 -2.157 1.916 3.134628 -1.21863 452 3.698 5.224 3.134628 2.089372 652 -2.947 2.853 4.518255 -1.66526 a. Dependent Variable: Forced Expiratory Volume (l/sec) Outputs for checking multicollinearity and outliers, output partially indicated for casewise diagnostics As it can be seen from the output presented on the previous slide, both smoke and age have a variance inflation factor (VIF) less than 10. Hence there is no evidence of multicollinearity. However, several observations had a standardized residual larger than 2 and hence classified as outliers. To get diagnostics for Normality and constant variance assumptions, click on plots > and move ZPRED to X and ZRESID to Y, select histogram, Normal probability plot and produce all partial plots. Then Click continue and then click OK. This will give you several plots are shown in the next couple of slides. A demonstration of how to select diagnostic plots for Normality and constant variance in SPSS. A histogram plot of standardized residuals A Normal Probability Plot of standardized residual A scatter plot of standardized residuals versus standardized predicted value The histogram and normal probability plots of the standardized residuals are almost symmetrical and lying about the diagonal line respectively. These suggest that the Normality assumption is not violated. The scatter plot of standardized residual versus standardized predicted value indicate that there is less variation at the lower end of predicted values than at the higher end. There is some evidence of heteroscedasticity. Besides partial regression plots, Cook’s and leverage values can be used to examine the presence of influential observations. To obtain these diagnostics in SPSS: having selected your dependent and independent variables, click on save and then under distances selected Cook’s and leverage values. Then select a folder to save to the output by clicking on browse. This will give you diagnostic values for checking the presence of influential points. A demonstration of how to select measures for assessing influential observations in SPSS Cook’s distance and centered leverage values for assessing outliers and influential points Using the formula indicated earlier for detecting influential points: (3*p-1)/n: Since we have three parameters and 654 observations, the critical value is (3*3-1)/654=8/654 =0.0122. There are several observations that have a centered leverage value larger than this. It would be important to investigate those observations for accuracy and validity. Observations that have Cook’s distance greater than 4/n can be described as outliers. 4/654 =0.0061. There are several observations that have Cook’s distance greater than 0.0061. These observations warranty further investigation. Final Remarks In multiple linear regression: 1. a linear combination of two or more predictor variables is used to explain the variation in the response (dependent) variable. 2. we can only ascertain relationships, but never be sure about the underlying causal mechanism. 3. it is important to investigate the residuals to determine whether or not they appear to fit the assumptions made in fitting the model.