Download Preparing Data for Analysis - Walden University Writing Center

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Lasso (statistics) wikipedia , lookup

Data assimilation wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Multiple Linear Regression Model
By
Walden University Statsupport Team
March 2011
Multiple Linear Regression Model
•
•
•
•
•
Introduction
Assumptions
ANOVA for Multiple Linear Regression
Regression Coefficients
Examining Multiple Regression Conditions
Introduction


Multiple linear regression attempts to model the relationship
between two or more explanatory variables and a response
variable by fitting a linear equation to observed data. Every value
of the independent variable x is associated with a value of the
dependent variable y.
The model for multiple linear regression given n observations is:
Yi   o  1 xi1   2 xi 2  ... p xip   i
where Yi is the dependent variable, βo is the intercept, β1, β2 and
βp are the regression coefficient of each independent variable
included in the regression model. εi is the random error term and
it is usually described as residual. It is the difference between
observed and predicted values of the dependent variable.
 The best-fitting line for the observed data is calculated by
minimizing the sum of the squares of the vertical deviations from
each data point to the line
Multiple Linear Regression Assumptions

The assumptions indicated under simple linear regression (in Week 10)
also hold for the multiple linear regression.

An important additional assumption in multiple linear regression is there is
no exact linear relationship among the X variables (they are linearly
independent). That means no multicollinearity problem. Multicollinearity
occurs when two or more predictors in the model are correlated and
provide redundant information about the response.

Important indicators of the presence of multicollinearity problem are
conditions in which none of the individual coefficients is statistically
significant but the overall F statistic of the ANOVA model is, the regression
coefficients are not stable when different samples are used and as
variables are added to the model there are changes in the signs of the
regression coefficients.
ANOVA for Multiple Linear Regression

Analysis of Variance (ANOVA) consists of calculations that provide
information about levels of variability within a regression model and
form a basis for tests of significance.

The ANOVA calculations for multiple regression are nearly identical
to the calculations for simple linear regression, except that the
degrees of freedom are adjusted to reflect the number of
explanatory variables included in the model. For p explanatory
variables, the model degrees of freedom (DFM) are equal to p, the
error degrees of freedom (DFE) are equal to (n - p - 1), and the total
degrees of freedom (DFT) are equal to (n - 1).
ANOVA for Multiple Linear Regression
Continued…

The column labeled F gives the overall F-test of H0 the regression
coefficients equal to 0 versus Ha that at least one of the regression
coefficients does not equal zero.

The column labeled significance F has the associated P-value. When pvalue > 0.05, we do not reject H0 at significance level 0.05. When pvalue < 0.05, we do reject H0 at significance level 0.05.

The P value tells you how confident you can be that each individual
variable has some correlation with the dependent variable.

The R-squared of the regression is the fraction of the variation in your
dependent variable that is accounted for by your independent variables.
ANOVA for Multiple Linear Regression
Continued…

In multiple linear regression, the regression coefficient tells you how
much the dependent variable is expected to increase when that
independent variable increases by one unit, holding all the other
independent variables constant.

A demonstration of multiple linear regression model fitting in SPSS is
given next. We will use the data on Forced expiratory volumes
(FEV.sav). In this dataset, we want to examine the effects parental
cigarette smoking status and age on FEV.

The FEV.sa dataset is shown on the following slide. Sex is dummy
coded as 0 for Female and 1 for male. Likewise, SMOKE is dummy
coded 0 for nonsmoker and 1 for smoker.
A display of the FEV data in SPSS
To fit multiple linear regression model in SPSS using the FEV data do the
following:
Analyze > Regression > Linear and then move forced expiratory volume into
the dependent box and Smoke and age into independent(s) box. Then Click
OK.
This will give you the model summary table, ANOVA table and the regression
coefficients table in the output window.
A demonstration of how to start fitting the multiple regression model in
SPSS
A demonstration of how to select the dependent and independent variable(s)
for fitting multiple regression model in SPSS.
Model Summary
Std. Error of the
Model
R
R Square
.759a
1
Estimate
Adjusted R Square
.577
.575
.5650627
a. Predictors: (Constant), smoke, age
ANOVAb
Model
1
Sum of Squares
df
Mean Square
Regression
283.058
2
141.529
Residual
207.862
651
.319
Total
490.920
653
F
Sig.
443.254
.000a
a. Predictors: (Constant), smoke, age
b. Dependent Variable: Forced Expiratory Volume (l/sec)
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
Std. Error
(Constant)
.367
.081
age
.231
.008
-.209
.081
smoke
Coefficients
Beta
t
Sig.
4.511
.000
.786
28.176
.000
-.072
-2.588
.010
a. Dependent Variable: Forced Expiratory Volume (l/sec)
Model outputs having fitted a multiple linear regression of FEV as a
function of Age and SMOKE
1. In multiple linear regression, it is necessary to look at the adjusted R square
instead of R squared to examine the percentage of variability in the dependent
variable that is explained by the dependent variable. In multiple regression setting
adjustment is needed because as predictors are added to the model, some of the
variance in Y is explained simply by chance. The adjusted R square in this case shows
that about 58% of the variability in FEV is explained by age and smoking status.
2. The p-value of the F statistic of the ANOVA table is less than 0.05. Hence we reject
the null hypothesis and state that at least one of the regression coefficients is
statistically significantly different from zero.
3. The p-value for the regression coefficient of age and smoke are both less 0.05.
That means they are statistically significantly different from zero. Hence both smoke
and age are significant predictors of FEV.
Our regression equation based on the indicated regression parameter estimates would
be:
FEV = 0.367 +0.231*Age-0.209*Smoke.
Regression Coefficients
The interpretation of the regression coefficients in multiple regression should
be made as a rate of change in the conditional mean of Y instead of as a rate
of change in Y.
The t-test tests the significance of each regression coefficients. Coefficients
that are not statistically significant should not be interpreted further. They
need to be stated as “No statistically significant linear dependence of the
mean of Y on x was found”.
In our case the coefficient of smoke =-0.209. This suggests that smoking
status is associated with an average 0.209 decline in FEV comparing to nonsmoking status.
The coefficient of age = 0.231. This indicates that each additional year of age
is associated with a 0.231 increase in FEV.
Examining Multiple Regression Conditions
In multiple regression analysis, a huge task is to check for any violations of the multiple
linear regression model assumptions stated earlier. These include checking for the
linearity assumption, normality assumption, presence of outliers and influential
observations, multicollinearity and non-constant variance (heteroscedasticity).
1. A violation of the linearity assumption can be checked using scatter plots or by
plotting observed versus predicted values or by plotting residuals versus predicted
values. The points should be symmetrically distributed around a diagonal line in the
former plot or a horizontal line in the latter plot.
2. A Normal Q-Q plot of the standardized residuals can be used to check for violations
of the assumptions of normality. Deviations from the diagonal line suggest non-
Normality.
3. Residual plots can be used to detect outliers and influential observations. In SPSS,
outliers can be detected using the casewise diagnostic analysis by setting the
standard deviations within 2 or 3 units.
Influential observations can be detected using leverage values. If the leverage
value is higher than (3*p-1)/n where p is the number of parameters in the
model including the intercept and n is the number of data points
(observations), an observation is declared an influential observation.
4. Multicollinearity can be detected using correlation matrix before fitting the
model. If two independent variables to be included in the model have a
statistically significant linear correlation, they are likely to cause
multicollinearity problems. A variance inflation factor is also used to detect the
problem of multicollinearity. The variance inflation factor (VIF) allows a quick
measure of how much a variable is contributing to the standard error in the
fitted regression model. When significant multicollinearity issues exist, the
variance inflation factor will be very large for the variables involved. a VIF of
10 and above indicates a multicollinearity problem.
5. Constant variance assumption violation can be checked using residual
plots of regression standardized residual versus regression standardized
predicted value.
To generate these diagnostics in SPSS for the FEV data, do the following:
Analyze > Regression > Linear and move Forced expiratory volume into the
dependent box and smoke and age into the independent box. Then click on
Statistics and select Collinearity diagnostics, Durbin-Watson and Casewise
diagnostics. Note that outlier outside 3 standard deviations is selected by
default. Change 3 into 2. Then click continue and then click OK.
This will give you output for checking multicollinearity and outliers.
A demonstration of how to select diagnostic statistic for checking outliers and
multicollinearity issues in SPSS.
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
smoke
age
Std. Error
.367
.081
-.209
.081
.231
.008
Coefficients
Beta
Collinearity Statistics
t
Sig.
Tolerance
VIF
4.511
.000
-.072
-2.588
.010
.837
1.195
.786
28.176
.000
.837
1.195
a. Dependent Variable: Forced Expiratory Volume (l/sec)
Casewise Diagnosticsa
Forced
Expiratory
Case
Std.
Volume
Predicted
Number
Residual
(l/sec)
Value
Residual
79
2.476
3.842 2.442814 1.399186
226
2.191
3.681
2.442814
1.238186
321
2.205
4.842
3.595837
1.246163
322
2.505
4.55
3.134628
1.415372
332
-2.037
2.236
3.386842
-1.15084
365
2.227
4.393
3.134628
1.258372
372
2.89
4.789
3.156238
1.632762
404
2.055
4.065
2.904023
1.160977
415
-2.151
1.458
2.673419
-1.21542
422
2.831
4.756
3.156238
1.599762
442
2.989
4.593
2.904023
1.688977
444
-2.157
1.916
3.134628
-1.21863
452
3.698
5.224
3.134628
2.089372
652
-2.947
2.853
4.518255
-1.66526
a. Dependent Variable: Forced Expiratory Volume (l/sec)
Outputs for checking multicollinearity and outliers, output partially indicated for
casewise diagnostics
As it can be seen from the output presented on the previous slide, both
smoke and age have a variance inflation factor (VIF) less than 10. Hence
there is no evidence of multicollinearity.
However, several observations had a standardized residual larger than 2 and
hence classified as outliers.
To get diagnostics for Normality and constant variance assumptions, click on
plots > and move ZPRED to X and ZRESID to Y, select histogram, Normal
probability plot and produce all partial plots. Then Click continue and then
click OK.
This will give you several plots are shown in the next couple of slides.
A demonstration of how to select diagnostic plots for Normality and constant
variance in SPSS.
A histogram plot of standardized residuals
A Normal Probability Plot of standardized residual
A scatter plot of standardized residuals versus standardized predicted value
The histogram and normal probability plots of the standardized residuals are
almost symmetrical and lying about the diagonal line respectively. These
suggest that the Normality assumption is not violated.
The scatter plot of standardized residual versus standardized predicted value
indicate that there is less variation at the lower end of predicted values than at
the higher end. There is some evidence of heteroscedasticity.
Besides partial regression plots, Cook’s and leverage values can be used to
examine the presence of influential observations.
To obtain these diagnostics in SPSS:
having selected your dependent and independent variables, click on save and
then under distances selected Cook’s and leverage values. Then select a folder
to save to the output by clicking on browse.
This will give you diagnostic values for checking the presence of influential
points.
A demonstration of how to select measures for assessing influential
observations in SPSS
Cook’s distance and centered leverage values for assessing outliers and
influential points
Using the formula indicated earlier for detecting influential points: (3*p-1)/n:
Since we have three parameters and 654 observations, the critical value is
(3*3-1)/654=8/654 =0.0122.
There are several observations that have a centered leverage value larger
than this. It would be important to investigate those observations for accuracy
and validity.
Observations that have Cook’s distance greater than 4/n can be described as
outliers. 4/654 =0.0061.
There are several observations that have Cook’s distance greater than 0.0061.
These observations warranty further investigation.
Final Remarks
In multiple linear regression:
1. a linear combination of two or more predictor variables is used to explain
the variation in the response (dependent) variable.
2. we can only ascertain relationships, but never be sure about the underlying
causal mechanism.
3. it is important to investigate the residuals to determine whether or not
they appear to fit the assumptions made in fitting the model.