Download 2. Interpreting the Slope Coefficients in Multiple Regression: Partial

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vector generalized linear model wikipedia , lookup

Generalized linear model wikipedia , lookup

Predictive analytics wikipedia , lookup

Simplex algorithm wikipedia , lookup

Regression analysis wikipedia , lookup

Transcript
YALE School of Management
EMBA MGT511- HYPOTHESIS TESTING AND REGRESSION
K. Sudhir
Lecture 5
Multiple Regression
1. Introduction to Multiple Regression
When we introduced the concept of regression, we used a small dataset of 6 observations
on Sales and Prices. We used the data to illustrate simple linear regression. Suppose that
the dataset actually also includes Advertising, as shown below. (Note that in the example
below we have changed the units of sales to hundreds of lbs, but everything else is the
same in the data).
Sales
Advertising
(hundred lbs) Price ($) ($ thousands)
115
5
105
5
105
10
95
10
95
15
85
15
20
15
25
20
30
25
Previously, we estimated a regression with just Price as the explanatory variable, and we
found that Sales= 120 – 2 Price. We now explore the problem of ignoring Advertising in
this regression.
An inspection of the data on Prices and Advertising suggests that as Price increases
managers are also increasing the level of Advertising. Specifically, Price and Advertising
have a correlation of + 0.85. Thus, the data can be interpreted as showing that the sales
reduction effect of an increase in price is at least partly offset by an increase in sales due
to the accompanying increase in advertising.
This means that the coefficient of –2 for Price from the simple regression underestimates
the true effect of Price on Sales. To see this, we run a multiple regression using Excel
with Prices and Advertising as explanatory variables.
The estimated equation is Sales = 95 – 4 Price + 2 Advertising
As expected, the price coefficient is much higher than –2, it now is actually –4. Thus by
omitting a relevant variable such as Advertising, which is correlated with Price, we
obtained a biased estimate for Price.
2. Interpreting the Slope Coefficients in Multiple Regression: Partial Slopes
For a simple regression between Sales and Price, we interpreted the slope coefficient of
Price as follows:
A dollar increase in Price is expected to result in a 200 lb (2 hundreds of pounds)
decrease in Sales.
In a multiple regression, the coefficient interpretation is as follows:
Controlling for the effects of advertising (or holding Advertising constant), a dollar
increase in Price is expected to result in a 400 lb decrease in Sales.
This interpretation of the coefficient of price as one where the effects of the other
predictor variables such as advertising are accounted for gives rise to the name “partial
slope” for the coefficients of multiple regression. Thus the effect of Price is the partial
effect of Price, as in partial derivative (i.e. holding other variables in the equation
constant).
Many students and managers who are unfamiliar with the inner workings of regression
models are surprised to find that regression coefficients change when other variables are
added to or deleted from the model. Mathematically, there is nothing surprising about
partial slopes changing when variables are added or deleted if these variables are
correlated with variables already in the model. Thus, if the variables are not correlated,
adding or dropping one variable will have no effect on the other coefficients (however,
the statistical uncertainty about the estimates will typically change).
3. Is the Regression Statistically Significant?
If we perform a multiple regression, the first question we need to address is whether the
independent variables as a group explain a sufficient amount of variation in the y
variable. In the language of hypothesis testing, we use the following null and alternative
hypotheses:
Null: H0: 1   2  ...   k  0
(The regression equation truly has no explanatory power)
Alternative: Ha: Any one of i , i  1...k ,  0
(The regression equation truly has explanatory power)
We now first explain the logic of the F-test and illustrate the computations with an
example. In practice, there is no need to do all of these calculations as Excel will provide
results for the statistical tests including the F-test. We therefore discuss the Excel output
and explain how to do the F-test using Excel output.
One simple idea to test the null hypothesis that all slopes are truly zero might be to just
look at the significance of each of the slope coefficients of the regression, which Excel
reports. However the complication is that the equation may have significant explanatory
power and yet all slope coefficients may be insignificant. Why?
Suppose, in the example above, advertising and price are perfectly correlated. Then it will
not be possible to separate out the effect of advertising from the effect of price.
Mathematically, it will not be possible to obtain the marginal effect of one variable,
holding the other one constant. Or, in a multiple regression with one intercept and two
slopes, we have three equations and three unknowns but one equation is redundant
(linearly dependent). In that case, Excel will not provide a result. This is equivalent to
stating that the standard error of one or more slope coefficients is infinitely large.
If the correlation between the predictor variables is high but not +1 or –1, the standard
error of one or more slopes will be high. In that case the partial slope of advertising may
not be significant after taking into account the effect of price. Similarly the partial slope
of price may not be significant after taking into account the effect of advertising. Hence
both the price and advertising partial slope coefficients may be insignificant, even though
the equation as a whole is significant.
When explanatory variables are highly correlated (multicollinear), then all of the
coefficients can become insignificant, even though each variable on its own may have a
significant effect (and the equation has significant explanatory power). This is the
problem of “multicollinearity”.
The F-test
To test whether the explanatory variables (x) together explain a significant amount of
variation in y, we can use the proportion of variation in the y variable that is explained by
the x variables. We have introduced unadjusted and adjusted R-squares for this purpose.
We now use this idea in a formal hypothesis test called the F-test.
To understand the intuition behind the F-test, consider the ANOVA table (ANOVA
stands for Analysis of Variance). We split the total variance in the y variable into
explained variance (due to the regression) and unexplained variance (residual). The
ANOVA table shows the degrees of freedom, the sum of squares (SS) and the Mean
squares (MS, obtained by dividing the SS by the corresponding degrees of freedom). The
MS(Re gression )
F-statistic is the ratio:
MS(Re sidual )
Under the null hypothesis (truly no explanatory power), the expected F-value is 1; i.e, the
mean square of the variation explained by the regression is equal to the mean square of
the variation for the residuals. The greater the calculated F-value based on the regression
result, the more evidence we have that the regression equation has true explanatory
power. The F-statistic is the explained variance divided by the residual variance.
ANOVA (for simple, linear regression)
Source
df
SS
MS
F
Regression
1
( ŷ i  y) 2
( ŷ i  y) 2 /1
MS(Re gression )
MS(Re sidual )
Residual
n 2
y i  ŷ i 2
y i  ŷ i 2 /(n-2)
_____________________________________________________________________
Total
n-1

 yi  y
2

2
 y i  y /(n-1)
A Detailed Illustration of the F-Test
We illustrate the F-test using the familiar regression example of Sales and Prices. Recall
that in this simple regression, we use 2 degrees of freedom to compute a slope and an
intercept, leaving (6-2)=4 degrees of freedom for the calculation of residual variance
(denominator = 4 df). For the numerator, we use 1 degree of freedom (only the slope
provides explanatory power).
Sales (y)
115
105
95
105
95
85
y  100
Price
(x)
5
5
10
10
15
15
yˆ  120  2 x
 i  y  yˆ
 i2  ( y  yˆ ) 2
y y
( y  y )2
110
110
100
100
90
90
5
-5
-5
5
5
-5
25
25
25
25
25
25
2
 i =150
15
5
-5
5
-5
-15
225
25
25
25
25
225
2
 ( y  y ) =550
SS (Residual)=150
SS(Total)=550
Therefore SS (Regression)=SS (Total)-SS (Residual)=400
MS(Residual)=SS(Residual)/df(Residual)=150/4=37.5
MS(Regression)=SS(Regression)/df(Regression)=400/1 =400
Hence F=
MS(Re gression )
=400/37.5=10.7
MS(Re sidual )
The F-statistic follows the F distribution. As in the case for the t-distribution we have to
specify appropriate degrees of freedom for the F-distribution. In fact, for the Fdistribution we have to specify both the Numerator (df1, associated with Regression) and
Denominator (df2, associated with Residual) degrees of freedom. See below for an
example of the shape of the F-distribution and the rejection region.
Rejection Region
F-df1,df2,0.05
We reject the null at the 5% risk of a type I error if the computed F is greater than the
critical value of Fdf1,df2, 0.05
From page 804 of text F1,4,0.05=7.71 (F-Critical)
Since F=10.7 > F-critical, we can reject the null.
Doing the F-test in Excel
Compare the detailed computations we did earlier with the ANOVA output below
produced by Excel. The F-statistic is 10.7. Excel also reports the p-value (probability of
Type 1 Error if the null hypothesis is rejected) for the F-statistic under the Significance of
F. The p-value is 0.03, which indicates we can reject the null at the 5 % risk of a type I
error.
In practice, checking this p-value is all you need to do to figure out whether the regression
is significant!!
Degrees of Freedom
Sum of Squares
Means Squares
F-Statistic
Significance of F (p-value)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.852803
R Square
0.727273
Adjusted R Square
0.659091
Standard Error
6.123724
Observations
6
ANOVA
df
Regression
Residual
Total
Intercept
Price
SS
1
4
5
400
150
550
MS
400
37.5
Coefficients
Standard Error t Stat
120 6.614378 18.14229
-2 0.612372 -3.265986
F
Significance F
10.66666667 0.030906
P-value
Lower 95% Upper 95%
5.42796E-05 101.6355 138.3645
0.030905835 -3.700222 -0.299778
A couple of interesting properties of the simple regression (regression with one
independent variable)
1. The p-value of the t-test and the F-test are identical. Why?
2. The square of the t-statistic is equal to the F-statistic, i.e, (-3.26)2=10.67
These two properties however do not hold for multiple regression.
4. Correlated Independent Variables: Tradeoff between Bias and Precision
In multiple regression, we use multiple independent variables (x1, x2…xn) to explain or
predict a dependent variable (y). Typically, the variables x1, x2…xn tend to be correlated.
Earlier, we discussed an example in which price and advertising were correlated. This
may occur for any number of reasons.
Correlated independent variables present the following two problems:
A. Omitted Variable Bias Problem:
Suppose the true model is: y   0  1 x1   2 x2   , but we estimate y   0  1 x1   ,
then ˆ (the estimated value of  in the simple regression) will be biased. We saw this
1
1
in the example with Price and Advertising. These two variables had a correlation of 0.85
in that example. If we included both Price and Advertising in the regression, the
estimated equation was: Sales = 95 – 4 Price + 2 Advertising. But if we omitted
Advertising, the estimated equation was: Sales= 120 – 2 Price. In this case, the estimated
Price coefficient is biased (in the computations: -2 instead of -4) if we omit Advertising.
We discuss a systematic way to think about omitted variable bias at the end of these notes
in point 6 of these notes.
B. Multicollinearity Problem:
Suppose we include two very highly correlated variables x1, x2 in the regression. Then,
estimating the equation y   0  1 x1   2 x2   can lead to estimates of both  1 and  2
being statistically insignificant. This can happen even though the equation as a whole is
statistically significant. Thus, even if the F-test result indicates that at least one slope is
truly different from zero, the standard errors of both slope coefficients may be so high as
to render each slope coefficient insignificant. This is called the “multicollinearity” or
“precision” problem. In the most extreme case, when the independent variables are
perfectly correlated, it is impossible to obtain a unique solution, meaning that the
standard errors of the slope coefficients are infinitely large.
Thus we are faced with a tradeoff between bias and precision for the estimates when we
deal with correlated variables. If two independent variables are correlated, we want to
include both in order to avoid bias in the slope coefficients. Yet when we include both, it
is possible that one or both slope coefficient estimates are insignificant.
The Solution:
There is no general solution to this problem. In practice, one should always estimate the
most complete model. In other words, if we have reason to believe that a candidate
independent variable is relevant, we should include that variable in the equation. We can
then use the t-test result for each slope coefficient to decide whether a variable should
remain in the equation. Even if independent variables are highly correlated, it may still be
possible to estimate all the slope coefficients with sufficiently high precision (low
standard errors).
If the correlation is extremely high, and the slope for at least one independent variable is
very imprecise (and hence the t-ratio is insignificant), we could drop one of the correlated
variables in order to solve the “precision” or multicollinearity problem. The argument is
that it is reasonable to drop one of the variables if two independent variables are highly
correlated, because in that case it is impossible to keep the other variable under “control”
or constant, when one variable is being changed. Thus it would not be possible to isolate
the true effect of each variable with sufficient precision.
5. Why include uncorrelated variables in multiple regression?
We have argued that if two independent variables (that both affect the dependent
variable) are correlated, not including one will bias the estimate of the other. Now
suppose x1, x2 are uncorrelated, and both affect y. In that case dropping one of the two
variables will not bias other slope coefficients. In other words, if two independent
variables are uncorrelated, it is unnecessary to “hold one constant” for the estimation of
the other variable’s impact on y.
Suppose now that we care to understand the slope coefficient for only one independent
variable, would it still be better to include both variables in the equation? Yes, if both
independent variables are relevant, including both variables will improve the precision of
both slope coefficients, so it is best to include both variables. The idea is that the standard
error of the slope coefficient will be reduced if both (uncorrelated) independent variables
are included.
We will illustrate this point in the beginning of the next lecture on Dummy Variables
with an example.
6. A Systematic Approach to Think about Omitted Variable Bias
We illustrate the idea of omitted variable bias using a couple of examples:
Example 1: Omission of advertising on the effect of price on sales
Recall the omitted bias problem we discussed in part 4. If we included both Price and
Advertising in the regression, the estimated equation was: Sales = 95 – 4 Price + 2
Advertising. But if we omitted Advertising, the estimated equation was: Sales= 120 – 2
Price. In this case, the estimated Price coefficient is biased if we omit Advertising. As we
can see, the bias causes the coefficient to be closer to zero; i.e., when advertising is
included the price coefficient is -4, but on omitting advertising becomes closer to zero at
-2.
Why does this bias happen? First, recall that price and advertising were positively
correlated in the data. That is when prices increased, advertising increased as well in the
data.
However the effects of price and advertising on the dependent variable (sales) are in the
opposite directions. While an increase in price reduced sales, an increase in advertising
increased sales. Thus the effects of price and advertising cancel each other out.
Thus price and advertising are positively correlated, but their effects cancel each other
out. Hence when we omit advertising in the regression model, the effect of price now
includes the effect of advertising, which tends to reduce the effect of price. Thus by
omitting advertising, we measure a smaller effect of price. Thus we may interpret that the
price effect is not significantly different from zero, because we did not include the
canceling effect of advertising.
Example 2: Omission of job experience on the effect of schooling on salary
If a person spends more years in school, they will tend to have fewer years of job
experience, relative to a person who spent fewer years in school. Thus years of job
experience would be negatively correlated with years of schooling.
Our intuition would suggest that greater job experience would raise salary, and greater
number of years of schooling would also raise salaries. Here both effects reinforce each
other, but the two variables are negatively correlated. Hence if we omitted job experience
from the regression equation and included only schooling, the omitted variable of job
experience would make it appear that schooling was not having as much of a positive
differential impact on salary, because people with less schooling tend to have greater job
experience (which also results in higher salary).
The following regression results illustrate this point:
First, the results of a regression with job experience omitted. As we can see here,
schooling does not have a significant positive effect.
Coefficients
Standard Error t Stat
P-value Lower 95%Upper 95%
Intercept 47334.97 3526.717 13.42182 1.84E-13 40098.75 54571.19
Schooling 311.0538 226.6091 1.372645 0.181158 -153.909 776.017
Now the results with job experience included. As we can see Schooling is highly
significant, on average an additional year of schooling increases salary by about $ 5800.
Coefficients
Standard Error t Stat
P-value Lower 95% Upper 95%
Intercept
-65798 29966.01 -2.19575 0.03723 -127394 -4201.91
Schooling 5793.49 1457.244 3.975647 0.000498 2798.079 8788.901
Experience 1836.442 484.1689 3.792978
0.0008 841.2179 2831.666
Thus when we look for omitted variable bias, we look at two criteria:
1. Are the effects of the included variable and omitted variable in the same direction
or in opposite directions?
2. What is the nature of the correlation between the included and omitted variables?
We summarize when omitted variable bias leads to non-significant effects using the
following table.
Effect of Included
Variable and
Omitted Variable on
the y variable
Same direction
(Both positive, both
negative)
Opposite Direction
(One is positive, the
other is negative)
Correlation between Included and Omitted
Variable
Positive
Negative
Bias Tends towards zero
(E.g. Salary, Education
and Experience)
Bias Tends towards zero
(E.g., Sales, Prices and
Advertising)
From a practical point of view, when you find a certain variable like price or schooling
etc., on which we have strong priors about a strong effect turns out to be insignificant, it
would be useful to think of what omitted variable might be masking the true effect. The
above two criteria should help in our search for potential omitted variables which might
be masking the effect and making the coefficient appear insignificant.