Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
YALE School of Management EMBA MGT511- HYPOTHESIS TESTING AND REGRESSION K. Sudhir Lecture 5 Multiple Regression 1. Introduction to Multiple Regression When we introduced the concept of regression, we used a small dataset of 6 observations on Sales and Prices. We used the data to illustrate simple linear regression. Suppose that the dataset actually also includes Advertising, as shown below. (Note that in the example below we have changed the units of sales to hundreds of lbs, but everything else is the same in the data). Sales Advertising (hundred lbs) Price ($) ($ thousands) 115 5 105 5 105 10 95 10 95 15 85 15 20 15 25 20 30 25 Previously, we estimated a regression with just Price as the explanatory variable, and we found that Sales= 120 – 2 Price. We now explore the problem of ignoring Advertising in this regression. An inspection of the data on Prices and Advertising suggests that as Price increases managers are also increasing the level of Advertising. Specifically, Price and Advertising have a correlation of + 0.85. Thus, the data can be interpreted as showing that the sales reduction effect of an increase in price is at least partly offset by an increase in sales due to the accompanying increase in advertising. This means that the coefficient of –2 for Price from the simple regression underestimates the true effect of Price on Sales. To see this, we run a multiple regression using Excel with Prices and Advertising as explanatory variables. The estimated equation is Sales = 95 – 4 Price + 2 Advertising As expected, the price coefficient is much higher than –2, it now is actually –4. Thus by omitting a relevant variable such as Advertising, which is correlated with Price, we obtained a biased estimate for Price. 2. Interpreting the Slope Coefficients in Multiple Regression: Partial Slopes For a simple regression between Sales and Price, we interpreted the slope coefficient of Price as follows: A dollar increase in Price is expected to result in a 200 lb (2 hundreds of pounds) decrease in Sales. In a multiple regression, the coefficient interpretation is as follows: Controlling for the effects of advertising (or holding Advertising constant), a dollar increase in Price is expected to result in a 400 lb decrease in Sales. This interpretation of the coefficient of price as one where the effects of the other predictor variables such as advertising are accounted for gives rise to the name “partial slope” for the coefficients of multiple regression. Thus the effect of Price is the partial effect of Price, as in partial derivative (i.e. holding other variables in the equation constant). Many students and managers who are unfamiliar with the inner workings of regression models are surprised to find that regression coefficients change when other variables are added to or deleted from the model. Mathematically, there is nothing surprising about partial slopes changing when variables are added or deleted if these variables are correlated with variables already in the model. Thus, if the variables are not correlated, adding or dropping one variable will have no effect on the other coefficients (however, the statistical uncertainty about the estimates will typically change). 3. Is the Regression Statistically Significant? If we perform a multiple regression, the first question we need to address is whether the independent variables as a group explain a sufficient amount of variation in the y variable. In the language of hypothesis testing, we use the following null and alternative hypotheses: Null: H0: 1 2 ... k 0 (The regression equation truly has no explanatory power) Alternative: Ha: Any one of i , i 1...k , 0 (The regression equation truly has explanatory power) We now first explain the logic of the F-test and illustrate the computations with an example. In practice, there is no need to do all of these calculations as Excel will provide results for the statistical tests including the F-test. We therefore discuss the Excel output and explain how to do the F-test using Excel output. One simple idea to test the null hypothesis that all slopes are truly zero might be to just look at the significance of each of the slope coefficients of the regression, which Excel reports. However the complication is that the equation may have significant explanatory power and yet all slope coefficients may be insignificant. Why? Suppose, in the example above, advertising and price are perfectly correlated. Then it will not be possible to separate out the effect of advertising from the effect of price. Mathematically, it will not be possible to obtain the marginal effect of one variable, holding the other one constant. Or, in a multiple regression with one intercept and two slopes, we have three equations and three unknowns but one equation is redundant (linearly dependent). In that case, Excel will not provide a result. This is equivalent to stating that the standard error of one or more slope coefficients is infinitely large. If the correlation between the predictor variables is high but not +1 or –1, the standard error of one or more slopes will be high. In that case the partial slope of advertising may not be significant after taking into account the effect of price. Similarly the partial slope of price may not be significant after taking into account the effect of advertising. Hence both the price and advertising partial slope coefficients may be insignificant, even though the equation as a whole is significant. When explanatory variables are highly correlated (multicollinear), then all of the coefficients can become insignificant, even though each variable on its own may have a significant effect (and the equation has significant explanatory power). This is the problem of “multicollinearity”. The F-test To test whether the explanatory variables (x) together explain a significant amount of variation in y, we can use the proportion of variation in the y variable that is explained by the x variables. We have introduced unadjusted and adjusted R-squares for this purpose. We now use this idea in a formal hypothesis test called the F-test. To understand the intuition behind the F-test, consider the ANOVA table (ANOVA stands for Analysis of Variance). We split the total variance in the y variable into explained variance (due to the regression) and unexplained variance (residual). The ANOVA table shows the degrees of freedom, the sum of squares (SS) and the Mean squares (MS, obtained by dividing the SS by the corresponding degrees of freedom). The MS(Re gression ) F-statistic is the ratio: MS(Re sidual ) Under the null hypothesis (truly no explanatory power), the expected F-value is 1; i.e, the mean square of the variation explained by the regression is equal to the mean square of the variation for the residuals. The greater the calculated F-value based on the regression result, the more evidence we have that the regression equation has true explanatory power. The F-statistic is the explained variance divided by the residual variance. ANOVA (for simple, linear regression) Source df SS MS F Regression 1 ( ŷ i y) 2 ( ŷ i y) 2 /1 MS(Re gression ) MS(Re sidual ) Residual n 2 y i ŷ i 2 y i ŷ i 2 /(n-2) _____________________________________________________________________ Total n-1 yi y 2 2 y i y /(n-1) A Detailed Illustration of the F-Test We illustrate the F-test using the familiar regression example of Sales and Prices. Recall that in this simple regression, we use 2 degrees of freedom to compute a slope and an intercept, leaving (6-2)=4 degrees of freedom for the calculation of residual variance (denominator = 4 df). For the numerator, we use 1 degree of freedom (only the slope provides explanatory power). Sales (y) 115 105 95 105 95 85 y 100 Price (x) 5 5 10 10 15 15 yˆ 120 2 x i y yˆ i2 ( y yˆ ) 2 y y ( y y )2 110 110 100 100 90 90 5 -5 -5 5 5 -5 25 25 25 25 25 25 2 i =150 15 5 -5 5 -5 -15 225 25 25 25 25 225 2 ( y y ) =550 SS (Residual)=150 SS(Total)=550 Therefore SS (Regression)=SS (Total)-SS (Residual)=400 MS(Residual)=SS(Residual)/df(Residual)=150/4=37.5 MS(Regression)=SS(Regression)/df(Regression)=400/1 =400 Hence F= MS(Re gression ) =400/37.5=10.7 MS(Re sidual ) The F-statistic follows the F distribution. As in the case for the t-distribution we have to specify appropriate degrees of freedom for the F-distribution. In fact, for the Fdistribution we have to specify both the Numerator (df1, associated with Regression) and Denominator (df2, associated with Residual) degrees of freedom. See below for an example of the shape of the F-distribution and the rejection region. Rejection Region F-df1,df2,0.05 We reject the null at the 5% risk of a type I error if the computed F is greater than the critical value of Fdf1,df2, 0.05 From page 804 of text F1,4,0.05=7.71 (F-Critical) Since F=10.7 > F-critical, we can reject the null. Doing the F-test in Excel Compare the detailed computations we did earlier with the ANOVA output below produced by Excel. The F-statistic is 10.7. Excel also reports the p-value (probability of Type 1 Error if the null hypothesis is rejected) for the F-statistic under the Significance of F. The p-value is 0.03, which indicates we can reject the null at the 5 % risk of a type I error. In practice, checking this p-value is all you need to do to figure out whether the regression is significant!! Degrees of Freedom Sum of Squares Means Squares F-Statistic Significance of F (p-value) SUMMARY OUTPUT Regression Statistics Multiple R 0.852803 R Square 0.727273 Adjusted R Square 0.659091 Standard Error 6.123724 Observations 6 ANOVA df Regression Residual Total Intercept Price SS 1 4 5 400 150 550 MS 400 37.5 Coefficients Standard Error t Stat 120 6.614378 18.14229 -2 0.612372 -3.265986 F Significance F 10.66666667 0.030906 P-value Lower 95% Upper 95% 5.42796E-05 101.6355 138.3645 0.030905835 -3.700222 -0.299778 A couple of interesting properties of the simple regression (regression with one independent variable) 1. The p-value of the t-test and the F-test are identical. Why? 2. The square of the t-statistic is equal to the F-statistic, i.e, (-3.26)2=10.67 These two properties however do not hold for multiple regression. 4. Correlated Independent Variables: Tradeoff between Bias and Precision In multiple regression, we use multiple independent variables (x1, x2…xn) to explain or predict a dependent variable (y). Typically, the variables x1, x2…xn tend to be correlated. Earlier, we discussed an example in which price and advertising were correlated. This may occur for any number of reasons. Correlated independent variables present the following two problems: A. Omitted Variable Bias Problem: Suppose the true model is: y 0 1 x1 2 x2 , but we estimate y 0 1 x1 , then ˆ (the estimated value of in the simple regression) will be biased. We saw this 1 1 in the example with Price and Advertising. These two variables had a correlation of 0.85 in that example. If we included both Price and Advertising in the regression, the estimated equation was: Sales = 95 – 4 Price + 2 Advertising. But if we omitted Advertising, the estimated equation was: Sales= 120 – 2 Price. In this case, the estimated Price coefficient is biased (in the computations: -2 instead of -4) if we omit Advertising. We discuss a systematic way to think about omitted variable bias at the end of these notes in point 6 of these notes. B. Multicollinearity Problem: Suppose we include two very highly correlated variables x1, x2 in the regression. Then, estimating the equation y 0 1 x1 2 x2 can lead to estimates of both 1 and 2 being statistically insignificant. This can happen even though the equation as a whole is statistically significant. Thus, even if the F-test result indicates that at least one slope is truly different from zero, the standard errors of both slope coefficients may be so high as to render each slope coefficient insignificant. This is called the “multicollinearity” or “precision” problem. In the most extreme case, when the independent variables are perfectly correlated, it is impossible to obtain a unique solution, meaning that the standard errors of the slope coefficients are infinitely large. Thus we are faced with a tradeoff between bias and precision for the estimates when we deal with correlated variables. If two independent variables are correlated, we want to include both in order to avoid bias in the slope coefficients. Yet when we include both, it is possible that one or both slope coefficient estimates are insignificant. The Solution: There is no general solution to this problem. In practice, one should always estimate the most complete model. In other words, if we have reason to believe that a candidate independent variable is relevant, we should include that variable in the equation. We can then use the t-test result for each slope coefficient to decide whether a variable should remain in the equation. Even if independent variables are highly correlated, it may still be possible to estimate all the slope coefficients with sufficiently high precision (low standard errors). If the correlation is extremely high, and the slope for at least one independent variable is very imprecise (and hence the t-ratio is insignificant), we could drop one of the correlated variables in order to solve the “precision” or multicollinearity problem. The argument is that it is reasonable to drop one of the variables if two independent variables are highly correlated, because in that case it is impossible to keep the other variable under “control” or constant, when one variable is being changed. Thus it would not be possible to isolate the true effect of each variable with sufficient precision. 5. Why include uncorrelated variables in multiple regression? We have argued that if two independent variables (that both affect the dependent variable) are correlated, not including one will bias the estimate of the other. Now suppose x1, x2 are uncorrelated, and both affect y. In that case dropping one of the two variables will not bias other slope coefficients. In other words, if two independent variables are uncorrelated, it is unnecessary to “hold one constant” for the estimation of the other variable’s impact on y. Suppose now that we care to understand the slope coefficient for only one independent variable, would it still be better to include both variables in the equation? Yes, if both independent variables are relevant, including both variables will improve the precision of both slope coefficients, so it is best to include both variables. The idea is that the standard error of the slope coefficient will be reduced if both (uncorrelated) independent variables are included. We will illustrate this point in the beginning of the next lecture on Dummy Variables with an example. 6. A Systematic Approach to Think about Omitted Variable Bias We illustrate the idea of omitted variable bias using a couple of examples: Example 1: Omission of advertising on the effect of price on sales Recall the omitted bias problem we discussed in part 4. If we included both Price and Advertising in the regression, the estimated equation was: Sales = 95 – 4 Price + 2 Advertising. But if we omitted Advertising, the estimated equation was: Sales= 120 – 2 Price. In this case, the estimated Price coefficient is biased if we omit Advertising. As we can see, the bias causes the coefficient to be closer to zero; i.e., when advertising is included the price coefficient is -4, but on omitting advertising becomes closer to zero at -2. Why does this bias happen? First, recall that price and advertising were positively correlated in the data. That is when prices increased, advertising increased as well in the data. However the effects of price and advertising on the dependent variable (sales) are in the opposite directions. While an increase in price reduced sales, an increase in advertising increased sales. Thus the effects of price and advertising cancel each other out. Thus price and advertising are positively correlated, but their effects cancel each other out. Hence when we omit advertising in the regression model, the effect of price now includes the effect of advertising, which tends to reduce the effect of price. Thus by omitting advertising, we measure a smaller effect of price. Thus we may interpret that the price effect is not significantly different from zero, because we did not include the canceling effect of advertising. Example 2: Omission of job experience on the effect of schooling on salary If a person spends more years in school, they will tend to have fewer years of job experience, relative to a person who spent fewer years in school. Thus years of job experience would be negatively correlated with years of schooling. Our intuition would suggest that greater job experience would raise salary, and greater number of years of schooling would also raise salaries. Here both effects reinforce each other, but the two variables are negatively correlated. Hence if we omitted job experience from the regression equation and included only schooling, the omitted variable of job experience would make it appear that schooling was not having as much of a positive differential impact on salary, because people with less schooling tend to have greater job experience (which also results in higher salary). The following regression results illustrate this point: First, the results of a regression with job experience omitted. As we can see here, schooling does not have a significant positive effect. Coefficients Standard Error t Stat P-value Lower 95%Upper 95% Intercept 47334.97 3526.717 13.42182 1.84E-13 40098.75 54571.19 Schooling 311.0538 226.6091 1.372645 0.181158 -153.909 776.017 Now the results with job experience included. As we can see Schooling is highly significant, on average an additional year of schooling increases salary by about $ 5800. Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept -65798 29966.01 -2.19575 0.03723 -127394 -4201.91 Schooling 5793.49 1457.244 3.975647 0.000498 2798.079 8788.901 Experience 1836.442 484.1689 3.792978 0.0008 841.2179 2831.666 Thus when we look for omitted variable bias, we look at two criteria: 1. Are the effects of the included variable and omitted variable in the same direction or in opposite directions? 2. What is the nature of the correlation between the included and omitted variables? We summarize when omitted variable bias leads to non-significant effects using the following table. Effect of Included Variable and Omitted Variable on the y variable Same direction (Both positive, both negative) Opposite Direction (One is positive, the other is negative) Correlation between Included and Omitted Variable Positive Negative Bias Tends towards zero (E.g. Salary, Education and Experience) Bias Tends towards zero (E.g., Sales, Prices and Advertising) From a practical point of view, when you find a certain variable like price or schooling etc., on which we have strong priors about a strong effect turns out to be insignificant, it would be useful to think of what omitted variable might be masking the true effect. The above two criteria should help in our search for potential omitted variables which might be masking the effect and making the coefficient appear insignificant.