* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Multiple Regression Analysis
Data assimilation wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Time series wikipedia , lookup
Regression toward the mean wikipedia , lookup
Choice modelling wikipedia , lookup
Linear regression wikipedia , lookup
Dr. Mohammed Alahmed Dr. mohammed Alahmed Multiple Regression Analysis http://fac.ksu.edu.sa/alahmed [email protected] (011) 4674108 1 • In simple linear regression we studied the relationship between one explanatory variable and one response variable. • Now, we look at situations where several explanatory variables works together to explain the response. Dr. mohammed Alahmed Introduction 2 Introduction • We look at the distribution of each variable to be used in multiple regression to determine if there are any unusual patterns that may be important in building our regression analysis. Dr. mohammed Alahmed • Following our principles of data analysis, we look first at each variable separately, then at relationships among the variables. 3 Multiple Regression • • • • X1 : average size of loan outstanding during the year, X2 : average number of loans outstanding, X3 : total number of new loan applications processed, and X4 : office salary scale index. • The model for this example is Y 0 1 x1 2 x2 3 X 3 4 x4 Dr. mohammed Alahmed • Example: In a study of direct operating cost, Y, for 67 branch offices of consumer finance charge, four independent variables were considered: 4 Formal Statement of the Model • General regression model Where: • 0, 1, , k are parameters • X1, X2, …,Xk are known constants • , the error terms are independent N(o, 2) Dr. mohammed Alahmed Y 0 1 x1 2 x2 k xk 5 • The values of the regression parameters i are not known. We estimate them from data. • As in the simple linear regression case, we use the least-squares method to fit a linear function to the data. yˆ b0 b1 x1 b2 x2 bk xk • The least-squares method chooses the b’s that make the sum of squares of the residuals as small as possible. Dr. mohammed Alahmed Estimating the parameters of the model 6 Estimating the parameters of the model • The least-squares estimates are the values that minimize the quantity 2 ˆ ( y y ) i i i 1 • Since the formulas for the least-squares estimates are complicated and hand calculation is out of question, we are content to understand the leastsquares principle and let software do the computations. Dr. mohammed Alahmed n 7 • The estimate of i is bi and it indicates the change in the mean response per unit increase in Xi when the rest of the independent variables in the model are held constant. • The parameters i are frequently called partial regression coefficients because they reflect the partial effect of one independent variable when the rest of independent variables are included in the model and are held constant Dr. mohammed Alahmed Estimating the parameters of the model 8 Estimating the parameters of the model n 1 2 ˆ S2 ( y y ) i i n k 1 i 1 and the regression standard error: Dr. mohammed Alahmed • The observed variability of the responses about this fitted model is measured by the variance: s s2 9 • In the model 2 and measure the variability of the responses about the population regression equation. • It is natural to estimate 2 by s2 and by s. Dr. mohammed Alahmed Estimating the parameters of the model 10 • The basic idea of the regression ANOVA table are the same in simple and multiple regression. • The sum of squares decomposition and the associated degrees of freedom are: 2 2 2 ˆ ˆ ( y y ) ( y y ) ( y y ) i i i i SST • df: SSR n 1 k (n k 1) SSE Dr. mohammed Alahmed Analysis of Variance Table 11 Source Regression Sum of Squares SS SSR df Mean Square MS F-test k MSR= SSR/k MSR/MSE MSE= SSE/n-k-1 Error SSE n-k-1 Total SST n-1 Dr. mohammed Alahmed Analysis of Variance Table 12 F-test for the overall fit of the model H 0 : 1 2 k 0 H a : not all i (i 1, k ) equal zero • We use the test statistic: MSR F MSE Dr. mohammed Alahmed • To test the statistical significance of the regression relation between the response variable y and the set of variables x1,…, xk, i.e. to choose between the alternatives: 13 F-test for the overall fit of the model • The decision rule at significance level is: F F ( ; k , n k 1) • Where the critical value F(, k, n-k-1) can be found from an F-table. • The existence of a regression relation by itself does not assure that useful prediction can be made by using it. • Note that when k=1, this test reduces to the F-test for testing in simple linear regression whether or not 1= 0 Dr. mohammed Alahmed • Reject H0 if 14 Interval estimation of i • Therefore, an interval estimate for i with 1- confidence coefficient is: bi t ( ; n k 1) s (bi ) 2 Where s(bi ) MSE ( x x )2 Dr. mohammed Alahmed • For our regression model, we have: bi i has a t - distributi on with n - k - 1 degrees of freedom s(bi ) 15 Significance tests for i • To test: H 0 : i 0 H a : i 0 t t( 2 t t ( ; n k 1) 2 ; n k 1) Dr. mohammed Alahmed • We may use the test statistic: bi t s (bi ) • Reject H0 if or 16 • Often we have many explanatory variables, and our goal is to use these to explain the variation in the response variable. • A model using just a few of the variables often predicts about as well as the model using all the explanatory variables. Dr. mohammed Alahmed Multiple regression model Building 17 • We may find that the reciprocal of a variable is a better choice than the variable itself, or that including the square of an explanatory variable improves prediction. • We may find that the effect of one explanatory variable may depends upon the value of another explanatory variable. We account for this situation by including interaction terms. Dr. mohammed Alahmed Multiple regression model Building 18 • The simplest way to construct an interaction term is to multiply the two explanatory variables together. • How can we find a good model? Dr. mohammed Alahmed Multiple regression model Building 19 • After a lengthy list of potentially useful independent variables has been compiled, some of the independent variables can be screened out. An independent variable • May not be fundamental to the problem • May be subject to large measurement error • May effectively duplicate another independent variable in the list. Dr. mohammed Alahmed Selecting the best Regression equation. 20 • Once the investigator has tentatively decided upon the functional forms of the regression relations (linear, quadratic, etc.), the next step is to obtain a subset of the explanatory variables (x) that “best” explain the variability in the response variable y. Dr. mohammed Alahmed Selecting the best Regression Equation. 21 • An automatic search procedure that develops sequentially the subset of explanatory variables to be included in the regression model is called stepwise procedure. • It was developed to economize on computational efforts. • It will end with the identification of a single regression model as “best”. Dr. mohammed Alahmed Selecting the best Regression Equation. 22 • Sales Forecasting • Multiple regression is a popular technique for predicting product sales with the help of other variables that are likely to have a bearing on sales. • Example • The growth of cable television has created vast new potential in the home entertainment business. The following table gives the values of several variables measured in a random sample of 20 local television stations which offer their programming to cable subscribers. A TV industry analyst wants to build a statistical model for predicting the number of subscribers that a cable station can expect. Dr. mohammed Alahmed Example: Sales Forecasting 23 • Y = Number of cable subscribers (SUSCRIB) • X1 = Advertising rate which the station charges local advertisers for one minute of prim time space (ADRATE) • X2 = Number of families living in the station’s area of dominant influence (ADI), a geographical division of radio and TV audiences (APIPOP) • X3 = Number of competing stations in the ADI (COMPETE) • X4 = Kilowatt power of the station’s non-cable signal (SIGNAL) Dr. mohammed Alahmed Example: Sales Forecasting 24 • The sample data are fitted by a multiple regression model using Excel program. • The marginal t-test provides a way of choosing the variables for inclusion in the equation. • The fitted Model is SUBSCRIBE 0 1 ADRATE 2 APIPOP 3 COMPETE 4 SIGNAL Dr. mohammed Alahmed Example: Sales Forecasting 25 Example: Sales Forecasting • Excel Summary output Regression Statistics Multiple R 0.884267744 R Square 0.781929444 Adjusted R Square 0.723777295 Standard Error 142.9354188 Observations 20 ANOVA df Regression Residual Total Intercept AD_Rate Signal APIPOP Compete 4 15 19 SS 1098857.84 306458.0092 1405315.85 MS F Significance F 274714.4601 13.44626923 7.52E-05 20430.53395 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 51.42007002 98.97458277 0.51952803 0.610973806 -159.539 262.3795 -0.267196347 0.081055107 -3.296477624 0.004894126 -0.43996 -0.09443 -0.020105139 0.045184758 -0.444954014 0.662706578 -0.11641 0.076204 0.440333955 0.135200486 3.256896248 0.005307766 0.152161 0.728507 16.230071 26.47854322 0.61295181 0.549089662 -40.2076 72.66778 Dr. mohammed Alahmed SUMMARY OUTPUT 26 • Do we need all the four variables in the model? • Based on the partial t-test, the variables signal and compete are the least significant variables in our model. • Let’s drop the least significant variables one at a time. Dr. mohammed Alahmed Example: Sales Forecasting 27 Example: Sales Forecasting • Excel Summary Output Regression Statistics Multiple R 0.882638739 R Square 0.779051144 Adjusted R Square 0.737623233 Standard Error 139.3069743 Observations 20 ANOVA df Regression Residual Total Intercept AD_Rate APIPOP Compete 3 16 19 SS 1094812.92 310502.9296 1405315.85 MS F Significance F 364937.64 18.80498277 1.69966E-05 19406.4331 Coefficients Standard Error t Stat P-value 51.31610447 96.4618242 0.531983558 0.602046756 -0.259538026 0.077195983 -3.36206646 0.003965102 0.433505145 0.130916687 3.311305499 0.004412929 13.92154404 25.30614013 0.550125146 0.589831583 Lower 95% Upper 95% -153.1737817 255.806 -0.423186162 -0.09589 0.15597423 0.711036 -39.72506442 67.56815 Dr. mohammed Alahmed SUMMARY OUTPUT 28 Example: Sales Forecasting Dr. mohammed Alahmed • The variable Compete is the next variable to get rid of. 29 Example: Sales Forecasting • Excel Summary Output SUMMARY OUTPUT Regression Statistics 0.8802681 R Square 0.774871928 Adjusted R Square 0.748386273 Standard Error 136.4197776 Observations Dr. mohammed Alahmed Multiple R 20 ANOVA df SS Regression MS 2 1088939.802 544469.901 Residual 17 316376.0474 18610.35573 Total 19 1405315.85 Coefficients Intercept 96.28121395 AD_Rate -0.254280696 APIPOP 0.495481252 Standard Error 50.16415506 t Stat F 29.2562866 P-value Significance F 3.13078E-06 Lower 95% Upper 95% -9.556049653 202.1184776 1.919322948 0.07188916 0.075014548 -3.389751739 0.003484198 -0.41254778 -0.096013612 0.065306012 7.45293E-07 0.357697418 7.587069489 0.633265086 30 Example: Sales Forecasting Final Model SUBSCRIBE 96.28 0.25 ADRATE 0.495 APIPOP Dr. mohammed Alahmed • All the variables in the model are statistically significant, therefore our final model is: 31 • What is the interpretation of the estimated parameters. • Is the association positive or negative? • Does this make sense intuitively, based on what the data represents? • What other variables could be confounders? • Are there other analysis that you might consider doing? New questions raised? Dr. mohammed Alahmed Interpreting the Final Model 32 • In multiple regression analysis, one is often concerned with the nature and significance of the relations between the explanatory variables and the response variable. • Questions that are frequently asked are: 1. What is the relative importance of the effects of the different independent variables? 2. What is the magnitude of the effect of a given independent variable on the dependent variable? Dr. mohammed Alahmed Multicollinearity 33 3. Can any independent variable be dropped from the model because it has little or no effect on the dependent variable? 4. Should any independent variables not yet included in the model be considered for possible inclusion? • Simple answers can be given to these questions if: • The independent variables in the model are uncorrelated among themselves. • They are uncorrelated with any other independent variables that are related to the dependent variable but omitted from the model. Dr. mohammed Alahmed Multicollinearity 34 • When the independent variables are correlated among themselves, multicollinearity or colinearity among them is said to exist. • In many non-experimental situations in business, economics, and the social and biological sciences, the independent variables tend to be correlated among themselves. • For example, in a regression of family food expenditures on the variables: family income, family savings, and the age of head of household, the explanatory variables will be correlated among themselves. Dr. mohammed Alahmed Multicollinearity 35 Multicollinearity Dr. mohammed Alahmed • Further, the explanatory variables will also be correlated with other socioeconomic variables not included in the model that do affect family food expenditures, such as family size. 36 Multicollinearity Some key problems that typically arise when the explanatory variables being considered for the regression model are highly correlated among themselves are: 1. 2. 3. Adding or deleting an explanatory variable changes the regression coefficients. The estimated standard deviations of the regression coefficients become large when the explanatory variables in the regression model are highly correlated with each other. The estimated regression coefficients individually may not be statistically significant even though a definite statistical relation exists between the response variable and the set of explanatory variables. Dr. mohammed Alahmed • 37 Multicollinearity Diagnostics • It measures how much the variances of the estimated regression coefficients are inflated as compared to when the independent variables are not linearly related. 1 VIFj , 2 1 Rj j 1,2,k Dr. mohammed Alahmed • A formal method of detecting the presence of multicollinearity that is widely used is by the means of Variance Inflation Factor. 2 R • j is the coefficient of determination from the regression of the jth independent variable on the remaining k-1 independent variables. 38 Multicollinearity Diagnostics - Its estimated coefficient and associated t-value will not change much as the other independent variables are added or deleted from the regression equation. • VIF much greater than 1 indicates the presence of multicollinearity. A maximum VIF value in excess of 10 is often taken as an indication that the multicollinearity may be unduly influencing the least square estimates. - the estimated coefficient attached to the variable is unstable and its associated t statistic may change considerably as the other independent variables are added or deleted. Dr. mohammed Alahmed • VIF near 1 suggests that multicollinearity is not a problem for the independent variables. 39 • The simple correlation coefficient between all pairs of explanatory variables (i.e., X1, X2, …, Xk ) is helpful in selecting appropriate explanatory variables for a regression model and is also critical for examining multicollinearity. • While it is true that a correlation very close to +1 or –1 does suggest multicollinearity, it is not true (unless there are only two explanatory variables) to infer multicollinearity does not exist when there are no high correlations between any pair of explanatory variables. Dr. mohammed Alahmed Multicollinearity Diagnostics 40 Example: Sales Forecasting SUBSCRIB ADRATE KILOWATT APIPOP COMPETE 1.00000 -0.02848 0.9051 0.44762 0.0478 0.90447 <.0001 0.79832 <.0001 ADRATE ADRATE -0.02848 0.9051 1.00000 -0.01021 0.9659 0.32512 0.1619 0.34147 0.1406 SIGNAL SIGNAL 0.44762 0.0478 SUBSCRIB SUBSCRIB -0.01021 0.9659 1.00000 0.45303 0.0449 0.46895 0.0370 APIPOP APIPOP 0.90447 <.0001 0.32512 0.1619 0.45303 0.0449 1.00000 0.87592 <.0001 COMPETE 0.79832 0.34147 0.46895 0.87592 1.00000 COMPETE <.0001 0.1406 0.0370 <.0001 Dr. mohammed Alahmed Pearson Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0 41 Example: Sales Forecasting SUBSCRIBE 51.42 0.27 ADRATE - .02 SIGNAL 0.44 APIPOP 16.23 COMPETE SUBSCRIBE 96.28 0.25 ADRATE 0.495 APIPOP Dr. mohammed Alahmed SUBSCRIBE 51.32 0.26 ADRATE 0.43 APIPOP 13.92 COMPETE 42 Example: Sales Forecasting • VIF calculation: • Fit the model APIPOP 0 1 SIGNAL 2 ADRATE 3 COMPETE Dr. mohammed Alahmed SUMMARY OUTPUT Regression Statistics Multiple R 0.878054 R Square 0.770978 Adjusted R Square 0.728036 Standard Error 264.3027 Observations 20 ANOVA df Regression SS MS 3 3762601 1254200 Residual 16 1117695 69855.92 Total 19 4880295 Coefficients Standard Error t Stat F 17.9541 Significance F 2.25472E-05 P-value Lower 95% Intercept -472.685 139.7492 -3.38238 0.003799 -768.9402258 -176.43 Compete 159.8413 28.29157 5.649786 3.62E-05 99.86587622 219.8168 ADRATE 0.048173 0.149395 0.322455 0.751283 -0.268529713 0.364876 Signal 0.037937 0.083011 0.457012 0.653806 -0.138038952 0.213913 Upper 95% 43 Example: Sales Forecasting • Fit the model Compete 0 1 ADRATE 2 APIPOP 3 SIGNAL SUMMARY OUTPUT 0.882936 R Square 0.779575 Adjusted R Square 0.738246 Standard Error Dr. mohammed Alahmed Regression Statistics Multiple R 1.34954 Observations 20 ANOVA df Regression SS MS 3 103.0599 34.35329 Residual 16 29.14013 1.821258 Total 19 132.2 Coefficients Standard Error t Stat F 18.86239 P-value Significance F 1.66815E-05 Lower 95% Upper 95% Intercept 3.10416 0.520589 5.96278 1.99E-05 2.000559786 4.20776 ADRATE 0.000491 0.000755 0.649331 0.525337 -0.001110874 0.002092 Signal 0.000334 0.000418 0.799258 0.435846 -0.000552489 0.001221 APIPOP 0.004167 0.000738 5.649786 3.62E-05 0.002603667 0.005731 44 Example: Sales Forecasting • Fit the model Signal 0 1 ADRATE 2 APIPOP 3 COMPETE Regression Statistics Multiple R 0.512244 R Square 0.262394 Adjusted R Square 0.124092 Standard Error 790.8387 Observations 20 ANOVA df Regression Residual Total Intercept APIPOP Compete ADRATE SS 3 3559789 16 10006813 19 13566602 MS 1186596 625425.8 Coefficients Standard Error t Stat 5.171093 547.6089 0.009443 0.339655 0.743207 0.457012 114.8227 143.6617 0.799258 -0.38091 0.438238 -0.86919 F Significance F 1.897261 0.170774675 P-value 0.992582 0.653806 0.435846 0.397593 Lower 95% Upper 95% -1155.707711 1166.05 -1.235874129 1.915184 -189.7263711 419.3718 -1.309935875 0.548109 Dr. mohammed Alahmed SUMMARY OUTPUT 45 Example: Sales Forecasting • Fit the model ADRATE 0 1 Signal 2 APIPOP 3 COMPETE SUMMARY OUTPUT 0.399084 R Square 0.159268 Adjusted R Square 0.001631 Standard Error 440.8588 Observations Dr. mohammed Alahmed Regression Statistics Multiple R 20 ANOVA df Regression SS MS 3 589101.7 196367.2 Residual 16 3109703 194356.5 Total 19 3698805 Coefficients Standard Error t Stat Intercept Signal APIPOP Compete F 1.010346 Significance F 0.413876018 P-value Lower 95% Upper 95% 253.7304 298.6063 0.849716 0.408018 -379.2865355 886.7474 -0.11837 0.136186 -0.86919 0.397593 -0.407073832 0.170329 0.134029 0.415653 0.322455 0.751283 -0.747116077 1.015175 52.3446 80.61309 0.649331 0.525337 -118.5474784 223.2367 46 Example: Sales Forecasting Variable R- Squared VIF ADRATE 0.159268 1.19 COMPETE 0.779575 4.54 SIGNAL 0.262394 1.36 APIPOP 0.770978 4.36 • There is no significant multicollinearity. Dr. mohammed Alahmed • VIF calculation Results: 47 • Many variables of interest in business, economics, and social and biological sciences are not quantitative but are qualitative. • Examples of qualitative variables are gender (male, female), purchase status (purchase, no purchase), and type of firms. • Qualitative variables can also be used in multiple regression. Dr. mohammed Alahmed Qualitative Independent Variables 48 • An economist wished to relate the speed with which a particular insurance innovation is adopted (y) to the size of the insurance firm (x1) and the type of firm. The dependent variable is measured by the number of months elapsed between the time the first firm adopted the innovation and and the time the given firm adopted the innovation. The first independent variable, size of the firm, is quantitative, and measured by the amount of total assets of the firm. The second independent variable, type of firm, is qualitative and is composed of two classesStock companies and mutual companies. Dr. mohammed Alahmed Qualitative Independent Variables 49 • Indicator, or dummy variables are used to determine the relationship between qualitative independent variables and a dependent variable. • Indicator variables take on the values 0 and 1. • For the insurance innovation example, where the qualitative variable has two classes, we might define the indicator variable x2 as follows: x2 Dr. mohammed Alahmed Indicator variables 1 if stock company 0 otherwise 50 • A qualitative variable with c classes will be represented by c-1 indicator variables. • A regression function with an indicator variable with two levels (c = 2) will yield two estimated lines. Dr. mohammed Alahmed Indicator variables 51 Interpretation of Regression Coefficients • In our insurance innovation example, the regression model is: • Where: x1 size of firm x2 Dr. mohammed Alahmed y 0 1 x1 2 x2 1 if stock company 0 otherwise 52 Interpretation of Regression Coefficients yˆi b0 b1 x1 b2 (0) b0 b1 x1 Mutual firms • For a stock firm x2 = 1 and the response function is: yˆi b0 b1 x1 b2 (1) (b0 b2 ) b1 x1 Stock firms Dr. mohammed Alahmed • To understand the meaning of the regression coefficients in this model, consider first the case of mutual firm. For such a firm, x2 = 0 and we have: 53 • The response function for the mutual firms is a straight line, with y intercept 0 and slope 1. • For stock firms, this also is a straight line, with the same slope 1 but with y intercept 0+2. • With reference to the insurance innovation example, the mean time elapsed before the innovation is adopted is linear function of size of firm (x1), with the same slope 1for both types of firms. Dr. mohammed Alahmed Interpretation of Regression Coefficients 54 • 2 indicates how much lower or higher the response function for stock firm is than the one for the mutual firm. • 2 measures the differential effect of type of firms. • In general, 2 shows how much higher (lower) the mean response line is for the class coded 1 than the line for the class coded 0, for any level of x1. Dr. mohammed Alahmed Interpretation of Regression Coefficients 55 Example: Insurance Innovation Adoption Months Elapsed Size type of firm Type 17 151 0 Mutual 26 92 0 Mutual 21 175 0 Mutual 30 31 0 Mutual 22 104 0 Mutual 0 277 0 Mutual 12 210 0 Mutual 19 120 0 Mutual 4 290 0 Mutual 16 238 1 Stock 28 164 1 Stock 15 272 1 Stock 11 295 1 Stock 38 68 1 Stock 31 85 1 Stock 21 224 1 Stock 20 166 1 Stock 13 305 1 Stock 30 124 1 Stock 14 246 1 Stock Dr. mohammed Alahmed • Here is the data set for the insurance innovation example: 56 Example: Insurance Innovation Adoption • Fitting the regression model Where x1 size of firm x2 1 if stock company 0 otherwise Dr. mohammed Alahmed y 0 1 x1 2 x2 • fitted response function is: yˆ 33.87 .1061x1 8.77 x2 57 Example: Insurance Innovation Adoption SUMMARY OUTPUT Multiple R 0.95993655 R Square 0.92147818 Adjusted R Square 0.91224031 Standard Error 2.78630562 Observations 20 ANOVA df Regression SS MS 2 1548.820517 774.4103 Residual 17 131.979483 7.763499 Total 19 1680.8 Coefficients Intercept Size type of firm Standard Error t Stat F 99.75016 P-value 33.8698658 1.562588138 21.67549 8E-14 -0.10608882 0.007799653 -13.6017 1.45E-10 8.76797549 1.286421264 6.815789 3.01E-06 Significance F 4.04966E-10 Lower 95% Upper 95% 30.57308841 37.16664321 Dr. mohammed Alahmed Regression Statistics -0.122544675 -0.089632969 6.053860079 11.4820909 58 Example: Insurance Innovation Adoption • The fitted response function is: yˆ 33.87 .1061x1 8.77 x2 yˆ (33.87 8.77) .1061x1 • Mutual firms response function is: yˆ 33.87 .1061x1 Dr. mohammed Alahmed • Stock firms response function is: • Interpretation ? 59 • Seasonal Patterns are not easily accounted for by the typical causal variables that we use in regression analysis. • An indicator variable can be used effectively to account for seasonality in our time series data. • The number of seasonal indicator variables to use depends on the data. • If we have p periods in our data series, we can not use more than P-1 seasonal indicator variables. Dr. mohammed Alahmed Accounting for Seasonality in a Multiple regression Model 60 • Housing starts in the United States measured in thousands of units. • These data are plotted for 1990 Q1 through 1999Q4. • There are typically few housing starts during the first quarter of the year (January, February, March); there is usually a big increase in the second quarter of (April, May, June), followed by some decline in the third quarter (July, August, September), and further decline in the fourth quarter (October, November, December). Dr. mohammed Alahmed Example: Private Housing Starts (PHS) 61 Nov-98 Jul-98 1 Mar-98 Nov-97 Jul-97 1 Mar-97 Nov-96 Jul-96 Mar-96 Nov-95 Jul-95 1 Mar-95 Nov-94 Jul-94 300 250 1 1 1 150 1 Dr. mohammed Alahmed 100 Mar-94 Nov-93 Jul-93 1 Mar-93 Nov-92 Jul-92 200 Mar-92 Nov-91 Jul-91 Mar-91 Nov-90 Jul-90 Mar-90 Example: Private Housing Starts (PHS) Private Housing Starts (PHS) in Thousands of Units 400 350 "1" marks the first quarter of each year. 50 0 62 • To Account for and measure this seasonality in a regression model, we will use three dummy variables: Q2 for the second quarter, Q3 for the third quarter, and Q4 for the fourth quarter. These will be coded as follows: • Q2 = 1 for all second quarters and zero otherwise. • Q3 = 1 for all third quarters and zero otherwise • Q4 = 1 for all fourth quarters and zero otherwise. Dr. mohammed Alahmed Example: Private Housing Starts (PHS) 63 • Data for private housing starts (PHS), the mortgage rate (MR), and these seasonal indicator variables are shown in the following slide. • Examine the data carefully to verify your understanding of the coding for Q2, Q3, Q4. • Since we have assigned dummy variables for the second, third, and fourth quarters, the first quarter is the base quarter for our regression model. • Note that any quarter could be used as the base, with indicator variables to adjust for differences in other quarters. Dr. mohammed Alahmed Example: Private Housing Starts (PHS) 64 PERIOD 31-Mar-90 30-Jun-90 30-Sep-90 31-Dec-90 31-Mar-91 30-Jun-91 30-Sep-91 31-Dec-91 31-Mar-92 30-Jun-92 30-Sep-92 31-Dec-92 31-Mar-93 30-Jun-93 30-Sep-93 31-Dec-93 31-Mar-94 30-Jun-94 30-Sep-94 31-Dec-94 PHS 217 271.3 233 173.6 146.7 254.1 239.8 199.8 218.5 296.4 276.4 238.8 213.2 323.7 309.3 279.4 252.6 354.2 325.7 265.9 MR 10.1202 10.3372 10.1033 9.9547 9.5008 9.5265 9.2755 8.6882 8.7098 8.6782 8.0085 8.2052 7.7332 7.4515 7.0778 7.0537 7.2958 8.4370 8.5882 9.0977 Q2 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 Q3 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 Q4 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 Book: Table 5-5, page 216 PERIOD 31-Mar-95 30-Jun-95 30-Sep-95 31-Dec-95 31-Mar-96 30-Jun-96 30-Sep-96 31-Dec-96 31-Mar-97 30-Jun-97 30-Sep-97 31-Dec-97 31-Mar-98 30-Jun-98 30-Sep-98 31-Dec-98 31-Mar-99 30-Jun-99 30-Sep-99 31-Dec-99 PHS 214.2 296.7 308.2 257.2 240 344.5 324 252.4 237.8 324.5 314.6 256.8 258.4 360.4 348 304.6 294.1 377.1 355.6 308.1 MR 8.8123 7.9470 7.7012 7.3508 7.2430 8.1050 8.1590 7.7102 7.7905 7.9255 7.4692 7.1980 7.0547 7.0938 6.8657 6.7633 6.8805 7.2037 7.7990 7.8338 Q2 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 Q3 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 Q4 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 Dr. mohammed Alahmed Example: Private Housing Starts (PHS) 65 Example: Private Housing Starts (PHS) • The regression model for private housing starts (PHS) is: • In this model we expect b1 to have a negative sign, and we would expect b2, b3, b4 all to have positive signs. Why? • Regression results for this model are shown in the next slide. Dr. mohammed Alahmed PHS 0 1 (MR) 2 (Q2) 3 (Q3) 4 (Q4) 66 Example: Private Housing Starts (PHS) SUMMARY OUTPUT ANOVA df Regression Residual Total Intercept MR Q2 Q3 Q4 4 35 39 SS 88837.93624 24485.87476 113323.811 MS F Significance F 22209.48406 31.74613731 3.33637E-11 699.5964217 Coefficients Standard Error t Stat P-value 473.0650749 35.54169837 13.31014264 2.93931E-15 -30.04838192 4.257226391 -7.058206249 3.21421E-08 95.74106935 11.84748487 8.081130334 1.6292E-09 73.92904763 11.82881519 6.249911462 3.62313E-07 20.54778131 11.84139803 1.73524961 0.091495355 Lower 95% Upper 95% 400.9115031 545.2186467 -38.69102153 -21.40574231 71.689367 119.7927717 49.91524679 97.94284847 -3.491564078 44.5871267 Dr. mohammed Alahmed Regression Statistics Multiple R 0.885398221 R Square 0.78393001 Adjusted R Square 0.759236296 Standard Error 26.4498851 Observations 40 67 Example: Private Housing Starts (PHS) PHSˆ 473.06 30.05( MR ) 95.74(Q 2) 73.93(Q3) 20.55(Q 4) Dr. mohammed Alahmed • Use the prediction equation to make a forecast for each of the fourth quarter of 1999. • Prediction equation: 68 Example: Private Housing Starts (PHS) Private Housing Starts (PHS) with a Simple Regression Forecast (PHSF1) and a Multiple Regression Forecast (PHSF2) in Thousands of Units 400 Dr. mohammed Alahmed 350 300 250 200 150 100 PHS 50 PHSF1 PHSF2 - - - - - - - - - - - - - - - - - - - - 0 69 • It is important to check the adequacy of the model before it becomes part of the decision making process. • Residual plots can be used to check the model assumptions. • It is important to study outlying observations to decide whether they should be retained or eliminated. • If retained, whether their influence should be reduced in the fitting process or revise the regression function. Dr. mohammed Alahmed Regression Diagnostics and Residual Analysis 70 • In the regression models we assume that the errors i are independent. • In business and economics, many regression applications involve time series data. • For such data, the assumption of uncorrelated or independent error terms is often not appropriate. Dr. mohammed Alahmed Time Series Data and the Problem of Serial Correlation 71 • If the error terms in the regression model are autocorrelated, the use of ordinary least squares procedures has a number of important consequences • MSE underestimate the variance of the error terms • The confidence intervals and tests using the t and F distribution are no longer strictly applicable. • The standard error of the regression coefficients underestimate the variability of the estimated regression coefficients. Dr. mohammed Alahmed Problems of Serial Correlation 72 First order serial correlation yt 0 1 xt t t t 1 t Where: • t = error at time t • = the parameter that measures correlation between adjacent error terms • t normally distributed error terms with mean zero and variance 2 Dr. mohammed Alahmed • The error term in current period is directly related to the error term in the previous time period. • Let the subscript t represent time, then the simple linear regression model is: 73 • The effect of positive serial correlation in a simple linear regression model. • Misleading forecasts of future y values. • Standard error of the estimate, S y.x will underestimate the variability of the y’s about the true regression line. • Strong autocorrelation can make two unrelated variables appear to be related. Dr. mohammed Alahmed Example 74 Durbin-Watson Test for Serial Correlation • Recall the first-order serial correlation model t t 1 t • The hypothesis to be tested are: H0 : 0 Ha : 0 • The alternative hypothesis is > 0 since in business and economic time series tend to show positive correlation. Dr. mohammed Alahmed yt 0 1 xt t 75 Durbin-Watson Test for Serial Correlation • The Durbin-Watson statistic is defined as n t 2 t et 1 ) 2 n e t 1 2 t • Where et yt yˆt the residual for time period t et 1 yt 1 yˆt 1 the residual for time period t -1 Dr. mohammed Alahmed DW (e 76 Durbin-Watson Test for Serial Correlation • The auto correlation coefficient can be estimated by the lag 1 residual autocorrelation r1(e) n • And it can be shown that t 2 n t t 1 2 e t t 1 DW 2(1 r1 (e)) Dr. mohammed Alahmed r1 (e) e e 77 Durbin-Watson Test for Serial Correlation • If r1(e) = 0, then DW = 2 (there is no correlation.) • If r1(e) > 0, then DW < 2 (positive correlation) • If r1(e) < 0, Then DW > 2 (negative correlation) Dr. mohammed Alahmed • Since –1 < r1(e) < 1 then 0 < DW < 4 78 Durbin-Watson Test for Serial Correlation - If DW > U, Do not reject H0. - If DW < L, Reject H0 - If L DW U, the test is inconclusive. • The critical Upper (U) an Lower (L) bound can be found in Durbin-Watson table of your text book. • To use this table you need to know - The significance level () - the number of independent parameters in the model (k), and - the sample size (n). Dr. mohammed Alahmed • Decision rule: 79 • The Blaisdell Company wished to predict its sales by using industry sales as a predictor variable. • The following table gives seasonally adjusted quarterly data on company sales and industry sales for the period 1983-1987. Dr. mohammed Alahmed Example 80 Year 1983 1984 1985 1986 1987 Quarter 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 CompSale 20.96 21.4 21.96 21.52 22.39 22.76 23.48 23.66 24.1 24.01 24.54 24.3 25 25.64 26.36 26.98 27.52 27.78 28.24 28.78 InduSale 127.3 130 132.7 129.4 135 137.1 141.2 142.8 145.5 145.3 148.3 146.4 150.2 153.1 157.3 160.7 164.2 165.6 168.7 171.7 Dr. mohammed Alahmed Example 81 Example Blaisdell Company Example 30 25 Dr. mohammed Alahmed Company Sales ($ millions) 35 20 15 10 5 0 0 50 100 150 200 Industry sales($ millions) 82 • The scatter plot suggests that a linear regression model is appropriate. • Least squares method was used to fit a regression line to the data. • The residuals were plotted against the fitted values. • The plot shows that the residuals are consistently above or below the fitted value for extended periods. Dr. mohammed Alahmed Example 83 Dr. mohammed Alahmed Example 84 Example • To confirm this graphic diagnosis we will use the Durbin-Watson test for: Ha : 0 • The test statistic is: n DW (e t 2 t et 1 ) 2 Dr. mohammed Alahmed H0 : 0 n e t 1 2 t 85 Year 1983 1984 1985 1986 1987 Quarter 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Company sales(y) Industry sales(x) 20.96 127.3 21.4 130 21.96 132.7 21.52 129.4 22.39 135 22.76 137.1 23.48 141.2 23.66 142.8 24.1 145.5 24.01 145.3 24.54 148.3 24.3 146.4 25 150.2 25.64 153.1 26.36 157.3 26.98 160.7 27.52 164.2 27.78 165.6 28.24 168.7 28.78 171.7 Blaisdell Company Example 35 30 25 et -0.02605 -0.06202 0.022021 0.163754 0.04657 0.046377 0.043617 -0.05844 -0.0944 -0.14914 -0.14799 -0.05305 -0.02293 0.105852 0.085464 0.106102 0.029112 0.042316 -0.04416 -0.03301 et -et-1 (et -et-1)^2 -0.03596 0.084036 0.141733 -0.11718 -0.00019 -0.00276 -0.10205 -0.03596 -0.05474 0.001152 0.094937 0.030125 0.12878 -0.02039 0.020638 -0.07699 0.013204 -0.08648 0.011152 0.001293 0.007062 0.020088 0.013732 3.76E-08 7.61E-06 0.010415 0.001293 0.002997 1.33E-06 0.009013 0.000908 0.016584 0.000416 0.000426 0.005927 0.000174 0.007478 0.000124 et ^2 0.000679 0.003846 0.000485 0.026815 0.002169 0.002151 0.001902 0.003415 0.008911 0.022243 0.021901 0.002815 0.000526 0.011205 0.007304 0.011258 0.000848 0.001791 0.00195 0.00109 0.097941 0.133302 Dr. mohammed Alahmed Example 86 Example • Using Durbin Watson table of your text book, for k = 1, and n=20, and using = .01 we find U = 1.15, and L = .95 • Since DW = .735 falls below L = .95 , we reject the null hypothesis, namely, that the error terms are positively autocorrelated. Dr. mohammed Alahmed .09794 DW .735 .13330 87 Remedial Measures for Serial Correlation • One major cause of autocorrelated error terms is the omission from the model of one or more key variables that have time-ordered effects on the dependent variable. • Use transformed variables. • The regression model is specified in terms of changes rather than levels. Dr. mohammed Alahmed • Addition of one or more independent variables to the regression model. 88 Extensions of the Multiple Regression Model • Business or economic logic may suggest that nonlinearity is expected. • A graphic display of the data may be helpful in determining whether non-linearity is present. • One common economic cause for non-linearity is diminishing returns. • Fore example, the effect of advertising on sales may diminish as increased advertising is used. Dr. mohammed Alahmed • In some situations, nonlinear terms may be needed as independent variables in a regression analysis. 89 Extensions of the Multiple Regression Model • Some common forms of nonlinear functions are : Y 0 1 ( X ) 2 ( X 2 ) 3 ( X 3 ) Y 0 1 (1 X ) 0 Y e X 1 Dr. mohammed Alahmed Y 0 1 ( X ) 2 ( X 2 ) 90 Extensions of the Multiple Regression Model PHS 0 1 (MR) 2 (Q2) 3 (Q3) 4 (Q4) • Where MR is the mortgage rate and Q2, Q3, and Q4 are indicators variables for quarters 2, 3, and 4. Dr. mohammed Alahmed • To illustrate the use and interpretation of a non-linear term, we return to the problem of developing a forecasting model for private housing starts (PHS). • So far we have looked at the following model 91 Example: Private Housing Start PHS 0 1 (MR) 2 (Q2) 3 (Q3) 4 (Q4) 5 ( DPI ) • Regression results for this model are shown in the next slide. Dr. mohammed Alahmed • First we add real disposable personal income per capita (DPI) as an independent variable. • Our new model for this data set is: 92 Example: Private Housing Start SUMMARY OUTPUT Regression Statistics Multiple R 0.943791346 R Square 0.890742104 Adjusted R Square 0.874187878 Standard Error 19.05542121 39 ANOVA df SS Regression 5 97690.01942 Residual 33 11982.59955 Total 38 109672.619 Coefficients Intercept Standard Error MS F 19538 53.80753 Significance F 6.51194E-15 363.1091 t Stat P-value Lower 95% Upper 95% -245.0826992 182.9546249 -31.06403714 105.1938477 -0.2953 0.769613 MR -20.1992545 4.124906847 -4.8969 2.5E-05 Q2 97.03478074 8.900711541 10.90191 1.78E-12 78.9261326 115.1434289 Q3 75.40017073 8.827185877 8.541813 7.17E-10 57.44111179 93.35922967 Q4 20.35306822 8.83373887 2.304015 0.027657 2.380677107 38.32545934 DPI 0.022407799 0.004356973 5.142974 1.21E-05 0.013543464 0.031272134 Dr. mohammed Alahmed Observations -28.59144723 -11.80706176 93 Example: Private Housing Start • The prediction model is • In comparison with the previous model, we see that • R-squared has improved, increased from 78% to 89%. • The standard error of the estimate has decreased from 26.49 for the previous model to 19.05 for the new model. Dr. mohammed Alahmed PHSˆ 31.06 20.19( MR ) 97.03(Q 2) 75.40(Q3) 20.35(Q 4) 0.02( DPI ) 94 • The value of the DW test has changed from 0.88 for the previous model to 0.78 for the new model. • At 5% level the critical value for DW test, from DurbinWatson table, for k = 5, and n = 39 is L= 1.22, and U = 1.79. • Since The value of the DW test is smaller than L=1.22, we reject the null hypothesis H0: = 0 • This implies that there is serial correlation in both models, the assumption of the independence of the error terms is not valid. Dr. mohammed Alahmed Example: Private Housing Start 95 Example: Private Housing Start • The Plot of PHS against DPI shows a curve linear relation. Private Housing Start and Disposable Personal Income 21500 21000 PHS 20000 19500 19000 18500 18000 17500 0 50 100 150 200 250 300 350 400 DPI • Next we introduce a nonlinear term into the regression. • The square of disposable personal income per capita (DPI2) is included in the regression model. Dr. mohammed Alahmed 20500 96 Example: Private Housing Start PHS 0 1 ( MR ) 2 (Q 2) 3 (Q3) 4 (Q 4) 5 ( DPI ) 6 ( DPI 2 ) 7 ( LPHS ) • Regression results for this model are shown in the next slide. Dr. mohammed Alahmed • We also add the dependent variable, lagged one quarter, as an independent variable in order to help reduce serial correlation. • The third model that we fit to our data set is: 97 Example: Private Housing Start SUMMARY OUTPUT Regression Statistics 0.97778626 R Square 0.956065971 Adjusted R Square 0.946145384 Standard Error 12.46719572 Observations 39 ANOVA df SS Regression MS 7 104854.2589 14979.17985 Residual 31 4818.360042 155.4309691 Total 38 109672.619 Coefficients Intercept 716.5926532 Standard Error 96.37191 Significance F 3.07085E-19 P-value Lower 95% 0.704153784 0.486593 -1358.949934 3.093504134 -4.414158396 0.000114 -19.96446404 -7.345970448 1017.664989 t Stat F Upper 95% 2792.13524 MR -13.65521724 Q2 106.9813297 6.069780998 17.62523718 1.04E-17 94.60192287 119.3607366 Q3 27.72122303 9.111432565 3.042465916 0.004748 9.138323433 46.30412262 Q4 -13.37855186 7.653050858 -1.748133144 0.09034 -28.98706069 2.22995698 DPI -0.060399279 0.104412354 -0.578468704 0.567127 -0.273349798 0.15255124 DPI SQUARED 0.000335974 0.000536397 0.626354647 0.535668 -0.000758014 0.001429963 LPHS 0.655786939 0.097265424 6.742241114 1.51E-07 0.457412689 0.854161189 Dr. mohammed Alahmed Multiple R 98 • The inclusion of DPI2 and Lagged PHS has increased the R-squared to 96% • The standard error of the estimate has decreased to 12.45 • The value of the DW test has increased to 2.32 which is greater than U = 1.79 which rule out positive serial correlation. • You see that the third model worked best for this data set. • The following slide gives the data set. Dr. mohammed Alahmed Example: Private Housing Start 99 Example: Private Housing Start PHS LPHS Q2 Q3 Q4 DPI 271.3 10.3372 MR 217 1 0 0 18063 DPI SQUARED 1,631,359.85 30-Sep-90 233 10.1033 271.3 0 1 0 18031 1,625,584.81 31-Dec-90 173.6 9.9547 233 0 0 1 17856 1,594,183.68 31-Mar-91 146.7 9.5008 173.6 0 0 0 17748 1,574,957.52 30-Jun-91 254.1 9.5265 146.7 1 0 0 17861 1,595,076.61 30-Sep-91 239.8 9.2755 254.1 0 1 0 17816 1,587,049.28 31-Dec-91 199.8 8.6882 239.8 0 0 1 17811 1,586,158.61 31-Mar-92 218.5 8.7098 199.8 0 0 0 18000 1,620,000.00 30-Jun-92 296.4 8.6782 218.5 1 0 0 18085 1,635,336.13 30-Sep-92 276.4 8.0085 296.4 0 1 0 18036 1,626,486.48 31-Dec-92 238.8 8.2052 276.4 0 0 1 18330 1,679,944.50 31-Mar-93 213.2 7.7332 238.8 0 0 0 17975 1,615,503.13 30-Jun-93 323.7 7.4515 213.2 1 0 0 18247 1,664,765.05 30-Sep-93 309.3 7.0778 323.7 0 1 0 18246 1,664,582.58 31-Dec-93 279.4 7.0537 309.3 0 0 1 18413 1,695,192.85 31-Mar-94 252.6 7.2958 279.4 0 0 0 18154 1,647,838.58 30-Jun-94 354.2 8.4370 252.6 1 0 0 18409 1,694,456.41 30-Sep-94 325.7 8.5882 354.2 0 1 0 18493 1,709,955.25 31-Dec-94 265.9 9.0977 325.7 0 0 1 18667 1,742,284.45 31-Mar-95 214.2 8.8123 265.9 0 0 0 18834 1,773,597.78 30-Jun-95 296.7 7.9470 214.2 1 0 0 18798 1,766,824.02 30-Sep-95 308.2 7.7012 296.7 0 1 0 18871 1,780,573.21 31-Dec-95 257.2 7.3508 308.2 0 0 1 18942 1,793,996.82 31-Mar-96 240 7.2430 257.2 0 0 0 19071 1,818,515.21 30-Jun-96 344.5 8.1050 240 1 0 0 19081 1,820,422.81 30-Sep-96 324 8.1590 344.5 0 1 0 19161 1,835,719.61 31-Dec-96 252.4 7.7102 324 0 0 1 19152 1,833,995.52 31-Mar-97 237.8 7.7905 252.4 0 0 0 19331 1,868,437.81 30-Jun-97 324.5 7.9255 237.8 1 0 0 19315 1,865,346.13 30-Sep-97 314.6 7.4692 324.5 0 1 0 19385 1,878,891.13 31-Dec-97 256.8 7.1980 314.6 0 0 1 19478 1,896,962.42 31-Mar-98 258.4 7.0547 256.8 0 0 0 19632 1,927,077.12 30-Jun-98 360.4 7.0938 258.4 1 0 0 19719 1,944,194.81 30-Sep-98 348 6.8657 360.4 0 1 0 19905 1,980,963.41 31-Dec-98 304.6 6.7633 348 0 0 1 20194 2,038,980.00 31-Mar-99 294.1 6.8805 304.6 0 0 0 20377 2,076,010.87 30-Jun-99 377.1 7.2037 294.1 1 0 0 20472 2,095,440.74 30-Sep-99 355.6 7.7990 377.1 0 1 0 20756 2,153,982.23 31-Dec-99 308.1 7.8338 355.6 0 0 1 21124 2,231,020.37 Dr. mohammed Alahmed PERIOD 30-Jun-90 100