Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 5: Data Analysis with Multiple Regression Inference for it Priyantha Wijayatunga, Department of Statistics, Umeå University [email protected] These materials are altered ones from copyrighted lecture slides (© 2009 W.H. Freeman and Company) from the homepage of the book: The Practice of Business Statistics Using Data for Decisions :Second Edition by Moore, McCabe, Duckworth and Alwan. Data Analysis with Multiple Linear Regression and Inference for it Reference to the book: Chapters 11.1 and 11.2 Data for multiple regression Estimating multiple regression coefficients Regression residuals Regression standard error Multiple linear regression model Inference about the regression coefficients Anova table for multiple regression Squared multiple correlation R2 Inference for a collection of regression coefficients Data for multiple linear regression Up to this point we have considered, in detail, the linear regression model in one explanatory variable x. ŷ = b0 + b1x Usually more complex linear models are needed in practical situations. There are many problems in which a knowledge of more than one explanatory variable is necessary in order to obtain a better understanding and better prediction of a particular response. In general, we have data on n cases and p explanatory variables. Multiple linear regression: Two or more independent variables and one dependent variable ? ? X3 ? ? X8 X1 ? X7 X6 X4 X2 X5 Y Estimating multiple regression coefficients We can write the most general type of linear model in independnet variables x1, x2,…xp in the form ŷ = b0 + b1x1+ b2x2 + … + bpxp This model is called a ‘first-order model’ with p explanatory variables. Some explanatory variables can be functions of some of the others, Eg. X2=X12 ,X5=X3X4, .... The method of least squares chooses the b’s that make the sum of squares of the residuals as small as possible. Eg: Y investment return X1 share price X2 interest X3 inflation Motivation More ”realistic” models May be better predictions Separate effects by different variables The simple linear regression model allows for one independent variable, “x” y =b0 + b1x + e y X y X1 The multiple linear regression model allows for more than one independent variable. Y = b0 + b1x1 + b2x2 + e X2 y Note how the straight line becomes a plain, and... X1 The multiple linear regression model allows for more than one independent variable. Y = b0 + b1x1 + b2x2 + e X2 Required conditions for the error variable The random error e of the model is normally distributed Its mean is equal to zero and its standard deviation is a constant (s) for all values of X The errors are independent Note that these are the same requirements as in the case of simple linear regression Regression residuals and regression standard error The residuals for multiple linear regression are ei = yi – ŷi Always examine the distribution of the residuals The variability of the responses about the multiple regression equation is measured by the regression standard error s, where s is the square root of s2 2 ˆ ( y y ) i i n p 1 2 e i n p 1 Multiple linear regression model For “p” number of explanatory variables, we can express the population mean response (my) as a linear equation: my = b0 + b1x1 … + bpxp The statistical model for n sample data (i = 1, 2, … n) is then: Data = fit + yi = (b0 + b1x1i … + bpxpi) + residual (ei) Where the ei are independent and normally distributed N(0, s). Therefore multiple linear regression assumes equal variance s2 of y. The parameters of the model are b0, b1 … bp. We selected a random sample of n individuals for which p + 1 variables were measured (x1 … , xp, y). The least-squares regression method minimizes the sum of squared deviations of ei (= yi – ŷi) to express y as a linear function of the p explanatory variables: ŷi = b0 + b1x1i … + bkxpi ■ As with simple linear regression, the constant b0 is the y intercept. ■ The regression coefficients (b1,…, bp) reflect the unique association of each independent variable with the y variable. They are analogous to the slope in simple regression. my ŷ b0 bp are unbiased estimates of population parameters β0 βp Now Estimate b0, b1, b2,…bk with the estimates b0, b1, b2, …, bk using the least squares method. Use SPSS for estimation and get the least squares regression equation yˆ b0 b1 x1 b2 x2 .... bk xk In practice, before trying to interpret regression coefficients, we check how well the model fits 14 Estimating Coefficients and Assessing Model The procedure used to perform regression analysis: Obtain the model coefficients and statistics using a statistical software. – Diagnose violations of required conditions. Try to remedy problems when identified. – Assess the model fit using statistics obtained from the sample. – If the model assessment indicates good fit to the data, use it to interpret the coefficients and generate predictions. Estimating Coefficients and Assessing the Model Example: Where to locate a new motor inn? La Quinta Motor Inns is planning an expansion Management wishes to predict which sites are likely to be profitable. Several areas where predictors of profitability can be identified are: Competition Market awareness Demand generators Demographics Physical quality Profitability Competition Rooms Number of hotels/motels rooms within 3 miles from the site. Market awareness Nearest Distance to the nearest La Quinta inn. Customers Office space College enrollment Margin Community Physical Income Disttwn Median household income. Distance to downtown. Data were collected from randomly selected 100 inns that belong to La Quinta, and ran for the following suggested model: MARGIN = b0 b1ROOMS b2NEAREST b3OFFICE_S b4ENROLLME + b5INCOME + b6DISTANCE Data: Margin 55,5 33,8 49 31,9 57,4 49 Number 3203 2810 2890 3422 2687 3759 Nearest 4,2 2,8 2,4 3,3 0,9 2,9 Office Space 549 496 254 434 678 635 Enrollment 8 17,5 20 15,5 15,5 19 Income 37 35 35 38 42 33 Distance 2,7 14,4 2,6 12,1 6,9 10,8 Regression Analysis, SPSS Output MARGIN = 37.149 - 0.008NUMBER +1.591NEAREST + 0.020OFFICE_ S +0.196ENROLLME + 0.421INCOME - 0.004DISTANCE Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Std. Error 37,149 7,044 NUMBER -,008 ,001 NEAREST 1,591 OFFICE_S Coefficients Beta 95% Confidence Interval for B t Sig. Lower Bound Upper Bound 5,274 ,000 23,162 51,136 -,447 -6,112 ,000 -,010 -,005 ,651 ,182 2,442 ,016 ,297 2,884 ,020 ,003 ,418 5,690 ,000 ,013 ,026 ENROLLME ,196 ,134 ,107 1,461 ,147 -,070 ,463 INCOME ,421 ,141 ,220 2,994 ,004 ,142 ,701 DISTANCE -,004 ,156 -,002 -,028 ,978 -,314 ,305 a. Dependent Variable: MARGIN Model Summary Change Statistics Model 1 R ,719 R Square a ,517 Adjusted R Std. Error of the R Square Square Estimate Change ,486 5,55895 F Change ,517 16,588 df1 df2 6 Sig. F Change 93 ,000 a. Predictors: (Constant), DISTANCE, INCOME, NUMBER, ENROLLME, OFFICE_S, NEAREST Regression standard error (standard error of estimate ) s should be small in order to have a good regression model s= 5,55895 The magnitude of s is judged by comparing it with mean of dependent variable (mean of MARGIN=45.74) Similar to simple linear regression, the coefficient of determination R2 is interpreted Confidence interval for βj Estimating the regression parameters β0, … βj … βp is a case of onesample inference with unknown population variance. We rely on the t distribution, with n – p – 1 degrees of freedom. A level C confidence interval for βj is: bj ± t* SEbj SEbj is the standard error of bj —we rely on software to obtain SEbj . t* is the t critical for the t (n – 2) distribution with area C between –t* and +t*. Significance test for βj To test the hypothesis H0: bj = 0 versus a 1 or 2 sided alternative. We calculate the t statistic which has the t (n – p – 1) distribution, to find the p-value of the test. Note: Software typically provides two-sided p-values. t = bj/SEbj F–Test and Many T–Tests Performing many T–test will increase the probability of Type I error (There are k number of T–tests ) F–test combines all of them into one single test and it does not increase the probability of Type I error F–test avoids the problem of multicolinearity ANOVA F-test for multiple regression For a multiple linear relationship, the ANOVA tests the hypotheses H 0: β1 = β2 = … = βp = 0 versus Ha: H0 not true by computing the F statistic: F = MSM / MSE When H0 is true, F follows the F(p, n − p − 1) distribution. The p-value is P(F > f ). A significant p-value doesn’t mean that all p explanatory variables have a significant influence on y—only that at least one does. ANOVA table for multiple regression Source Sum of squares SS df Mean square MS F P-value MSM/MSE Tail area above F Model (Regression) 2 ˆ ( y y ) i p SSM/DFM Error (Residual) ( y yˆ ) n−p−1 SSE/DFE Total ( y y) i 2 i n−1 2 i SST = SSM + SSE DFT = DFM + DFE The standard deviation of the sampling distribution, s, for n sample data points is calculated from the residuals ei = yi – ŷi s 2 2 e i n p 1 2 ˆ ( y y ) i i n p 1 SSE MSE DFE s is an unbiased estimate of the regression standard deviation σ. Squared multiple correlation R2 Just as with simple linear regression, R2, the squared multiple correlation is the proportion of the variation in the response variable y that is explained by the model. In the particular case of multiple linear regression, the model is all p explanatory variables taken together. 2 ˆ ( y y ) i SSModel R 2 ( yi y) SSTotal 2 Inference for a collection of regression coefficients The null hypothesis that a collection of q explanatory variables all have coefficients equal to zero is tested by an F statistic with q and n - p - 1 degrees of freedom. This statistic can be computed from the squared multiple correlations for the model with all of the explanatory variables included (R12) and the model with the q variables deleted (R22). F = ((n - p - 1)(R12 – R22)) / ((q)(1 - R12)) Example: La Quinta... Multiple R R Square Adjusted R Square Standard Error Observations 0,7246 0,5251 0,4944 5,51 100 ANOVA df Regression Residual Total Intercept Number Nearest Office Space Enrollment Income Distance 6 93 99 SS 3123,8 2825,6 5949,5 Coefficients Standard Error 38,14 6,99 -0,0076 0,0013 1,65 0,63 0,020 0,0034 0,21 0,13 0,41 0,14 -0,23 0,18 MS 520,6 30,4 F Significance F 17,14 0,0000 t Stat P-value 5,45 0,0000 -6,07 0,0000 2,60 0,0108 5,80 0,0000 1,59 0,1159 2,96 0,0039 -1,26 0,2107 We have data on 224 first-year computer science majors at a large university in a given year. The data for each student includes: * Cumulative GPA after 2 semesters at the university (y, response variable) * SAT math score (SATM, x1, explanatory variable) * SAT verbal score (SATV, x2, explanatory variable) * Average high school grade in math (HSM, x3, explanatory variable) * Average high school grade in science (HSS, x4, explanatory variable) * Average high school grade in English (HSE, x5, explanatory variable) Here are the summary statistics for these data given by software SAS: The first step in multiple linear regression is to study all pair-wise relationships between the p + 1 variables. Here is the SAS output for all pair-wise correlation analyses (value of r and 2 sided p-value of H0: ρ = 0). Scatterplots for all 15 pair-wise relationships are also necessary to understand the data. For simplicity, let’s first run a multiple linear regression using only the three high school grade averages: P-value very significant R2 is fairly small (20%) HSM significant HSS, HSE not P-value very significant R2 is fairly small (20%) The ANOVA for the multiple linear regression using only HSM, HSS, and HSE is very significant at least one of the regression coefficients is significantly different from zero. But R2 is fairly small (0.205) only about 20% of the variations in cumulative GPA can be explained by these high school scores. (Remember, a small p-value does not imply a large effect.) HSM significant HSS, HSE not The tests of hypotheses for each b within the multiple linear regression reach significance for HSM only. We found a significant correlation between HSS and GPA when analyzed by themselves, so why is bHSS not significant to the multiple regression equation? Well, HHS and HHM are also significantly correlated. When all three high school averages are used together in the multiple regression analysis, only HSM contributes significantly to our ability to predict GPA. We now drop the least significant variable from the previous model: HSS. P-value very significant R2 is small (20%) HSM significant HSE not The conclusions are about the same, but notice that the actual regression coefficients have changed. predicted GPA=.590+.169HSM+.045HSE+.034HSS predicted GPA=.624+.183HSM+.061HSE Let’s run a multiple linear regression with the two SAT scores only. P-value very significant R2 is very small (6%) SATM significant SATV not The ANOVA test for βSATM and βSATV is very significant at least one is not zero. R2 is really small (0.06) only 6% of GPA variations is explained by these tests. When taken together, only SATM is a significant predictor of GPA (P 0.0007). We finally run a multiple regression model with all the variables together P-value very significant R2 fairly small (21%) HSM significant The overall test is significant, but only the average high school math score (HSM) makes a significant contribution in this model to predict the cumulative GPA. This conclusion applies to computer majors at this large university. Excel Minitab