Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Statistics for Business and Economics Chapter 11 Multiple Regression and Model Building Learning Objectives 1. 2. 3. 4. 5. 6. 7. 8. Explain the Linear Multiple Regression Model Describe Inference About Individual Parameters Test Overall Significance Explain Estimation and Prediction Describe Various Types of Models Describe Model Building Explain Residual Analysis Describe Regression Pitfalls Types of Regression Models 1 Explanatory Variable Regression Models 2+ Explanatory Variables Multiple Simple Linear NonLinear Linear NonLinear Models With Two or More Quantitative Variables Types of Regression Models 1 Explanatory Variable Regression Models 2+ Explanatory Variables Multiple Simple Linear NonLinear Linear NonLinear Multiple Regression Model • General form: y 0 1 x1 2 x2 k xk • k independent variables • x1, x2, …, xk may be functions of variables — e.g. x2 = (x1)2 Regression Modeling Steps 1. Hypothesize deterministic component 2. Estimate unknown model parameters 3. Specify probability distribution of random error term • Estimate standard deviation of error 4. Evaluate model 5. Use model for prediction and estimation Probability Distribution of Random Error Regression Modeling Steps 1. Hypothesize deterministic component 2. Estimate unknown model parameters 3. Specify probability distribution of random error term • Estimate standard deviation of error 4. Evaluate model 5. Use model for prediction and estimation Assumptions for Probability Distribution of ε 1. 2. 3. 4. Mean is 0 Constant variance, σ2 Normally Distributed Errors are independent Linear Multiple Regression Model Types of Regression Models Explanatory Variable 1 Quantitative Variable 2 or More Quantitative Variables 1 Qualitative Variable 1st 2nd 3rd Order Order Order Model Model Model 1st Inter2nd Order Action Order Model Model Model Dummy Variable Model Regression Modeling Steps 1. Hypothesize deterministic component 2. Estimate unknown model parameters 3. Specify probability distribution of random error term • Estimate standard deviation of error 4. Evaluate model 5. Use model for prediction and estimation First–Order Multiple Regression Model Relationship between 1 dependent and 2 or more independent variables is a linear function Population Y-intercept Population slopes y 0 1 x1 2 x2 Dependent (response) variable Random error k xk Independent (explanatory) variables First-Order Model With 2 Independent Variables • Relationship between 1 dependent and 2 independent variables is a linear function • Model E ( y) 0 1 x1 2 x2 • Assumes no interaction between x1 and x2 — Effect of x1 on E(y) is the same regardless of x2 values Population Multiple Regression Model Bivariate model: y Response Plane yi 0 1 x1i 2 x2i i (Observed y) 0 i x2 x1 (x1i , x2i) E ( y) 0 1 x1i 2 x2i Sample Multiple Regression Model Bivariate model: y Response Plane yi ˆ0 ˆ1 x1i ˆ2 x2i i (Observed y) ^0 ^ i x2 x1 (x1i , x2i) yˆi ˆ0 ˆ1 x1i ˆ2 x2i No Interaction E(y) = 1 + 2x1 + 3x2 E(y) E(y) = 1 + 2x1 + 3(3) = 10 + 2x1 12 E(y) = 1 + 2x1 + 3(2) = 7 + 2x1 8 E(y) = 1 + 2x1 + 3(1) = 4 + 2x1 4 E(y) = 1 + 2x1 + 3(0) = 1 + 2x1 0 x1 0 0.5 1 1.5 Effect (slope) of x1 on E(y) does not depend on x2 value Parameter Estimation Regression Modeling Steps 1. Hypothesize Deterministic Component 2. Estimate Unknown Model Parameters 3. Specify Probability Distribution of Random Error Term • Estimate Standard Deviation of Error 4. Evaluate Model 5. Use Model for Prediction & Estimation First-Order Model Worksheet Case, i yi x1i x2i 1 2 3 4 : 1 4 1 3 : 1 8 3 5 : 3 5 2 6 : Run regression with y, x1, x2 Multiple Linear Regression Equations Too complicated by hand! Ouch! Interpretation of Estimated Coefficients ^ 1. Slope (k) • Estimated y changes by ^k for each 1 unit increase in xk holding all other variables constant – Example: if ^ = 2, then sales (y) is expected to 1 increase by 2 for each 1 unit increase in advertising (x1) given the number of sales rep’s (x2) 2. Y-Intercept (^0) • Average value of y when xk = 0 1st Order Model Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.) and newspaper circulation (000) on the number of ad responses (00). Estimate the unknown parameters. You’ve collected the following data: (y) (x1) (x2) Resp Size Circ 1 1 2 4 8 8 1 3 1 3 5 7 2 6 4 4 10 6 Parameter Estimation Computer Output ^0 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Param=0 Prob>|T| INTERCEP 1 0.0640 0.2599 0.246 0.8214 ADSIZE 1 0.2049 0.0588 3.656 0.0399 CIRC 1 0.2805 0.0686 4.089 0.0264 ^1 ^2 yˆ .0640 .2049x1 .2805x2 Interpretation of Coefficients Solution ^ 1. Slope (1) • Number of responses to ad is expected to increase by .2049 (20.49) for each 1 sq. in. increase in ad size holding circulation constant ^ 2. Slope (2) • Number of responses to ad is expected to increase by .2805 (28.05) for each 1 unit (1,000) increase in circulation holding ad size constant Estimation of σ2 Regression Modeling Steps 1. Hypothesize Deterministic Component 2. Estimate Unknown Model Parameters 3. Specify Probability Distribution of Random Error Term • Estimate Standard Deviation of Error 4. Evaluate Model 5. Use Model for Prediction & Estimation Estimation of σ2 For a model with k independent variables SSE s n (k 1) where SSE yi yˆi 2 SSE s s n (k 1) 2 2 Calculating s2 and s Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x1, and newspaper circulation (000), x2, on the number of ad responses (00), y. Find SSE, s2, and s. Analysis of Variance Computer Output Analysis of Variance Source DF Regression 2 Residual Error 3 Total 5 SS 9.249736 .250264 9.5 MS 4.624868 .083421 SSE S2 .250264 s .083421 63 2 s .083421 .2888 F 55.44 P .0043 Evaluating the Model Regression Modeling Steps 1. Hypothesize Deterministic Component 2. Estimate Unknown Model Parameters 3. Specify Probability Distribution of Random Error Term • Estimate Standard Deviation of Error 4. Evaluate Model 5. Use Model for Prediction & Estimation Evaluating Multiple Regression Model Steps 1. Examine variation measures 2. Test parameter significance • • Individual coefficients Overall model 3. Do residual analysis Variation Measures Evaluating Multiple Regression Model Steps 1. Examine variation measures 2. Test parameter significance • • Individual coefficients Overall model 3. Do residual analysis Multiple Coefficient of Determination • Proportion of variation in y ‘explained’ by all x variables taken together Explained Variation SSE • R = = 1Total Variation SSyy 2 • Never decreases when new x variable is added to model — Only y values determine SSyy — Disadvantage when comparing models Adjusted Multiple Coefficient of Determination • • Takes into account n and number of parameters Similar interpretation to R2 R 2 a n-1 2 = 1- 1 R n-(k+1) Estimation of R2 and Ra2 Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x1, and newspaper circulation (000), x2, on the number of ad responses (00), y. Find R2 and Ra2. Excel Computer Output Solution R2 Ra2 Testing Parameters Evaluating Multiple Regression Model Steps 1. Examine variation measures 2. Test parameter significance • • Individual coefficients Overall model 3. Do residual analysis Inference for an Individual β Parameter • Confidence Interval ˆi t 2 sˆ i • Hypothesis Test Ho: βi = 0 Ha: βi ≠ 0 (or < or > ) • Test Statistic ˆi t sˆ i df = n – (k + 1) Confidence Interval Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x1, and newspaper circulation (000), x2, on the number of ad responses (00), y. Find a 95% confidence interval for β1. Excel Computer Output Solution ˆ1 sˆ 1 Confidence Interval Solution .204921 3.182(.058822) .0177 1 .3921 Hypothesis Test Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x1, and newspaper circulation (000), x2, on the number of ad responses (00), y. Test the hypothesis that the mean ad response increases as circulation increases (ad size constant). Use α = .05. Hypothesis Test Solution • • • • • H0: 2 = 0 Ha: 2 > 0 .05 df 6 - 3 = 3 Critical Value(s): Test Statistic: Decision: Reject H0 .05 0 2.353 Conclusion: t Excel Computer Output Solution ˆ2 sˆ 2 Hypothesis Test Solution • • • • • H0: 2 = 0 Ha: 2 > 0 .05 df 6 - 3 = 3 Critical Value(s): Reject H0 .05 0 2.353 t Test Statistic: ˆ2 .280492 t 4.089 S ˆ .068602 2 Decision: Reject at = .05 Conclusion: There is evidence the mean ad response increases as circulation increases Excel Computer Output Solution ˆ2 t sˆ 2 P–Value Evaluating Multiple Regression Model Steps 1. Examine variation measures 2. Test parameter significance • • Individual coefficients Overall model 3. Do residual analysis Testing Overall Significance • Shows if there is a linear relationship between all x variables together and y • Hypotheses — H0: 1 = 2 = ... = k = 0 No linear relationship — Ha: At least one coefficient is not 0 At least one x variable affects y Testing Overall Significance • Test Statistic MS ( Model ) F MS ( Error ) • Degrees of Freedom 1 = k 2 = n – (k + 1) — k = Number of independent variables — n = Sample size Testing Overall Significance Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x1, and newspaper circulation (000), x2, on the number of ad responses (00), y. Conduct the global F–test of model usefulness. Use α = .05. Testing Overall Significance Solution H0: β1 = β2 = 0 Ha: At least 1 not zero = .05 1 = 2 2 = 3 Critical Value(s): • • • • • Test Statistic: Decision: = .05 Conclusion: 0 9.55 F Testing Overall Significance Computer Output k Source DF Model 2 Error 3 C Total 5 Analysis of Variance Sum of Squares 9.2497 0.2503 9.5000 n – (k + 1) Mean Square 4.6249 0.0834 F Value 55.440 Prob>F 0.0043 MS(Model) MS(Error) Testing Overall Significance Solution H0: β1 = β2 = 0 Ha: At least 1 not zero = .05 1 = 2 2 = 3 Critical Value(s): • • • • • = .05 0 9.55 F Test Statistic: 4.6249 F 55.44 .0834 Decision: Reject at = .05 Conclusion: There is evidence at least 1 of the coefficients is not zero Testing Overall Significance Computer Output Solution Analysis of Variance Source DF Model 2 Error 3 C Total 5 Sum of Squares 9.2497 0.2503 9.5000 Mean Square 4.6249 0.0834 F Value 55.440 Prob>F 0.0043 MS(Model) MS(Error) P-Value Interaction Models Types of Regression Models Explanatory Variable 1 Quantitative Variable 2 or More Quantitative Variables 1 Qualitative Variable 1st 2nd 3rd Order Order Order Model Model Model 1st Inter2nd Order Action Order Model Model Model Dummy Variable Model Interaction Model With 2 Independent Variables • Hypothesizes interaction between pairs of x variables — Response to one x variable varies at different levels of another x variable • Contains two-way cross product terms E ( y) 0 1 x1 2 x2 3 x1 x2 • Can be combined with other models — Example: dummy-variable model Effect of Interaction Given: E ( y) 0 1 x1 2 x2 3 x1 x2 • Without interaction term, effect of x1 on y is measured by 1 • With interaction term, effect of x1 on y is measured by 1 + 3x2 — Effect increases as x2 increases Interaction Model Relationships E(y) = 1 + 2x1 + 3x2 + 4x1x2 E(y) E(y) = 1 + 2x1 + 3(1) + 4x1(1) = 4 + 6x1 12 8 E(y) = 1 + 2x1 + 3(0) + 4x1(0) = 1 + 2x1 4 x1 0 0 0.5 1 1.5 Effect (slope) of x1 on E(y) depends on x2 value Interaction Model Worksheet Case, i yi x1i x2i 1 2 3 4 : 1 4 1 3 : 1 8 3 5 : 3 5 2 6 : x1i x2i 3 40 6 30 : Multiply x1 by x2 to get x1x2. Run regression with y, x1, x2 , x1x2 Interaction Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x1, and newspaper circulation (000), x2, on the number of ad responses (00), y. Conduct a test for interaction. Use α = .05. Interaction Model Worksheet yi x1i x2i 1 4 1 3 2 4 1 8 3 5 6 10 2 8 1 7 4 6 x1i x2i 2 64 3 35 24 60 Multiply x1 by x2 to get x1x2. Run regression with y, x1, x2 , x1x2 Excel Computer Output Solution Global F–test indicates at least one parameter is not zero F P-Value Interaction Test Solution • • • • • H0: 3 = 0 Ha: 3 ≠ 0 .05 df 6 - 2 = 4 Critical Value(s): Reject H0 .025 -2.776 Test Statistic: Decision: Reject H0 .025 0 2.776 Conclusion: t Excel Computer Output Solution ˆ3 t sˆ 3 Interaction Test Solution • • • • • H0: 3 = 0 Ha: 3 ≠ 0 .05 df 6 - 2 = 4 Critical Value(s): Reject H0 .025 -2.776 Reject H0 .025 0 2.776 t Test Statistic: t = 1.8528 Decision: Do no reject at = .05 Conclusion: There is no evidence of interaction Second–Order Models Types of Regression Models Explanatory Variable 1 Quantitative Variable 2 or More Quantitative Variables 1 Qualitative Variable 1st 2nd 3rd Order Order Order Model Model Model 1st Inter2nd Order Action Order Model Model Model Dummy Variable Model Second-Order Model With 1 Independent Variable • • • Relationship between 1 dependent and 1 independent variable is a quadratic function Useful 1st model if non-linear relationship suspected Curvilinear effect Model E ( y ) 0 1 x 2 x Linear effect 2 Second-Order Model Relationships y 2 > 0 y 2 > 0 x1 y 2 < 0 x1 y x1 2 < 0 x1 Second-Order Model Worksheet Case, i yi xi 2 xi 1 2 3 4 : 1 4 1 3 : 1 8 3 5 : 1 64 9 25 : Create x2 column. Run regression with y, x, x2. 2nd Order Model Example The data shows the number of weeks employed and the number of errors made per day for a sample of assembly line workers. Find a 2nd order model, conduct the global F–test, and test if β2 ≠ 0. Use α = .05 for all tests. Errors (y) Weeks (x) 20 1 18 1 16 2 10 4 8 4 4 5 3 6 1 8 2 10 1 11 0 12 1 12 Second-Order Model Worksheet yi xi 2 xi 20 1 1 18 1 1 16 2 4 10 4 16 : : : Create x2 column. Run regression with y, x, x2. Excel Computer Output Solution yˆ 23.728 4.784 x .242 x 2 Overall Model Test Solution Global F–test indicates at least one parameter is not zero F P-Value β2 Parameter Test Solution β2 test indicates curvilinear relationship exists t P-Value Types of Regression Models Explanatory Variable 1 Quantitative Variable 2 or More Quantitative Variables 1 Qualitative Variable 1st 2nd 3rd Order Order Order Model Model Model 1st Inter2nd Order Action Order Model Model Model Dummy Variable Model Second-Order Model With 2 Independent Variables • • • Relationship between 1 dependent and 2 independent variables is a quadratic function Useful 1st model if non-linear relationship suspected Model E ( y) 0 1 x1i 2 x2i 3 x1i x2i x x 2 4 1i 2 5 2i Second-Order Model Relationships y 4 + 5 > 0 x2 x1 y 32 > 4 4 5 x2 x1 4 + 5 < 0 y x2 x1 E ( y ) 0 1 x1i 2 x2i 3 x1i x2i x 2 4 1i x 2 5 2i Second-Order Model Worksheet Case, i yi x1i x2i x1ix2i 2 x1i 1 2 3 4 : 1 4 1 3 : 1 8 3 5 : 3 5 2 6 : 3 40 6 30 : 1 64 9 25 : 2 x2i 9 25 4 36 : Multiply x1 by x2 to get x1x2; then create x12, x22. Run regression with y, x1, x2 , x1x2, x12, x22. Models With One Qualitative Independent Variable Types of Regression Models Explanatory Variable 1 Quantitative Variable 2 or More Quantitative Variables 1 Qualitative Variable 1st 2nd 3rd Order Order Order Model Model Model 1st Inter2nd Order Action Order Model Model Model Dummy Variable Model Dummy-Variable Model • Involves categorical x variable with 2 levels — e.g., male-female; college-no college • • • Variable levels coded 0 and 1 Number of dummy variables is 1 less than number of levels of variable May be combined with quantitative variable (1st order or 2nd order model) Dummy-Variable Model Worksheet Case, i yi x1i x2i 1 2 3 4 : 1 4 1 3 : 1 8 3 5 : 1 0 1 1 : x2 levels: 0 = Group 1; 1 = Group 2. Run regression with y, x1, x2 Interpreting DummyVariable Model Equation Given: yˆi ˆ0 ˆ1 x1i ˆ2 x2i y = Starting salary of college graduates x1 = GPA 0 if Male x2 = 1 if Female Same slopes Male ( x2 = 0 ): yˆi ˆ0 ˆ1 x1i ˆ2 (0) ˆ0 ˆ1 x1i Female ( x2 = 1 ): yˆi ˆ0 ˆ1 x1i ˆ2 (1) ˆ0 ˆ2 ˆ1 x1i Dummy-Variable Model Example Computer Output: yˆi 3 5 x1i 7 x2i 0 if Male x2 = 1 if Female Same slopes Male ( x2 = 0 ): yˆi 3 51 x1i 7(0) 3 5 x1i Female ( x2 = 1 ): yˆi 3 5x1i 7(1) 3 7 5x1i Dummy-Variable Model Relationships y Same Slopes ^ 1 Female ^ + ^ 0 2 Male ^ 0 x1 0 0 Nested Models Comparing Nested Models • Contains a subset of terms in the complete (full) model • Tests the contribution of a set of x variables to the relationship with y • Null hypothesis H0: g+1 = ... = k = 0 — Variables in set do not improve significantly the model when all other variables are included • Used in selecting x variables or models • Part of most computer programs Selecting Variables in Model Building Selecting Variables in Model Building A butterfly flaps its wings in Japan, which causes it to rain in Nebraska. -- Anonymous Use Theory Only! Use Computer Search! Model Building with Computer Searches • Rule: Use as few x variables as possible • Stepwise Regression — Computer selects x variable most highly correlated with y — Continues to add or remove variables depending on SSE • Best subset approach — Computer examines all possible sets Residual Analysis Evaluating Multiple Regression Model Steps 1. Examine variation measures 2. Test parameter significance • • Individual coefficients Overall model 3. Do residual analysis Residual Analysis • Graphical analysis of residuals — Plot estimated errors versus xi values Difference between actual yi and predicted yi Estimated errors are called residuals — Plot histogram or stem-&-leaf of residuals • Purposes — Examine functional form (linear v. non-linear model) — Evaluate violations of assumptions Residual Plot for Functional Form Add x2 Term Correct Specification ^ e ^e x x Residual Plot for Equal Variance Unequal Variance Correct Specification ^ e ^ e x Fan-shaped. Standardized residuals used typically. x Residual Plot for Independence Not Independent Correct Specification ^ e ^ e x Plots reflect sequence data were collected. x Residual Analysis Computer Output Dep Var Predict Student Obs SALES Value Residual Residual -2-1-0 1 2 1 1.0000 0.6000 0.4000 1.044 | |** 2 1.0000 1.3000 -0.3000 -0.592 | *| 3 2.0000 2.0000 0 0.000 | | 4 2.0000 2.7000 -0.7000 -1.382 | **| 5 4.0000 3.4000 0.6000 1.567 | |*** Plot of standardized (student) residuals | | | | | Regression Pitfalls Regression Pitfalls • Parameter Estimability — Number of different x–values must be at least one more than order of model • Multicollinearity — Two or more x–variables in the model are correlated • Extrapolation — Predicting y–values outside sampled range • Correlated Errors Multicollinearity • • • • • High correlation between x variables Coefficients measure combined effect Leads to unstable coefficients depending on x variables in model Always exists – matter of degree Example: using both age and height as explanatory variables in same model Detecting Multicollinearity • • • Significant correlations between pairs of x variables are more than with y variable Non–significant t–tests for most of the individual parameters, but overall model test is significant Estimated parameters have wrong sign Solutions to Multicollinearity • • • Eliminate one or more of the correlated x variables Avoid inference on individual parameters Do not extrapolate Extrapolation y Interpolation Extrapolation Extrapolation Sampled Range x Conclusion 1. 2. 3. 4. 5. 6. 7. 8. Explained the Linear Multiple Regression Model Described Inference About Individual Parameters Tested Overall Significance Explained Estimation and Prediction Described Various Types of Models Described Model Building Explained Residual Analysis Described Regression Pitfalls