* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Survey
Document related concepts
Transcript
Applied Regression By Nitiphong Songsrirote, Ph.D. (www.kaset51.com) Simple Linear Regression Learning Objectives 1.Describe the Linear Regression Model 2.State the Regression Modeling Steps 3.Explain Ordinary Least Squares Understand and check model assumptions 4.Compute Regression Coefficients 5.Predict Response Variable 6.Interpret Computer Output Models Models 1. Representation of Some Phenomenon 2. Mathematical Model Is a Mathematical Expression of Some Phenomenon 3. Often Describe Relationships between Variables 4. Types Deterministic Models Probabilistic Models Deterministic Models 1. Hypothesize Exact Relationships 2. Suitable When Prediction Error is Negligible 3. Example: Force Is Exactly Mass Times Acceleration F = m·a © 1984-1994 T/Maker Co. Probabilistic Models 1. Hypothesize 2 Components Deterministic Random Error 2. Example: Sales Volume Is 10 Times Advertising Spending + Random Error Y = 10X + Random Error May Be Due to Factors Other Than Advertising Types of Probabilistic Models Probabilistic Models Regression Models Correlation Models Other Models Regression Models Types of Probabilistic Models Probabilistic Models Regression Models Correlation Models Other Models Regression Models 1. Answer ‘What Is the Relationship Between the Variables?’ 2. Equation Used 1 Numerical Dependent (Response) Variable What Is to Be Predicted 1 or More Numerical or Categorical Independent (Explanatory) Variables 3. Used Mainly for Prediction & Estimation Regression Modeling Steps 1. Hypothesize Deterministic Component 2. Estimate Unknown Model Parameters 3. Specify Probability Distribution of Random Error Term Estimate Standard Deviation of Error 4. Evaluate Model 5. Use Model for Prediction & Estimation Model Specification Regression Modeling Steps 1. Hypothesize Deterministic Component 2. Estimate Unknown Model Parameters 3. Specify Probability Distribution of Random Error Term Estimate Standard Deviation of Error 4. Evaluate Model 5. Use Model for Prediction & Estimation Specifying the Model 1. Define Variables 2. Hypothesize Nature of Relationship Expected Effects (i.e., Coefficients’ Signs) Functional Form (Linear or Non-Linear) Interactions Model Specification Is Based on Theory 1. 2. 3. 4. Theory of Field (e.g., Management) Mathematical Theory Previous Research ‘Common Sense’ Thinking Challenge: Which Is More Logical? Sales Sales Advertising Sales Advertising Sales Advertising Advertising Types of Regression Models Types of Regression Models Regression Models Types of Regression Models 1 Explanatory Variable Simple Regression Models Types of Regression Models 1 Explanatory Variable Simple Regression Models 2+ Explanatory Variables Multiple Types of Regression Models 1 Explanatory Variable Simple Linear Regression Models 2+ Explanatory Variables Multiple Types of Regression Models 1 Explanatory Variable Regression Models Multiple Simple Linear 2+ Explanatory Variables NonLinear Types of Regression Models 1 Explanatory Variable Regression Models 2+ Explanatory Variables Multiple Simple Linear NonLinear Linear Types of Regression Models 1 Explanatory Variable Regression Models 2+ Explanatory Variables Multiple Simple Linear NonLinear Linear NonLinear Linear Regression Model Types of Regression Models 1 Explanatory Variable Regression Models 2+ Explanatory Variables Multiple Simple Linear NonLinear Linear NonLinear Linear Equations Y Y = mX + b m = Slope Change in Y Change in X b = Y-intercept X High School Teacher Linear Regression Model 1. Relationship Between Variables Is a Linear Function Population Y-Intercept Population Slope Random Error Yi 0 1X i i Dependent (Response) Variable (e.g., income) Independent (Explanatory) Variable (e.g., education) Population & Sample Regression Models Population & Sample Regression Models Population $ $ $ $ $ Population & Sample Regression Models Population Unknown Relationship $ Yi 0 1X i i $ $ $ $ Population & Sample Regression Models Population Random Sample Unknown Relationship $ Yi 0 1X i i $ $ $ $ $ $ Population & Sample Regression Models Population Unknown Relationship $ Yi 0 1X i i $ $ $ $ Random Sample Yi ˆ0 ˆ1 X i ˆi $ $ Population Linear Regression Model Y Yi 0 1X i i Observed value i = Random error E Y 0 1 X i X Observed value Sample Linear Regression Model Y Yi 0 1X i i ^i = Random error Yi 0 1X i Unsampled observation X Observed value Estimating Parameters: Least Squares Method Regression Modeling Steps 1. Hypothesize Deterministic Component 2. Estimate Unknown Model Parameters 3. Specify Probability Distribution of Random Error Term Estimate Standard Deviation of Error 4. Evaluate Model 5. Use Model for Prediction & Estimation Scatter Diagram 1. Plot of All (Xi, Yi) Pairs 2. Suggests How Well Model Will Fit 60 40 20 0 Y 0 20 40 X 60 Thinking Challenge How would you draw a line through the points? How do you determine which line ‘fits best’? 60 40 20 0 Y 0 20 40 X 60 Thinking Challenge How would you draw a line through the points? How do you determine which line ‘fits best’? 60 40 20 0 Y 0 20 40 X 60 Thinking Challenge How would you draw a line through the points? How do you determine which line ‘fits best’? 60 40 20 0 Y 0 20 40 X 60 Thinking Challenge How would you draw a line through the points? How do you determine which line ‘fits best’? 60 40 20 0 Y 0 20 40 X 60 Thinking Challenge How would you draw a line through the points? How do you determine which line ‘fits best’? 60 40 20 0 Y 0 20 40 X 60 Thinking Challenge How would you draw a line through the points? How do you determine which line ‘fits best’? 60 40 20 0 Y 0 20 40 X 60 Thinking Challenge How would you draw a line through the points? How do you determine which line ‘fits best’? 60 40 20 0 Y 0 20 40 X 60 Least Squares 1. ‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y Values Are a Minimum But Positive Differences Off-Set Negative Least Squares 1. ‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y Values Are a Minimum But Positive Differences Off-Set Negative n n 2 ˆ Yi Yi ˆi i 1 2 i 1 Least Squares 1. ‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y Values Are a Minimum But Positive Differences Off-Set Negative ˆ ˆ Y Y n i 1 2 i i n 2 i i 1 2. LS Minimizes the Sum of the Squared Differences (SSE) Least Squares Graphically n 2 2 2 2 2 LS minimizes i 1 2 3 4 i 1 Y2 0 1X 2 2 Y ^4 ^2 ^1 ^3 Yi 0 1X i X Coefficient Equations Prediction Equation yˆi ˆ0 ˆ1xi Sample Slope ˆ1 SS xy SS xx xi x yi y 2 x x i Sample Y-intercept ˆ0 y ˆ1x Computation Table Xi Yi 2 Xi X1 Y1 X12 Y12 X1Y1 X2 Y2 X2 2 2 X2Y2 : : : Xn Xi 2 Yi XiYi Y2 : 2 Yn Xn Yi 2 Xi : 2 XnYn 2 Yi XiYi Yn Interpretation of Coefficients ^ 1. Slope (1) ^ Estimated Y Changes by 1 for Each 1 Unit Increase in X ^ If 1 = 2, then Sales (Y) Is Expected to Increase by 2 for Each 1 Unit Increase in Advertising (X) Interpretation of Coefficients 1. Slope (^1) Estimated Y Changes by ^1 for Each 1 Unit Increase in X ^ If 1 = 2, then Sales (Y) Is Expected to Increase by 2 for Each 1 Unit Increase in Advertising (X) ^ 2. Y-Intercept (^0) Average Value of Y When X = 0 If 0 = 4, then Average Sales (Y) Is Expected to Be 4 When Advertising (X) Is 0 Parameter Estimation Example You’re a marketing analyst for Hasbro Toys. You gather the following data: Ad $ Sales (Units) 1 1 2 1 3 2 4 2 5 4 What is the relationship between sales & advertising? Scatter Diagram Sales vs. Advertising Sales 4 3 2 1 0 0 1 2 3 Advertising 4 5 Parameter Estimation Solution Table Xi Yi Xi2 Yi2 XiYi 1 1 1 1 1 2 1 4 1 2 3 2 9 4 6 4 2 16 4 8 5 4 25 16 20 15 10 55 26 37 Parameter Estimation Solution n X i Yi n i 1 i 1 X Y i i n i 1 n ˆ1 X i n i 1 2 X i n i 1 n 2 1510 37 5 2 15 55 5 ˆ0 Y ˆ1 X 2 0.703 0.10 0.70 Coefficient Interpretation Solution Coefficient Interpretation Solution ^ 1. Slope (1) Sales Volume (Y) Is Expected to Increase by .7 Units for Each $1 Increase in Advertising (X) Coefficient Interpretation Solution ^ 1. Slope (1) Sales Volume (Y) Is Expected to Increase by .7 Units for Each $1 Increase in Advertising (X) ^ 2. Y-Intercept (0) Average Value of Sales Volume (Y) Is -.10 Units When Advertising (X) Is 0 Difficult to Explain to Marketing Manager Expect Some Sales Without Advertising Parameter Estimation Computer Output ^k Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Param =0 INTERCEP 1 -0.1000 0.6350 -0.157 ADVERT 1 0.7000 0.1914 3.656 ^0 ^1 Prob>|T| 0.8849 0.0354 Derivation of Parameter Equations Goal: Minimize squared error 2 2 ˆi yi ˆ0 ˆ1xi 0 ˆ0 ˆ0 2 yi ˆ0 ˆ1xi 2ny nˆ0 nˆ1x ˆ0 y ˆ1x Derivation of Parameter Equations 2 2 ˆ ˆ ˆi yi 0 1xi 0 ˆ1 ˆ1 2 xi yi ˆ0 ˆ1xi 2 xi yi y ˆ1x ˆ1xi ˆ1 xi xi x xi yi y ˆ1 xi x xi x xi x yi y ˆ1 SS xy SS xx Parameter Estimation Thinking Challenge You’re an economist for the county cooperative. You gather the following data: Fertilizer (lb.)Yield (lb.) 4 3.0 6 5.5 10 6.5 12 9.0 What is the relationship between fertilizer & crop yield? Scatter Diagram Crop Yield vs. Fertilizer* Yield (lb.) 10 8 6 4 2 0 0 5 10 Fertilizer (lb.) 15 Parameter Estimation Solution Table* Xi Yi 2 Xi 2 Yi 4 3.0 16 9.00 12 6 5.5 36 30.25 33 10 6.5 100 42.25 65 12 9.0 144 81.00 108 32 24.0 296 162.50 218 XiYi Parameter Estimation Solution* n X i Yi n i 1 i 1 X Y i i n i 1 n ˆ1 X i n i 1 2 X i n i 1 n 2 3224 218 ˆ0 Y ˆ1 X 6 0.658 0.80 4 2 32 296 4 0.65 Coefficient Interpretation Solution* Coefficient Interpretation Solution* ^ 1. Slope (1) Crop Yield (Y) Is Expected to Increase by .65 lb. for Each 1 lb. Increase in Fertilizer (X) Coefficient Interpretation Solution* ^ 1. Slope (1) Crop Yield (Y) Is Expected to Increase by .65 lb. for Each 1 lb. Increase in Fertilizer (X) ^ 2. Y-Intercept (0) Average Crop Yield (Y) Is Expected to Be 0.8 lb. When No Fertilizer (X) Is Used Regression Modeling Steps 1. Hypothesize Deterministic Component 2. Estimate Unknown Model Parameters 3. Specify Probability Distribution of Random Error Term Estimate Standard Deviation of Error 4. Evaluate Model 5. Use Model for Prediction & Estimation Linear Regression Assumptions 1.Mean of Probability Distribution of Error Is 0 2.Probability Distribution of Error Has Constant Variance 3.Probability Distribution of Error is Normal 4.Errors Are Independent Error Probability Distribution ^ f() Y X2 X X1 Random Error Variation Random Error Variation 1. Variation of Actual Y from Predicted Y Random Error Variation 1. Variation of Actual Y from Predicted Y 2. Measured by Standard Error of Regression Model Sample Standard Deviation of , s^ Random Error Variation 1. Variation of Actual Y from Predicted Y 2. Measured by Standard Error of Regression Model Sample Standard Deviation of , s^ 3. RV Affects Several Factors Parameter Significance Prediction Accuracy Evaluating the Model Testing for Significance Regression Modeling Steps 1. Hypothesize Deterministic Component 2. Estimate Unknown Model Parameters 3. Specify Probability Distribution of Random Error Term Estimate Standard Deviation of Error 4. Evaluate Model 5. Use Model for Prediction & Estimation Test of Slope Coefficient 1. Shows If There Is a Linear Relationship Between X & Y 2. Involves Population Slope 1 3. Hypotheses H0: 1 = 0 (No Linear Relationship) Ha: 1 0 (Linear Relationship) 4. Theoretical Basis Is Sampling Distribution of Slope Sampling Distribution of Sample Slopes Sampling Distribution of Sample Slopes Y Sample 1 Line Sample 2 Line Population Line X Sampling Distribution of Sample Slopes Y Sample 1 Line All Possible Sample Slopes Sample 1: 2.5 Sample 2: 1.6 Sample 3: 1.8 Sample 4: 2.1 : : Very large number of sample slopes Sample 2 Line Population Line X Sampling Distribution of Sample Slopes Y Sample 1 Line Sample 2 Line Population Line X Sampling Distribution S^1 1 All Possible Sample Slopes Sample 1: 2.5 Sample 2: 1.6 Sample 3: 1.8 Sample 4: 2.1 : : Very large number of sample slopes ^ 1 Slope Coefficient Test Statistic tn2 ˆ1 1 S ˆ 1 where S ˆ 1 S Xi n i 1 2 Xi n i 1 n 2 Test of Slope Coefficient Example You’re a marketing analyst for Hasbro Toys. You find b0 = -.1, b1 = .7 & s = .60553. Ad $ Sales (Units) 1 1 2 1 3 2 4 2 5 4 Is the relationship significant at the .05 level? Solution Table Xi Yi 2 Xi 2 Yi XiYi 1 1 1 1 1 2 1 4 1 2 3 2 9 4 6 4 2 16 4 8 5 4 25 16 20 15 10 55 26 37 Test of Slope Parameter Solution Test Statistic: H0: 1 = 0 Ha: 1 0 1 1 0.70 0 t 3.656 .05 S 0.1915 1 df 5 - 2 = 3 Critical Value(s): Decision: Reject Reject Reject at = .05 .025 .025 -3.1824 0 3.1824 t Conclusion: There is evidence of a relationship Test Statistic Solution ˆ1 1 0.70 0 tn2 3.656 S ˆ 0.1915 1 where S ˆ 1 S n Xi n 2 i 1 X i n i 1 2 0.60553 15 55 3 5 0.1915 Test of Slope Parameter Computer Output Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Param=0 Prob>|T| INTERCEP 1 -0.1000 0.6350 -0.157 0.8849 ADVERT 1 0.7000 0.1914 3.656 0.0354 ^ k S^ k t = ^k / S^ k P-Value Measures of Variation in Regression 1. Total Sum of Squares (SSyy) 2. Explained Variation (SSR) Measures Variation of Observed Yi Around the MeanY Variation Due to Relationship Between X&Y 3. Unexplained Variation Variation Due to Other Factors (SSE) Variation Measures Y Yi Total sum of squares (Yi -Y)2 Unexplained sum ^ )2 of squares (Yi - Y i Yi 0 1X i Explained sum of ^ squares (Yi -Y)2 Y Xi X Coefficient of Determination 1. Proportion of Variation ‘Explained’ by Relationship Between X & Y 0 r2 1 Explained Variation r Total Variation 2 ˆ Y Y Y Y n i 1 n 2 i 2 i i 1 Y Y n i 1 2 i Coefficient of Determination Examples Y Y r2 = 1 r2 = -1 X Y X Y r2 = .8 X r2 = 0 X Coefficient of Determination Example You’re a marketing analyst for Hasbro ^ = 0.7. Toys. You find^0 = -0.1 & 1 Ad $ Sales (Units) 1 1 2 1 3 2 4 2 5 4 Interpret a coefficient of determination of 0.8167. r 2 Computer Output r2 Root MSE Dep Mean C.V. S 0.60553 2.00000 30.27650 R-square Adj R-sq 0.8167 0.7556 r2 adjusted for number of explanatory variables & sample size Using the Model for Prediction & Estimation Regression Modeling Steps 1. Hypothesize Deterministic Component 2. Estimate Unknown Model Parameters 3. Specify Probability Distribution of Random Error Term Estimate Standard Deviation of Error 4. Evaluate Model 5. Use Model for Prediction & Estimation Prediction With Regression Models 1. Types of Predictions Point Estimates Interval Estimates 2. What Is Predicted Population Mean Response E(Y) for Given X Point on Population Regression Line Individual Response (Yi) for Given X What Is Predicted Y YIndividual Mean Y, E(Y) ^ 0 + ^Y i= ^ 1X E(Y) = 0 + 1X Prediction,^Y XP X Confidence Interval Estimate of Mean Y Yˆ t n 2, / 2 SYˆ E (Y ) Yˆ t n 2, / 2 SYˆ where 1 SYˆ S n X X X X 2 p n i 1 2 i Factors Affecting Interval Width 1. Level of Confidence (1 - ) 2. Data Dispersion (s) Width Increases as Variation Increases 3. Sample Size Width Increases as Confidence Increases Width Decreases as Sample Size Increases 4. Distance of Xp from MeanX Width Increases as Distance Increases Why Distance from Mean? Y m a S _ Y 1 e l p e n i L Sample 2 X1 X Greater dispersion than X1 Line X2 X Confidence Interval Estimate Example You’re a marketing analyst for Hasbro Toys. You find b0 = -.1, b1 = .7 & s = .60553. Ad $ Sales (Units) 1 1 2 1 3 2 4 2 5 4 Estimate the mean sales when advertising is $4 at the .05 level. Solution Table Xi Yi Xi2 Yi2 XiYi 1 1 1 1 1 2 1 4 1 2 3 2 9 4 6 4 2 16 4 8 5 4 25 16 20 15 10 55 26 37 Confidence Interval Estimate Solution Yˆ t n 2, / 2 SYˆ E (Y ) Yˆ t n 2, / 2 SYˆ Yˆ 0.1 0.7 4 2.7 X to be predicted 1 4 3 SYˆ .60553 0.3316 5 10 2 2.7 3.1824 0.3316 E (Y ) 2.7 3.18240.3316 1.6445 E (Y ) 3.7553 Prediction Interval of Individual Response Yˆ t n 2, / 2 S Y Yˆ YP Yˆ t n 2, / 2 S Y Yˆ where 1 S Y Yˆ S 1 n X X X X 2 P n i 1 2 i Why the Extra ‘S’? Y Y we're trying to predict Expected (Mean) Y + ^ ^= 0 ^ 1X i Yi E(Y) = 0 + 1X Prediction, ^ Y XP X Interval Estimate Computer Output Dep Var Obs SALES 1 1.000 2 1.000 3 2.000 4 2.000 5 4.000 Pred Std Err Low95% Upp95% Low95% Upp95% Value Predict Mean Mean Predict Predict 0.600 0.469 -0.892 2.092 -1.837 3.037 1.300 0.332 0.244 2.355 -0.897 3.497 2.000 0.271 1.138 2.861 -0.111 4.111 2.700 0.332 1.644 3.755 0.502 4.897 3.400 0.469 1.907 4.892 0.962 5.837 Predicted Y when X = 4 SY^ Confidence Interval Prediction Interval Hyperbolic Interval Bands Y ^ ^= 0 Xi ^ 1 + Yi _ X X XP Correlation Models Types of Probabilistic Models Probabilistic Models Regression Models Correlation Models Other Models Correlation Models 1. Answer ‘How Strong Is the Linear Relationship Between 2 Variables?’ 2. Coefficient of Correlation Used Population Correlation Coefficient Denoted (Rho) Values Range from -1 to +1 Measures Degree of Association 3. Used Mainly for Understanding Sample Coefficient of Correlation 1. Pearson Product Moment Coefficient of Correlation, r: r Coefficien t of Determinat ion X n i 1 X n i 1 i X Yi Y X 2 i Y Y n i 1 2 i Coefficient of Correlation Values -1.0 -.5 0 +.5 +1.0 Coefficient of Correlation Values No Correlation -1.0 -.5 0 +.5 +1.0 Coefficient of Correlation Values No Correlation -1.0 -.5 Increasing degree of negative correlation 0 +.5 +1.0 Coefficient of Correlation Values Perfect Negative Correlation -1.0 No Correlation -.5 0 +.5 +1.0 Coefficient of Correlation Values Perfect Negative Correlation -1.0 No Correlation -.5 0 +.5 +1.0 Increasing degree of positive correlation Coefficient of Correlation Values Perfect Negative Correlation -1.0 Perfect Positive Correlation No Correlation -.5 0 +.5 +1.0 Coefficient of Correlation Examples Y r=1 Y r = -1 X Y r = .89 X Y X r=0 X Test of Coefficient of Correlation 1. Shows If There Is a Linear Relationship Between 2 Numerical Variables 2. Same Conclusion as Testing Population Slope 1 3. Hypotheses H0: = 0 (No Correlation) Ha: 0 (Correlation) Conclusion 1. Described the Linear Regression Model 2. Stated the Regression Modeling Steps 3. Explained Ordinary Least Squares 4. Computed Regression Coefficients 5. Predicted Response Variable 6. Interpreted Computer Output Multiple Regression and Model Building Learning Objectives 1. Explain the Linear Multiple Regression Model 2. Test Overall Significance 3. Describe Various Types of Models 4. Evaluate Portions of a Regression Model 5. Interpret Linear Multiple Regression Computer Output 7. Explain Residual Analysis 8. Describe Regression Pitfalls Types of Regression Models 1 Explanatory Variable Regression Models 2+ Explanatory Variables Multiple Simple Linear NonLinear Linear NonLinear Regression Modeling Steps 1. Hypothesize Deterministic Component 2. Estimate Unknown Model Parameters 3. Specify Probability Distribution of Random Error Term Estimate Standard Deviation of Error 4. Evaluate Model 5. Use Model for Prediction & Estimation Linear Multiple Regression Model Hypothesizing the Deterministic Component Regression Modeling Steps 1. Hypothesize Deterministic Component 2. Estimate Unknown Model Parameters 3. Specify Probability Distribution of Random Error Term Estimate Standard Deviation of Error 4. Evaluate Model 5. Use Model for Prediction & Estimation Linear Multiple Regression Model 1. Relationship between 1 dependent & 2 or more independent variables is a linear function Population Y-intercept Population slopes Random error Yi 0 1X 1i 2 X 2i k X ki i Dependent (response) variable Independent (explanatory) variables Population Multiple Regression Model Bivariate model Y Response Plane X1 Yi = 0 + 1X1i + 2X2i + i (Observed Y) 0 i X2 (X1i,X2i) E(Y) = 0 + 1X1i + 2X2i Sample Multiple Regression Model Bivariate model Y Response Plane X1 Yi = ^0 + ^1X1i + ^2X2i + ^i (Observed Y) ^ 0 ^ i X2 (X1i,X2i) ^ ^ Yi = 0 + ^1X1i + ^2X2i Parameter Estimation Regression Modeling Steps 1. Hypothesize Deterministic Component 2. Estimate Unknown Model Parameters 3. Specify Probability Distribution of Random Error Term Estimate Standard Deviation of Error 4. Evaluate Model 5. Use Model for Prediction & Estimation Multiple Linear Regression Equations Too complicated by hand! Ouch! Interpretation of Estimated Coefficients Interpretation of Estimated Coefficients ^ 1. Slope (k) ^ Estimated Y Changes by k for Each 1 Unit Increase in Xk Holding All Other Variables Constant Example: If 1^ = 2, then Sales (Y) Is Expected to Increase by 2 for Each 1 Unit Increase in Advertising (X1) Given the Number of Sales Rep’s (X2) Interpretation of Estimated Coefficients ^ 1. Slope (k) ^ Estimated Y Changes by k for Each 1 Unit Increase in Xk Holding All Other Variables Constant Example: If 1^ = 2, then Sales (Y) Is Expected to Increase by 2 for Each 1 Unit Increase in Advertising (X1) Given the Number of Sales Rep’s (X2) ^ 2. Y-Intercept (0) Average Value of Y When Xk = 0 Parameter Estimation Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.) & newspaper circulation (000) on the number of ad responses (00). You’ve collected the following data: Resp Size Circ 1 1 2 4 8 8 1 3 1 3 5 7 2 6 4 4 10 6 Parameter Estimation Computer Output ^P Parameter Variable DF Estimate INTERCEP 1 0.0640 ADSIZE 1 0.2049 CIRC 1 0.2805 Parameter Estimates Standard T for H0: Error Param=0 Prob>|T| 0.2599 0.246 0.8214 0.0588 3.656 0.0399 0.0686 4.089 0.0264 ^0 ^1 ^2 Interpretation of Coefficients Solution Interpretation of Coefficients Solution ^ 1. Slope (1) # Responses to Ad Is Expected to Increase by .2049 (20.49) for Each 1 Sq. In. Increase in Ad Size Holding Circulation Constant Interpretation of Coefficients Solution ^ 1. Slope (1) # Responses to Ad Is Expected to Increase by .2049 (20.49) for Each 1 Sq. In. Increase in Ad Size Holding Circulation Constant ^ 2. Slope (2) # Responses to Ad Is Expected to Increase by .2805 (28.05) for Each 1 Unit (1,000) Increase in Circulation Holding Ad Size Constant Evaluating the Model Regression Modeling Steps 1. Hypothesize Deterministic Component 2. Estimate Unknown Model Parameters 3. Specify Probability Distribution of Random Error Term Estimate Standard Deviation of Error 4. Evaluate Model 5. Use Model for Prediction & Estimation Evaluating Multiple Regression Model Steps 1. Examine Variation Measures 2. Do Residual Analysis 3. Test Parameter Significance Overall Model Individual Coefficients 4. Test for Multicollinearity Variation Measures Coefficient of Multiple Determination Proportion of Variation in Y ‘Explained’ by All X Variables Taken Together SSE 2 Explained variation SS yy SSE R 1 Total variation SS yy SS yy Check Your Understanding If you add a variable to the model How will that affect the R-squared value for the model? Adjusted R2 R2 Never Decreases When New X Variable Is Added to Model Only Y Values Determine SSyy Disadvantage When Comparing Models Solution: Adjusted R2 Each additional variable reduces adjusted R2, unless SSE goes up enough to compensate n 1 SSE SSE 2 1 R SS n k 1 SSyy yy 2 Ra 1 Variance of Error Assuming model is correctly specified… Best (unbiased) estimator of 2 Var E i2 2 SSE ˆ i is 2 s n k 1 n k 1 Used in formula for computing Exact formula is too complicated to show But higher value for s leads to higher Individual Coefficients Parameter Estimation Computer Output ^P Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Param=0 Prob>|T| INTERCEP 1 0.0640 0.2599 0.246 0.8214 ADSIZE 1 0.2049 0.0588 3.656 0.0399 CIRC 1 0.2805 0.0686 4.089 0.0264 ^ t 0 ^ 1 ^ 2 ˆi sˆ i Evaluating Multiple Regression Model Steps 1. Examine Variation Measures 2. Do Residual Analysis 3. Test Parameter Significance Overall Model Individual Coefficients 4. Test for Multicollinearity Testing Overall Significance 1. Shows If There Is a Linear Relationship Between All X Variables Together & Y 2. Uses F Test Statistic 3. Hypotheses H0: 1 = 2 = ... = k = 0 No Linear Relationship Ha: At Least One Coefficient Is Not 0 At Least One X Variable Affects Y Testing Overall Significance Computer Output Analysis of Variance Source DF Model 2 Error 3 C Total 5 k Sum of Squares 9.2497 0.2503 9.5000 Mean Square 4.6249 0.0834 n - k -1 n-1 F Value 55.440 Prob>F 0.0043 MS(Model) MS(Error) P-Value Types of Regression Models Explanatory Variable 1 Quantitative Variable 2 or More Quantitative Variables 1 Qualitative Variable 1st 2nd 3rd Order Order Order Model Model Model 1st Inter- 2nd Order Action Order Model Model Model Dummy Variable Model Models With a Single Quantitative Variable Types of Regression Models Explanatory Variable 1 Quantitative Variable 2 or More Quantitative Variables 1 Qualitative Variable 1st 2nd 3rd Order Order Order Model Model Model 1st Inter- 2nd Order Action Order Model Model Model Dummy Variable Model First-Order Model With 1 Independent Variable First-Order Model With 1 Independent Variable 1. Relationship Between 1 Dependent & 1 Independent Variable Is Linear E (Y ) 0 1X 1i First-Order Model With 1 Independent Variable 1. Relationship Between 1 Dependent & 1 Independent Variable Is Linear E (Y ) 0 1X 1i 2. Used When Expected Rate of Change in Y Per Unit Change in X Is Stable First-Order Model With 1 Independent Variable 1. Relationship Between 1 Dependent & 1 Independent Variable Is Linear E (Y ) 0 1X 1i 2. Used When Expected Rate of Change in Y Per Unit Change in X Is Stable 3. Used With Curvilinear Relationships If Relevant Range Is Linear First-Order Model Relationships E (Y ) 0 1X 1i Y 1 > 0 Y 1 < 0 X1 X1 First-Order Model Worksheet Case, i Yi X1i 2 X1i 1 2 3 4 : 1 4 1 3 : 1 8 3 5 : 1 64 9 25 : Run regression with Y, X1 Types of Regression Models Explanatory Variable 1 Quantitative Variable 2 or More Quantitative Variables 1 Qualitative Variable 1st 2nd 3rd Order Order Order Model Model Model 1st Inter- 2nd Order Action Order Model Model Model Dummy Variable Model Second-Order Model With 1 Independent Variable 1. Relationship Between 1 Dependent & 1 Independent Variables Is a Quadratic Function 2. Useful 1St Model If Non-Linear Relationship Suspected Second-Order Model With 1 Independent Variable 1. Relationship Between 1 Dependent & 1 Independent Variables Is a Quadratic Function 2. Useful 1St Model If Non-Linear Relationship Suspected Curvilinear effect 3. Model 2 E (Y ) 0 1X 1i 2 X 1i Linear effect Second-Order Model Relationships Y 2 > 0 Y 2 > 0 X1 Y 2 < 0 X1 Y 2 < 0 X1 X1 Second-Order Model Worksheet Case, i Yi X1i 2 X1i 1 2 3 4 : 1 4 1 3 : 1 8 3 5 : 1 64 9 25 : Create X12 column. Run regression with Y, X1, X12. Types of Regression Models Explanatory Variable 1 Quantitative Variable 2 or More Quantitative Variables 1 Qualitative Variable 1st 2nd 3rd Order Order Order Model Model Model 1st Inter- 2nd Order Action Order Model Model Model Dummy Variable Model Third-Order Model With 1 Independent Variable 1. Relationship Between 1 Dependent & 1 Independent Variable Has a ‘Wave’ 2. Used If 1 Reversal in Curvature Third-Order Model With 1 Independent Variable 1. Relationship Between 1 Dependent & 1 Independent Variable Has a ‘Wave’ 2. Used If 1 Reversal in Curvature 3. Model E (Y ) 0 1X 1i 2 X 12i 3 X 13i Linear effect Curvilinear effects Third-Order Model Relationships E (Y ) 0 1X 1i Y 3 > 0 2 2 X 1i Y X1 3 3 X 1i 3 < 0 X1 Third-Order Model Worksheet Case, i Yi X1i X1i2 X1i3 1 1 1 1 1 2 4 8 64 512 3 1 3 9 27 4 3 5 25 125 : : : : : Multiply X1 by X1 to get X12. Multiply X1 by X1 by X1 to get X13. Run regression with Y, X1, X12 , X13. Models With Two or More Quantitative Variables Types of Regression Models Explanatory Variable 1 Quantitative Variable 2 or More Quantitative Variables 1 Qualitative Variable 1st 2nd 3rd Order Order Order Model Model Model 1st Inter- 2nd Order Action Order Model Model Model Dummy Variable Model First-Order Model With 2 Independent Variables 1. Relationship Between 1 Dependent & 2 Independent Variables Is a Linear Function 2. Assumes No Interaction Between X1 & X2 Effect of X1 on E(Y) Is the Same Regardless of X2 Values First-Order Model With 2 Independent Variables 1. Relationship Between 1 Dependent & 2 Independent Variables Is a Linear Function 2. Assumes No Interaction Between X1 & X2 Effect of X1 on E(Y) Is the Same Regardless of X2 Values 3. Model E (Y ) 0 1X 1i 2 X 2i No Interaction No Interaction E(Y) E(Y) = 1 + 2X1 + 3X2 12 8 4 0 X1 0 0.5 1 1.5 No Interaction E(Y) E(Y) = 1 + 2X1 + 3X2 12 8 4 E(Y) = 1 + 2X1 + 3(0) = 1 + 2X1 0 X1 0 0.5 1 1.5 No Interaction E(Y) E(Y) = 1 + 2X1 + 3X2 12 8 E(Y) = 1 + 2X1 + 3(1) = 4 + 2X1 4 E(Y) = 1 + 2X1 + 3(0) = 1 + 2X1 0 X1 0 0.5 1 1.5 No Interaction E(Y) E(Y) = 1 + 2X1 + 3X2 12 E(Y) = 1 + 2X1 + 3(2) = 7 + 2X1 8 E(Y) = 1 + 2X1 + 3(1) = 4 + 2X1 4 E(Y) = 1 + 2X1 + 3(0) = 1 + 2X1 0 X1 0 0.5 1 1.5 No Interaction E(Y) E(Y) = 1 + 2X1 + 3X2 E(Y) = 1 + 2X1 + 3(3) = 10 + 2X1 12 E(Y) = 1 + 2X1 + 3(2) = 7 + 2X1 8 E(Y) = 1 + 2X1 + 3(1) = 4 + 2X1 4 E(Y) = 1 + 2X1 + 3(0) = 1 + 2X1 0 X1 0 0.5 1 1.5 No Interaction E(Y) E(Y) = 1 + 2X1 + 3X2 E(Y) = 1 + 2X1 + 3(3) = 10 + 2X1 12 E(Y) = 1 + 2X1 + 3(2) = 7 + 2X1 8 E(Y) = 1 + 2X1 + 3(1) = 4 + 2X1 4 E(Y) = 1 + 2X1 + 3(0) = 1 + 2X1 0 X1 0 0.5 1 1.5 Effect (slope) of X1 on E(Y) does not depend on X2 value First-Order Model Relationships Y Response Surface X1 0 X2 First-Order Model Worksheet Case, i Yi X1i X2i 1 2 3 4 : 1 4 1 3 : 1 8 3 5 : 3 5 2 6 : Run regression with Y, X1, X2 Types of Regression Models Explanatory Variable 1 Quantitative Variable 2 or More Quantitative Variables 1 Qualitative Variable 1st 2nd 3rd Order Order Order Model Model Model 1st Inter- 2nd Order Action Order Model Model Model Dummy Variable Model Interaction Model With 2 Independent Variables 1. Hypothesizes Interaction Between Pairs of X Variables Response to One X Variable Varies at Different Levels of Another X Variable Interaction Model With 2 Independent Variables 1. Hypothesizes Interaction Between Pairs of X Variables Response to One X Variable Varies at Different Levels of Another X Variable 2. Contains Two-Way Cross Product Terms E (Y ) 0 1X 1i 2 X 2i 3 X 1i X 2i Interaction Model With 2 Independent Variables 1. Hypothesizes Interaction Between Pairs of X Variables Response to One X Variable Varies at Different Levels of Another X Variable 2. Contains Two-Way Cross Product Terms (Y ) Be 0Combined 1X 1i With 2 X 2Other 3.ECan Models i 3X 1i X 2i Example: Dummy-Variable Model Effect of Interaction Effect of Interaction 1. Given: E (Y ) 0 1X 1i 2 X 2i 3 X 1i X 2i Effect of Interaction 1. Given: E (Y ) 0 1X 1i 2 X 2i 3 X 1i X 2i 2. Without Interaction Term, Effect of X1 on Y Is Measured by 1 Effect of Interaction 1. Given: E (Y ) 0 1X 1i 2 X 2i 3 X 1i X 2i 2. Without Interaction Term, Effect of X1 on Y Is Measured by 1 3. With Interaction Term, Effect of X1 on Y Is Measured by 1 + 3X2 Effect Increases As X2i Increases Interaction Model Relationships Interaction Model Relationships E(Y) E(Y) = 1 + 2X1 + 3X2 + 4X1X2 12 8 4 0 X1 0 0.5 1 1.5 Interaction Model Relationships E(Y) E(Y) = 1 + 2X1 + 3X2 + 4X1X2 12 8 E(Y) = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1 4 0 X1 0 0.5 1 1.5 Interaction Model Relationships E(Y) E(Y) = 1 + 2X1 + 3X2 + 4X1X2 E(Y) = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1 12 8 E(Y) = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1 4 0 X1 0 0.5 1 1.5 Interaction Model Relationships E(Y) E(Y) = 1 + 2X1 + 3X2 + 4X1X2 E(Y) = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1 12 8 E(Y) = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1 4 0 X1 0 0.5 1 1.5 Effect (slope) of X1 on E(Y) does depend on X2 value Interaction Model Worksheet Case, i Yi X1i X2i X1i X2i 1 2 3 4 : 1 4 1 3 : 1 8 3 5 : 3 5 2 6 : 3 40 6 30 : Multiply X1 by X2 to get X1X2. Run regression with Y, X1, X2 , X1X2 Types of Regression Models Explanatory Variable 1 Quantitative Variable 2 or More Quantitative Variables 1 Qualitative Variable 1st 2nd 3rd Order Order Order Model Model Model 1st Inter- 2nd Order Action Order Model Model Model Dummy Variable Model Second-Order Model With 2 Independent Variables 1. Relationship Between 1 Dependent & 2 or More Independent Variables Is a Quadratic Function 2. Useful 1St Model If Non-Linear Relationship Suspected Second-Order Model With 2 Independent Variables 1. Relationship Between 1 Dependent & 2 or More Independent Variables Is a Quadratic Function 2. Useful 1St Model If Non-Linear Relationship Suspected 3. Model E (Y ) 0 1X 1i 2 X 2i 3 X 1i X 2i 2 4 X 1i 2 5 X 2i Second-Order Model Relationships Y X2 X1 Y X1 4 + 5 > 0 32 > 4 4 5 X2 Y 4 + 5 < 0 X2 X1 E (Y ) 0 1X 1i 2 X 2i 3 X 1i X 2i 2 4 X 1i 2 5 X 2i Second-Order Model Worksheet Case, i Yi X1i X2i 1 2 3 4 : 1 4 1 3 : 1 8 3 5 : 3 5 2 6 : X1i X2i X1i2 3 40 6 30 : 1 64 9 25 : X2i 2 9 25 4 36 : Multiply X1 by X2 to get X1X2; then X12, X22. Run regression with Y, X1, X2 , X1X2, X12, X22. Models With One Qualitative Independent Variable Types of Regression Models Explanatory Variable 1 Quantitative Variable 2 or More Quantitative Variables 1 Qualitative Variable 1st 2nd 3rd Order Order Order Model Model Model 1st Inter- 2nd Order Action Order Model Model Model Dummy Variable Model Dummy-Variable Model 1.Involves Categorical X Variable With 2 Levels e.g., Male-Female; College-No College 2.Variable Levels Coded 0 & 1 3.Number of Dummy Variables Is 1 Less Than Number of Levels of Variable 4.May Be Combined With Quantitative Variable (1st Order or 2nd Order Model) Dummy-Variable Model Worksheet Case, i Yi X1i X2i 1 2 3 4 : 1 4 1 3 : 1 8 3 5 : 1 0 1 1 : X2 levels: 0 = Group 1; 1 = Group 2. Run regression with Y, X1, X2 Interpreting Dummy-Variable Model Equation Interpreting Dummy-Variable Model Equation Given: Yi 0 1X 1i 2 X 2i Y Starting salary of college grad's X 1 GPA 0 if Male X2 1 if Female Interpreting Dummy-Variable Model Equation Given: Yi 0 1X 1i 2 X 2i Y Starting salary of college grad's X 1 GPA 0 if Male X2 1 if Female Males ( X 2 0): Yi 0 1X 1i 2 (0) 0 1X 1i Interpreting Dummy-Variable Model Equation Given: Yi 0 1X 1i 2 X 2i Y Starting salary of college grad's X 1 GPA 0 if Male X2 1 if Female Same slopes Males ( X 2 0): Yi 0 1X 1i 2 (0) 0 1X 1i Females (X 2 1): Yi 0 1X 1i 2 (1) 0 2 ) 1X 1i Dummy-Variable Model Relationships Y ^ Same Slopes 1 Females ^ 0 + ^2 ^ 0 Males 0 0 X1 Dummy-Variable Model Example Dummy-Variable Model Example Computer Output: Yi 3 5 X 1i 7 X 2i 0 if Male X2 1 if Female Dummy-Variable Model Example Computer Output: Yi 3 5 X 1i 7 X 2i 0 if Male X2 1 if Female Males ( X 2 0): Yi 3 5 X 1i 7(0) 3 5 X 1i Dummy-Variable Model Example Computer Output: Yi 3 5 X 1i 7 X 2i 0 if Male X2 1 if Female Males ( X 2 0): Same slopes Yi 3 5 X 1i 7(0) 3 5 X 1i Females (X 2 1): Yi 3 5 X 1i 7(1) (3 + 7) 5 X 1i Selecting Variables in Model Building Selecting Variables in Model Building A Butterfly Flaps its Wings in Japan, Which Causes It to Rain in Nebraska. -- Anonymous Use Theory Only! Use Computer Search! Model Building with Computer Searches 1. Rule: Use as Few X Variables As Possible 2. Stepwise Regression Computer Selects X Variable Most Highly Correlated With Y Continues to Add or Remove Variables Depending on SSE 3. Best Subset Approach Computer Examines All Possible Sets Residual Analysis Evaluating Multiple Regression Model Steps 1. Examine Variation Measures 2. Do Residual Analysis 3. Test Parameter Significance Overall Model Individual Coefficients 4. Test for Multicollinearity Residual Analysis 1. Graphical Analysis of Residuals Plot Estimated Errors vs. Xi Values Difference Between Actual Yi & Predicted Yi Estimated Errors Are Called Residuals Plot Histogram or Stem-&-Leaf of Residuals 2. Purposes Examine Functional Form (Linear vs. Non-Linear Model) Evaluate Violations of Assumptions Linear Regression Assumptions 1. Mean of Probability Distribution of Error Is 0 2. Probability Distribution of Error Has Constant Variance 3. Probability Distribution of Error is Normal 4. Errors Are Independent Residual Plot for Functional Form Add X2 Term Correct Specification ^ e ^ e X X Residual Plot for Equal Variance Unequal Variance SR Correct Specification SR X Fan-shaped. Standardized residuals used typically (residual divided by standard error of prediction) X Residual Plot for Independence Not Independent Correct Specification SR SR X X Residual Analysis Computer Output Dep Var Predict Student Obs SALES Value Residual Residual -2-1-0 1 2 1 1.0000 0.6000 0.4000 1.044 | |** 2 1.0000 1.3000 -0.3000 -0.592 | *| 3 2.0000 2.0000 0 0.000 | | 4 2.0000 2.7000 -0.7000 -1.382 | **| 5 4.0000 3.4000 0.6000 1.567 | |*** | | | | | Plot of standardized (student) residuals Multiple Regression Models Multiple Regression Models Linear Linear PolyNomial Dummy Variable Square Root NonLinear Interaction Log Reciprocal Exponential Polynomial (Curvilinear) Regression Model Curvilinear Regression Model Relationship between 1 response variable and 2 or more explanatory variable is a polynomial function Useful when scatter diagram indicates non-linear relationship Curvilinear model: Yi 0 1 X 1i 2 X 12i i The second explanatory variable is the square of the 1st. Curvilinear Regression Model Curvilinear models may be considered when scatter diagram takes on the following shapes: Y Y 2 > 0 X1 Y 2 > 0 X1 Y 2 < 0 X1 2 = the coefficient of the quadratic term 2 < 0 X1 Testing for Significance: Curvilinear Model Testing for Overall Relationship Similar to test for linear model MSR F test statistic = MSE Testing the Curvilinear Effect Compare curvilinear model Yi 0 1 X 1i 2 X 12i i with the linear model Yi 0 1 X 1 i i Testing for Significance: Curvilinear Model May require testing a portion of the model (e.g. the linear and squared terms) when there are other variables in the model Yi 0 1 X1i 2 X 3 X2i i 2 1i Here we must test 1 2 0 to test for the significance of X1 - an F-test for these two “variables” Inherently Linear Models Non-linear models that can be expressed in linear form Can be estimated by LS in linear form Require data transformation Multiplicative model example 1 2 Yi 0 X 1i X 2i i ln Yi ln 0 1ln X 1i 2 ln X 2i ln i Using Transformations Requires Data Transformation Either or Both Independent and Dependent Variables May be Transformed Can be based on theory, logic or scatter diagrams Square Root Transformation Yi 0 1 X 1 i 2 X 2 i i Y 1 > 0 Similarly for X2 1 < 0 X1 Transforms one of above model to one that appears linear. Often used to overcome heteroscedasticity. Logarithmic Transformation Yi 0 1 ln( X 1i ) 2 ln( X 2 i ) i Y 1 > 0 Similarly for X2 1 < 0 X1 Exponential Transformation Original Model Yi e Y 0 1X 1i 2 X 2 i i 1 > 0 1 < 0 Similarly for X2 X1 Transformed into: ln Yi 0 1 X 1i 2 X 2 i ln 1 Interpretation of coefficients The dependent variable is logged. The coefficient on the independent variable can be approximately interpreted as : a 1 unit change in X leads to a b percentage change in Y. The independent variable is logged. The coefficient on the independent variable can be approximately interpreted as : a 100 percent change in X leads to a b unit change in Y. Interpretation of coefficients Both dependent and independent variables are logged. The coefficient on the independent variable can be approximately interpreted as : a 1 percent change in X leads to a b percentage change in Y. Therefore b is the elasticity of Y with respect to a change in X. Income and Experience: Scatter Plot Income and Experience: Linear Linear Model Income and Experience: Log Independent Variable Log independent variable Income and Experience: Income Logged Log(Y) Income and Experience: Double Log Double Log - Elasticity Model (Note: LFEXP is already logged in this example) Income and Experience: Quadratic Quadratic Income and Experience: Log plus Quadratic Log(Y) + Quadratic Income and Experience: All Specifications Many specifications Standardized and Unstandardized Many disciplines report ONLY standardized coefficients The usual coefficients are then referred to as “unstandardized coefficients” The “standardized” coefficient are often referred to as “beta weights” The t-tests for significance of the slopes are identical for either of these two. Interpretation of coefficients If both Y and X are measured in standardized form, Yi Y yi and Xi X Y xi Then the b’s are called standardized coefficients. They indicate the number of standard deviations Y will change when X changes by one standard deviation BETA Coefficients Example Comparison of coefficients In general, we should NOT compare coefficients unless they are measured in the same units (e.g. dollars or inches) Two “unit free” measures are sometimes used to compare coefficients: elasticities (percentage changes) standardized coefficients (Stand. Dev. Changes) Violation of Assumptions Omitted Variables This problem occurs if a variable is omitted from the specification either due to an error by the researcher or lack of data. If the variable is uncorrelated with the included variables: The estimated slopes are inefficient (their variance is too large). The estimated slopes are unbiased. Omitted Variables If the variable is correlated with the included variables: The t-tests are biased (the estimated variance of the slopes is too small). The estimated slopes are biased. This is a serious problem because it leads us to reject true null hypotheses too often. Omitted Variables This suggests that great care be taken in model building. It is generally not good procedure to allow the sample to dictate the model. It is better to include a variable that should not be there than exclude a variable that should. EXAMPLE EXAMPLE Effect of Omitted Variable Measurement Error In the dependent variable Slopes are biased toward zero - null hypotheses that are false are more difficult to reject. Measurement error makes it more difficult to reject null hypotheses. In an independent variable Slope is biased toward zero. Slopes of other variables that are correlated with this variable can also be biased. Measurement error can lead to rejecting true nulls. Measurement Error Implications Your dependent variable is hard to measure: product satisfaction or quality of work. If you do find results they would be even stronger if you could measure the variable accurately. A significant result with a variable that is difficult to measure should not be dismissed! Measurement Error Implications Your independent variable is hard to measure: product satisfaction or quality of work. Same as dependent variable (a significant result would be even more significant). HOWEVER, poor measurement can lead you to give MORE credit than is due to another variable. Measurement Error Conclusions Measure your variables as accurately as possible to improve the power of your tests If your independent variable is difficult to measure, you must worry about the results for other variables in the model. Heteroscedasticity Typically a problem in cross sectional data Slopes are unbiased, but inefficient. However, this is often an indication of an omitted variable problem, in which case the slopes are potentially biased. Heteroscedasticity Usually occurs due to a few outliers. Possible cures: Drop the outliers. Use a transformation like a log transformation that eliminates the problem. Use advanced procedures to correct the problem (Weighted Least Squares; Generalized Least Squares) Heteroscedasticity Examples: Data on firms of different sizes - there is likely to be more heterogeneity in management for small firms Small firms -> big errors Large firms -> small errors Data on proportions gathered from groups of different sizes Large groups likely to give better estimates Example: College graduation rates Autocorrelation This occurs when the error in one observation is correlated with the error in another observation. This is generally a time series problem. This correlation can be quite simple, or very complicated. If the correlation is with the previous observation error, this is called 1st order autocorrelation. Example Plots of Residuals Positive Autocorrelation Negative Autocorrelation None The Durbin-Watson Statistic •Used when data is collected over time to detect autocorrelation (Residuals in one time period are related to residuals in another period) •Measures Violation of independence assumption n D ( ei ei 1 ) i 2 n 2 ei i 1 2 •Approximately 0=positive autocorrelation •Approximately 2=none •Approximately 4=negative autocorrelation. The Durbin-Watson Statistic Durbin-Watson table (one-tailed critical values) =.05 P=1 DL N 15 16 17 18 P=2 DL DH 1.08 1.1 1.13 1.16 1.36 1.37 1.38 1.39 DH 0.95 0.98 1.02 1.05 1.54 1.54 1.54 1.53 The Durbin-Watson Statistic Durbin-Watson table (one-tailed critical values) =.05 P=1 DL N 15 16 17 18 P=2 DL DH 1.08 1.1 1.13 1.16 1.36 1.37 1.38 1.39 DH 0.95 0.98 1.02 1.05 1.54 1.54 1.54 1.53 d>DH indicates ACCEPT NULL The Durbin-Watson Statistic Durbin-Watson table (one-tailed critical values) =.05 P=1 DL N 15 16 17 18 P=2 DL DH 1.08 1.1 1.13 1.16 1.36 1.37 1.38 1.39 d<DL indicates REJECT NULL DH 0.95 0.98 1.02 1.05 1.54 1.54 1.54 1.53 The Durbin-Watson Statistic Durbin-Watson table (one-tailed critical values) =.05 P=1 DL N 15 16 17 18 P=2 DL DH 1.08 1.1 1.13 1.16 1.36 1.37 1.38 1.39 d>DL and d<DH is inconclusive DH 0.95 0.98 1.02 1.05 1.54 1.54 1.54 1.53 The Durbin-Watson Statistic Durbin-Watson table (one-tailed critical values) =.05 P=1 DL N 15 16 17 18 P=2 DL DH 1.08 1.1 1.13 1.16 1.36 1.37 1.38 1.39 DH 0.95 0.98 1.02 1.05 1.54 1.54 1.54 1.53 Test for NEGATIVE autocorrelation: USE 4-d Example: d=3.5 n=15, p=2 use d=.5 reject null Durbin-Watson Example Relationship between sales and customers Regression Statistics Multiple R 0.810829997 R Square 0.657445284 Adjusted R Square 0.631094922 Standard Error 0.936036681 Observations 15 Coefficients Standard Error t Stat P-value Intercept -16.0321936 5.310167093 -3.019150493 0.009868641 Customers 0.030760228 0.006158189 4.995011683 0.000245105 Week Customers Sales 1 794 9.33 2 799 8.26 3 837 7.48 4 855 9.08 5 845 9.83 6 844 10.09 7 863 11.01 8 875 11.49 9 880 12.07 10 905 12.55 11 886 11.92 12 843 10.27 13 904 11.80 14 950 12.15 15 841 9.64 Durbin-Watson Example DW=.88 p=1 (# of variables) n=15 (# of observations) dl=1.08 dh=1.36 Conclusion: Reject null of no positive autocorrelation (DW< dl) Problem and Cure •Autocorrelation present •t-tests are biased - estimated standard error too small • Degree of autocorrelation known (or estimated) •Remove by differencing the data. • Special case: correlation +1 -> first difference the data Yt * Yt Yt 1 X t* X t X t 1 We then run the regression using Y* and X* instead of Y and X EXAMPLE: How is birth rate related to wars and women in the labor force? WLF=labor force participation of women Divorce=divorce rate returnyr=3 years following a war Birth Rate WLF UE 2.03 25.4 2.22 26.70 2.27 29.10 2.12 29.20 2.04 29.20 2.41 27.80 2.66 27.40 2.49 28.00 2.45 28.30 2.41 28.80 2.49 29.30 2.51 29.40 2.50 29.20 2.53 29.40 2.50 30.20 2.52 31.00 2.53 31.20 2.45 31.5 2.40 31.70 2.37 32.30 2.33 32.60 2.24 32.70 2.17 33.20 2.10 33.60 1.94 34.00 1.84 34.60 1.78 35.10 3.75 35.50 1.78 36.30 1.84 36.70 9.90 4.70 1.90 3.20 1.90 3.9 3.9 3.80 5.9 5.3 3.3 3.00 2.90 5.50 4.40 4.10 4.30 6.80 5.50 5.50 6.70 5.5 5.70 5.20 4.5 3.80 3.80 3.60 3.50 4.90 Divorce waryear 17 18 23 28 30 27 24 23 25 23 24 25 25 25 25 24 25 25 26 27 26 26 26 26 27 27 27 28 30 33 returnyr 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Results Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.859549368 0.738825117 0.684413683 0.152709042 30 ANOVA df Regression Residual Total Intercept WLF UE Divorce waryear returnyr 5 24 29 SS 1.583255433 0.559681234 2.142936667 Coefficients Standard Error 4.374896708 0.381602139 -0.04910246 0.014827648 -0.045616967 0.023630251 -0.008953659 0.015631931 -0.327043186 0.074959758 0.009847529 0.089477525 MS F Significance F 0.316651087 13.57849007 2.41579E-06 0.023320051 t Stat 11.46454976 -3.311547559 -1.930447837 -0.572780128 -4.362916791 0.110055896 P-value 3.1963E-11 0.002928173 0.065449464 0.572121264 0.0002099 0.913280166 Lower 95% 3.587308764 -0.079705215 -0.094387398 -0.041216371 -0.48175249 -0.174824968 Upper 95% 5.162484651 -0.018499706 0.003153464 0.023309053 -0.172333881 0.194520026 Residuals Residuals DW=.55 reject Null of no autocorrelation 0.2 0.15 0.1 0.05 Estimate rho as r=1-d/2 =1-.55/2=.725 0 -0.05 0 -0.1 -0.15 -0.2 5 10 15 20 25 30 35 Differencing Data Birth Rate WLF UE 2.03 25.4 2.22 26.70 2.27 29.10 2.12 29.20 2.04 29.20 2.41 27.80 2.66 27.40 2.49 28.00 2.45 28.30 2.41 28.80 2.49 29.30 2.51 29.40 2.50 29.20 2.53 29.40 2.50 30.20 2.52 31.00 2.53 31.20 2.45 31.5 2.40 31.70 2.37 32.30 2.33 32.60 2.24 32.70 2.17 33.20 2.10 33.60 1.94 34.00 1.84 34.60 1.78 35.10 1.75 35.50 1.78 36.30 1.84 36.70 9.90 4.70 1.90 3.20 1.90 3.9 3.9 3.80 5.9 5.3 3.3 3.00 2.90 5.50 4.40 4.10 4.30 6.80 5.50 5.50 6.70 5.5 5.70 5.20 4.5 3.80 3.80 3.60 3.50 4.90 Divorce waryear returnyr 17 1 18 1 23 1 28 1 30 1 27 0 24 0 23 0 25 0 23 0 24 0 25 0 25 0 25 0 25 0 24 0 25 0 25 0 26 0 27 0 26 0 26 0 26 0 26 0 27 1 27 1 27 1 28 1 30 1 33 1 Birthrate 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.74825 0.6605 0.47425 0.503 0.931 0.91275 0.5615 0.64475 0.63375 0.74275 0.70475 0.68025 0.7175 0.66575 0.7075 0.703 0.61575 0.62375 0.63 0.61175 0.55075 0.546 0.52675 0.4175 0.4335 0.446 0.4595 0.51125 0.5495 WLF 8.285 9.7425 8.1025 8.03 6.63 7.245 8.135 8 8.2825 8.42 8.1575 7.885 8.23 8.885 9.105 8.725 8.88 8.8625 9.3175 9.1825 9.065 9.4925 9.53 9.64 9.95 10.015 10.0525 10.5625 10.3825 UE Divorce -2.4775 -1.5075 1.8225 -0.42 2.5225 1.0725 0.9725 3.145 1.0225 -0.5425 0.6075 0.725 3.3975 0.4125 0.91 1.3275 3.6825 0.57 1.5125 2.7125 0.6425 1.7125 1.0675 0.73 0.5375 1.045 0.845 0.89 2.3625 5.675 9.95 11.325 9.7 5.25 4.425 5.6 8.325 4.875 7.325 7.6 6.875 6.875 6.875 5.875 7.6 6.875 7.875 8.15 6.425 7.15 7.15 7.15 8.15 7.425 7.425 8.425 9.7 11.25 waryear 0.275 0.275 0.275 0.275 -0.725 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.275 0.275 0.275 0.275 0.275 returnyr 0 0 0 0 0 1 0.275 0.275 -0.725 0 0 0 0 1 0.275 0.275 -0.725 0 0 0 0 0 0 0 0 0 0 0 0 Results Regression Statistics Multiple R 0.812694974 R Square 0.660473121 Adjusted R Square 0.58666293 Standard Error 0.083098593 Observations 29 ANOVA df Regression Residual Total Intercept WLF UE Divorce waryear returnyr 5 23 28 SS 0.308955666 0.158823652 0.467779317 MS F Significance F 0.061791133 8.948264624 7.86547E-05 0.006905376 Coefficients Standard Error t Stat P-value 1.411080007 0.15976456 8.832246672 7.53745E-09 -0.0646903 0.019521702 -3.313763265 0.003028155 -0.014449922 0.013867407 -1.042006025 0.308238603 -0.019019342 0.010589345 -1.79608287 0.085630564 -0.116753859 0.056676884 -2.059990798 0.050889678 0.028674254 0.050850908 0.563888726 0.578287347 Multicollinearity Does not violate assumptions of least squares (unless it is perfectly collinear) Estimates have low ability to reject false null hypotheses (low power). A post hoc problem. Little that can be done - eliminating a variable could cause omitted variable bias. Multicollinearity May require testing groups of variables instead of individual slopes. Use F-test for a group of variables that are measuring a similar idea rather than testing the idea by looking at individual ttests Example of Collinearity Model MPG Type of Drive Fuel Type Fuel Capacity Length Wheelbase Width Turning Circle Weight Luggage Capacity Front Leg Room Front Head Room How is MPG influenced by car characteristics? Regression Results Regression Statistics Multiple R 0.916170699 R Square 0.839368749 Adjusted R Square 0.816421427 Standard Error 1.943378831 Observations 89 ANOVA df Regression Residual Total Intercept Type of Drive Fuel Type Fuel Capacity Length Wheelbase Width Turning Circle Weight Luggage Capacity Front Leg Room Front Head Room SS 1519.596956 290.8075387 1810.404494 MS 138.1451778 3.776721282 Coefficients Standard Error 47.84639077 16.00522579 -0.990467814 0.804947293 -0.448181315 0.669007482 -0.033480953 0.205431049 0.035032065 0.0609802 0.051601109 0.103653825 -0.005035985 0.163180417 -0.194077967 0.160875083 -0.009083907 0.001760789 0.085785172 0.09438463 0.028830301 0.326150787 -0.226076981 0.256358435 t Stat 2.989423041 -1.230475366 -0.669919736 -0.162979028 0.574482609 0.497821566 -0.030861452 -1.206389226 -5.158997869 0.908889218 0.088395621 -0.881878454 11 77 88 F Significance F 36.57807062 4.3792E-26 P-value 0.003750637 0.222264941 0.504913062 0.870961928 0.567315883 0.620028615 0.975459886 0.23136098 1.87855E-06 0.366244683 0.929791742 0.380587216 Correlations of Independent Variables Type of Drive Type of Drive 1 Fuel Type 0.268681117 Fuel Capacity -0.428951231 Length -0.341685414 Wheelbase -0.320081091 Width -0.441548037 Turning Circle -0.143483852 Weight -0.520865176 Luggage Capacity 0.04879301 Front Leg Room -0.282796832 Front Head Room 0.150372499 Width Type of Drive Fuel Type Fuel Capacity Length Wheelbase Width Turning Circle Weight Luggage Capacity Front Leg Room Front Head Room 1 0.771254595 0.855494824 0.443142211 0.508740454 0.102180504 Fuel Type Fuel Capacity 1 -0.473951823 1 -0.222986645 0.804874478 -0.289068849 0.76811138 -0.181405593 0.669241997 -0.10837901 0.572058456 -0.399893445 0.894722586 0.149137156 0.397945037 -0.319408961 0.642973059 0.496620764 -0.263388897 Turning Circle Weight 1 0.719981675 1 0.317445699 0.429944841 0.329045265 0.633639181 0.090580546 -0.123170049 Length 1 0.90800949 0.87697456 0.789886571 0.906826507 0.510775619 0.539896618 0.06588231 Wheelbase 1 0.775996773 0.663126584 0.856277889 0.488358101 0.519103235 0.114062218 Luggage Capacity Front Leg Room Front Head Room 1 0.280372614 0.113971709 1 -0.132931814 1