Download Class 23 notes - Darden Faculty

Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions Adjusted R-square Hours 2 4.17 4.42 4.75 4.83 6.67 7 7.08 7.17 7.17 10 12 12.5 13.67 15.08 Pg 9-12 Pfeifer note Hours Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 7.900667 1.003487 7.08 7.17 3.886488 15.10479 -0.75506 0.524811 13.08 2 15.08 118.51 15 Our better method of forecasting hours would use a mean of 7.9 and standard deviation of 3.89 (and the tdistribution with 14 dof) The sample variance is 𝑠2 (𝑌 − 𝑌)2 = = 3.892 = 15.1 𝑛−1 The variation in Hours that regression will try to explain Adjusted R-square Pg 9-12 Pfeifer note SUMMARY OUTPUT MSF 26 34.2 29 34.3 85.9 143.2 85.5 140.6 140.6 40.4 101 239.7 179.3 126.5 140.8 Hours 2 4.17 4.42 4.75 4.83 6.67 7 7.08 7.17 7.17 10 12 12.5 13.67 15.08 Regression Statistics Multiple R 0.7260033 R Square 0.5270808 Adjusted R Square 0.4907024 Standard Error 2.7735959 Observations 15 Our better method of forecasting hours for job A would use a mean of 10.51 and standard deviation of 2.77 (and the t-distribution with 13 dof) The squared standard error is ANOVA df Regression Residual Total Intercept MSF 1 13 14 Coefficients 3.312316 0.0444895 𝑌 = 3.31 + 0.044 × 157.3 = 10.51 2 (𝑌 − 𝑌) 𝑠𝑒 2 = = 2.772 = 7.69 𝑛−2 The variation in Hours regression leaves unexplained. Adjusted R-square Pg 9-12 Pfeifer note • Adjusted R-square is the percentage of variation explained • The initial variation is s2 = 15.1 • The variation left unexplained (after using MSF in a regression) is (standard error)2 = 7.69. 𝑠2 −(𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟)2 𝑠2 • Adjusted R-square = • Adjusted R-square = (15.1-7.69)/15.1 = 0.49 • The regression using MSF explained 49% of the variation in hours. • The “adjusted” happened in the calculation of s and standard error. From the Pfeifer note Adj R-square = 1.0 Standard error = 0 Adj R-square = 0.5 Standard error = s Adj R-square = 0.0 Why Pfeifer says R2 is over-rated • There is no standard for how large it should be. – In some situations an adjusted R2 of 0.05 would be FANTASTIC. In others, an adjusted R2 of 0.96 would be DISAPOINTING. • It has no real use. – Unlike “standard error” which is needed to make probability forecasts. • It is usually redundant – When comparing models, lower standard errors mean higher adj R2 – The correlation coefficient (which shares the same sign as b) ≈ the square root of adj R2. The Coal Pile Example 96% of the variation in W is explained by this regression. • The firm needed a way to estimate the weight of a coal pile (based on it’s dimensions) SUMMARY OUTPUT W 56 93 161 31 70 76 375 34 45 58 D 20 25 30 15 20 20 40 15 20 20 We just used MULTIPLE regression. h 10 10 12 12 14 14 16 14 8 10 d 15 20 24 10 13 13 32 8 16 15 Regression Statistics Multiple R 0.986792416 R Square 0.973759272 Adjusted R Square 0.960638908 Standard Error 20.56622179 Observations 10 ANOVA df Regression Residual Total Intercept D h d 3 6 9 Coefficients -294.6954733 -15.12016461 23.02366255 27.62139918 The Coal Pile Example 100% of the variation in W is explained by this regression. • Engineer Bob calculated the Volume of each pile and used simple regression… SUMMARY OUTPUT W 56 93 161 31 70 76 375 34 45 58 Vol 2421.64 3992.44 6898.94 1492.26 3038.44 3038.44 16353.04 1499.06 2044.13 2421.64 Standard error went from to 20.6 to 2.8!!! Regression Statistics Multiple R 0.999673782 R Square 0.99934767 Adjusted R Square 0.999266129 Standard Error 2.808218162 Observations 10 ANOVA df Regression Residual Total Intercept Vol 1 8 9 Coefficients 0.668821408 0.022970159 Sec 5 of Pfeifer note Sec 12.4 of EMBS The Four Assumptions • Linearity 30 25 Y 20 • Independence 15 10 – The n observations were sampled independently from the same population. 40 50 60 70 25 20 15 10 5 0 -5 0 2 4 6 8 10 -10 -15 -20 -25 – The probability distribution of Y│X is normal. Fitted 250 200 150 100 50 75.04 64.47 53.90 43.34 32.77 22.20 11.63 0 1.06 Frequency • Errors are normal. Y’s don’t have to be. 30 -9.51 • Normality 20 -20.08 – All Y’s given X share a common σ. 10 X Residual • Homoskedasticity 0 -30.64 – 𝜇 =𝑎+𝑏×𝑋 35 12 Sec 5 of Pfeifer note Sec 12.4 of EMBS The four assumptions SUMMARY OUTPUT MSF 26 34.2 29 34.3 85.9 143.2 85.5 140.6 140.6 40.4 101 239.7 179.3 126.5 140.8 Hours 2 4.17 4.42 4.75 4.83 6.67 7 7.08 7.17 7.17 10 12 12.5 13.67 15.08 Regression Statistics Multiple R 0.7260033 R Square 0.5270808 Adjusted R Square 0.4907024 Standard Error 2.7735959 Observations 15 ANOVA df Regression Residual Total Intercept MSF 1 13 14 Coefficients 3.312316 0.0444895 𝑌 = 3.31 + 0.044 × 157.3 = 10.51 Our better method of forecasting hours for job A would use a mean of 10.51 and standard deviation of 2.77 (and the t-distribution with 13 dof) Independence (all 15 points count equally) Linearity homoskedasticity Normality Hypotheses P 13 of Pfeifer note Sec 12.5 of EMBS • H0: P=0.5 (LTT, wunderdog) • H0: Independence (supermarket job and response, treatment and heart attack, light and myopia, tosser and outcome) • H0: μ=100 (IQ) • H0: μM= μF (heights, weights, batting average) • H0: μcompact= μmid = μlarge (displacement) H0: b=0 P 13 of Pfeifer note Sec 12.5 of EMBS • b=0 means X and Y are independent – In this way it’s like the chi-squared independence test….for numerical variables. • b=0 means don’t use X to forecast Y – Don’t put X in the regression equation • b=0 means just use 𝑌 to forecast Y • b=0 means the “true” adj R-square is zero. P 13 of Pfeifer note Sec 12.5 of EMBS Testing b=0 is EASY!!! • H0: μ=100 • 𝑡= 𝑌−100 𝑠/ 𝑛 • P-value from the t.dist with n-1 dof The standard error of the coefficient Intercept MSF 𝑏 MSF 26 34.2 29 34.3 85.9 143.2 85.5 140.6 140.6 40.4 101 239.7 179.3 126.5 140.8 Hours 2 4.17 4.42 4.75 4.83 6.67 7 7.08 7.17 7.17 10 12 12.5 13.67 15.08 • H0: b=0 • 𝑡 = (𝑏-0)/(se of coef) • P-value from t.dist using n-2 dof. The t-stat to test b=0. The 2-tailed pvalue. Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 3.3123 1.4021 2.3624 0.0344 0.2832 6.3414 0.0445 0.0117 3.8064 0.0022 0.0192 0.0697 Using Yes/No variable in Regression Categorical n=60 Numerical Numerical Categorical Displaceme nt Fuel Type Hwy MPG Car Class 1 Midsize 3.5 R 28 2 Midsize 3 R 26 3 Large 3 P 26 4 Large 3.5 P 25 . . . . . . . . . . 58 Compact 6 P 20 59 Midsize 2.5 R 30 60 Midsize 2 R 32 Does MPG “depend” on fuel type? Sec 8 of Pfeifer note Sec 13.7 of EMBS Fuel type (yes/no) and mpg (numerical) • Un-stack the data so there are two columns of MPG data. • Data Analysis, T-test two sample H0: μP = μR Or H0: μP – μR = 0 t-Test: Two-Sample Assuming Equal Variances Mean Variance Observations Pooled Variance Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail P 24.33333 12.4 36 11.2579 0 58 -3.81704 0.000165 1.671553 0.000331 2.001717 R 27.70833 9.519928 24 0.999835 Sec 8 of Pfeifer note Sec 13.7 of EMBS Using Yes/No variables in Regression 1. Convert the categorical variable into a 1/0 DUMMY Variable. – Use an if statement to do this. – It won’t matter which is assigned 1, which is assigned 0. – It doesn’t even matter what 2 numbers you assign to the two categories (regression will adjust) 2. Regress MPG (numerical) on DUMMY (1/0 numerical) 3. Test H0: b=0 using the regression output. Sec 8 of Pfeifer note Sec 13.7 of EMBS Using Yes/No variables in Regression Fuel Type Dprem R 0 R 0 P 1 P 1 . . P 1 P 1 P 1 R 0 R 0 Hwy MPG 28 26 26 25 . 21 25 20 30 32 SUMMARY OUTPUT Regression Statistics Adj R Square 0.1870 Standard Error 3.3553 Observations 60 ANOVA df SS Regression Residual 1 58 164.025 652.958 Total 59 816.983 Intercept Dprem Coeff Std Error 27.708 0.6849 -3.375 0.8842 MS 164.025 11.258 F 14.570 Sig F 3.306E-04 t Stat P-value 40.4564 3.321E-44 -3.8170 3.306E-04 Sec 8 of Pfeifer note Sec 13.7 of EMBS Regression with one Dummy variable 𝑀𝑃𝐺 = 27.7 − 3.8 × 𝐷𝑝𝑟𝑒𝑚 𝑌 =𝑎+𝑏×𝐷 For Regular, When D=0, 𝑀𝑃𝐺 =27.7 𝑌=𝑎 H0: μP = μR Or H0: μP – μR = 0 When D=1, Or For premium, H0: b = 0 𝑌 =𝑎+𝑏 𝑌 = 27.7 − 3.8 = 24.3 What we learned today • We learned about “adjusted R square” – The most over-rated statistic of all time. • We learned the four assumptions required to use regression to make a probability forecast of Y│X. – And how to check each of them. • We learned how to test H0: b=0. – And why this is such an important test. • We learned how to use a yes/no variable in a regression. – Create a dummy variable.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Class 23 notes - Darden Faculty