* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Class 23 notes - Darden Faculty
Survey
Document related concepts
Transcript
Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions Adjusted R-square Hours 2 4.17 4.42 4.75 4.83 6.67 7 7.08 7.17 7.17 10 12 12.5 13.67 15.08 Pg 9-12 Pfeifer note Hours Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 7.900667 1.003487 7.08 7.17 3.886488 15.10479 -0.75506 0.524811 13.08 2 15.08 118.51 15 Our better method of forecasting hours would use a mean of 7.9 and standard deviation of 3.89 (and the tdistribution with 14 dof) The sample variance is π 2 (π β π)2 = = 3.892 = 15.1 πβ1 The variation in Hours that regression will try to explain Adjusted R-square Pg 9-12 Pfeifer note SUMMARY OUTPUT MSF 26 34.2 29 34.3 85.9 143.2 85.5 140.6 140.6 40.4 101 239.7 179.3 126.5 140.8 Hours 2 4.17 4.42 4.75 4.83 6.67 7 7.08 7.17 7.17 10 12 12.5 13.67 15.08 Regression Statistics Multiple R 0.7260033 R Square 0.5270808 Adjusted R Square 0.4907024 Standard Error 2.7735959 Observations 15 Our better method of forecasting hours for job A would use a mean of 10.51 and standard deviation of 2.77 (and the t-distribution with 13 dof) The squared standard error is ANOVA df Regression Residual Total Intercept MSF 1 13 14 Coefficients 3.312316 0.0444895 π = 3.31 + 0.044 × 157.3 = 10.51 2 (π β π) π π 2 = = 2.772 = 7.69 πβ2 The variation in Hours regression leaves unexplained. Adjusted R-square Pg 9-12 Pfeifer note β’ Adjusted R-square is the percentage of variation explained β’ The initial variation is s2 = 15.1 β’ The variation left unexplained (after using MSF in a regression) is (standard error)2 = 7.69. π 2 β(π π‘ππππππ πππππ)2 π 2 β’ Adjusted R-square = β’ Adjusted R-square = (15.1-7.69)/15.1 = 0.49 β’ The regression using MSF explained 49% of the variation in hours. β’ The βadjustedβ happened in the calculation of s and standard error. From the Pfeifer note Adj R-square = 1.0 Standard error = 0 Adj R-square = 0.5 Standard error = s Adj R-square = 0.0 Why Pfeifer says R2 is over-rated β’ There is no standard for how large it should be. β In some situations an adjusted R2 of 0.05 would be FANTASTIC. In others, an adjusted R2 of 0.96 would be DISAPOINTING. β’ It has no real use. β Unlike βstandard errorβ which is needed to make probability forecasts. β’ It is usually redundant β When comparing models, lower standard errors mean higher adj R2 β The correlation coefficient (which shares the same sign as b) β the square root of adj R2. The Coal Pile Example 96% of the variation in W is explained by this regression. β’ The firm needed a way to estimate the weight of a coal pile (based on itβs dimensions) SUMMARY OUTPUT W 56 93 161 31 70 76 375 34 45 58 D 20 25 30 15 20 20 40 15 20 20 We just used MULTIPLE regression. h 10 10 12 12 14 14 16 14 8 10 d 15 20 24 10 13 13 32 8 16 15 Regression Statistics Multiple R 0.986792416 R Square 0.973759272 Adjusted R Square 0.960638908 Standard Error 20.56622179 Observations 10 ANOVA df Regression Residual Total Intercept D h d 3 6 9 Coefficients -294.6954733 -15.12016461 23.02366255 27.62139918 The Coal Pile Example 100% of the variation in W is explained by this regression. β’ Engineer Bob calculated the Volume of each pile and used simple regressionβ¦ SUMMARY OUTPUT W 56 93 161 31 70 76 375 34 45 58 Vol 2421.64 3992.44 6898.94 1492.26 3038.44 3038.44 16353.04 1499.06 2044.13 2421.64 Standard error went from to 20.6 to 2.8!!! Regression Statistics Multiple R 0.999673782 R Square 0.99934767 Adjusted R Square 0.999266129 Standard Error 2.808218162 Observations 10 ANOVA df Regression Residual Total Intercept Vol 1 8 9 Coefficients 0.668821408 0.022970159 Sec 5 of Pfeifer note Sec 12.4 of EMBS The Four Assumptions β’ Linearity 30 25 Y 20 β’ Independence 15 10 β The n observations were sampled independently from the same population. 40 50 60 70 25 20 15 10 5 0 -5 0 2 4 6 8 10 -10 -15 -20 -25 β The probability distribution of YβX is normal. Fitted 250 200 150 100 50 75.04 64.47 53.90 43.34 32.77 22.20 11.63 0 1.06 Frequency β’ Errors are normal. Yβs donβt have to be. 30 -9.51 β’ Normality 20 -20.08 β All Yβs given X share a common Ο. 10 X Residual β’ Homoskedasticity 0 -30.64 β π =π+π×π 35 12 Sec 5 of Pfeifer note Sec 12.4 of EMBS The four assumptions SUMMARY OUTPUT MSF 26 34.2 29 34.3 85.9 143.2 85.5 140.6 140.6 40.4 101 239.7 179.3 126.5 140.8 Hours 2 4.17 4.42 4.75 4.83 6.67 7 7.08 7.17 7.17 10 12 12.5 13.67 15.08 Regression Statistics Multiple R 0.7260033 R Square 0.5270808 Adjusted R Square 0.4907024 Standard Error 2.7735959 Observations 15 ANOVA df Regression Residual Total Intercept MSF 1 13 14 Coefficients 3.312316 0.0444895 π = 3.31 + 0.044 × 157.3 = 10.51 Our better method of forecasting hours for job A would use a mean of 10.51 and standard deviation of 2.77 (and the t-distribution with 13 dof) Independence (all 15 points count equally) Linearity homoskedasticity Normality Hypotheses P 13 of Pfeifer note Sec 12.5 of EMBS β’ H0: P=0.5 (LTT, wunderdog) β’ H0: Independence (supermarket job and response, treatment and heart attack, light and myopia, tosser and outcome) β’ H0: ΞΌ=100 (IQ) β’ H0: ΞΌM= ΞΌF (heights, weights, batting average) β’ H0: ΞΌcompact= ΞΌmid = ΞΌlarge (displacement) H0: b=0 P 13 of Pfeifer note Sec 12.5 of EMBS β’ b=0 means X and Y are independent β In this way itβs like the chi-squared independence testβ¦.for numerical variables. β’ b=0 means donβt use X to forecast Y β Donβt put X in the regression equation β’ b=0 means just use π to forecast Y β’ b=0 means the βtrueβ adj R-square is zero. P 13 of Pfeifer note Sec 12.5 of EMBS Testing b=0 is EASY!!! β’ H0: ΞΌ=100 β’ π‘= πβ100 π / π β’ P-value from the t.dist with n-1 dof The standard error of the coefficient Intercept MSF π MSF 26 34.2 29 34.3 85.9 143.2 85.5 140.6 140.6 40.4 101 239.7 179.3 126.5 140.8 Hours 2 4.17 4.42 4.75 4.83 6.67 7 7.08 7.17 7.17 10 12 12.5 13.67 15.08 β’ H0: b=0 β’ π‘ = (π-0)/(se of coef) β’ P-value from t.dist using n-2 dof. The t-stat to test b=0. The 2-tailed pvalue. Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 3.3123 1.4021 2.3624 0.0344 0.2832 6.3414 0.0445 0.0117 3.8064 0.0022 0.0192 0.0697 Using Yes/No variable in Regression Categorical n=60 Numerical Numerical Categorical Displaceme nt Fuel Type Hwy MPG Car Class 1 Midsize 3.5 R 28 2 Midsize 3 R 26 3 Large 3 P 26 4 Large 3.5 P 25 . . . . . . . . . . 58 Compact 6 P 20 59 Midsize 2.5 R 30 60 Midsize 2 R 32 Does MPG βdependβ on fuel type? Sec 8 of Pfeifer note Sec 13.7 of EMBS Fuel type (yes/no) and mpg (numerical) β’ Un-stack the data so there are two columns of MPG data. β’ Data Analysis, T-test two sample H0: ΞΌP = ΞΌR Or H0: ΞΌP β ΞΌR = 0 t-Test: Two-Sample Assuming Equal Variances Mean Variance Observations Pooled Variance Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail P 24.33333 12.4 36 11.2579 0 58 -3.81704 0.000165 1.671553 0.000331 2.001717 R 27.70833 9.519928 24 0.999835 Sec 8 of Pfeifer note Sec 13.7 of EMBS Using Yes/No variables in Regression 1. Convert the categorical variable into a 1/0 DUMMY Variable. β Use an if statement to do this. β It wonβt matter which is assigned 1, which is assigned 0. β It doesnβt even matter what 2 numbers you assign to the two categories (regression will adjust) 2. Regress MPG (numerical) on DUMMY (1/0 numerical) 3. Test H0: b=0 using the regression output. Sec 8 of Pfeifer note Sec 13.7 of EMBS Using Yes/No variables in Regression Fuel Type Dprem R 0 R 0 P 1 P 1 . . P 1 P 1 P 1 R 0 R 0 Hwy MPG 28 26 26 25 . 21 25 20 30 32 SUMMARY OUTPUT Regression Statistics Adj R Square 0.1870 Standard Error 3.3553 Observations 60 ANOVA df SS Regression Residual 1 58 164.025 652.958 Total 59 816.983 Intercept Dprem Coeff Std Error 27.708 0.6849 -3.375 0.8842 MS 164.025 11.258 F 14.570 Sig F 3.306E-04 t Stat P-value 40.4564 3.321E-44 -3.8170 3.306E-04 Sec 8 of Pfeifer note Sec 13.7 of EMBS Regression with one Dummy variable πππΊ = 27.7 β 3.8 × π·ππππ π =π+π×π· For Regular, When D=0, πππΊ =27.7 π=π H0: ΞΌP = ΞΌR Or H0: ΞΌP β ΞΌR = 0 When D=1, Or For premium, H0: b = 0 π =π+π π = 27.7 β 3.8 = 24.3 What we learned today β’ We learned about βadjusted R squareβ β The most over-rated statistic of all time. β’ We learned the four assumptions required to use regression to make a probability forecast of YβX. β And how to check each of them. β’ We learned how to test H0: b=0. β And why this is such an important test. β’ We learned how to use a yes/no variable in a regression. β Create a dummy variable.