Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Class 16: Thursday, Nov. 4 • Note: I will e-mail you some info on the final project this weekend and will discuss in class on Tuesday. Predicting Emergency Calls to the AAA Club Response Calls Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) Parameter Estimates Term Intercept Average Temperature Range Rain forecast Snow forecast Weekday Sunday Subzero 0.692384 0.584719 1735.151 4318.75 28 Estimate Std Error 3628.7902 2153.788 -35.63182 51.52383 133.30434 429.70588 548.80038 -1603.1 -1847.152 3857.6004 50.85675 1211.933 1342.27 876.7378 1212.612 1489.803 t Ratio Prob>|t| 1.68 0.1076 -0.69 0.4972 2.62 0.35 0.41 -1.83 -1.52 2.59 0.0164 0.7266 0.6870 0.0824 0.1433 0.0175 R-Squared Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.692384 0.584719 1735.151 4318.75 28 • R-squared: As in simple linear regression, measures proportion of variability in Y explained by the regression of Y on these X’s. Between 0 and 1, nearer to 1 indicates more variability explained. • Don’t get excited that R-squared has increased when you add more variables into the model. Adding another explanatory variable to the model will always increase R-squared. The right question to ask is not whether R-squared has increased when we add an explanatory variable to a model but whether or not R-squared has increased by a useful amount. The t-statistic and the associated p-value for the ttest for each coefficient answers this question. Overall F-test • Analysis of Variance Source DF Sum of Squares Mean Square Model 7 135532366 19361767 Error 20 60214949 3010747.4 C. Total 27 195747315 F Ratio 6.4309 Prob > F 0.0005 • Test of whether any of the predictors are useful: H 0 : 1 p 0 vs.H a : at least one of 1 ,, p does not equal zero. Tests whether the model provides better predictions than the sample mean of Y. • p-value for the test: Prob>F in Analysis of Variance table. • p-value = 0.005, strong evidence that at least one of the predictors is useful for predicting ERS for the New York AAA club. Assumptions of Multiple Linear Regression Model 1. Linearity: E (Y | X 1,, X p ) 0 1 X 1 p X p 2. Constant variance: The standard deviation of Y for the subpopulation of units with X 1 x1 ,, X p x p is the same for all subpopulations. 3. Normality: The distribution of Y for the subpopulation of units with X 1 x1 ,, X p x p is normally distributed for all subpopulations. 4. The observations are independent. Assumptions for linear regression and their importance to inferences Inference Assumptions that are important Point prediction, point estimation Confidence interval for slope, hypothesis test for slope, confidence interval for mean response Prediction interval Linearity, independence Linearity, constant variance, independence, normality (only if n<30) Linearity, constant variance, independence, normality Checking Linearity • Plot residuals versus each of the explanatory variables. Each of these plots should look like random scatter, with no pattern in the mean of the residuals. Bivariate Fit of Residual Calls By Average Temperature 4000 Bivariate Fit of Residual Calls By Range 4000 3000 Residual Calls Residual Calls 3000 2000 1000 0 2000 1000 0 -1000 -1000 -2000 -2000 -3000 -3000 0 10 20 30 40 50 -5 0 5 10 15 20 25 30 35 40 Range Average Temperature If residual plots show a problem, then we could try to transform the x-variable and/or the y-variable. Residual Plots in JMP • After Fit Model, click red triangle next to Response, click Save Columns and click Residuals. • Use Fit Y by X with Y=Residuals and X the explanatory variable of interest. Fit Line will draw a horizontal line with intercept zero. It is a property of the residuals from multiple linear regression that a least squares regression of the residuals on an explanatory variable has slope zero and intercept zero. Residual by Predicted Plot Residual by Predicted Plot 4000 Calls Residual 3000 2000 1000 0 -1000 -2000 -3000 1000 3000 5000 7000 9000 Calls Predicted • • • • Fit Model displays the Residual by Predicted Plot automatically in its output. The plot is a plot of the residuals versus the predicted Y’s, Yˆi Eˆ (Yi | X1 xi1,, X p xip ) We can think of the predicted Y’s as summarizing all the information in the X’s. As usual we would like this plot to show random scatter. Pattern in the mean of the residuals as the predicted Y’s increase: Indicates problem with linearity. Look at residual plots versus each explanatory variable to isolate problem and consider transformations. Pattern in the spread of the residuals: Indicates problem with constant variance. Checking Normality • As with simple linear regression, make histogram of residuals and normal quantile plot of residuals. 3 .99 2 .95 .90 1 .75 .50 0 .25 .10 .05 -1 -2 .01 -3 -3000 -1000 0 1000 2000 3000 4000 Normal Quantile Plot Distributions Residual Calls Normality appears to be violated: several points are outside the confidence bands. Distribution of Residuals is skewed to the right. Transformations to Remedy Constant Variance and Normality Nonconstant Variance • When the variance of Y| Yˆ increases with Yˆ, try transforming Y to log Y or Y to Y • When the variance of Y| Yˆ decreases with Yˆ , try transforming Y to 1/Y or Y to Y2 Nonnormality • When the distribution of the residuals is skewed to the right, try transforming Y to log Y. • When the distribution of the residuals is skewed to the left, try transforming Y to Y2 Influential Points, High Leverage Points, Outliers • As in simple linear regression, we identify high leverage and high influence points by checking the leverages and Cook’s distances (Use save columns to save Cook’s D Influence and Hats). • High influence points: Cook’s distance > 1 • High leverage points: Hat greater than (3*(# of explanatory variables + 1))/n is a point with high leverage. • Use same guidelines for dealing with influential observations as in simple linear regression. • Point that has unusual Y given its explanatory variables: point with a residual that is more than 3 RMSEs away from zero. Scatterplot Matrix • Before fitting a multiple linear regression model, it is good idea to make scatterplots of the response variable versus the explanatory variable. This can suggest transformations of the explanatory variables that need to be done as well as potential outliers and influential points. • Scatterplot matrix in JMP: Click Analyze, Multivariate Methods and Multivariate, and then put the response variable first in the Y, columns box and then the explanatory variables in the Y, columns box. Scatterplot Matrix 8000 6000 4000 2000 50 Calls 30 A verage Temperature 10 40 25 15 Range 0 1 Rain f orecast 0.5 0 1 Snow f orecast 0.5 0 1 Weekday 0.5 0 1 Sund 0.5 0 1 0.5 0 2000 6000 10 20 40 50 0 5 15 25 35 0 .5 1 0 .5 1 0 .5 1 0 .5 • In order to evaluate benefits of a proposed irrigation scheme in Egypt, the relation of yield Y of wheat to rainfall is investigated over several years (see rainfall.JMP). • How can regression analysis help? Year Yield (Bu./Acre), Y Total Spring Rainfall, R Average Spring Temperature, T 1963 60 8 56 1964 50 10 47 1965 70 11 53 1966 70 10 53 1967 80 9 56 1968 50 9 47 1969 60 12 44 1970 40 11 44 Simple Linear Regression of Yield on Rainfall 90 80 70 Yield • Bivariate Fit of Yield By Total Spring Rainfall 60 50 40 30 7 8 9 10 11 12 13 Total Spring Rainfall Linear Fit Linear Fit Yield = 76.666667 - 1.6666667 Total Spring Rainfall Rainfall reduces yield!? Is irrigation a bad idea? Linear Fit Yield = 76.666667 - 1.6666667 Total Spring Rainfall • Interpretation of coefficient of rainfall: The change in the mean yield that is associated with a one inch increase in rainfall. Other important variables (lurking variables) are not held fixed and might tend to change as rainfall increases. Bivariate Fit of Average Spring Temperature By Total Spring Rainfall Average Spring Temperature 57.5 Temperature tends to decrease as rainfall increases. 55 52.5 50 47.5 45 42.5 7 8 9 10 11 Total Spring Rainfall 12 13 Controlling for Known Lurking Variables: Multiple Regression • To evaluate the benefits of the irrigation scheme, we want to know how changes in rainfall are associated with changes in yield when all other important variables (lurking variables) such as temperature held fixed. • Multiple regression provides this. • Coefficient on rainfall in the multiple regression of yield on rainfall and temperature = change in the mean yield that is associated with a one inch increase in rainfall when temperature is held fixed. Multiple Regression Analysis Response Yield Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) Parameter Estimates Term Intercept Total Spring Rainfall Average Spring Temperature 0.790476 0.706667 7.091242 60 8 Estimate Std Error -144.7619 55.8499 5.7142857 2.680238 2.952381 0.692034 t Ratio Prob>|t| -2.59 0.0487 2.13 0.0862 4.27 0.0080 • Rainfall is estimated to be beneficial once temperature is held fixed.