Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multiple Regression I 4/9/12 • Transformations • The model • Individual coefficients • R2 • ANOVA for regression • Residual standard error Section 9.4, 9.5 Professor Kari Lock Morgan Duke University To Do • Project 2 Proposal (due Wednesday, 4/11) • Homework 9 (due Monday, 4/16) • Project 2 Presentation (Thursday, 4/19) • Project 2 Paper (Wednesday, 4/25) Non-Constant Variability Non-Normal Residuals Transformations • If the conditions are not satisfied, there are some common transformations you can apply to the response variable • You can take any function of y and use it as the response, but the most common are • log(y) (natural logarithm - ln) • y (square root) • y2 (squared) • ey (exponential)) log(y) Original Response, y: Logged Response, log(y): y Original Response, y: Square root of Response, y: 2 y Original Response, y: Squared response, y2: y e Original Response, y: Exponentiated Response, ey: Multiple Regression • Multiple regression extends simple linear regression to include multiple explanatory variables: y 0 1 x1 2 x2 ... k xk òi Grade on Final • We’ll use your current grades to predict final exam scores, based on a model from last semester’s students • Response: final exam score • Explanatory: hw average, clicker average, exam 1, exam 2 y 0 1hw 2clicker 3exam1 4exam2 Grade on Final What variable is the most significant predictor of final exam score? a) Homework average b) Clicker average c) Exam 1 d) Exam 2 Inference for Coefficients The p-value for explanatory variable xi is associated with the hypotheses H 0 : i 0 H a : i 0 For intervals and p-values of coefficients in multiple regression, use a t-distribution with degrees of freedom n – k – 1, where k is the number of explanatory variables included in the model Grade on Final Estimate your score on the final exam. What type of interval do you want for this estimate? a) Confidence interval b) Prediction interval Grade on Final Estimate your score on the final exam. (hw average is out of 10, clicker average is out of 2) Grade on Final Is the clicker coefficient really negative?!? Give a 95% confidence interval for the clicker coefficient (okay to use t* = 2). Grade on Final Is your score on exam 2 really not a significant predictor of your final exam score?!? Coefficients • The coefficient (and significance) for each explanatory variable depend on the other variables in the model! • In predicting final exam scores, if you know someone’s score on Exam 1, it doesn’t provide much additional information to know their score on Exam 2 (both of these explanatory variables are highly correlated) Grade on Final If you take Exam 1 out of the model… Now Exam 2 is significant! Model with Exam 1: Grade on Final If you include Project 1 in the model… Model without Project 1: Grades Multiple Regression • The coefficient for each explanatory variable is the predicted change in y for one unit change in x, given the other explanatory variables in the model! • The p-value for each coefficient indicates whether it is a significant predictor of y, given the other explanatory variables in the model! • If explanatory variables are associated with each other, coefficients and p-values will change depending on what else is included in the model Residuals Are the conditions satisfied? (a) Yes (b) No Evaluating a Model • How do we evaluate the success of a model? • How we determine the overall significance of a model? • How do we choose between two competing models? Variability • One way to evaluate a model is to partition variability Total Variability Variability Explained by the Model Error Variability • A good model “explains” a lot of the variability in Y Exam Scores • Without knowing the explanatory variables, we can say that a person’s final exam score will probably be between 60 and 98 (the range of Y) • Knowing hw average, clicker average, exam 1 and 2 grades, and project 1 grades, we can give a narrower prediction interval for final exam score • We say the some of the variability in y is explained by the explanatory variables • How do we quantify this? Variability How do we quantify variability in Y? a) Standard deviation of Y b) Sum of squared deviations from the mean of Y c) (a) or (b) d) None of the above Sums of Squares Total Variability n Yi Y i 1 SST 2 Variability Explained by the model Error variability ˆ Y Y ˆ Y Y n i 1 i SSM 2 n i 1 i SSE i 2 Variability Total Sum of Squares: n SST yi y 2 i 1 Y Model Sum of Squares: n SSM yˆi y 2 i 1 Error Sum of Squares: n SSE yi yˆi 2 i 1 • If SSM is much higher than SSE, than the model explains a lot of the variability in Y 2 R SSM "Variability in Y explained by the model" R SST "Total variability in Y" 2 Variability Explained by the Model Total Variability • R2 is the proportion of the variability in Y that is explained by the model 2 R • For simple linear regression, R2 is just the squared correlation between X and Y • For multiple regression, R2 is the squared correlation between the actual values and the predicted values 2 R R 2 0.67 R 2 0.09 Final Exam Grade Is the model significant? • If we want to test whether the model is significant (whether the model helps to predict y), we can test the hypotheses: H 0 : 1 2 ... k 0 H a : At least one i 0 • We do this with ANOVA! ANOVA for Regression Source df Model k Sum of Squares SSM Error n-k-1 SSE Total n-1 SST Mean F p-value Square MSM = MSM SSM/k MSE Use Fk,n-k-1 MSE = SSE/(n-k-1) k: number of explanatory variables n: sample size Final Exam Grade For this model, do the explanatory variables significantly help to predict final exam score? (calculate a p-value). (a) Yes (b) No n = 69 SSM = 3125.8 SSE = 1901.4 ANOVA for Regression 5 Sum of Squares 3125.8 Mean Square 625.16 Error 63 1901.4 30.18 Total 68 5027.2 Source df Model F 20.71 p-value 0 Final Exam Grade Simple Linear Regression • For simple linear regression, the following tests will all give equivalent p-values: • t-test for non-zero correlation • t-test for non-zero slope • ANOVA for regression Mean Square Error (MSE) • Mean square error (MSE) measures the average variability in the errors (residuals) • The square root of MSE gives the standard deviation of the residuals (giving a typical distance of points from the line) • This number is also given in the R output as the residual standard error, and is known as s in the textbook Final Exam Grade Simple Linear Model yi 0 1 xi i i ~ N 0, Residual standard error = MSE = se estimates the standard deviation of the residuals (the spread of the normal distributions around the predicted values) Residual Standard Error • Use the fact that the residual standard error is 5.494 and your predicted final exam score to compute an approximate 95% prediction interval for your final exam score yˆ 2 5.494 • NOTE: This calculation only takes into account errors around the line, not uncertainty in the line itself, so your true prediction interval will be slightly wider To Come… • How do we decide which explanatory variables to include in the model? • How do we use categorical explanatory variables? • What is the coefficient of one explanatory variable depends on the value of another explanatory variable?