* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 2005-04-19-selection..
Vector generalized linear model wikipedia , lookup
Perceptual control theory wikipedia , lookup
Computer simulation wikipedia , lookup
History of numerical weather prediction wikipedia , lookup
General circulation model wikipedia , lookup
Data assimilation wikipedia , lookup
Predictive analytics wikipedia , lookup
Simplex algorithm wikipedia , lookup
April 19 — Choosing model variables 1. The standard linear statistical model (lm in R) (a) yi = β0 + β1 x1i + · · · + βk xki + ei (b) The errors, ei have mean 0, variance σ 2 , and are independent. (c) The random variables ei have normal distributions. 2. Problem: How many variables (of the k) should we keep? There is a tradeoff - more variables “explains” more variation but makes a more complicated model. 3. Some solutions suggested by yesterday: (a) Eliminate any variable with a “large” p-value (in other words, the evidence in favor of β 6= 0 not strong) (b) Find the collection of variables with largest adjusted-R2 . 4. One approach. (a) Stepwise regression, add or subtract one variable at a time based on a criterion. (b) Want to mimimize some (increasing) function of SSResid and k. (As k or SSResid increases, model becomes less desirable). (c) Candidates for function to minimize (there are theoretical reasons for these) AIC = n ln SSResid + 2(k + 1) n BIC = n ln SSResid + (k + 1) ln n n (d) step in R does both AIC (default) and BIC stepwise regression. (e) Stepwise regression may not find the best model since it does not look at every subset of variables. Homework Read Devore and Farnum Section 11.6, pages 538–545. 1. The data page at the course website has a dataset that includes statistics on five years of majorleague baseball. We desire to take certain statistics computed on a per-game basis and use these to predict the number of runs-per-game (RG). The version of the dataset that is useful in this regard is the “per game” version. The variables are RG (runs), X1BG (singles), X2BG (doubles), X3BG (triples), HRG (home runs), SOG (strikeouts), SBG (stolen bases), CSG (caught stealing). (a) Use all of the other variables in a linear function to predict RG. (b) Based on the t-values in the preceding analysis, which variables are obvious candidates for removal from the model? Refit the model without these variables. Compute adjusted R2 for the two models. (c) Employ stepwise regresson in R (step). Does step give the same result as the analysis in part (b)? (d) Using the model from (c), what variable should be the next to remove? Refit the model without that variable and compare adjusted R2 of this model to the models of (b) and (c). (e) If you know something about baseball, do the coefficients in the model of (c) “make sense?”