Download 2005-04-19-selection..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vector generalized linear model wikipedia , lookup

Perceptual control theory wikipedia , lookup

Computer simulation wikipedia , lookup

History of numerical weather prediction wikipedia , lookup

General circulation model wikipedia , lookup

Data assimilation wikipedia , lookup

Predictive analytics wikipedia , lookup

Simplex algorithm wikipedia , lookup

Generalized linear model wikipedia , lookup

Regression analysis wikipedia , lookup

Transcript
April 19 — Choosing model variables
1. The standard linear statistical model (lm in R)
(a) yi = β0 + β1 x1i + · · · + βk xki + ei
(b) The errors, ei have mean 0, variance σ 2 , and are independent.
(c) The random variables ei have normal distributions.
2. Problem: How many variables (of the k) should we keep? There is a tradeoff - more variables
“explains” more variation but makes a more complicated model.
3. Some solutions suggested by yesterday:
(a) Eliminate any variable with a “large” p-value (in other words, the evidence in favor of β 6= 0
not strong)
(b) Find the collection of variables with largest adjusted-R2 .
4. One approach.
(a) Stepwise regression, add or subtract one variable at a time based on a criterion.
(b) Want to mimimize some (increasing) function of SSResid and k. (As k or SSResid increases,
model becomes less desirable).
(c) Candidates for function to minimize (there are theoretical reasons for these)
AIC = n ln
SSResid
+ 2(k + 1)
n
BIC = n ln
SSResid
+ (k + 1) ln n
n
(d) step in R does both AIC (default) and BIC stepwise regression.
(e) Stepwise regression may not find the best model since it does not look at every subset of
variables.
Homework
Read Devore and Farnum Section 11.6, pages 538–545.
1. The data page at the course website has a dataset that includes statistics on five years of majorleague baseball. We desire to take certain statistics computed on a per-game basis and use these
to predict the number of runs-per-game (RG). The version of the dataset that is useful in this
regard is the “per game” version. The variables are RG (runs), X1BG (singles), X2BG (doubles),
X3BG (triples), HRG (home runs), SOG (strikeouts), SBG (stolen bases), CSG (caught stealing).
(a) Use all of the other variables in a linear function to predict RG.
(b) Based on the t-values in the preceding analysis, which variables are obvious candidates for
removal from the model? Refit the model without these variables. Compute adjusted R2
for the two models.
(c) Employ stepwise regresson in R (step). Does step give the same result as the analysis in
part (b)?
(d) Using the model from (c), what variable should be the next to remove? Refit the model
without that variable and compare adjusted R2 of this model to the models of (b) and (c).
(e) If you know something about baseball, do the coefficients in the model of (c) “make sense?”