Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Variable Selection Insurance Lawrence D. Brown Statistics Department Wharton School, Univ of Pennsylvania [email protected] Abstract Among statisticians variable selection is a common and very dangerous activity. This talk will survey the dangers and then propose two forms of insurance to guarantee against the damages from this activity. Conventional statistical inference requires that a specific model of how the data were generated be specified before the data are examined and analyzed. Yet it is common in applications for a variety of variable selection procedures to be undertaken to determine a preferred model followed by statistical tests and confidence intervals computed for this “final” model. Such practices are typically misguided. The parameters being estimated depend on this final model, and post-model-selection sampling distributions may have unexpected properties that are very different from what is conventionally assumed. Confidence intervals and statistical tests do not perform as they should. We address this dilemma within a standard linear-model framework. There is a numerical response of interest (Y) and a suite of possible explanatory variables, X 1,…,Xp. to be used in a multiple linear regression. The data is gathered, a multivariate linear model is constructed using a selected subset of the potential X variables, and inference (estimates, confidence intervals, tests) is performed for the selected slope parameters. We propose two types of insurance to guarantee against the deleterious effects of this type of variable selection. The first provides valid confidence intervals and tests based on the design matrix of the observed variables. It does not adherence to a pre-specified variable selection algorithm. This insurance may involve overly conservative procedures; but on the other hand, no less conservative procedure of this type will provide the desired insurance. The second type of insurance is purchased through use of a properly specified split-sample bootstrap. These intervals may be less conservative, but are not always so, and part of their price lies in the split-sample scheme that effectively sacrifices a portion of the data. This is joint work with R. Berk, A. Buja, E. George, E. Pitkin, M. Traskin, K. Zhang and L. Zhao.