Download Variable Selection Insurance Lawrence D. Brown Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Confidence interval wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Variable Selection Insurance
Lawrence D. Brown
Statistics Department
Wharton School, Univ of Pennsylvania
[email protected]
Abstract
Among statisticians variable selection is a common and very dangerous activity. This talk
will survey the dangers and then propose two forms of insurance to guarantee against the
damages from this activity.
Conventional statistical inference requires that a specific model of how the data were
generated be specified before the data are examined and analyzed. Yet it is common in
applications for a variety of variable selection procedures to be undertaken to determine a
preferred model followed by statistical tests and confidence intervals computed for this “final”
model. Such practices are typically misguided. The parameters being estimated depend on this
final model, and post-model-selection sampling distributions may have unexpected properties
that are very different from what is conventionally assumed. Confidence intervals and statistical
tests do not perform as they should.
We address this dilemma within a standard linear-model framework. There is a numerical
response of interest (Y) and a suite of possible explanatory variables, X 1,…,Xp. to be used in a
multiple linear regression. The data is gathered, a multivariate linear model is constructed using a
selected subset of the potential X variables, and inference (estimates, confidence intervals, tests)
is performed for the selected slope parameters.
We propose two types of insurance to guarantee against the deleterious effects of this type
of variable selection. The first provides valid confidence intervals and tests based on the design
matrix of the observed variables. It does not adherence to a pre-specified variable selection
algorithm. This insurance may involve overly conservative procedures; but on the other hand, no
less conservative procedure of this type will provide the desired insurance. The second type of
insurance is purchased through use of a properly specified split-sample bootstrap. These intervals
may be less conservative, but are not always so, and part of their price lies in the split-sample
scheme that effectively sacrifices a portion of the data.
This is joint work with R. Berk, A. Buja, E. George, E. Pitkin, M. Traskin, K. Zhang and L. Zhao.