Download SUPPLEMENTAL DIGITAL CONTENT 3. Appendix B

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Time series wikipedia , lookup

Transcript
SUPPLEMENTAL DIGITAL CONTENT 3.
Appendix B. Nonparametric Regression
Introduction
Nonparametric regression aims to uncover structure in data while making minimal
assumptions, thereby letting the data speak for itself. Typically this means flexibly
estimating the mean of an outcome Y given a vector of predictors X from some sample that is, doing so without specifying how exactly the mean of Y relates to the predictors
X, and without specifying what the distribution of Y given X happens to be. Contrast this
with the standard regression approach, for example, which presupposes not only that Y
happens to follow one of the commonly studied distributions (e.g., normal or Poisson)
but also that there is a strictly linear relationship between the predictors and outcome,
i.e., g(E[Y|X=x]) = b0 + b1x1 + ... + bpxp, where g is a known link function. In practice
these assumptions are often made for the sake of convenience, rather than because
they actually have some justification; but misspecifying the distribution or the mean
structure can have dire consequences, including erroneous scientific conclusions and
missed opportunities for finding distinctive features in the data. For excellent and
thorough reviews of nonparametric regression see texts by Tibshirani et al. (2009) and
Wasserman (2006).
Generalized additive modeling
The generalized additive model (GAM), developed by Hastie and Tibshirani (1990), is
based on linear regression, but rather than assuming a linear relationship between the
predictors themselves and the outcome, assumes a linear relationship between
nonparametric functions f of the predictors and the outcome:
p
g ( E[Y | X  x])  f 0   f j ( x j )
j 1
In practice the nonparametric functions f are either splines or local regression curves,
and they are estimated via an iterative "backfitting" algorithm. The backfitting algorithm
essentially works by fitting a spline (i.e., nonlinear but smooth curve) for each predictor
to residual values calculated after excluding that predictor, and repeating until all the
new curves for each predictor no longer change much. The functions are chosen to give
the best model fit without overfitting, by maximizing the sum-of-squared error (for
continuous outcomes) or log-likelihood (for binary outcomes) after penalizing according
to the number of parameters in the model.
The above estimation procedure has an easy implementation in R, so that analysts can
use GAMs without worrying about computing details. Once the "gam" package has
been installed and loaded in an R session, a GAM can be fit with the following code
(assuming "dataset" is the name of a data frame in R containing the outcome y and
covariates x1, x2, and x3):
gam.result <- gam(y ~ s(x1,4) + s(x2,4) + x3, data=dataset)
The above model uses splines with four degrees of freedom for nonparametric functions
for x1 and x2, but assumes a linear relationship between predictor x3 and y. The model
is easily adaptable - in practice one could incorporate many more predictors, along with
interaction terms and potentially greater or fewer degrees of freedom.
Boosting
Boosting was first developed by Freund and Schapire (1996), and unlike GAM is a socalled "ensemble" method or "meta-classifier" (like random forests). Ensemble methods
work by fitting many models to the data and generating predictions by combining results
across models. Specifically, boosting fits models to successively re-weighted and resampled versions of the data using weights that increase the importance of previously
misclassified data points.
We used "stochastic gradient tree boosting," which reweights random subsamples of
the data with the negative gradient of an appropriate loss function across a large
number of iterations and generates predictions using sums of trees. The loss function
(e.g., squared error, absolute error, deviance) defines the distance between a prediction
and a true value, and for a given problem can be determined by, for example, whether
the outcome is continuous or categorical; the gradient of the loss function is just the
derivative with respect to the prediction, and can be thought of as a simplified residual
itself. Boosting yields predictions that are sums of trees, but the analyst has to specify
how large these individual trees should be; this determines what level of interaction
among predictors is allowed, as trees with a single split will be additive in the predictors
while trees with two splits will allow for pairwise interactions and so on. The analyst also
has to decide how many trees to use (typical values are at least in the hundreds), how
much data to sample at each step, as well as a shrinkage parameter (with more
shrinkage requiring more trees). These choices can be guided by the data via cross-
validation, or by suggested default values. It should be noted that boosting inherently
does variable selection through its use of simple trees.
Gradient tree boosting can be easily implemented in R via the "gbm" package:
gbm.result <- gbm(y ~ x1+x2+x3, data=dataset, n.trees=1000,
interaction.depth=3, bag.fraction=0.5, shrinkage=0.05)
This model allows 3-way interactions, uses 1000 trees fit to samples of half the data,
and has shrinkage parameter 0.05. Many more details and options are discussed in the
"gbm" package documentation.
References
Hastie T, Tibshirani R. Generalized additive models. 1st ed. London; New York:
Chapman and Hall; 1990.
Hastie T, Tibshirani R, Friedman JH. The elements of statistical learning: data mining,
inference, and prediction. 2nd ed. New York, NY: Springer; 2009.
Freund Y, Schapire R. Experiments with a new boosting algorithm. Proceedings of the
Thirteenth International Conference on Machine Learning. San Francisco: Morgan
Kaufmann; 1996.
Wasserman L. All of nonparametric statistics. New York, NY: Springer; 2006.