Download SUPPLEMENTAL DIGITAL CONTENT 3. Appendix B

SUPPLEMENTAL DIGITAL CONTENT 3. Appendix B. Nonparametric Regression Introduction Nonparametric regression aims to uncover structure in data while making minimal assumptions, thereby letting the data speak for itself. Typically this means flexibly estimating the mean of an outcome Y given a vector of predictors X from some sample that is, doing so without specifying how exactly the mean of Y relates to the predictors X, and without specifying what the distribution of Y given X happens to be. Contrast this with the standard regression approach, for example, which presupposes not only that Y happens to follow one of the commonly studied distributions (e.g., normal or Poisson) but also that there is a strictly linear relationship between the predictors and outcome, i.e., g(E[Y|X=x]) = b0 + b1x1 + ... + bpxp, where g is a known link function. In practice these assumptions are often made for the sake of convenience, rather than because they actually have some justification; but misspecifying the distribution or the mean structure can have dire consequences, including erroneous scientific conclusions and missed opportunities for finding distinctive features in the data. For excellent and thorough reviews of nonparametric regression see texts by Tibshirani et al. (2009) and Wasserman (2006). Generalized additive modeling The generalized additive model (GAM), developed by Hastie and Tibshirani (1990), is based on linear regression, but rather than assuming a linear relationship between the predictors themselves and the outcome, assumes a linear relationship between nonparametric functions f of the predictors and the outcome: p g ( E[Y | X  x])  f 0   f j ( x j ) j 1 In practice the nonparametric functions f are either splines or local regression curves, and they are estimated via an iterative "backfitting" algorithm. The backfitting algorithm essentially works by fitting a spline (i.e., nonlinear but smooth curve) for each predictor to residual values calculated after excluding that predictor, and repeating until all the new curves for each predictor no longer change much. The functions are chosen to give the best model fit without overfitting, by maximizing the sum-of-squared error (for continuous outcomes) or log-likelihood (for binary outcomes) after penalizing according to the number of parameters in the model. The above estimation procedure has an easy implementation in R, so that analysts can use GAMs without worrying about computing details. Once the "gam" package has been installed and loaded in an R session, a GAM can be fit with the following code (assuming "dataset" is the name of a data frame in R containing the outcome y and covariates x1, x2, and x3): gam.result <- gam(y ~ s(x1,4) + s(x2,4) + x3, data=dataset) The above model uses splines with four degrees of freedom for nonparametric functions for x1 and x2, but assumes a linear relationship between predictor x3 and y. The model is easily adaptable - in practice one could incorporate many more predictors, along with interaction terms and potentially greater or fewer degrees of freedom. Boosting Boosting was first developed by Freund and Schapire (1996), and unlike GAM is a socalled "ensemble" method or "meta-classifier" (like random forests). Ensemble methods work by fitting many models to the data and generating predictions by combining results across models. Specifically, boosting fits models to successively re-weighted and resampled versions of the data using weights that increase the importance of previously misclassified data points. We used "stochastic gradient tree boosting," which reweights random subsamples of the data with the negative gradient of an appropriate loss function across a large number of iterations and generates predictions using sums of trees. The loss function (e.g., squared error, absolute error, deviance) defines the distance between a prediction and a true value, and for a given problem can be determined by, for example, whether the outcome is continuous or categorical; the gradient of the loss function is just the derivative with respect to the prediction, and can be thought of as a simplified residual itself. Boosting yields predictions that are sums of trees, but the analyst has to specify how large these individual trees should be; this determines what level of interaction among predictors is allowed, as trees with a single split will be additive in the predictors while trees with two splits will allow for pairwise interactions and so on. The analyst also has to decide how many trees to use (typical values are at least in the hundreds), how much data to sample at each step, as well as a shrinkage parameter (with more shrinkage requiring more trees). These choices can be guided by the data via cross- validation, or by suggested default values. It should be noted that boosting inherently does variable selection through its use of simple trees. Gradient tree boosting can be easily implemented in R via the "gbm" package: gbm.result <- gbm(y ~ x1+x2+x3, data=dataset, n.trees=1000, interaction.depth=3, bag.fraction=0.5, shrinkage=0.05) This model allows 3-way interactions, uses 1000 trees fit to samples of half the data, and has shrinkage parameter 0.05. Many more details and options are discussed in the "gbm" package documentation. References Hastie T, Tibshirani R. Generalized additive models. 1st ed. London; New York: Chapman and Hall; 1990. Hastie T, Tibshirani R, Friedman JH. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York, NY: Springer; 2009. Freund Y, Schapire R. Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann; 1996. Wasserman L. All of nonparametric statistics. New York, NY: Springer; 2006.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download SUPPLEMENTAL DIGITAL CONTENT 3. Appendix B