Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Outline Introduction Bayesian Model Example Extensions and future work Bibliography Bayesian Classification and Regression Tree Analysis (CART) Teresa Jacobson Department of Applied Mathematics and Statistics Jack Baskin School of Engineering UC Santa Cruz March 11, 2010 Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Introduction Bayesian Model Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW) Example Extensions and future work Bibliography Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography What is CART? The general aim of classification and regression tree analysis: given a set of observations yi and associated variables xij , i = 1 : n and j = 1 : p, find a way of using x to partition the observations into homogeneously distributed groups, then use the group to predict y . Use binary trees to recursively split observations with yes/no questions about variables in x. Assume each end or terminal node has a homogeneous distribution. Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography How do we do this? Seminal work by Breiman et al[1] was surprisingly Bayesian, involving the elicitation of priors and risk/utility functions on misclassification. However, the actual tree generation methods were still very ad-hoc. After this work was published a large number of different ad-hoc methods appear, as well as attempts to combine them to produce better inferential strategies. Methods are largely deterministic in nature and produce one tree per method. Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Going Bayesian: The Problem ! p =? 1 1 Image courtesy of Diesel-stock, Diesel-stock.deviantart.com. Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW) Notation Notation follows that of Wu, Tjelmeland and West (WTW)[7]. Observations yi , “regressors” xi , i ∈ I = {1 : n}, j ∈ 1 : k. We wish to predict y ∈ Y based associated x ∈ X = X1 × · · · × Xk . Nodes u with the root note denoted as node 0 and each non-terminal node u with children nodes 2u + 1 (left) and 2u + 2 (right). Trees are then defined as appropriate subsets of the set N = {0, 1, 2, . . . }. Write the number of nodes of a tree T as m(T ). Splitting: For each node U: Choose a predictor variable index kT (u) and a splitting threshold τT (u) ∈ XkT (u) . We then assign y to the left child of u if xkT (u) ≤ τT (u). Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW) Example tree Example tree from iris data height=4, log(p)=134.866 Petal.Width <> 1.5 Sepal.Length <> 6.2 Petal.Width <> 0.6 1 ● Sepal.Length <> 5.9 8e−04 30 obs 2 ● 3 ● 0.0045 17 obs 0.0023 13 obs Jacobson 4 ● 5 ● 0.0017 11 obs 0.0086 19 obs Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW) Likelihood Each terminal node (leaf) viewed as a random sample from some distribution with density φ(·|θu ) where θu is dependently only on the leaf. Usually φ is either multinomial (categorical outcomes) or normal (continuous outcomes). Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW) Tree prior Simplify by using a prior of the form p(Θ, T ) = p(Θ|T )p(T ) and Chipman, George, and McCulloch (CGM) specify p(T ) implicitly by using a tree-generating process: 1. Begin by setting T to be the trivial one-node tree 2. Split a node with probability psplit (u, T ) 3. If a node splits, assign a splitting rule τT (u) according to some distribution p(τT (u)|u, T ). Update T to reflect the new tree, and repeat steps 2 and 3. Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW) Tree prior (cont.) Consider psplit (u, T ) = α(1 + du )−β , β ≥ 0; 0 ≤ α ≤ 1 where dn is the node depth. Consider finite splitting values. Suggestion: choose k uniformly from available predictors and then τ from the set of observed values if xk is quantitative or from the available subsets if qualitative. For Θ, use iid normal-inverse-gamma for Θ|T if constructing a regression tree and Dirichlet if constructing a classification tree. CGM suggest choosing hyperparameters based on fitting a greedy tree model. Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW) Fitting procedure Proceed through MCMC. Interest focuses on the steps for sampling the tree structure. CGM use a Metropolis-Hastings step with a transition kernel choosing randomly among four steps: I Grow: Pick a terminal node and split into two children nodes, I Prune: Pick a parent of two terminal nodes and collapse, I Change: Pick an internal node and reassign the splitting rule, I Swap: Pick a parent-child pair and swap splitting rules, unless the other child of the parent has the same pair, in which case give both children the splitting rule of the parent. All steps are reversible, so the Markov chain is reversible. Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW) Limitations I Relatively slow mixing: tendency to stay in local area I Tendency to get “stuck” in a local mode: CGM suggest repeated restarting either from trivial tree or trees found by other methods such as bootstrap bumping I No single tree output; no good way of picking one “good” tree from sample Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW) WTW propose two significant improvements to CGM’s method: I Improved prior on tree structure: the “pinball prior”, I New M-H method, “tree restructure” move. They also allow for infinite splitting moves, via a prior on the space of splitting values. A prior with finite point masses would duplicate that of CGM as a special case. Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW) Pinball prior Idea: generate some number of terminal nodes m(T ), then “cascade” these nodes down the tree, randomly splitting left/right with some probability until nodes define individual leaves. I Specify prior density for tree size, m(T ) ∼ α(m(T )). Natural: Poisson, m(T ) = 1 + Pois(λ) for some specified λ. I Construct a prior density for splitting, β(ml(u) (T )|mu (T )), where ml(u) (T ) is the number sent left from some number mu (T ) that have cascaded down to node u. There are a number of choices for β, e.g. uniform or binomial. Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Chipman, George, and McCulloch (CGM) Wu, Tjelmeland, and West (WTW) Tree restructure move Idea: Restructure the tree branches without changing the terminal categories. I Begin at node 0 I Recursively identify possible splitting rules that leave terminal categories unchanged I Choose some splitting rule, repeat until terminal nodes fully specified This move radically restructures the tree without affecting categorization and eliminates the tendency to get stuck near local maxima: effective exploration of posterior → better mixing, better posterior inference. Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Example Iris data: We wish to use sepal length and petal width to predict petal length. Divide data into two sets: 30 of each species for tree creation, 20 for evaluation. > iris.subsample.index <- c(sample(1:50, 30), sample(51:100, 30), sample(101:150, 30)) > iris.train <- iris[iris.subsample.index,] > iris.test <- iris[-iris.subsample.index,] Iris petal length 5 Testing Petal.Width 2.5 6 Training 7 7 6 2.0 5 1.5 4 1.0 3 2 0.5 1 5 6 7 Sepal.Length Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Example (Cont.) Using bcart in the tgb package: > bcart.iris <- bcart(X = iris.train[,c(1,4)], XX = iris.test[,c(1,4)], Z = iris.train[,3], trace = TRUE, R=5, BTE = c(2000, 10000, 2)) z quantile diff (error) 2.5 z mean ● 2.0 ● ●● ● ●● ● ● ● ●●● ● ● 1.5 0.5 dth Wi Jacobson ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●● 1.0 tal. Pe Petal.Width z h ngt .Le pal ● ● ● ● Se ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●●● ●●●● ●● ●●●● ● ● ● ●● ● ● ● 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Bayesian CART ● Outline Introduction Bayesian Model Example Extensions and future work Bibliography height=3, log(p)=118.534 height=4, log(p)=134.866 Petal.Width <> 1.5 height=5, log(p)=104.631 Petal.Width <> 1.5 Petal.Width <> 1.5 Sepal.Length <> 5.9 Petal.Width <> 0.6 Sepal.Length <> 6.5 Sepal.Length <> 5.5 1 ● 4 ● 5 ● 8e−04 0.0017 0.0086 30 obs 11 obs 19 obs Sepal.Length <> 5.9 Petal.Width <> 0.3 3 ● 0.0316 12 obs 1 ● 2 ● 3 ● 4 ● 2 ● 3 ● 1 ● 2 ● 9e−04 0.0064 0.0028 0.0063 0.0045 0.0023 8e−04 0.0211 30 obs 30 obs 17 obs 13 obs 17 obs 13 obs 24 obs 11 obs Jacobson Sepal.Length <> 6.2 Sepal.Length <> 6.2 Petal.Width <> 0.6 Bayesian CART 5 ● 6 ● 0.0022 0.002 0.0098 13 obs 4 ● 11 obs 19 obs Outline Introduction Bayesian Model Example Extensions and future work Bibliography 6 ● setosa versicolor virginica ● ● 3 4 5 ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 2 Observed petal length ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 1 Observed petal length 6 ● ● ● ● ● ● ● ● ● ● ●● 7 Testing data 7 Training data 2 3 4 5 6 Predicted petal length ● ● ● ● ● ● ● ● ● 2 3 4 5 Predicted petal length Jacobson Bayesian CART 6 Outline Introduction Bayesian Model Example Extensions and future work Bibliography Extensions and Future Work I Implementation! I Inference methods: tree averaging Beyond the Gaussian I I I Heavy-tailed distributions Skew and count data I Improved priors I Improved sampling steps Jacobson Bayesian CART Outline Introduction Bayesian Model Example Extensions and future work Bibliography Bibliography Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth Statistics/Probability Series. Wadsworth International Group, 1984. Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bayesian cart model search. Journal of the American Statistical Association, 93(443):935–960, September 1998. Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Hierarchical priors for bayesian cart shrinkage. Statistics and Computing, 10:17–24, 2000. Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bayesian treed models. Machine Learning, 48:299–320, 2002. David G. T. Denison, Bani K. Mallick, and Adrian F. M. Smith. A bayesian cart algorithm. Biometrika, 85(2):363–377, June 1998. Wei-Yin Loh. Classification and regression tree methods. In Ruggeri, Kenett, and Faltin, editors, Encyclopedia of Statistics in Quality and Reliability, pages 315–323. Wiley, 2008. Yuhong Wu, Håkon Tjelmeland, and Mike West. Bayesan cart - prior specification and posterior simulation -. January 2006. Jacobson Bayesian CART