Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Topic 10 - Ensemble Methods Data Mining - Volinsky - 2011 - Columbia University 1 Ensemble Models - Motivation • Remember this picture? • Always looking for balance between low complexity (‘good on average’ but bad for prediction) and high complexity (‘good for specific cases’ but might overfit) • By combining many different models, ensembles make it easier to hit the ‘sweet spot’ of modelling. • Best for models to draw from diverse, independent opinions – Wisdom Of Crowds Stest(q) Strain(q) Data Mining - Volinsky - 2011 - Columbia University 2 Ensemble Methods - Motivation • Models are just models. – Usually not true! – The truth is often much more complex than any single model can capture. – Combinations of simple models can be arbitrarily complex. (e.g. spam/robots models, neural nets, splines) • Notion: An average of several measurements is often more accurate and stable than a single measurement Accuracy: how well the model does for estimation and prediction Stability: small changes in inputs have little effect on outputs Data Mining - Volinsky - 2011 - Columbia University 3 Ensemble Methods – How They Work • The ensemble predicts a target value as an average or a vote of the predictions (of several individual models)... – Each model is fit independently of the others – Final prediction is a combination of the independent predictions of all models • For an continuous target, an ensemble averages predictions – Usually weighted • For a categorical target (classification), an ensemble may average the probabilities of the target values…or may use ‘voting’. – Voting classifies a case into the class that was selected most by individual models Data Mining - Volinsky - 2011 - Columbia University 4 Ensemble Models – Why they work • Voting example – – – – 5 independent classifiers 70% accuracy for each Use voting… What is the probability that the ensemble model is correct? • Lets simulate it – What about 100 examples? – (not a realistic example, why?) Data Mining - Volinsky - 2011 - Columbia University 5 Ensemble Schemes • The beauty is that you can average together models of any kind!!! • Don’t need fancy schemes – just average! • But there are fancy schemes: each one has various ways of fitting many models to the same data, and use voting or averaging – Stacking (Wolpert 92): fit many leave-1-out models – Bagging (Breiman 96) build models on many permutations of original data – Boosting (Freund & Shapire 96): iteratively re-model, using re-weighted data based on errors from previous models… – Arcing (Breiman 98), Bumping (Tibshirani 97), Crumpling (Anderson & Elder 98) , Born-Again (Breiman 98): – Bayesian Model Averaging - near to my heart… • We’ll explore BMA, bagging and boosting… Data Mining - Volinsky - 2011 - Columbia University 6 Ensemble Methods – Bayesian Model Averaging Data Mining - Volinsky - 2011 - Columbia University 7 Model Averaging • Idea: account for inherent variance of the model selection process • Posterior Variance = Within-Model Variance + Between-Model Variance • Data-driven model selection is risky: “Part of the evidence is spent specify the model” (Leamer, 1978) • Model-based inferences can be over-precise Data Mining - Volinsky - 2011 - Columbia University 8 Model Averaging • For some quantity of interest D: avg over all Models M, given the data D: Pr( D| D) Pr( D| M , D) Pr( M | D) M To calculate the first term properly, you need to integrate out model parameters q, Pr(D | M,D) Pr(D | M,q ,D)Pr(q | M,D) ˆ, D) Pr(D | M,q Where ^q is the MLE. For the second term, note that Pr(M k | D) Pr(D | M k )Pr(M k ) BIC k log Pr(D | Mk) log Pr(D | q k , M k ) Data Mining - Volinsky - 2011 - Columbia University dk log( n) 2 9 Bayesian Model Averaging • The approximations on the previous page allow you to calculate many posterior model probabilities quickly, and gives you the weights to use for averaging. • But, how do you know which models to average over? – Example, regression with p parameters – Each subset of p is a ‘model’ – 2p possible models! • Idea: Data Mining - Volinsky - 2011 - Columbia University 10 Model Averaging • But how to find the best models without fitting all models? • Solution: Leaps and Bounds algorithm can find the best model without fitting all models – Goal: find the single best model for each model size Don’t need to traverse this part of the tree since there is no way it can beat AB Data Mining - Volinsky - 2011 - Columbia University 11 BMA - Example PMP = Posterior Model Probability Best Models Score on holdout data: BMA wins Data Mining - Volinsky - 2011 - Columbia University 12 Ensemble Methods - Boosting Data Mining - Volinsky - 2011 - Columbia University 13 Boosting… • Different approach to model ensembles – mostly for classification • Observed: when model predictions are not highly correlated, combining does well • Big idea: can we fit models specifically to the “difficult” parts of the data? Data Mining - Volinsky - 2011 - Columbia University 14 Boosting— Algorithm From HTF p. 339 Data Mining - Volinsky - 2011 - Columbia University 15 Example • Courtesy M. Littman Data Mining - Volinsky - 2011 - Columbia University 16 • Courtesy M. Littman Example Data Mining - Volinsky - 2011 - Columbia University 17 • Courtesy M. Littman Example Data Mining - Volinsky - 2011 - Columbia University 18 Boosting - Advantages • Fast algorithms - AdaBoost • Flexible – can work with any classification algorithm • Individual models don’t have to be good – In fact, the method works best with bad models! – (bad = slightly better than random guessing) – Most common model – “boosted stumps” Data Mining - Volinsky - 2011 - Columbia University 19 Boosting Example from HTF p. 302 Data Mining - Volinsky - 2011 - Columbia University 20 Ensemble Methods – Bagging / Stacking Data Mining - Volinsky - 2011 - Columbia University 21 Bagging for Combining Classifiers Bagging = Boostrap aggregating • Big Idea: – To avoid overfitting of specific dataset, fit model to “bootstrapped” random sets of the data • Bootstrap – Random sample, with replacement, from the data set – Size of sample = size of data – – – – X= (1,2,3,4,5,6,7,8,9,10) B1=(1,2,3,3,4,5,6,6,7,8) B2=(1,1,1,1,2,2,2,5,6,8) … • Bootstrap sample have the same statistical properties as original data • By creating similar datasets you can see how much stability there is in your data. If there is a lack of stability, averaging helps. Data Mining - Volinsky - 2011 - Columbia University 22 Bagging • Training data sets of size N • Generate B “bootstrap” sampled data sets of size N • Build B models (e.g., trees), one for each bootstrap sample – Intuition is that the bootstrapping “perturbs” the data enough to make the models more resistant to true variability – Note: only ~62% of data included in any bootstrap sample • Can use the rest as an out-of-sample estimate! • For prediction, combine the predictions from the B models – Voting or averaging based on“out-of-bag” sample – Plus: generally improves accuracy on models such as trees – Negative: lose interpretability Data Mining - Volinsky - 2011 - Columbia University 23 HTF Bagging Example p 285 Data Mining - Volinsky - 2011 - Columbia University 24 Ensemble Methods – Random Forests Data Mining - Volinsky - 2011 - Columbia University 25 Random Forests • Trees are great, but – As we’ve seen, they are “unstable” – Also, trees are sensitive to the primary split, which can lead the tree in inappropriate directions – one way to see this: fit a tree on a random sample, or a bootstrapped sample of the data - Data Mining - Volinsky - 2011 - Columbia University 26 Example of Tree Instability from G. Ridgeway, 2003 Data Mining - Volinsky - 2011 - Columbia University 27 Random Forests • Solution: – random forests: an ensemble of decision trees – Similar to bagging: inject randomness to overcome instability – each tree is built on a random subset of the training data • Boostrapped version of data – at each split point, only a random subset of predictors are considered – Use “out-of-bag” hold out sample to estimate size of each tree – prediction is simply majority vote of the trees ( or mean prediction of the trees). • Randomizing the variables used is the key – Reduces correlation between models! • Has the advantage of trees, with more robustness, and a smoother decision rule. Data Mining - Volinsky - 2011 - Columbia University 28 HTF Example p 589 Data Mining - Volinsky - 2011 - Columbia University 29 Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1), 5-32 Data Mining - Volinsky - 2011 - Columbia University 30 Random Forests – How Big A Tree • Breiman’s original algorithm said: “to keep bias low, trees are to be grown to maximum depth” • However, empirical evidence typically shows that “stumps” do best Data Mining - Volinsky - 2011 - Columbia University 31 Ensembles – Main Points • Averaging models together has been shown to be effective for prediction • Many weird names: – See papers by Leo Breiman (e.g. “Bagging Predictors”, Arcing the Edge”, and “Random Forests” for more detail • Key points – Models average well if they are uncorrelated – Can inject randomness to insure uncorrelated models – Averaging small models better than large ones • Also, can give more insight into variables than simple tree – Variables that show up again and again must be good Data Mining - Volinsky - 2011 - Columbia University 32 Visualizing Forests • Data: Wisconsin Breast Cancer – Courtesy S. Urbanek Data Mining - Volinsky - 2011 - Columbia University 33 Data Mining - Volinsky - 2011 - Columbia University 34 Data Mining - Volinsky - 2011 - Columbia University 35 References • Random Forests from Leo Breiman himself • Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1), 5-32 • Hastie, Tibshirani, Friedman (HTF) – Chapters 8,10,15,16 – Data Mining - Volinsky - 2011 - Columbia University 36