Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Random aggregated and bagged ensembles of SVMs: an empirical bias-variance analysis Giorgio Valentini e-mail: [email protected] DSI – Dipartimento di Scienze dell’ Informazione Università degli Studi di Milano MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 1 Goals • Developing methods and procedures to estimate the biasvariance decomposition of the error in ensembles of learning machines • A quantitative evaluation of the variance reduction property in random aggregated and bagged ensembles (Breiman,1996). • A characterization of bias-variance (BV) decomposition of the error in bagged and random aggregated ensembles of SVMs, comparing the results with BV decomposition in single SVMs (Valentini and Dietterich, 2004) • Getting insights into the reasons why the ensemble method Lobag (Valentini and Dietterich, 2003) works. • Getting insights into the reasons why random subsampling techniques works with large data mining problems (Breiman, 1999; Chawla et al. 2002). MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 2 Random aggregated ensembles Let D = {(xj , tj)}, 1 j m, be a set of m samples drawn identically and independently from a population U according to P, where P(x, t) is the joint distribution of the data points in U. Let L be a learning algorithm, and define fD = L (D) as the predictor produced by L applied to a training set D. The model produces a prediction fD(x) = y. Suppose that a sequence of learning sets { Dk } is given, each i.i.d. from the same underlying distribution P. Breiman proposed to aggregate the fD trained with different samples drawn from U to get a better predictor fA(x, P). For classification problems tj S N , and fA(x, P) = arg maxj |{k | fDk (x) = j }| . As the training sets D are randomly drawn from U, we name the procedure to build fA random aggregating. MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 3 Random aggregation reduces variance Considering regression problems, if T and X are random variables having joint distribution P, the expected squared loss EL for the single predictor fD(X) is: EL = ED[ET,X[(T - fD(X))2 ]] while the expected squared loss ELA for the aggregated predictor is: ELA = ET,X[(T - fA(X))2 ] Breiman showed that EL ELA. This disequality depends on the instability of the predictions, that is on how unequal the two sides of the following eq. are: ED[fD(X)]2 ED[f D2 (X)] There is a strict relationship between the instability and the variance of the base predictor. Indeed the variance V (X) of the base predictor is: V (X) = ED[(fD(X) - ED[fD(X)])2 ]= ED[f D2 (X)] - ED[fD(X)]2 Breiman showed also that in classification problems, as in regression, aggregating MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 “good” predictors can lead to better performances, as long as the base 4 How much does the variance reduction property hold for bagging too? Breiman theoretically showed the random aggregating reduces variance Bagging is an approximation of random aggregating, for at least two reasons: 1. Bootstrap samples are not “real” data samples: they are drawn from a data set D, that is in turn a sample from the population U. On the contrary fA uses samples drawn directly from U. 2. Bootstrap samples are drawn from D according to an uniform probability distribution, which is only an approximation of the unknown true distribution P. 1. Does the variance reduction property hold for bagging too ? 2. Can we provide a quantitative estimate of variance reduction both in random aggregating and bagging? MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 5 A quantitative estimate of bias-variance decomposition of the error in random aggregated (RA) and bagged ensembles of learning machines • • We developed procedures to quantitatively evaluate bias-variance decompostion of the error according to Domingos unified bias-variance theory (Domingos, 2000). We proposed three basic techniques (Valentini, 2003): 1. 2. • • Out-of-bag, or cross-validation estimate (when only small samples are available) Hold-out techniques (when relatively large data sets are available) In order to get a reliable estimate of the error we applied the second technique evaluating the bias-variance decomposition using quite large test sets. We summarize here the two main experimental steps to perform bias variance analysis with resampling-based ensembles: 1. 2. Procedures to generate data for ensemble training Bias-variance decomposition of the error on a separated test set MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 6 Procedure to generate training samples for bagged ensembles Procedure to generate training samples for random aggregates ensembles MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 7 Procedure to estimate the bias-variance decomposition of the error in ensembles of learning machines MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 8 Comparison of bias-variance decomposition of the error in random aggregated (RA) and bagged ensembles of SVMs on 7 two-class classification problems Gaussian kernels Linear kernels • Results represent changes relative to single SVMs (e.g. zero change means no difference). Square labeled lines refer to random aggregated ensembles, triangle to bagged ensembles. • In random aggregated ensembles the error decreases form 15 to 70% w.r.t. single SVMs, while in bagged ensemble the errror decreases from 0 to 15% depending on the data set. • Variance is significantly reduced in RA ens. (about 90%), while in bagging the variance reduction is quite limited, if compared to RA decrement (between 0 and 35 %). No substantial bias reduction is registered. MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 9 Characterization of bias-variance decompostion of the error in random aggregated ensembles of SVMs (gaussian kernel) • Lines labeled with crosses: single SVMs • Lines labeled with triangles: RA SVM ensembles MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 10 Lobag works when unbiased variance is relatively high • Lobag (Low bias bagging) is a variant of bagging that uses low biased base learners selected through bias-variance analysis procedures (Valentini and Dietterich, 2003). • Our experiments with bagging show the reasons why Lobag works: bagging lowers variance, but the bias remains substantially unchanged. Hence selecting low bias base learners Lobag reduces both bias (through bias-variance analysis) and variance (through classical aggregation techniques) • Valentini and Dietterich experimentally showed that Lobag is effective, with SVMs as base learners, when small sized samples are used, that is when the variance due to reduced cardinality of the available data is relatively high. But when we have relatively large data sets, we may expect that lobag does not outperform bagging (because in this case, on the average, the unbiased variance will be relatively low). MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 11 Why random subsampling techniques work with large databases ? • Breiman proposed random subsampling techniques for classification in large databases, using decision trees as base learners (Breiman, 1999), and these techniques have been also successfully applied in distributed environments (Chawla et al., 2002). • Random aggregating can also be interpreted as a technique to draw from a large population small subsamples to train the base learners and then aggregating them e.g. by majority voting. • Our experiments on random aggregated ensembles show that the variance component of the error is strongly reduced, while the bias remains unchanged or it is lowered , getting insights into the reasons why random subsampling techniques works with large data mining problems. In particular our experimental analysis suggests to apply SVMs trianed on small subsamples when large database are available or when they are fragmented in distributed systems. MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 12 Conclusions • We showed how to apply bias-variance decomposition techniques to the analysis of bagged and random aggregated ensembles of learning machines. • These techniques have been applied to the analysis of bagged and random aggregated ensembles of SVMs, but can be directly applied to a large set of ensemble methods* • The experimental analysis show that random aggregated ensembles significantly reduce the variance component of the error w.r.t. single SVMs, but this property only partially holds for bagged ensembles. • The empirical bias variance analysis gets also insights into the reasons why Lobag works, highliting on the other hand some limitations of the Lobag approach. • The bias-variance analysis of random aggregated ensembles highlights also the reasons of their successfull application to large scale data mining problems. *the C++ classes and applications to perform BV analysis are freely available at: http://homes.dsi.unimi.it/~valenti/sw/NEURObjects MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 13 References • Breiman, L.: Bagging predictors. Machine Learning 24 (1996) 123-140 • Breiman, L.: Pasting Small Votes for Classification in Large Databases and OnLine. Machine Learning 36 (1999) 85-103 • Chawla, N., Hall, L., Bowyer, K., Moore, T., Kegelmeyer, W.: Distributed pasting of small votes. In: MCS2002, Cagliari, Italy. Vol. 2364 of Lecture Notes in Computer Science., Springer-Verlag (2002) 52-61 • Domingos, P. A Unified Bias-Variance Decomposition for Zero-One and Squared Loss. In: Proc. 17th National Conference on Artificial Intelligence, Austin, TX, AAAI Press (2000) 564-569 • G. Valentini and T.G. Dietterich. Low Bias Bagged Support Vector Machines. ICML 2003, pages 752-759, Washington D.C., USA (2003). AAAI Press. • Valentini, G. Ensemble methods based on bias-variance analysis. PhD thesis, DISI, Università di Genova, Italy (2003), ftp://ftp.disi.unige.it/person/ValentiniG/Tesi/finalversion/vale-th-2003-04.pdf. • Valentini, G., Dietterich, T.G.: Bias-variance analysis of Support Vector Machines for the development of SVM-based ensemble methods. Journal of Machine Learning Research (2004) (accepted for publication) MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 14