Download MCS 2004 - Università degli Studi di Milano

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Random aggregated and bagged
ensembles of SVMs:
an empirical bias-variance analysis
Giorgio Valentini
e-mail: [email protected]
DSI – Dipartimento di Scienze dell’ Informazione
Università degli Studi di Milano
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
1
Goals
• Developing methods and procedures to estimate the biasvariance decomposition of the error in ensembles of learning
machines
• A quantitative evaluation of the variance reduction property in
random aggregated and bagged ensembles (Breiman,1996).
• A characterization of bias-variance (BV) decomposition of the
error in bagged and random aggregated ensembles of SVMs,
comparing the results with BV decomposition in single SVMs
(Valentini and Dietterich, 2004)
• Getting insights into the reasons why the ensemble method
Lobag (Valentini and Dietterich, 2003) works.
• Getting insights into the reasons why random subsampling
techniques works with large data mining problems (Breiman,
1999; Chawla et al. 2002).
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
2
Random aggregated ensembles
Let D = {(xj , tj)}, 1 j  m, be a set of m samples drawn identically and
independently from a population U according to P, where P(x, t) is the joint
distribution of the data points in U.
Let L be a learning algorithm, and define fD = L (D) as the predictor
produced by L applied to a training set D. The model produces a prediction
fD(x) = y.
Suppose that a sequence of learning sets { Dk } is given, each i.i.d. from
the same underlying distribution P.
Breiman proposed to aggregate the fD trained with different samples drawn
from U to get a better predictor fA(x, P).
For classification problems tj  S  N , and fA(x, P) = arg maxj |{k | fDk (x) =
j }| .
As the training sets D are randomly drawn from U, we name the procedure
to build fA random aggregating.
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
3
Random aggregation reduces variance
Considering regression problems, if T and X are random variables having
joint distribution P, the expected squared loss EL for the single predictor
fD(X) is:
EL = ED[ET,X[(T - fD(X))2 ]]
while the expected squared loss ELA for the aggregated predictor is:
ELA = ET,X[(T - fA(X))2 ]
Breiman showed that EL  ELA. This disequality depends on the
instability
of the predictions, that is on how unequal the two sides of the following
eq. are:
ED[fD(X)]2  ED[f D2 (X)]
There is a strict relationship between the instability and the variance of the
base
predictor. Indeed the variance V (X) of the base predictor is:
V (X) = ED[(fD(X) - ED[fD(X)])2 ]= ED[f D2 (X)] - ED[fD(X)]2
Breiman showed also that in classification problems, as in regression,
aggregating
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
“good” predictors can lead to better performances, as long as the base
4
How much does the variance reduction
property hold for bagging too?
Breiman theoretically showed the random aggregating reduces
variance
Bagging is an approximation of random aggregating, for at least two
reasons:
1. Bootstrap samples are not “real” data samples: they are drawn from
a data set D, that is in turn a sample from the population U. On the
contrary fA uses samples drawn directly from U.
2. Bootstrap samples are drawn from D according to an uniform
probability distribution, which is only an approximation of the
unknown true distribution P.
1. Does the variance reduction property hold for bagging too ?
2. Can we provide a quantitative estimate of variance
reduction both in random aggregating and bagging?
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
5
A quantitative estimate of bias-variance decomposition
of the error in random aggregated
(RA) and bagged ensembles of learning machines
•
•
We developed procedures to quantitatively evaluate bias-variance
decompostion of the error according to Domingos unified bias-variance
theory (Domingos, 2000).
We proposed three basic techniques (Valentini, 2003):
1.
2.
•
•
Out-of-bag, or cross-validation estimate (when only small samples are
available)
Hold-out techniques (when relatively large data sets are available)
In order to get a reliable estimate of the error we applied the second
technique evaluating the bias-variance decomposition using quite large
test sets.
We summarize here the two main experimental steps to perform bias
variance analysis with resampling-based ensembles:
1.
2.
Procedures to generate data for ensemble training
Bias-variance decomposition of the error on a separated test set
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
6
Procedure to generate training samples
for bagged ensembles
Procedure to generate training samples
for random aggregates ensembles
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
7
Procedure to estimate
the bias-variance
decomposition of the
error in ensembles of
learning machines
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
8
Comparison of bias-variance decomposition of the error in random aggregated
(RA) and bagged ensembles of SVMs on 7 two-class classification problems
Gaussian kernels
Linear kernels
• Results represent changes relative to single SVMs (e.g. zero change means no difference). Square labeled
lines refer to random aggregated ensembles, triangle to bagged ensembles.
• In random aggregated ensembles the error decreases form 15 to 70% w.r.t. single SVMs, while in bagged
ensemble the errror decreases from 0 to 15% depending on the data set.
• Variance is significantly reduced in RA ens. (about 90%), while in bagging the variance reduction is
quite limited, if compared to RA decrement (between 0 and 35 %). No substantial bias reduction is
registered.
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
9
Characterization of bias-variance decompostion of the error in
random aggregated ensembles of SVMs (gaussian kernel)
• Lines labeled with crosses:
single SVMs
• Lines labeled with triangles:
RA SVM ensembles
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
10
Lobag works when unbiased variance is relatively high
• Lobag (Low bias bagging) is a variant of bagging that uses low biased
base learners selected through bias-variance analysis procedures
(Valentini and Dietterich, 2003).
• Our experiments with bagging show the reasons why Lobag works:
bagging lowers variance, but the bias remains substantially unchanged.
Hence selecting low bias base learners Lobag reduces both bias
(through bias-variance analysis) and variance (through classical
aggregation techniques)
• Valentini and Dietterich experimentally showed that Lobag is effective,
with SVMs as base learners, when small sized samples are used, that is
when the variance due to reduced cardinality of the available data is
relatively high. But when we have relatively large data sets, we may
expect that lobag does not outperform bagging (because in this case,
on the average, the unbiased variance will be relatively low).
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
11
Why random subsampling techniques work
with large databases ?
• Breiman proposed random subsampling techniques for classification in
large databases, using decision trees as base learners (Breiman, 1999),
and these techniques have been also successfully applied in distributed
environments (Chawla et al., 2002).
• Random aggregating can also be interpreted as a technique to draw
from a large population small subsamples to train the base learners and
then aggregating them e.g. by majority voting.
• Our experiments on random aggregated ensembles show that the
variance component of the error is strongly reduced, while the bias
remains unchanged or it is lowered , getting insights into the reasons
why random subsampling techniques works with large data mining
problems. In particular our experimental analysis suggests to apply
SVMs trianed on small subsamples when large database are available
or when they are fragmented in distributed systems.
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
12
Conclusions
• We showed how to apply bias-variance decomposition techniques to the
analysis of bagged and random aggregated ensembles of learning
machines.
• These techniques have been applied to the analysis of bagged and random
aggregated ensembles of SVMs, but can be directly applied to a large set
of ensemble methods*
• The experimental analysis show that random aggregated ensembles
significantly reduce the variance component of the error w.r.t. single
SVMs, but this property only partially holds for bagged ensembles.
• The empirical bias variance analysis gets also insights into the reasons
why Lobag works, highliting on the other hand some limitations of the
Lobag approach.
• The bias-variance analysis of random aggregated ensembles highlights
also the reasons of their successfull application to large scale data mining
problems.
*the
C++ classes and applications to perform BV analysis are freely available at:
http://homes.dsi.unimi.it/~valenti/sw/NEURObjects
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
13
References
• Breiman, L.: Bagging predictors. Machine Learning 24 (1996) 123-140
• Breiman, L.: Pasting Small Votes for Classification in Large Databases and OnLine. Machine Learning 36 (1999) 85-103
• Chawla, N., Hall, L., Bowyer, K., Moore, T., Kegelmeyer, W.:
Distributed pasting of small votes. In: MCS2002, Cagliari, Italy. Vol. 2364 of
Lecture Notes in Computer Science., Springer-Verlag (2002) 52-61
• Domingos, P. A Unified Bias-Variance Decomposition for Zero-One and
Squared Loss. In: Proc. 17th National Conference on Artificial Intelligence,
Austin, TX, AAAI Press (2000) 564-569
• G. Valentini and T.G. Dietterich. Low Bias Bagged Support Vector Machines.
ICML 2003, pages 752-759, Washington D.C., USA (2003). AAAI Press.
• Valentini, G. Ensemble methods based on bias-variance analysis. PhD thesis,
DISI, Università di Genova, Italy (2003),
ftp://ftp.disi.unige.it/person/ValentiniG/Tesi/finalversion/vale-th-2003-04.pdf.
• Valentini, G., Dietterich, T.G.: Bias-variance analysis of Support Vector
Machines for the development of SVM-based ensemble methods.
Journal of Machine Learning Research (2004) (accepted for publication)
MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004
14