Download Solving Complex Machine Learning Problems with Ensemble Methods

Document related concepts

Mixture model wikipedia , lookup

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Solving Complex Machine Learning Problems
with Ensemble Methods
ECML/PKDD 2013 Workshop
Prague, 27 September 2013
Edited by
Ioannis Katakis
Daniel-Hernández Lobato
Gonzalo Martı́nez-Munõz
Ioannis Partalas
National and Kapodistrian University of Athens
Universidad Autónoma de Madrid
Université Joseph Fourier, Grenoble
COPEM
2013
Preface
Ensemble methods have experienced a huge grow within the machine learning
community as a consequence of their generalization performance and robustness.
In particular, ensemble learning provides a straight-forward way of improving,
generally, the performance of single learners. Thus, multiple classifier systems
represent a solution worth of being considered in applications where high predictive performance is strictly required. The emphasis in the COPEM workshop
was to discuss ensemble strategies that not only focused on supervised classification, but that could be used to solve difficult and general machine learning
problems. The workshop brought together members of the ensemble methods
community and also researchers from other fields that could benefit from using
such techniques to address interesting research challenges. More precisely, the
goals of the COPEM workshop that were successfully achieved include: a) the
discussion of state-of-the-art approaches that exploit ensembles to solve complex machine learning problems and, b) bringing the community together to
exchange views and opinions for future research lines and applications in ensemble learning, and to initiate new collaborations towards new challenges.
COPEM was held in Prague, Czech Republic, on September 27th 2013 as
a workshop of the European Conference on Machine Learning and Principles
and Practice of Knowledge Discovery in Databases (ECML/PKDD 2013). A
total number of 22 candidate papers were submitted for evaluation. Paper submissions went through an exhaustive peer-review process. All papers received
at last two independent reviews and some papers even three. From the total
number of submitted papers, 11 papers were accepted for presentation at the
workshop. In order to produce an interesting scientific program, the organizing
committee made the selection of accepted papers by taking into consideration
both the comments and the numerical scores provided by the reviewers, giving
special attention to borderline papers. The review process was handled by 17
members of the program committee and, additionally, 9 external reviewers were
recruited.
The program of the workshop was divided in four sessions of 1 hour and
30 minutes each. In each session at most 4 presentations took place with a total time of 22 minutes per presentation. Furthermore, the last session included
some time which was used to start a discussion about potential uses of ensemble
methods to address difficult machine learning problems and to give the conclusions of the workshop. Lastly, the program featured an invited talk by Prof.
Pierre Dupont (Université catholique de Louvain). This talk focused on the use
of ensemble methods for the robust identification of biomarkers.
A topic analyzed with interest in the COPEM workshop was the use of ensemble methods to identify relevant features or attributes for prediction. The
type of learning problems considered for this task included supervised, semisupervised and multi-label learning problems. Additionally, traditional methods for identifying relevant features using ensembles were re-visited to further
explore their statistical properties and their stability in the feature selection
process. The utility of ensemble methods to carry out other difficult learning
tasks such as software reliability prediction or image classification was also covered. Finally, different meta-learners were analyzed as potential solutions to
combine several human expert opinions. These methods were shown to improve
the prediction performance over the best single human expert.
The second stream of papers covered many interesting topics and applications related to ensemble methods. An interesting application presented was
anomaly detection. In particular, multiple, weak detectors can be used in order
to meet the various requirements of this application (e.g. low training complexity, on-line training, detection accuracy). In the workshop, a new generalization
of bagging is introduced. It is called Neighbourhood Balanced Bagging and the
sampling probabilities of examples are modified according to the class distribution in their neighbourhoods. Prototype Support Vector machines is a new
approach that trains an ensemble of linear SVMs that are tuned to different regions of the feature space. Additionally, an empirical comparison of supervised
ensemble learning approaches was presented including well known methods like
Boosting, Bagging and Random Forests. Finally, we had a presentation on clustering ensembles which is probably the least discussed topic in the ensemble
literature.
Acknowledgements We would like to thank all the authors who submitted
papers to the workshop as well as all the participants. Additionally, we would
like to sincerely thank the invited speaker Prof. Pierre Dupont. We are also
grateful to the program committee members as well as to the external reviewers
for their indispensable help during the review period for providing with high
quality reviews in a very short period of time. We should also acknowledge
the help of the ECML/PKDD workshop chairs, Andrea Passerini and Niels
LandWehr for their help and excellent cooperation. Ioannis Katakis is funded
by the the EU INSIGHT project (FP7-ICT 318225). Daniel Hernández-Lobato
and Gonzalo Martı́nez-Muñoz acknowledge financial support from the Spanish
Dirección General de Investigación, project TIN2010-21575-C02-02.
Prague, September 2013
Ioannis Katakis
Daniel-Hernández Lobato
Gonzalo Martı́nez-Munõz
Ioannis Partalas
4
Workshop Organization
Workshop Organizers
Ioannis Katakis
Daniel-Hernández Lobato
Gonzalo Martı́nez-Munõz
Ioannis Partalas
National and Kapodistrian University of Athens (Greece)
Universidad Autónoma de Madrid (Spain)
Universidad Autónoma de Madrid (Spain)
Université Joseph Fourier (France)
Program Committee
Massih-Reza Amini
Alberto Suárez
José Miguel Hernández-Lobato
Christian Steinruecken
Luis Fernando Lago
Jérôme Paul
Grigorios Tsoumakas
Eric Gaussier
Alexandre Aussem
Lior Rokach
Dimitrios Gunopulos
Ana M. González
Johannes Furnkranz
Indre Zliobaite
José Dorronsoro
Rohit Babbar
Jesse Read
University Joseph Fourier (France)
Universidad Autónoma de Madrid (Spain)
University of Cambridge (United Kingdom)
University of Cambridge (United Kingdom)
Universidad Autónoma de Madrid (Spain)
Université catholique de Louvain (Belgium)
Aristotle University of Thessaloniki (Greece)
University Joseph Fourier (France)
University Claude Bernard Lyon 1 (France)
Ben-Gurion University of the Negev (Israel)
National and Kapodistrian University of Athens (Greece)
Universidad Autnoma de Madrid (Spain)
TU Darmstadt (Germany)
Aalto University (Finland)
Universidad Autónoma de Madrid (Spain)
University Joseph Fourier (France)
Universidad Carlos III de Madrid (Spain)
5
External Reviewers
Aris Kosmopoulos
Antonia Saravanou
Bartosz Krawczyk
Newton Spolaôr
Nikolas Zygouras
Dimitrios Kotsakos
George Tzanis
Dimitris Kotzias
Efi Papatheocharous
NCSR “Demokritos” (Greece)
National and Kapodistrian University of Athens (Greece)
Wroclaw University of Technology (Poland)
Aristotle University of Thessaloniki (Greece)
National and Kapodistrian University of Athens (Greece)
National and Kapodistrian University of Athens (Greece)
Aristotle University of Thessaloniki (Greece)
National and Kapodistrian University of Athens (Greece)
Swedish Institute of Computer Science (Sweden)
Sponsors
6
Contents
1 Invited talk: Robust biomarker identification with ensemble
feature selection methods . . . . . . . . . . . . . . . . . . . . . . .
Pierre Dupont
9
2 Local Neighbourhood in Generalizing Bagging for Imbalanced
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Jerzy Bllaszczyński, Jerzy Stefanowski and Marcin Szajek
3 Anomaly Detection by Bagging . . . . . . . . . . . . . . . . . . . 25
Tomáš Pevný
4 Efficient semi-supervised feature selection by an ensemble approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Mohammed Hindawi, Haytham Elghazel and Khalid Benabdeslem
5 Feature ranking for multi-label classification using predictive
clustering trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Dragi Kocev, Ivica Slavkov and Sašo Džeroski
6 Identification of Statistically Significant Features from Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Jérôme Paul, Michel Verleysen and Pierre Dupont
7 Prototype Support Vector Machines: Supervised Classification in Complex Datasets . . . . . . . . . . . . . . . . . . . . . . . 81
April Shen and Andrea Danyluk
8 Software Reliability prediction via two different implementations of Bayesian model averaging . . . . . . . . . . . . . . . . . . 95
Alex Sarishvili and Gerrit Hanselmann
9 Multi-Space Learning for Image Classification Using AdaBoost
and Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . 110
Wenrong Zeng, Xue-Wen Chen, Hong Cheng and Jing Hua
7
10 An Empirical Comparison of Supervised Ensemble Learning
Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Mohamed Bibimoune, Haytham Elghazel and Alex Aussem
11 Clustering Ensemble on Reduced Search Spaces . . . . . . . . . 139
Sandro Vega-Pons and Paolo Avesani
12 An Ensemble Approach to Combining Expert Opinions . . . . 153
Hua Zhang, Evgueni Smirnov, Nikolay Nikolaev, Georgi Nalbantov and Ralf Peeters
8
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Robust Biomarker Identification with Ensemble
Feature Selection Methods
(Invited Talk)
Pierre Dupont
Université catholique de Louvain, Belgium
[email protected]
Abstract. Biomarker identification is an important topic in biomedical applications of computational biology, including applications such
as gene selection from high dimensional data produced on microarrays
or RNA-seq technologies. From a machine learning and statistical viewpoint, such identification is a feature selection problem, typically from
thousands of potentially relevant dimensions and only a few dozens of
samples. In such a context, the lack of stability of the feature selection
process is often problematic as the list of selected biomarkers may largely
vary for only marginal fluctuations of the data samples.
We describe in this talk simple yet effective ways to increase the robustness of such selection through ensemble methods and various aggregation mechanisms to build a consensus list from an ensemble of lists. The
increased robustness is key for subsequent biological validation of the
selected markers. We first describe selection methods that are embedded
in the estimation of an ensemble of support vector machines (SVMs).
SVMs are powerful classification models that have shown state-of-theart performance on several diagnosis and prognosis tasks on biological
data. Their feature selection extensions also offer good results for gene
selection tasks. We show that the robustness of SVMs for biomarker
discovery can be substantially increased by using ensemble feature selection techniques, while at the same time improving their classification
performances.
We also briefly discuss some alternative ensemble mechanisms extending
univariate methods. Those methods are simpler and less computationally
intensive than embedded approaches. Whenever based on a simple statistical test, such as a paired t-test or a Wilcoxon rank test, they also offer
a convenient way to weight each candidate feature with its associated
p-value and to construct the consensus list accordingly.
We conclude our talk by stressing the well known risk of selection bias
and how such risk can be limited through appropriate estimation procedures.
9
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Local Neighbourhood in Generalizing Bagging
for Imbalanced Data
Jerzy BÃlaszczyński, Jerzy Stefanowski and Marcin Szajek
Institute of Computing Sciences, Poznań University of Technology,
60–965 Poznań, Poland
{jerzy.blaszczynski, jerzy.stefanowski, marcin.szajek}@cs.put.poznan.pl
Abstract. Bagging ensembles specialized for class imbalanced data are
considered. We show that difficult distributions of the minority class can
be handled by analyzing the content of the local neighbourhood of examples. First, we introduce a new generalization of bagging, called Neighbourhood Balanced Bagging, where sampling probabilities of examples
are modified according to the class distribution in their neighbourhoods.
Experiments show that it is competitive to other extensions of bagging.
Finally, we demonstrate that assessing types of the minority examples
based on the analysis of their neighbourhoods could help in explaining
why some ensembles work better for imbalanced data than others.
1
Introduction
One of the sources of difficulties while constructing accurate classifiers is class
imbalance in data. This difficulty manifests itself by the fact that one of the
target classes is significantly less numerous than the other classes. This problem
often occurs in many applications, and constitutes a difficulty for most learning
algorithms. As a results many classifiers are biased toward majority classes and
fail to recognize examples from the minority class.
The class imbalance problem has received a growing research interest in the
last decade and a number of specialized methods have been proposed, see their
review [7]. In general, they are categorized in data level and algorithm level ones.
Methods from the first category are based on pre-processing, and they transform
the original data distribution into more balanced one. The simplest methods are:
random over-sampling, which replicates examples from the minority class, and
random under-sampling, which randomly eliminates examples from the majority classes until a required degree of balance between classes is reached. The
more informative methods, e.g., SMOTE, introduce additional synthetic examples according to an internal characteristics of regions around examples from the
minority class [4]. The methods from the second category are classifier dependent
ones. They also include ensembles of classifiers. However, the standard methods
of ensemble construction are oriented toward improving the overall classification
accuracy and they do not solve sufficiently the recognition of the minority class.
The new proposed ensembles usually include either integrating pre-processing
10
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
methods before learning component classifiers or embedding the cost-sensitive
framework in the ensemble learning process, see their review in [5].
Although several specialized ensembles have been presented as adequate to
class imbalance, there is still lack of their general comparison or discussion of
their competence area. To the best to our knowledge, only two comprehensive
studies were carried out in different experimental frameworks [5, 10]. The main
conclusions from the comparative study [5] were that simpler versions of using
under-sampling or SMOTE inside ensembles worked better than more complex
solutions. Then, the experiments with few best boosting and bagging generalizations over noisy and imbalanced data sets showed that bagging outperformed boosting. In our previous study we experimentally compared main bagging variants for class imbalance and also observed that under-sampling bagging
performed much better than variants with over-sampling [2]. In particular, the
Roughly Balanced Bagging [8] achieved the best results.
We keep our interest in bagging extensions and identify two tasks to be undertaken. The first task is looking for the hypothesis why the under-sampling
bagging works better than over-sampling ones. The other task concerns an attempt to construct yet another ensemble, more similar to over-sampling the
minority class, leading to performance closer to the Roughly Balanced Bagging.
Our new view is to resign from the simple integration of pre-processing with
unchanged bagging sampling technique. Unlike using the equal probabilities of
each example in bootstrap sampling we want to change probabilities of their
drawing and to focus this sampling toward the minority class and additionally
more to the examples located in difficult sub-regions of the minority class.
While considering the probability of each example to be drawn we propose
to analyze class distributions in the local neighbourhood of the minority example [13]. Depending on the distribution of examples from the majority class
in this neighbourhood, we can evaluate whether this example could be safe or
unsafe (difficult) to be learned. This approach is inspired by our earlier positive experience with studying the ”nature” of imbalanced data where such local
characteristics was successfully modeled with the k-nearest neighbourhood [13].
To sum up, the main contributions of our study are the following. The first
aim is to introduce a new extension of bagging for class imbalance, where the
probability of selecting an example into the bootstrap sample is influenced by
the analysis of the class distribution in a local neighbourhood of the example.
The new proposal is compared against existing extensions over several data sets.
Then, the second aim is to use the same type of analysis to explain how contents
of bootstrap samples affect performance of the Roughly Balanced Bagging and
the proposed new extension.
2
Related Works
Due to space limits, we briefly discuss the most related works only. The reader is
referred to [5] for the most comprehensive review of current ensembles addressing
the class imbalance. Below we discuss extensions of bagging.
11
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Recall that original Breiman’s bagging [3] is based on the bootstrap aggregation, where the training set for each classifier is constructed by random uniformly
sampling (with replacement) instances from the original training set (usually
keeping the size of the original data). Then, T component classifiers are induced
by the same learning algorithm from these T bootstrap samples. Their predictions form the final decision with the equal weight majority voting. However,
bootstrap samples are still biased toward the majority class. Most of proposals
overcome this drawback by using pre-processing techniques, which change the
balance between classes in bootstraps.
In Underbagging approaches the number of the majority class examples in
each bootstrap sample is randomly reduced to the cardinality of the minority
class (Nmin ). In the simplest proposal Exactly Balanced Bagging (EBBag), while
creating each training bootstrap sample, the entire minority class is just copied
and combined with randomly chosen subsets of the majority class to exactly
balance cardinalities between classes. The base classifiers and their aggregation
are constructed as in the standard bagging.
The Roughly Balanced Bagging (RBBag) [8] results from the critique of the
EBBag. Instead of fixing the constant sample size, it equalizes the sampling
probability of each class. For each of T iterations the size of the majority class
in the final bootstrap sample (Smaj ) is determined probabilistically according to
the negative binominal distribution. Then, Nmin examples are drawn from the
minority class and Smaj examples are drawn from the entire majority class using
bootstrap sampling as in the standard bagging (with or without replacement).
The class distribution inside the bootstrap samples maybe slightly imbalanced
and varies over iterations. According to [8] this approach is more consistent
with the nature of the original bagging and performs better than EBBag. In our
experiments on larger collection of data [2] both RBBag and EBBag achieved
quite similar results for the sensitivity measure while RBBag was slightly better
than EBBag for G-mean and F-measure.
Another way to transform bootstrap samples includes over-sampling the minority class before training classifiers. In this way the number of minority examples is increased in each bootstrap sample while the majority class is not reduced
as in underbagging. This idea is realized with different over-sampling techniques.
We present two approaches further used in experiments.
Overbagging is the simplest version which applies over-sampling to transform
each bootstrap sample. Smaj of minority class examples is sampled with replacement to exactly balance the cardinality of the minority and the majority class.
Majority examples are sampled with replacement as in the original bagging.
Another approach is used in SMOTEBagging to increase diversity of component classifiers. First, SMOTE is used instead of random over-sampling of the
minority class. Then, SMOTE resampling rate (α) is stepwise changed in each
iteration from small to high values (e.g,. from 10% to 100%). This ratio defines
the number if minority examples (α × Nmin ) to be additionally re-sampled in
each iteration. Quite similar trick is also used to construct bootstrap samples in
”from underbagging to overbagging” ensemble.
12
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
3
Neighbourhood Balanced Bagging for Imbalanced Data
The proposed extension of bagging descends from results of studying sources
of difficulties in learning from imbalanced classes. The high imbalance ratio between cardinalities of minority and majority class is not the only and not even
the main reason of these difficulties. Other, as we call them, data factors, which
characterize class distributions, are also influential. The experimental studies, as
e.g. [9], demonstrate that the degradation of classification performance is linked
to the decomposition of the minority class into many sub-parts containing very
few examples. It means that the minority class does not form a homogeneous,
compact distribution of the target concept but it is scattered into many smaller
sub-clusters surrounded by majority examples (they correspond to the small
disjuncts as they are harder to be learned and more contribute to classification
errors than larger sub-concepts). Other factors related to the class distribution
(occurring together with the class rarity) concern the effect of too strong overlapping between the classes [6] or presence of too many single minority examples
inside the majority class regions [12].
We follow studies, as [11, 12], where the data factors are linked to different
types of examples creating the minority class distribution. Authors differentiate between safe and unsafe examples. Safe examples are ones located in the
homogeneous regions populated by examples from one class only. Other examples are unsafe and more difficult for learning. Unsafe examples are categorized
into borderline (placed close to the decision boundary between classes), rare
cases (isolated groups of few examples located deeper inside the opposite class),
or outliers. The appropriate treatment of these types of minority examples within
pre-processing methods should lead to improving learning classifiers, e.g., as it
has been done by Stefanowski inside the informed pre-processing method SPIDER [14].
The question is how to identify these types of examples. In [13], it is achieved
by analyzing the class distribution inside a local neighbourhood of the considered
example, which is modeled by k-nearest neighbour examples. The distance between the examples is calculated according to the HVDM metric (Heterogeneous
Value Difference Metric) [16]. Then, the number of neighbours from the opposite
class indicates how safe or unsafe is the considered example (see [13] for details).
Inspired by the positive results of [14, 13], we will exploit characteristics of
the local neighbourhood in a different quantitative way. The result is a new modification of bagging, which is called Neighbourhood Balanced Bagging (NBBag).
The idea behind NBBag is to focus sampling process toward these minority examples, which are hard to be learned (i.e. unsafe ones) while decreasing
probabilities of selecting examples from the majority class at the same time. Recall that the idea of changing sampling probabilities has been considered in our
previous work with applying bagging to noisy data and improving the overall
accuracy [1]. Here, we postulate another strategy to change bootstrap samples.
It is carried through a conjunction of sampling modifications at two levels: global
and local ones.
13
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
At the first, global level, we attempt at increasing the chance of drawing the
minority examples with respect to the imbalance ratio in the original data. We
implement it by changing the probability of sampling of majority examples. More
precisely, first we set p1min probability of sampling of each minority example to
1. Then, we downscale p1maj probability of sampling of a majority example to
Nmin
Nmaj , where Nmin , Nmaj are numbers of examples in the minority and majority
class in the original data, respectively. Intuitively, it could refer to the situation,
where minority and majority classes contain examples of the same type, e.g., safe
ones, and the class distributions are not affected by other data factors. Thus,
this simple modification of probabilities exploits information about the global
between-class imbalance. It should lead to bootstrap samples with approximately
globally balanced class cardinalities.
However, the experimental studies [2, 5, 10] show that the global balancing
in overbagging (somehow similar to our global level) is not competitive to other
extensions of bagging. Moreover, most studied imbalance data sets contain many
unsafe minority examples while the majority classes comprise rather safe ones,
see results in [12]. However, while more focusing on the local characteristics of
the minority class one should differently treat types of unsafe examples, as earlier
successful experiments with such pre-processing methods as SPIDER [14] or generalizations of SMOTE as Borderline-SMOTE (see its description in [7]) pointed
out that safe minority examples could be less over-sampled than borderline or
other unsafe ones.
The second local level is intended to shift sampling of minority examples
to these unsafe examples that are harder to be learned – what is identified
by analyzing their k-nearest neighbours. This level can be modeled in different
ways, having in mind the following rule: the more unsafe example, the more
amplified probability of its drawing. This is partly inspired by earlier successful
experiences with informed pre-processing methods. The modification rule could
be done with either linear or non-linear function. In this study we use the formula
L2min , defined as:
L2min =
0
(Nmaj
)ψ
,
k
(1)
0
is the number of examples in the neighbourhood, which belong to
where Nmaj
the majority class; ψ is an exponential scaling factor, which in default case of
a linear modification is set to 1. The value ψ may be increased if one wants to
strengthen the role of rare cases and outliers in bootstraps. This increase may
correspond to data sets where the minority class distribution in the original data
is scattered into many rare cases or outliers, and the number of safe examples
is significantly limited (see exemplary data, e.g. balance-scale, in the further
experiments – section 4).
The formula L2min requires re-scaling as it may lead to the probability equal
0
to 0 for completely safe examples, i.e., for Nmaj
= 0. We propose to re-formulate
it as:
β × (L2min + 1)
14
(2)
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
where β is a technical coefficient referring to drawing a completely safe example.
Intuitively, safe examples from both minority and majority classes should have
the same probability of being selecting to bootstraps. Setting β to 0.5 keeps this
intuition. Adding the number ”1” corresponds to a normalization of sampling
probabilities inside the conjunctive combination, if one expects that pmin ∈ [0, 1].
Then, we hypothesize that examples from majority class are, by default,
not balanced on the second level, which is reflected by L2maj = 0. The intuition
behind this hypothesis is that examples from majority class, are more likely to be
safe. Even when it is false for some data, it is still quite apparent that amplifying
majority rare or outlying examples, at this level, would increase difficulties of
learning classifiers from the minority classes interiors disrupted by them.
Finally, local and global levels are combined by a multiplication. This combination could correspond to the independence assumption, i.e. the distribution
of examples in the neighbourhood is independent from the global distribution
of examples in the whole data set. This leads us to the final formulations of the
probability of selecting minority and majority classes, respectively as:
pmin = p1min × β(L2min + 1) = p1min × 0.5(L2min + 1) = 0.5(L2min + 1), (3)
Nmin
pmaj = p1maj × β(L2maj + 1) = p1maj × 0.5 =
× 0.5,
(4)
Nmaj
resulting from L2maj = 0, and default β set to 0.5.
4
Experiments
The first part experiments is an evaluation of the new proposed NBBag while
the others concern using the local neighbourhood analysis to assess types of
examples and studying the contents of bootstrap samples.
4.1
Evaluation of Bagging Extensions
First, we compare performance of NBBag with existing extensions of bagging. As
a baseline for this comparison we use a balanced bagging (BBag), i.e., a variant
which attempts to globally balance cardinalities of majority class and minority
class in bootstrap samples (this is achieved by using only the first ”global”
level of NBBag of decreasing probability of the minority examples according to
the imbalance ratio). Following our earlier study [2], we chose Rough Balanced
Bagging (RBBag) as the best under-sampling extension. As our approach is
more similar to over-sampling, we also consider: Overbagging (OverBag) and
SMOTEBagging (SMOBag).
All implementations are done for the WEKA framework. Component classifiers in all bagging variants are learned with C4.5 tree learning algorithm (J4.8),
which uses standard parameters except disabling pruning. For all bagging variants, we tested the following numbers T of component classifiers: 20, 50 and
100. Due to space limit, we present detailed results for T = 50 only. Results for
15
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
other T lead to similar general conclusions. In case of using SMOBag, we used
5 neighbours and the oversampling ratio α was stepwise changed in each sample
starting from 10%. In NBBag we tested different sizes of the neighbourhood with
k = 5, 7, 9 and 11. Their best values depend on a particular data set, however
using 7 neighbours is the best in case of the linear amplification (ψ = 1). This
option is further denoted as 7NBBag. As minority classes in some data sets are
composed of mainly rare examples and outliers, we also considered the other
variant of increasing a selection of examples with ψ = 2. For it the best results
are obtained with a slightly smaller number of neighbours equal 5 – so it will be
denoted as 5NBBag2 .
We conduct our analysis on 20 real-world data sets representing different
domains, sizes and imbalance ratio. Most of data sets come from the UCI repository, and have been used in other works on class imbalance. Two data sets
abominal and scrotal-pain come from our medical applications. For data sets
with more than two classes, we chose the smallest one as a minority class and
combined other classes into one majority class. Their characteristics are preN
sented in Table 1 where IR is the imbalance ratio defined as Nmaj
.
min
Table 1. Data characteristics
# examples # attributes Minority class
IR
Data set
abdominal pain
723
13
positive
2.58
balance-scale
625
4
B
11.76
breast-cancer
286
9
recurrence-events 2.36
breast-w
699
9
malignant
1.90
car
1728
6
good
24.04
cleveland
303
13
3
7.66
cmc
1473
9
2
3.42
credit-g
1000
20
bad
2.33
ecoli
336
7
imU
8.60
flags
194
29
white
10.41
haberman
306
4
2
2.78
hepatitis
155
19
1
3.84
ionosphere
351
34
b
1.79
new-thyroid
215
5
2
5.14
postoperative
90
8
S
2.75
scrotal pain
201
13
positive
2.41
solar-flareF
1066
12
F
23.79
transfusion
748
4
1
3.20
vehicle
846
18
van
3.25
yeast-ME2
1484
8
ME2
28.10
The performance of bagging ensembles is measured using: sensitivity of the
minority class (the minority class accuracy), its specificity (an accuracy of recognizing majority classes), their aggregation to the geometric mean (G-mean) and
F-measure (referring to the minority class, and used with equal weights ”1” as-
16
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
signed to precision and recall). For their definitions see, e.g. [7]. These measures
are estimated with the stratified 10-fold cross-validation repeated several times
to reduce the variance. The average values of G-mean and sensitivity are presented in Tables 2 and 3, respectively. The differences between classifier average
results will be also analyzed using either Friedman or Wilcoxon statistical tests
(with a standard significance level 0.05). In all these tables the last row contains
average ranks calculated as in the Friedman test – the lower average rank, the
better classifier.
Table 2. G-mean [%] of compared bagging ensembles
Data set
abdominal pain
balance-scale
breast-cancer
breast-w
car
cleveland
cmc
credit-g
ecoli
flags
haberman
hepatitis
ionosphere
new-thyroid
pima
scrotal pain
solar-flareF
transfusion
vehicle
yeast-ME2
avg. rank
BBag SMOBag OverBag RBBag 7NBBag 5NBBag2
79.04
80.85
79.44 79.99
80.26
80.82
19.74
0.00
1.40 58.12
47.36
61.07
60.60
52.57
56.17 58.62
59.32
56.53
96.11
95.88
96.23 96.13
96.21
96.14
96.21
95.26
95.29 97.09
96.80
96.98
51.02
25.03
22.77 72.14
58.06
65.75
61.12
57.74
59.95 64.86
62.81
64.33
65.87
80.68
71.75 86.89
66.48
66.94
84.41
58.38
51.42 72.91
86.52
86.74
61.16
62.48
64.30 65.84
61.46
61.46
61.22
60.02
58.11 65.92
62.24
48.65
74.92
68.47
72.16 80.23
78.19
75.33
90.88
90.30
90.47 90.25
90.76
89.95
95.35
95.18
95.36 97.15
96.73
97.02
74.51
72.33
73.54 74.59
74.18
72.30
73.41
70.42
72.01 74.17
72.29
71.42
64.97
55.04
58.07 84.91
66.80
71.13
67.33
63.96
64.83 67.39
64.98
39.56
95.13
94.34
94.61 94.77
95.49
95.91
63.59
59.41
59.70 84.37
69.35
74.86
3.65
5.0
4.35
1.95
2.83
3.23
The results of Friedman tests (with CD = 1.69), reveal that, in both cases of
G-mean and sensitivity, RBBag, 7NBBag, and 5NBBag2 are significantly better
than the rest of classifiers, without significant difference among them. Still, we
can give some more detailed observations.
For G-mean, RBBag is the best classifier according to average ranks (see
Table 2). It is also significantly better than several classifiers according to the
Wilcoxon test - although the difference is not significant to 7NBBag. The worst
classifier with respect to G-mean is SMOBag. Although it is a more complex
approach using the informed SMOTE method, one can notice that the much
simpler overbagging or BBag give better evaluation measures.
Analyzing the recognition of the minority examples, i.e. the sensitivity measure in Table 3, the best performing is 5NBBag2 with respect to the average
17
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 3. Sensitivity [%] of compared bagging ensembles
Data set
abdominal pain
balance-scale
breast-cancer
breast-w
car
cleveland
cmc
credit-g
ecoli
flags
haberman
hepatitis
ionosphere
new-thyroid
pima
scrotal pain
solar-flareF
transfusion
vehicle
yeast-ME2
avg. rank
BBag SMOBag OverBag RBBag 7NBBag 5NBBag2
75.54
71.57
74.22 80.99
78.51
80.99
4.90
0.00
0.67 65.67
35.51
72.45
54.12
34.35
44.91 59.81
59.88
66.71
96.02
95.02
95.98 96.41
96.47
96.35
94.20
92.54
92.62 100.00
95.36
95.80
29.14
17.22
16.11 77.22
39.71
54.57
50.93
40.05
46.47 66.80
57.12
66.61
61.27
71.67
60.83 90.28
67.53
73.93
77.14
55.00
66.67 78.33
82.00
84.29
82.35
45.89
52.89 68.56
82.94
82.94
56.30
49.81
49.86 61.34
69.51
87.28
65.62
54.44
62.78 84.17
73.44
69.38
85.56
83.70
84.70 86.00
87.38
87.94
92.57
92.22
93.06 97.50
95.43
96.00
74.70
65.13
67.38 78.09
80.56
85.07
69.83
58.56
65.89 75.78
70.34
71.86
47.44
37.33
42.17 87.33
50.70
58.84
61.46
51.53
56.54 69.83
72.64
92.08
94.77
92.14
93.46 96.48
95.63
96.48
41.57
39.11
39.11 91.56
49.80
59.22
4.05
5.78
5.08
1.95
2.52
1.62
ranks. However, according to the Wilcoxon test, its difference to RBBag is not
significant. Again, the worst classifier in this comparison is SMOBag.
We do not show values of the F-measure, due to space limits. Nevertheless, these results indicate, similarly to the results for G-mean, that RBBag,
7NBBag, and 5NBBag2 are better than other classifiers with no significant difference among them (e.g., the Wilcoxon test p value is 0.3 while comparing the
pair of best classifiers RBBag and 7NBBag).
Looking more precisely at results in Tables 2 and 3 one can also notice that
classifiers leading to high improvements of the sensitivity also strongly deteriorate G-mean at the same time (it means that the recognition of the majority
class is much worse). For example see transfusion data set, which contains
many outliers (see Table 4) and using ψ = 2 in the variant 5NBBag2 leads the
highest sensitivity 92.08% and the worst G-mean 39.56% among all compared
classifiers. The similar trade off occurs also for, e.g., haberman data set and in
the case of RBBag for, e.g., car. The linear amplification of the local probability
of the minority class (ψ = 1) is the more conservative approach and it could
be used if one wants to improve the sensitivity while still keeping the accuracy
of majority classes at the sufficient level. However, tuning other intermediate ψ
values between 1 and 2 could be the topic of further experiments.
Finally, notice that using the imbalance ratio to global balancing classes in
bootstrap samples is not sufficient. Consider results of BBag which works sim-
18
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
ilarly to over-bagging. Taking into account information about the local neighbourhood of minority examples improves classification performance with respect
to all evaluation measures. To conclude, the introduction of local modifications of
sampling probabilities inside the combination rule of NBBag maybe the crucial
element leading to the significantly better performance of the ensembles than all
overbagging variants as well as for making it competitive to RBBag.
4.2
Analyzing Data Characteristics and Bootstrap Samples
The aim of this part of experiments is to learn more about the nature of the
best bagging extensions.
First, we want to study class data characteristics in considered data sets and
to identify types of examples (recall their distinction in section 3). Following
the method introduced in [13] we propose to assign types of examples using
information about class labels in their k-nearest local neighbourhood.
In this analysis we will use k = 5, because k = 3 may poorly distinguish the
nature of examples, and k = 7 has led to quite similar decisions [13]. This choice
is also similar to the size of neighbourhood used in NBBag. For the considered
example x and k = 5, the proportion of the number of neighbours from the
same class as x against neighbours from the opposite class can range from 5:0
(all neighbours are from the same class as the analyzed example x) to 0:5 (all
neighbours belong to the opposite class). Depending on this proportion, we assign
the type labels to the example x in the following way [13]: Proportions 5:0 or
4:1 inside the neighbourhood – the example x is labeled as a safe example (as it
is surrounded by examples from the same class); 3:2 or 2:3 – it is a borderline
example; 1:4 – it is interpreted as a rare case; 0:5 – it is an outlier. For higher
values of k such proportions could be interpreted in a similar way.
The results of such labeling of the minority class examples are presented in
Table 4. The first observation is that many data sets contain rather a small
number of safe examples. The exceptions are three data sets composed of almost
only safe examples: breast-w, car, and flags. On the other hand, there are
data sets such as cleveland, balance-scale or solar-flareF, which do not
contain any safe examples. We carried out the similar neighbourhood analysis
for the majority classes and make a contrary observation – nearly all data sets
contain mainly safe majority examples (e.g. yeast-ME2: 98.5%, ecoli: 91.7%)
and sometimes a limited number of borderline examples (e.g. balance-scale:
84.5% safe and 15.6% borderline examples). What is even more important nearly
all data sets do not contain any majority outliers and at most 2% of rare examples. Thus, we can repeat similar conclusions to [13], saying that in most data
sets the minority class includes mainly difficult unsafe examples.
Then, one can observe that for safe data sets nearly all bagging extensions
achieve similar high performance (see Tables 2 - 3 for breast-w, new-thyroid).
A quite similar observation concerns data sets with still high number of safe
examples, limited borderline ones and no / or nearly no rare cases or outliers - see,
e.g., vehicle. One the other hand, the strong differences between classifiers occur
for the most difficult data distributions with a limited number of safe minority
19
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 4. Labeling minority examples expressed as a percentage of each type of examples occurring in this class
Data set
abdominal pain
balance-scale
breast-cancer
breast-w
car
cleveland
cmc
credit-g
ecoli
flags
haberman
hepatitis
ionosphere
new-thyroid
pima
scrotal pain
solar-flareF
transfusion
vehicle
yeast-ME2
Safe
61,39
0,00
21,18
91,29
47,83
0,00
13,81
15,67
28,57
100,00
4,94
18,75
44,44
68,57
29,85
50,85
2,33
18,54
74,37
5,88
Border
23,76
0,00
38,82
7,88
47,83
45,71
53,15
61,33
54,29
0,00
61,73
62,50
30,95
31,43
56,34
33,90
41,86
47,19
24,62
47,06
Rare
6,93
8,16
27,06
0,00
0,00
8,57
14,41
12,33
2,86
0,00
18,52
6,25
11,90
0,00
5,22
10,17
16,28
11,24
0,00
7,84
Outlier
7,92
91,84
12,94
0,83
4,35
45,71
18,62
10,67
14,29
0,00
14,81
12,50
12,70
0,00
8,58
5,08
39,53
23,03
1,01
39,22
examples. Furthermore, the best improvements of all evaluation measures for
RBBag or NBBag are observed for the unsafe data sets. For instance, consider
cleveland (no safe examples, nearly 50% of outliers) where RBBag has 72%
G-mean comparing to overbagging with 22.7%. Similar highest improvements
occur for balance-scale (containing the highest number of outliers among all
data sets) where NBBag gets 61.07% while OverBag 1.4%. Similar situations
also occur for yeast-ME2, ecoli, haberman or solar-flare. We can conclude
that RBBag and NBBag strongly outperform other bagging extensions for the
most difficult data sets with large numbers of outliers or rare cases – sometimes
occurring with borderline examples.
In order to better understand these improvements achieved by RBBag and
NBBag, we perform the same neighbourhood analysis and labeling types of minority examples inside their bootstraps. For each bootstrap sample we label
types of minority examples basing on class labels of the k-nearest neighbours.
Then, we average results from all bootstraps. The results of this labeling are
presented in two rows of Table 5, referring to each classifier. We present result
for 7NBBag only, due to space limits and skip some safe data, where there is no
big changes of distributions between both variants of NNBag.
In our opinion these results reveal very interesting properties of both ensembles. While comparing Tables 4 and 5 notice that RBBag and NBBag strongly
change types of the minority class distributions into safer ones inside their bootstraps. For many data sets which originally contain high numbers of rare cases
20
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 5. Distributions of types of the minority examples [in %] inside bootstrap samples for each classifier and data set
Data set
Classifier
NBBag
abdominal pain
RBBag
NBBag
balance-scale
RBBag
NBBag
breast-cancer
RBBag
NBBag
breast-w
RBBag
NBBag
cleveland
RBBag
NBBag
cmc
RBBag
NBBag
credit-g
RBBag
NBBag
ecoli
RBBag
NBBag
flags
RBBag
NBBag
haberman
RBBag
NBBag
hepatitis
RBBag
NBBag
ionosphere
RBBag
BBag
new-thyroid
RBBag
NBBag
scrotal pain
RBBag
NBBag
solar-flareF
RBBag
NBBag
transfusion
RBBag
NBBag
vehicle
RBBag
NBBag
yeast-ME2
RBBag
21
Safe
65.14
72.60
59.23
39.68
37.55
35.56
93.44
93.57
64.58
42.86
38.33
42.47
36.07
34.44
81.61
85.33
100.00
100.00
33.37
25.34
65.89
67.01
94.96
51.98
96.54
90.83
62.92
64.67
84.83
70.52
35.71
41.00
86.84
89.80
86.66
64.31
Border
21.5
19.15
20.52
59.02
43.51
52.82
6.04
5.60
17.59
53.33
41.14
50.96
49.21
59.58
7.76
11.75
0.00
0.00
46.82
66.09
23.78
26.60
3.62
31.10
2.41
9.17
25.24
29.34
7.67
21.37
46.43
41.90
10.4
10.20
7.51
34.29
Rare
7.41
5.11
5.63
0.05
11.39
7.52
0.40
0.29
7.07
0.44
10.74
3.13
7.47
2.79
3.96
0.00
0.00
0.00
9.63
4.07
4.38
1.25
0.49
6.60
0.15
0.00
6.57
3.50
3.76
2.69
9.44
3.76
1.46
0.00
1.79
0.22
Outlier
5.96
3.15
14.62
1.25
7.54
4.10
0.12
0.54
10.76
3.37
9.79
3.44
7.26
3.19
6.67
2.92
0.00
0.00
10.19
4.50
5.95
5.14
0.93
10.32
0.90
0.00
5.27
2.49
3.74
5.43
8.42
13.33
1.30
0.00
4.05
1.18
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
or outliers, the transformed bootstrap samples contain now more safe examples.
For instance, consider the very difficult balance-scale data set (containing
originally 91.8% outliers), where RBBag creates bootstrap samples with at most
4% outliers and 7.5% rare cases while moving the rest of examples into safe and
borderline ones. Similar data type shift could be observed for: yeast-ME2 (originally 5% safe examples, now over 70%), solar-flareF, ecoli, ionosphere,
hepatitis, cleveland. Finally, one can notice that RBBag usually constructs
slightly safer data than NBBag.
Recall that literature known extensions of bagging are based on the simple
idea of balancing distributions in bootstrap samples. However, our results indicate that transforming distributions of examples into safer ones can be more
influential. In case of RBBag it could be connected with strong filtering majority class examples in each bootstrap sample. Notice that many data sets contain
nearly 1000 examples with around 50 minority ones. For instance, the number
of all examples in solar flare is 1066 while the minority class contains 43 examples only. The new created bootstrap samples include only 43 safe majority
examples and as a result most of the majority class examples (also reflecting
their original distribution) disappear. It can be interpreted as a kind of cleaning around the minority class examples, so they become safer in their local
neighbourhood. Having such a transformed distribution in each sample can help
construct base classifiers, which are more biased toward the minority class. On
the other hand, the size of the learning set can be dramatically reduced. As a
result, some bootstrap samples may lead to weak classifiers, and this type of
ensemble may need more component classifiers than NBBag, which uses larger
bootstrap samples.
5
Discussion and Final Remarks
The difficulty of learning classifiers from imbalanced data comes from complex
distributions of the minority class. Besides the unequal class cardinalities, the
minority class is decomposed into smaller sub-parts, affected by strong overlapping, rare cases or outliers. In our study we attempt to capture these data
characteristics by analyzing the local neighbourhood of minority class examples.
Our main message it to show that such kind of local information could be useful
both for proposing a new type of bagging and to explain why some ensembles
work better than others.
Our first contribution includes the introduction of the Nearest Balanced Bagging which is based on different principles than all known bagging extensions for
class imbalances. First, instead of integrating bagging with pre-processing, we
keep the standard bagging idea but we change radically probabilities of sampling
examples by increasing the chance of drawing more difficult minority examples.
Furthermore, we promote to amplify the role of difficult examples with respect
to their local neighbourhood. The experimental results show that this proposal
is significantly better than existing over-sampling generalizations of bagging and
22
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
it is competitive to Roughly Balanced Bagging (being the best known undersampling variant).
The other contribution is to use the local neighbourhood analysis to assess
the type of examples in data. It comes from the earlier research of Stefanowski
and Napierala [13], however it is now applied in the context of ensembles, which
uncover new characteristics of studied ensembles.
First, the strongest differences between classifiers have been noticed for data
sets containing the most unsafe minority examples. Indeed, both RBBag and
NBBag ensembles have strongly outperformed all overbagging variants for such
data. Furthermore, the analysis of types of minority examples inside bootstrap
samples has clearly showed that RBBag and NBBag strongly changed data characteristics comparing to the original data sets. Many examples from the minority class labeled as unsafe (in particular as rare cases or outliers) have been
transformed to more safe ones. This might be more influential for improving the
classification performance than the simple global class balancing, which was previously considered in the literature and applied to many of existing approaches
to generalize bagging.
References
1. BÃlaszczyński, J., SÃlowiński, R., Stefanowski, J.: Feature Set-based Consistency
Sampling in Bagging Ensembles. Proc. From Local Patterns To Global Models
(LEGO), ECML/PKDD Workshop, 19–35 (2009)
2. BÃlaszczyński, J., Stefanowski, J., Idkowiak L.: Extending bagging for imbalanced
data. Proc. of the 8th CORES 2013, Springer Series on Advances in Intelligent
Systems and Computing 226, 269–278 (2013)
3. Breiman, L.: Bagging predictors. Machine Learning, 24 (2), 123–140 (1996)
4. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority
Over-sampling Technique. Journal of Artifical Intelligence Research, 16, 341-378
(2002)
5. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H. Herrera, F.: A Review
on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and HybridBased Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part
C: Applications and Reviews, 99, 1–22 (2011)
6. Garcia, V., Sanchez, J.S., Mollineda, R.A.: An empirical study of the behaviour
of classifiers on imbalanced and overlapped data sets. In: Proc. of Progress in
Pattern Recognition, Image Analysis and Applications 2007, Springer, LNCS, vol.
4756, 397–406 (2007)
7. He H., Garcia E.: Learning from imbalanced data. IEEE Transactions on Data and
Knowledge Engineering, 21 (9), 1263–1284 (2009)
8. Hido S., Kashima H.: Roughly balanced bagging for imbalance data. Statistical
Analysis and Data Mining, 2 (5-6), 412–426 (2009)
9. Jo, T., Japkowicz, N.: Class Imbalances versus small disjuncts. ACM SIGKDD
Explorations Newsletter, 6 (1), 40–49 (2004)
10. Khoshgoftaar T., Van Hulse J., Napolitano A.: Comparing boosting and bagging
techniques with noisy and imbalanced data. IEEE Transactions on Systems, Man,
and Cybernetics–Part A, 41 (3), 552–568 (2011)
23
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
11. Kubat, M., Matwin, S.: Addresing the curse of imbalanced training sets: one-side
selection. In Proc. of Int. Conf. on Machine Learning ICML 97, 179–186 (1997)
12. Napierala, K., Stefanowski, J., Wilk, Sz.: Learning from Imbalanced Data in Presence of Noisy and Borderline Examples. In: Proc. of 7th Int. Conf. RSCTC 2010,
Springer, LNAI vol. 6086, pp. 158–167 (2010)
13. NapieraÃla, K., Stefanowski, J.: The influence of minority class distribution on learning from imbalance data. In. Proc. 7th Int. Conference HAIS 2012, Part II, LNAI
vol. 7209, Springer, pp. 139-150 (2012)
14. Stefanowski, J., Wilk, Sz.: Selective Pre-processing of Imbalanced Data for Improving Classification Performance. In: Proc. of 10th Int. Conference DaWaK 2008,
Springer Verlag, LNCS vol. 5182, 283-292 (2008)
15. Wang, S., Yao, T.: Diversity analysis on imbalanced data sets by using ensemble
models. In Proc. IEEE Symp. Comput. Intell. Data Mining, 324-331 (2009).
16. Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. Journal
of Artifical Intelligence Research, 6, 1-34 (1997)
24
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Anomaly Detection by Bagging
Tomáš Pevný
[email protected]
Agent Technology Center, Department of Computers,
Czech Technical University in Prague
Abstract. Many contemporary domains, e.g. network intrusion detection, fraud detection, etc., call for an anomaly detector processing a
continuous stream of data. This need is driven by the high rate of their
acquisition, limited resources for storing them, or privacy issues. The
data can be also non-stationary requiring the detector to continuously
adapt to their changes. A good detector for these domains should therefore have a low training and classification complexity, on-line training
algorithm, and, of course, a good detection accuracy.
This paper proposes a detector trying to meet all these criteria. The
detector consists of multiple weak detectors, each implemented as a onedimensional histogram. The one-dimensional histogram was chosen because it can be efficiently created on-line, and probability estimates can
be efficiently retrieved from it. This construction gives the detector linear complexity of training with respect to the input dimension, number
of samples, and number of weak detectors. Similarly, the classification
complexity is linear with respect to number of weak detectors and the
input dimension.
The accuracy of the detector is compared to seven anomaly detectors
from the prior art on the range of 36 classification problems from UCI
database. Results show that despite detector’s simplicity, its accuracy
is competitive to that of more complex detectors with a substantially
higher computational complexity.
Keywords: anomaly detection, on-line learning, ensemble methods, large data
Abstract.
1
Introduction
The goal of an anomaly detector is to find samples, which in some sense deviate
from the majority. It finds application in many important fields, such as network
intrusion detection, fraud detection, monitoring of health, environmental and
industrial processes, data-mining, etc. These domains frequently need a detector
with low complexity of training and classification, which can efficiently process
large number of samples, ideally in real-time. These requirements implicate that
the detector should be trained on-line, which is also important for domains,
25
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
where the data cannot be stored, or the data are non-stationary and the detector
needs to be updated continously.
This paper describes a detector (further called ADBag), which has a provably linear complexity of training with respect to the number of training samples
n, and their dimension d. The classification has a low complexity which scales
linearly with dimension d. The detector consists of an ensemble of k weak detectors, each implemented as a one-dimensional histogram with b bins. The onedimensional histogram was chosen because it can be created online in a single
pass over the data, and probability estimates can be efficiently retrieved from
it. Since each weak detector process only one-dimensional data, the input space
Rd is reduced to R by a projection on a randomly generated vector w. Projection vectors w create the diversity among weak detectors, which is important
for the success of the ensemble approach. As will be explained in more detail in
Section 3, ADBag can be related to a naïve Parzen window estimator [20], as
each weak detector provides an estimate of the probability of a sample. According to [26], Parzen window estimator gives frequently better results than more
complex classifiers1 .
ADBag’s accuracy is experimentally compared to the selected prior art algorithms on 36 problems downloaded from the UCI database [6] listed in the
classification problems with numerical attributes. Although there is no single
dominating algorithm, ADBag’s performance is competitive to others, but with
a significantly smaller computational and storage requirements. On a dataset
with millions of features and samples it is demonstrated that ADBag can efficiently handle large-scale data.
The paper is organized as follows. The next section briefly reviews relevant
prior art, shows the computational complexity, and discusses issues related to the
on-line learning and classification. ADBag is presented in Section 3. In Section 4,
it is experimentally compared to the prior art and its efficiency is demonstrated
on a large-scale dataset [17]. Finally, Section 5 concludes the paper.
2
Related work
A survey on anomaly and outlier detection [3] contains plenty of methods for
anomaly and outlier detection. Below, those relevant to us or otherwise important are reviewed. We remark that ADBag falls into the category of model-based
detectors, since every weak detector essentially creates a very simple model.
2.1
Model-based detectors
Basic model-based anomaly detectors assume the data follow a known distribution. For example principal component transformation based detector [25] with
complexity of training of O(nd3 ) assumes a multi-variate normal distribution.
1
Parzen window estimator is not suitable for real-time detection, since the complexity
of obtaining an estimate of pdf depends linearly on the number of observed samples
n.
26
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Although multi-variate normal distribution rarely fits the real data, as will be
seen later, this detector frequently provides good results.
On the contrary, One-Class Support Vector Machine [23] (OC-SVM) does
not assume anything about the distribution of the data. It finds the smallest
area, where 1 − ν fraction of data are located (ν is parameter of the method
specifying desired false-positive rate). This is achieved by projecting the data to
high-dimensional (feature) space, and then finding hyperplane best separating
the data from the origin. It has been noted that when OC-SVM is used with a
linear kernel, it introduces bias to the origin [26]. This problem is removed by
using Gaussian kernel. Support Vector Data Description algorithm [26] (SVDD)
removes the bias in OC-SVM by replacing the separation hyperplane by a sphere
encapsulating most of the data. It has been showed that if OC-SVM and SVDD
are used with a Gaussian kernel, they are equivalent [23]. Due to this fact,
SVDD is omitted in the comparison section. The complexity of the training of
both methods is super-linear with respect to the number of samples, n, being
O(n3 d) in the worst-case.
Recently proposed FRAC [19] aimed to bridge the gap between supervised
and unsupervised learning. FRAC is an ensemble of models, each estimating
one feature on basis of others (for data of a dimension d, FRAC uses d different models). The rationale behind this is that anomalous samples exhibit
different dependencies among features, which can be detected from prediction
errors modeled by histograms. FRAC’s complexity depends on the algorithm
used to implement models, which can be large, considering that a search for a
possible hyper-parameters needs to be done. Because of this, an ordinary linear
least-square regression is used leading to the complexity O(nd4 ). It is stated
that FRAC is well suited for an on-line setting, but it might not be straightforward. For real-valued features, every update changes all models together with
distribution of their errors. Consequently, to update the histograms of their errors, all previously observed samples are required. This means that unless some
simplifying assumptions are accepted, the method cannot be used in an online
settings.
Generally, the complexity of the classification of model-based detectors is
negligible in comparison to the complexity of training. Yet for some methods this
might be difficult to control. An example is OC-SVM, where the classification
complexity depends on the number of support vectors, which is a linear function
of the number of training samples n.
The on-line training of all above detectors is generally difficult. Since there
is no closed-form solution for the on-line PCA, the on-line version of the PCA
detector does not exist either. The on-line adaptation of OC-SVM is discussed
in [11], but the solution is an approximation of the solution returned by the
batch version. The exact on-line version of SVDD is described in [27], but the
algorithm requires substantial bookkeeping, which degrades its usability in realtime applications. Moreover, the bookkeeping increases the storage requirements,
which are not bounded anymore.
27
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
2.2
Distance-based detectors
Distance based detectors use all available data as a model. All data are usually
presented in a single batch and the outliers are found within. Thus, it is assumed
that the majority of samples comes from one (nominal) class. The lack of training
phase makes the adaptation to on-line settings easy — new samples are just
added to the set of already observed samples. Notice that this increases the
complexity of the classification phase, which is a linear function of the number
of samples n.
A k-nearest neighbor [12] (KNN) is a popular method to identify outliers
inspired by the corresponding method from classification. It ranks samples according to their distance to k th - nearest neighbor. KNN has been criticized for
not being able to detect outliers in data with clusters of different density [2]. The
local outlier factor [2] (LOF) solves this problem by defining the outlier score
as a fraction of sample’s distance to its k th -nearest neighbor and the average of
the same distance of all its k nearest neighbors. True inliers have score around
one, while outliers have score much greater. The prior art here is vast and it is
impossible to list all, hence we refer to [29] for more.
The complexity of the classification phase of nearest-neighbor based detectors is driven by the nearest-neighbor search, which is with a naïve implementation O(n) operation. To alleviate, more efficient approaches have been adopted
based on bookkeeping [22], better search structures like KD-trees, or approximate search. Nevertheless, the complexity of all methods depends in some way
on the number of training samples n.
2.3
Ensembles in outlier detection
The ensemble approach has been so far little utilized in the anomaly detection.
A significant portion of the prior art focuses on a unification of scores [8,24],
which is needed for diversification by using different algorithms [18]. A diversification by a random selection of sub-spaces has been utilized in [14].
ADBag’s random projection can be considered as a modification of the subspace method. Yet, the important difference is that the random projection relates
all features together and not just some of them, as the sub-space method does.
Also, all previous work use heavy-weighted detectors, while ADBag uses very
simple detector, which gives it its low complexity.
2.4
Random projections
Random projections have been utilized mainly in distance-based outlier detection
schemes to speedup the search. De Vries et al. [5] uses the property that random
projections approximately preserve L2 distances among set of points [10]. Thus
instead of performing the k th -NN search in a high-dimensional space in the
LOF, the search is conducted in the space of reduced dimension but on a larger
neighborhood, which is then refined by the search in the original dimension.
28
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Similarly, Pham et al. [21] use random projections to estimate distribution
of angles between samples, which has been proposed in [13] as a good anomaly
criterium.
Unlike both above schemes, ADBag is a model-based detector that does
not rely on the notion of distance or angles, but on the notion of probability
density. ADBag brings random projections to the extreme, by projecting the
input space into a single dimension, which extremely simplifies the complexity
of all operations over it.
3
Algorithm description
ADBag is an ensemble of equal, weak detectors. Every weak detector within
the ensemble is a histogram in one-dimensional space R, which is created by
projection the input space Rd to a vector w. Projection vectors are generated
randomly during the initialization of a weak histograms before any data have
been observed. Let hi (x) = p̂i (xT wi ) denotes the output of ith weak detector (p̂i
is empirical probability distribution function (pdf) of data projected on wi ). ADBag’s output is an average of negative logarithms of output of all weak detectors.
Specifically,
k
1X
log hi (x).
(1)
f (x) = −
k i=1
The rest of this section describes ADBag in detail, and it also explains the
design choices. Subsection 3.1 describes strategies to generate projection vectors wi . Following Subsection 3.2 points to issues related to on-line creation of
histograms and presents a method adopted from the prior art. Finally, Subsection 3.3 explains the rationale behind the aggregation function and ADBag itself.
The section finishes with a paragraph discussing ADBag’s hyper-parameters and
their effect on the computational complexity.
3.1
Projections
Projection vectors are generated randomly at the initialization of weak detectors. In all experiments presented in Section 4, their elements were generated
according to the normal distribution with zero mean and unit variance, which
was chosen due to its use in the proof of Johnson-Lindenstrauss (JL) lemma [10]
(JL lemma shows that L2 distances between points in the projected space approximates the same quantity in the input space). Other probabilities to generate
random vectors w certainly exists. According to [15] it is possible to use sparse
vector w, which is interesting, as it would allow the detector to elegantly deal
with missing variables.
3.2
Histogram
Recall that one of the most important requirements was that the detector should
operate (learn and classify) over data-streams, which means that the usual ap-
29
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Algorithm 1: Algorithm constructing approaximation of the probability
distribution of the data {x1 , . . . , xn } projected on the vector w.
initialize H = {}, zmin = +∞, zmax = −∞, and w ∼ N (0, 1d ) ;
for j ← 1 to n do
z = xT
j w;
zmin = min{zmin , z};
zmax = min{zmax , z};
if ∃(zi == z) then
mi = mi + 1;
continue
else
H = H ∪ {z, 1}
end
if |H| > b then
Sort pairs in H such that z1 < z2 < . . . < zb+1 ;
Find i minimizing zi+1 − zi ;
Replace pairs (zi , mi ), (zi+1 , mi+1 ), by the pair
zi mi + zi+1 mi+1
, mi + mi+1
mi + mi+1
end
H = H ∪ {(zmin , 0), (zmax , 0)};
Sort pairs in H such that zmin < z1 < z2 < . . . < zmax ;
proaches, such as equal-area, or equal-width histograms cannot be used. The former requires the data to be available in a single batch, while the latter requires
the bounds of the data to be known in beforehand. To avoid these limitations
and have a bound resources, an algorithm proposed in [1] is adopted, despite
it does not guarantee the convergence to the true pdf. A reader interested in
this problem should look to [16] and to references therein. In the experimental
section, ADBag with equal-area histogram created in the batch training is compared to the ADBag with the on-line histogram with the conclusion that both
offer the same performance.
The chosen on-line histogram approximates the distribution of data in a set
of pairs H = {(z1 , m1 ), . . . , (zb , mb )}, where zi ∈ R and mi ∈ N, and b is an
upper bound on the number of histogram bins. The algorithm maintains pairs
(zi , mi ), such that every point zi is surrounded by mi points, of which half is to
the left and half is to the right to zi . Consequently, the number of points in the
i+1
interval [zi , zi+1 ] is equal to mi +m
, and the probability of point z ∈ (zi , zi+1 )
2
is estimated as a weighted average.
The construction of H is described in Algorithm 1. It starts with H =
{} being an empty set. Upon receiving a sample, z = xT w, it looks if there
is a pair (zi , mi ) in H such that z is equal to zi . If so, the corresponding
30
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Algorithm 2: Algorithm returning approximate of probability density in
point x projected on the vector w.
H = H ∪ {(zmin , 0), (zmax , 0)};
Sort pairs in H such that zmin < z1 < z2 < . . . < zmax ;
z = xT w;
if ∃(i|zi < z ≤ zi+1 ) then
z m +zi+1 mi+1
;
return i2Mi (zi+1
−zi )
else
return 10−10
end
;
count mi is increased by one. If not, a new pair (z, 1) is added to H. If size
of H exceeds the maximal number of bins b, the algorithm finds two closest
pairs (zi , mi ), (zi+1 , mi+1 ), and replaces them with an interpolated pair
zi mi +zi+1 mi+1
, mi
mi +mi+1
+ mi+1 . Keeping zi sorted makes all above operations efficient.
The estimation of the probability density in point z = xT w is described
in Algorithm 2. Assuming pairs in H are sorted according to zi (the sorting
is explicitly stated, but as has been mentioned above, for efficiency H should
be sorted all the time), an i such that zi < z ≤ zi+1 is found first. If such i
Pb
mi +zi+1 mi+1
exists, then the density in z is estimated as zi2M
i=1 mi .
(zi+1 −zi ) , where M =
Otherwise, it is assumed that z is outside the estimated region and 10−10 is
returned.
3.3
Aggregation of weak detectors
ADBag’s output on a sample x ∈ Rd can be expressed as
f (x) = −
k
1X
log p̂i (xwiT )
k i=1
= − log
k
Y
! k1
p̂i (xwiT )
i=1
∼ − log p(xw1 , xw2 , . . . , xwk ),
(2)
where p̂i denotes empirical marginal pdf along the projection wi , and p(xw1 , xw2 ,
. . . , xwk ) denotes the joint pdf.
The equation shows that ADBag’s output is inversely proportional to the
joint probability of the sample under the assumption that marginals on projection vectors wi and wj are independent ∀i, j ∈ k, i 6= j (used in the last line of
Equation (2)). Similarly, the output can be viewed as a negative log-likehood of
31
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
the sample, meaning that less likely the sample is, the higher value of anomaly
it gets.
The independence of xwiT and xwjT for i 6= j assumed in last equation in (2)
is questionable and in reality it probably does not hold. Nevertheless, the very
same assumption is made in the naïve bayes classifier, which despite that it is almost always violated, gives results competitive to more sophisticated classifiers.
Zhang [28] explains this phenomenon from a theoretical point of view and gives
conditions, under which effects of conditional dependencies cancel out making
naïve bayes equal to the bayes classifier. These conditions depend on the probability distribution of both classes and they are difficult to be verified in practice,
since they require the exact knowledge of conditional dependencies among features. Nevertheless, due to ADBag’s similarity to the Parzen window classifier,
the similar argumentation might explain ADBag’s performance.
Another line of thoughts can relate ADBag to a PCA based detector [25].
If dimension d is sufficiently high, then projection vectors wi and wj , i 6= j
are approximately orthogonal. Assuming again the independence of xwiT and
xwjT , the projected data are orthogonal and uncorrelated, which are the most
important properties of Principal Component Analysis (PCA).
3.4
Hyper-parameters
ADBag is controlled by two hyper-parameters: the number of weak detectors k
and the number of histogram bins b within every detector. Both parameters are
very predictable in a way they influence the accuracy.
Generally speaking, the higher the number of weak detectors, the better
the accuracy. Nevertheless, after certain threshold, adding more detectors does
not significantly improve the accuracy. In all experiments in Section 4.2, we set
k = 150. Afterward investigation of the least k at which ADBag has the accuracy
above 99% of the maximum has found that this k was most of the time well below
100.
√
The number of histogram bins was set to b = [ n], where n is the number
of samples. The rationale behind
this choice is the following. If the number of
√
samples n → +∞, and b = n, then the equal area histogram converges to the
true probability distribution function. In practice, b should be set according to
available resources and expected number of samples. Interestingly, investigation
on a dataset with millions of features and samples revealed that the effect of b
on the accuracy is limited and small values of b are sufficient (see Section 4.3 for
details).
Both hyper-parameters influence the computational complexity of ADBag’s
training and classification. It is easy to see that both complexities depend at
most linearly on both parameters.
32
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
4
Experiments
ADBag’s accuracy was evaluated and compared to the state of the art on 36
problems from UCI database [6] listed under the category “classification problems
with numerical attributes without missing variables”.
The following algorithms were chosen due to their generality and acceptance
by a community: PCA based anomaly detector [25], OC-SVM [23], FRAC [19],
KNN [12], and LOF [2]. Although these algorithms were not designed for realtime on-line learning, they were selected because they provide a good benchmark.
For every problem, the class with the highest number of samples were used
as a nominal class and all samples from remaining classes were used as representatives of anomalous class. In one repetition of the experiment, 75% of nominal
samples was used for training and the rest was used for testing. The samples
from the anomalous class were not sampled — all of them were always used.
The data have been always normalized to have zero mean and unit variance on
the training part of the nominal class. The distance-based algorithms (KNN and
LOF) used the training data as a data to which distance of classified sample was
calculated. Every experiment was repeated 100 times.
To avoid problems with a tradeoff between false positive and false negative
rate, the area under the ROC curve (AUC) is used as a measure of the quality
of the detection. This measure is frequently used for comparisons of this kind.
The matlab package used in the evaluation process is available at
http://agents.fel.cvut.cz/~pevnak.
4.1
Settings of hyper-parameters
LOF and KNN method both used k = 10 (recall that in both methods, k denotes
the number of samples determining the neighborhood), as recommended in [2].
OC-SVM with a Gaussian kernel is cumbersome to use in practice, since
its two hyper-parameters (width of the Gaussian kernel γ and expected false
positive rate ν) have unpredictable effect on the accuracy (note that anomalous
samples are not available during training). Hence, the following heuristic has
been adopted. A false positive rate on the grid
10j
|i ∈ {0, . . . , 5}, j ∈ {−3, . . . , 3} ,
(ν, γ) ∈
0.01 · 2i ,
d
has been estimated by five-fold cross-validation (d is the input dimension of the
problem). Then, the lowest ν and γ (in this order) with estimated false positive
rate below 1% has been used. If such combination of parameters does not exist,
the combination with the lowest false positive rate has been used. The choice
of 1% false positive rate is justified by training on samples from the nominal
class only, where no outliers should be present. The reason for choosing lowest ν
and γ is that a good generalization is expected. The choice of the parameters is
probably not optimal for maximizing AUC, but it shows the difficulty of using
algorithms with many hyper-parameters. SVM implementation has been taken
from the libSVM library [4].
33
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
FRAC detector used ordinary linear least-square estimators, which in this
setting does not have any hyper-parameters. The PCA detector based on principal component transformation used all components with eigenvalues greater
than 0.01.
√
ADBag used k = 150 weak detectors, and b = n bins in the histogram (n is
the number of samples from nominal class). As will be discussed later, 150 weak
detectors is for most problems unreasonably high, but it was used to avoid the
poor performance due to their low number. The same holds for b.
4.2
Experimental results
Table 1 shows average AUCs of compared detectors on all 36 tested problems.
The first thing to notice is that there is no single detector dominating the others.
Every detector excels in at least one problem, and it is inferior in other. Therefore the last row in Table 1 shows the average rank of a given detector over
all problems (calculated only for unsupervised detectors with batch learning).
According to this measure, KNN, and the proposed ADBag detector with equalarea histogram provide overall the best performance, and they are on average
equally good.
The result shows that increased complexity of the detector does not necessarily lead to a better performance, which can be caused by more complicated
setting of parameters to be tuned.
Column captioned kmin in Table 1 shows a sufficient number of weak detectors, determined as the least k providing AUC higher then 0.99 · AUC on 150
detectors. For all problems, the sufficient number of weak detectors is well below the maximum number of 150. For many problems, kmin is higher than the
dimension of the input space. This shows that diverse set of random projections
provides a different views on the problem leading to the better accuracy.
For almost all problems, the accuracy increases with the number of projections. The only exception is “musk-2” dataset, which is not unimodal, as the
plot of first two principal components reveals three clusters of data. Contrary,
the “spect-heart” is actually a difficult problem even for the supervised classification, as the Adaboost [7] algorithms achieves only 0.62 AUC.
Investigation of problems on which ADBag is worse than the best detector
shows that ADBag performs poorly in cases, where the support of the probability
distribution of nominal class is not convex, and it encapsulates the support of
the probability distribution of the anomalous class. We believe that these cases
are rare in very high-dimensional problems, for which ADBag is designed.
Finally, comparing accuracies of batch and on-line versions (Table 1), it is
obvious that the on-line version of ADBag is no worse than the batch version.
This is important for an efficient processing of large data and application in
non-stationary problems [9].
34
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
dataset
abalone
blood-transfusion
breast-cancer-wisconsin
breast-tissue
cardiotocography
ecoli
gisette
glass
haberman
ionosphere
iris
isolet
letter-recognition
libras
madelon
magic-telescope
miniboone
multiple-features
musk-2
page-blocks
parkinsons
pendigits
pima-indians
sonar
spect-heart
statlog-satimage
statlog-segment
statlog-shuttle
statlog-vehicle
synthetic-control-chart
vertebral-column
wall-following-robot
waveform-1
waveform-2
wine
yeast
Average rank
batch unsupervised methods
online
FRAC PCA KNN LOF SVM ADBag ADBag d
n
0.44 0.46 0.46 0.50 0.60 0.59
0.59
8 1146
0.41 0.53 0.56 0.46 0.53 0.56
0.55
4
428
0.94 0.96 0.95 0.95 0.90 0.96
0.96
30 268
0.87 0.90 0.91 0.92 0.94 0.94
0.94
9
16
0.71 0.72 0.73 0.80 0.87 0.82
0.82
21 1241
0.93 0.97 0.98 0.98 0.98 0.98
0.98
7
107
— 0.78 0.78 0.83 0.74 0.74
0.75 5000 2250
0.64 0.68 0.67 0.68 0.55 0.67
0.65
9
57
0.67 0.61 0.67 0.66 0.33 0.64
0.64
3
169
0.96 0.96 0.97 0.95 0.93 0.96
0.96
34 169
1.00 1.00 1.00 1.00 1.00 1.00
1.00
4
38
0.87 0.98 0.99 0.99 0.99 0.99
0.98
617 180
0.99 0.99 0.98 0.98 0.79 0.96
0.95
16 610
0.66 0.88 0.72 0.75 0.72 0.82
0.81
90
18
0.51 0.51 0.52 0.52 0.53 0.51
0.51
500 975
0.83 0.80 0.83 0.84 0.69 0.73
0.73
10 9249
0.62 0.54 0.77 0.54 —
0.83
0.83
50 70174
0.77 1.00 0.99 1.00 0.99 0.99
0.99
649 150
0.83 0.79 0.79 0.84 0.18 0.53
0.42
166 4186
0.96 0.96 0.96 0.91 0.86 0.95
0.94
10 3685
0.71 0.68 0.61 0.67 0.85 0.74
0.73
22 110
1.00 1.00 0.99 1.00 0.99 0.99
0.99
16 858
0.72 0.72 0.75 0.68 0.64 0.73
0.73
8
375
0.67 0.66 0.53 0.62 0.65 0.63
0.63
60
83
0.31 0.27 0.22 0.23 0.68 0.32
0.31
44 159
0.81 0.97 0.99 0.99 0.85 0.99
0.99
36 1150
0.99 1.00 0.99 0.99 0.98 0.99
0.99
19 248
0.91 0.98 1.00 1.00 0.65 0.92
0.92
8 34190
0.98 0.98 0.89 0.94 0.60 0.85
0.86
18 164
0.97 1.00 1.00 1.00 1.00 1.00
1.00
60
75
0.85 0.88 0.88 0.89 0.87 0.95
0.90
6
150
0.75 0.68 0.78 0.74 0.63 0.70
0.68
24 1654
0.89 0.90 0.90 0.89 0.93 0.91
0.92
21 1272
0.81 0.82 0.81 0.80 0.76 0.78
0.80
40 1269
0.93 0.95 0.95 0.93 0.91 0.93
0.93
13
53
0.72 0.72 0.72 0.71 0.67 0.74
0.72
8
347
4.0
3.2 3.1 3.2 4.3
3.1
kmin
10
2
20
18
41
12
78
10
28
39
4
40
53
79
59
27
19
12
1
12
73
6
32
99
1
21
10
9
69
9
20
59
67
111
63
33
Table 1: Average unnormalized area under ROC curve calculated from 100 repetitions (higher is better). The best performance for a given problem are bold-faced.
The last row shows the average rank of the algorithm from all 36 problems (lower
is better).
35
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
4.3
URL dataset
A URL dataset [17] contains 2.4 million samples with 3.2 million features, hence
it is a good dataset where ADBag’s power can be demonstrated. Each sample
in the dataset contains sparse features extracted from an URL. The class of
benign samples contains random URL (obtained by visiting http://random.
yahoo.com/bin/ryl). The class of malicious samples was obtained by extracting
links in spam e-mails. URL’s were collected during 120 days. The original work
used the dataset in a supervised binary classification scenario obtaining accuracy
around 98%. Here, we use the dataset in the anomaly detection scenario utilizing
samples from the benign class for training.
0.8
AUC
AUC
0.8
0.78
0.76
KNN
Continuous
Fixed
Updated
0.6
28
27
26
b
25
24 100
200
300
400
500
0.4
0
20
40
60
80
100
120
Day
k
(a) ADBag with respect to k and b.
(b) KNN and versions of ADBag.
Fig. 1: Left figure shows AUC of updated ADBag on URL dataset for different
number of weak detectors k, and histogram bins b. Right figure shows AUC
of KNN detector in reduced space, Continuously updated ADBag (Continuous),
ADBag trained on the first day (Fixed), and ADBag trained every day (Updated)
with respect to days.
ADBag was evaluated with different number of weak detectors k ∈ {50, 100,
150, . . . , 500}, and different number of bins b ∈ {16, 32, 64, 128, 256}. A three
strategies to train ADBag were investigated.
Fixed ADBag was trained on benign samples from the day zero only, which
has never been used for evolution. Continuous ADBag was trained on all samples
from benign class up to the day before the testing data came from. This means
that if Continuous ADBag was evaluated on data from day l, the training used
benign samples from days 0, 1, 2, . . . , l − 1 (Continuous ADBag was trained in
the on-line manner). Finally, Updated ADBag was trained on benign samples
from the previous day than the data for evaluation.
Note that the prior art used in the previous section cannot be used directly for
a benchmarking purposes, because it cannot handle these data. In order to have
a method that ADBag can be compared to, a strategy from [5] has been adopted
36
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
and used with the KNN detector (the best according to results in previous subsec6
tion). Specifically, a random projection matrix W ∈ R3.2·10 ,500 , Wij ∼ N (0, 1)
has been created and all samples were projected in this new, k-dimensional
space. Remark that due to Johnson-Lindenstrauss lemma, L2 distances between
points should be approximately preserved, hence the KNN method should work
without modification. The KNN method was executed with 10 and 20 nearest
neighbors, finding the former being better.
Missing features were treated as zeros, which allows to efficiently handle notyet-seen variables by adding new row(s) to projection matrix W. This strategy
has been used for both ADBag and KNN methods.
Figure 1b shows AUCs of KNN and three variants of ADBag in every day
in the data set. According to the plot, in some days ADBag detectors were better, whereas in other days it was vice versa. The average difference between the
KNN and ADBag retrained every day was about 0.02, which is negligible difference considering that Continuous ADBag was approximately 27 times faster.
Interestingly, all three versions of ADBag provided similar performance, meaning that the benign samples were stationary. This phenomenon has been caused
by the fact that the distribution of all URLs on the internet has not changed
considerably during 120 days, when benign samples were acquired.
Figure 1a shows, how the accuracy of Continuous ADBag changes with the
number of weak detectors k, and the number of histogram bins b. As expected,
higher values yields to higher accuracy, although competitive results can be
achieved for k = 200 and b = 32. This suggests that b can have a low value,
which does not have to be proportional to the number of samples n.
Continuos ADBag’s average time of update and classification per day for
the most complex setting with k = 500 and b = 256 was 5.86s. The average
classification time for the KNN detector with k = 500 and 10 nearest neighbors
was 135.57. Both times are without projection of data to a lower-dimensional
space, which was done separately. This projection took 669.25s for 20,000 samples. These numbers show that ADBag is well suited for efficient detection of
anomalous events on large-scale data. Its accuracy is competitive to the state of
the art methods, while its running times are order of magnitude lower. Running
times were measured on Macbook air equipped with a 2-core 2Ghz Intel core i7
processor and 8Gb of memory.
5
Conclusion
This paper has proposed an anomaly detector with bounded computational and
storage requirements. This type of detector is important for many contemporary applications requiring to process large data, which are beyond capabilities
of traditional algorithms, such as one-class support vector machine or nearestneighbor based methods.
The detector is built as an ensemble of weak detectors, where each weak
detector is implemented as a histogram in one dimension. This one dimension
is obtained by projecting the input space to a randomly generated projection
37
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
vector. Projection vectors are generated randomly, which simultaneously creates
the needed diversity between weak detectors.
The accuracy of the detector was compared to five detectors from the prior
art on 36 classification problems from UCI datasets. According to the results,
the proposed detector and the nearest-neighbor based detector provide overall
the best performance. It was also demonstrated that the detector can efficiently
handle dataset with millions of samples and features.
The fact, that the proposed detector is competitive to established solutions
is especially important, if one takes its small computational and memory requirements into the account. Moreover, the detector can be trained on-line on a
data-streams, which open doors to its application in non-stationary problems.
With respect to experimental results, the proposed detector represents an
interesting alternative to established solutions, especially if large data needs to
be efficiently handled. It would be interesting to investigate the impact of sparse
random projections on the accuracy, as this will increase the efficiency and enable
the detector to be applied on data with missing features.
6
Acknowledgments
This work was supported by the Grant Agency of Czech Republic under the
project P103/12/P514.
References
1. Y. Ben-Haim and E. Tom-Tov. A streaming parallel decision tree algorithm. The
Journal of Machine Learning Research, 11:849–872, 2010.
2. M. M. Breunig, H.P. Kriegel, R.T. Ng, and J. Sander. Lof: identifying density-based
local outliers. SIGMOD Rec., 29(2):93–104, 2000.
3. V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM
Computing Surveys (CSUR), 41(3):1–58, 2009.
4. C. C. Chang and C. J. Lin. LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software
available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
5. T. de Vries, S. Chawla, and M. E. Houle. Finding local anomalies in very high
dimensional space. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 128–137. IEEE, 2010.
6. A. Frank and A. Asuncion. UCI machine learning repository, 2010. http://
archive.ics.uci.edu/ml.
7. Y. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23–37.
Springer, 1995.
8. J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms
into probability estimates. In Data Mining, 2006. ICDM’06. Sixth International
Conference on, pages 212–221. IEEE, 2006.
9. E. Hazan and C. Seshadhri. Efficient learning algorithms for changing environments. In Proceedings of the 26th Annual International Conference on Machine
Learning, pages 393–400. ACM, 2009.
38
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
10. W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a
hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
11. J. Kivinen, A. J. Smola, and R.C. Williamson. Online Learning with Kernels.
IEEE Transactions on Signal Processing, 52(8):2165–2176, 2004.
12. E. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers.
In Proceedings of the International Conference on Very Large Data Bases, pages
211–222, 1999.
13. H.-P. Kriegel and A. Zimek. Angle-based outlier detection in high-dimensional
data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 444–452. ACM, 2008.
14. A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In Conference
on Knowledge Discovery in Data: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, volume 21, pages
157–166, 2005.
15. P. Li. Very sparse stable random projections for dimension reduction in l α (0 <
α ≤ 2)) norm. Proceedings of the 13th ACM SIGKDD international conference on
Knowledge discovery and data mining - KDD ’07, page 440, 2007.
16. X. Lin. Continuously maintaining order statistics over data streams. In Proceedings of the eighteenth conference on Australasian database-Volume 63, pages 7–10.
Australian Computer Society, Inc., 2007.
17. J. Ma, L. K Saul, S. Savage, and G. M. Voelker. Identifying suspicious urls: an
application of large-scale online learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 681–688. ACM, 2009.
18. H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of
heterogeneous detectors on random subspaces. In Database Systems for Advanced
Applications, pages 368–383. Springer, 2010.
19. K. Noto, C. Brodley, and D. Slonim. Frac: a feature-modeling approach for
semi-supervised and unsupervised anomaly detection. volume 25, pages 109–133.
Springer US, 2012.
20. E. Parzen. On estimation of a probability density function and mode. The annals
of mathematical statistics, 33(3):1065–1076, 1962.
21. N. Pham and R. Pagh. A near-linear time approximation algorithm for anglebased outlier detection in high-dimensional data. In Proceedings of the 18th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages
877–885. ACM, 2012.
22. D. Pokrajac, A. Lazarevic, and J. L. Latecki. Incremental Local Outlier Detection
for Data Streams. In 2007 IEEE Symposium on Computational Intelligence and
Data Mining, pages 504–515. IEEE, 2007.
23. B. Schölkopf, J. C. Platt., J. Shawe-Taylor, A. J. Smola, and R. C. Williamson.
Estimating the support of a high-dimensional distribution. Neural Comput.,
13(7):1443–1471, 2001.
24. E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of
outlier rankings and outlier scores. In Proceedings of the 12th SIAM International
Conference on Data Mining (SDM), Anaheim, CA, 2012, pages 1047–1058, 2012.
25. M. L. Shyu. A novel anomaly detection scheme based on principal component
classifier. Technical report, DTIC Document, 2003.
26. D. M. J. Tax and R. P. W. Duin. Support vector data description. Mach. Learn.,
54(1):45–66, 2004.
27. D.M.J. Tax and P. Laskov. Online svm learning: from classification to data description and back. In Neural Networks for Signal Processing, 2003. NNSP’03.
2003 IEEE 13th Workshop on, pages 499–508, 2003.
39
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
28. H. Zhang. The optimality of naive bayes. In V Barr and Z Markov, editors, Proceedings of the Seventeenth International Florida Artificial Intelligence Research
Society Conference. AAAI Press, 2004.
29. A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining,
5(5):363–387, 2012.
40
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Efficient semi-supervised feature selection by an
ensemble approach
Mohammed Hindawi1 , Haytham Elghazel2 , Khalid Benabdeslem2
1
INSA de Lyon
LIRIS, CNRS UMR 5205
F-69621, France
[email protected]
2
University of Lyon1
LIRIS, CNRS UMR 5205
F-69622, France
{haytham.elghazel, khalid.benabdeslem}@univ-lyon1.fr
Abstract. Constrained Laplacian Score (CLS) is a recently proposed
method for semi-supervised feature selection. It presented an outperforming performance comparing to other methods in the state of the art.
This is because CLS exploits both unsupervised and supervised parts
of data for selecting the most relevant features. However, the choice of
the little supervision information (represented by pairwise constraints)
is still a critical issue. In fact, constraints are proven to have some noise
which may deteriorate the learning performance. In this paper we try to
override any negative effects of constraints set by the variation of their
sources. This is done by an ensemble technique using both a resampling
of data (bagging) and a random subspace strategy. The proposed approach generates a global ranking of features by aggregating multiple
Constraint Laplacian Scores on different views of the available labeled
and unlabeled data . We validate our approach by empirical experiments
over high-dimensional datasets and compare it with other representative
methods.
Key words: Feature selection, semi-supervised learning, constraint score,
ensemble methods.
1
Introduction
In nowadays machine learning applications, data acquisition tools have well developed making it’s easier to get continuously a voluminous rough data. The
huge quantity of data in its turn, has a dramatically deterioration effects on
both stocking and treating the data via the classical learning algorithm due to
the “curse of dimensionality”. In order to override this problem, feature selection
has become one of the most important techniques to reduce the dimensionality.
Feature selection can be defined as the process of choosing the most relevant
features of data. The relevance of a feature may differ according to the learning
context, which may be roughly divided into supervised, unsupervised and semi
supervised feature selection.
41
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
In supervised feature selection, where all data instances are labeled, a relevance of a feature is measured according to its correlation with label information.
Then a ’good’ feature would be the one at which the instances with the same
labels record the same (or closer) values, and vice-versa [16]. Unsupervised feature selection is considered as a much harder problem due to the absence of
labels; hence the relevance of a feature is measured according to its ability in
preserving some data characteristics (e.g. variance) [11]. Actually, the supervised
feature selection methods outperform the unsupervised ones due to the presence
of labels which represent the background knowledge about the data. However,
labels availability is not always guaranteed, this is because labels -generally- require experts’ intervention which is costly to obtain. Adding the aforementioned
idea of rapid data acquisition tools development, a more frequent case in machine learning applications is to provide labeling information for a small part of
data, then the data is called ‘semi-supervised’, which in its turn produces the
so-called “small-labeled sample problem” [34].
In [3] we proposed Constrained Laplacian Score (CLS) as a semi-supervised
scoring method which makes profit of the data structure and the label information (transformed into pairwise constraints). CLS has scored an outstanding
performance towards other competitive methods. However, the method was sensible to the noise in the constraints set. To tackle this problem, we later proposed
a Constrained Selection based Feature Selection framework (CSFS) [19] in which
we enhanced the function score in order to be more efficient. In order to overcome
the problem of noisy constraints, CSFS exploits a constraint selection process
according to a coherence measure (proposed in [8]), which considers that two
constraints are incoherent if they represent two contradictive powers and coherent if not. When the constraint selection is done, the remaining constraints are
obviously fewer but more efficient. CSFS outperformed the results of its ancestor CLS, this could be explained by the amelioration of the scoring function and
the elimination of the constraint noise. However, CSFS had two critical points
: firstly, even if they are efficient, the size of selected constraint set was rather
small, this has led in some cases to dramatic minimization of the constraints use
feasibility. In addition, CSFS and CLS are based on the Euclidian distance between instances in the computation of feature scores, in this case the calculation
of such distance becomes less reliable when data is of high dimensionality.
To overcome the two mentioned problems, we present an ensemble-based
framework called EnsCLS (for Ensemble Constraint Laplacian Score) for semisupervised feature selection. EnsCLS combines both a resampling of data (bagging) and a random selection of features (random subspaces or RSM for short)
strategy. The CLS score is then used to measure features relevance on each
replicate of data and the score average of all features across all ensemble components is considered. A combination of these two main strategies (bagging and
RSM) for producing feature ranking, leads to an exploration of distinct views of
inter-pattern relationships and allows to (i) compute robust estimates of variable importance against small changes in the pairwise constraint set, and (ii) to
mitigate the curse of dimensionality.
42
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
The rest of the paper is organized as follows: Section 2 reviews recent studies
on semi-supervised feature selection and ensemble methods. Section 3 briefly
recalls Constraint Laplacian Score algorithm. Then we discuss the details of
the proposed EnsCLS algorithm in Section 4. Experiments using relevant highdimensional benchmarks and real datasets are presented in Section 5. Finally,
we conclude this paper in Section 6.
2
Related works
In this section, we briefly present the semi-supervised feature selection and the
semi-supervised ensemble approaches that appeared recently in the literature.
2.1
Feature selection
With the advent of semi-supervised feature selection, some unsupervised methods are adopted to this context by ignoring the few label information. Laplacian
score [18], as example, determines a feature relevance according to the variance of
data along it. The variance is an important measure of data, nevertheless, labeled
data also carry valuable information and represent the background knowledge
about the domain. At the other hand, another score, called constraint score [33],
depends only on the few available labeling information which is transformed
into constraints. Actually, constraint score proved that utilizing a few number
of constraints it may perform competitively to other full labels methods (like
Fisher score [13]), this had made constraint score more adaptive to the smalllabeled sample problem. However, constraint score ignores the “large” unlabeled
data part which carry the real data structure. In addition, the performance of
constraint score is severely influenced by the choice of the constraint set. To
overcome this problem, authors in [30] proposed a bagging approach (BS) to the
constraint score in order to ameliorate the overall classification accuracy. The
main drawback of the method is -as mentioned- the still ignorance of the unlabeled part of data which is generally far larger than the labeled one. In order
to make profit of the both labeled and unlabeled parts of data, a score called
C4 [23] has proposed a simple multiplication of the Laplacian and Constraint
scores in order to compromise between the two scores. However, the method is
biased towards the features with good Laplacian score but bad constraint score
and vice-versa.
2.2
Ensemble learning
Ensemble methods have been called the most influential development in Data
Mining and Machine Learning in the last decade. They combine multiple models
into one usually more accurate than the best of its components. This improvement of performances relies on the concept of diversity which states that a good
classifier ensemble is an ensemble in which the examples that are misclassified
are different from one individual classifier to another. Dietterich [10] states that
43
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
”A necessary and sufficient condition for an ensemble of classifiers to be more
accurate than any of its individual members is if the classifiers are accurate and
diverse”. Many methods have been proposed to generate accurate, yet diverse,
sets of models. Bagging [5], boosting [14] and Random Subspaces [20] are the
most popular examples of this methodology. While Bagging obtains a bootstrap
sample by uniformly sampling with replacement from original training set, boosting resamples or reweighs the training data by emphasizing more on instances
that are misclassified by previous classifiers. Likewise bagging, random subspaces
method (RSM) are another excellent source of obtaining diversity through feature set manipulation that provides different views of data and allows to improve
the quality of classification solutions.
Recently, besides classification ensemble, there also appears clustering [29, 31]
and semi-supervised learning [26, 32, 17] ensemble for which it has been shown
that combining the strengths of a diverse set of clusterings or semi-supervised
learners can often yield more accurate and robust solutions. Last but not least,
considerable attention was paid to exploiting the power of ensemble with a view
to identify and remove the irrelevant features in a supervised [6, 27], unsupervised
[21, 22, 12] and semi-supervised [2] setting.
3
Constraint Laplacian Score
In this section we present a brief description of the CLS score [3] upon which we
depend in our framework. In fact, CLS utilizes both parts of data, labeled and
unlabeled. The labeled part is transformed into pairwise constraints, which can
be classified on two subsets: ΩML (a set of Must-Link constraints) and ΩCL (a
set of Cannot-Link constraints)
– Must-Link constraint (M L): involving two instances xi and xj , specifies
that they have the same label.
– Cannot-Link constraint (CL): involving two instances xi and xj , specifies
that they have different labels.
Let X be a dataset of n instances characterized by p features. X consists of
two subsets: XL for labeled data and XU for unlabeled data.
Let r be a feature to evaluate. We define its vector by fr = (fr1 , ..., frn ). The
CLS of r, which should be minimized, is computed by:
P
2
i,j (fri − frj ) Sij
(1)
CLSr = P P
i 2
i
j|∃l,(xl ,xj )∈ΩCL (fri − αrj ) Dii
P
where D is a diagonal matrix with Dii =
j Sij , and Sij is defined by the
neighborhood relationship between instances (xi = 1, .., n) as follows:
Sij =
8
kxi −xj k2
>
>
λ
<e−
>
>
:
0
if ((xi , xj ) ∈ XU and xi , xj are
neighbors) or (xi , xj ) ∈ ΩM L
otherwise
44
(2)
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Algorithm 1 CLS
Require:
A data set X(n × p), which consists of two subsets: XL (L × p), the subset
of labeled training instances and XU (U × p), the subset of unlabeled training instances; the input space (F = {f1 , . . . , fp }); the constant λ and the
neighborhood degree k.
1: Construct the constraint sets (ΩML and ΩCL ) from the labeled part: XL .
2: Calculate the dissimilarity matrix S and the diagonal matrix D.
3: for r = 1 to p do
4:
Calculate CLSr according to eq(1).
5: end for
where λ is a constant to be set, and xi , xj are neighbors means that xi is among
k nearest neighbors of xj .


frj if (xi , xj ) ∈ ΩCL
i
(3)
αrj = µr if i = j and xi ∈ XU


fri otherwise
P
where µr = n1 i fri (the mean of the feature vector fr ).
CLS represents an enhanced version of both scores Laplacian [18] and Constraintbased [33]. In fact, Laplacian score can be seen as a special version of CLS when
there are no labels (X = XU ), and when (X = XL ), CLS can be considered as
an adjusted version of constraint score [33]. In CLS, we proposed a more efficient
combination of both scores by a new score function, including the geometrical
structure of unlabeled data and the constraint-preserving ability of labeled data.
With CLS, on the one hand, a relevant feature should be the one on which
those two instances (neighbors or related by an M L constraint) are close to each
other. On the other hand, the relevant feature should be the one with a larger
variance or on which those two instances (related by a CL constraint) are well
separated. We present the whole procedure of CLS in Algorithm 1.
Note that this algorithm is computed in time O(p × max(n2 , log p)). To
reduce this complexity, we proposed in our prior work [3] to apply a clustering
on XU . The idea was to substitute this huge part of data by a smaller one
XU′ = (u1 , ..., uK ) by preserving the geometric structure of XU , where K is the
number of clusters. We proposed to use the Self-Organizing Map (SOM) based
clustering [24], for its ability to preserve the topological relationship of data well
and thus the geometric structure of their distribution. With this strategy, we
reduced the complexity to O(p × max(U, log p)), where U is the size of XU .
4
Ensemble Constraint Laplacian Score
In this section we present our ensemble based approach of constrained laplacian
score for semi-supervised feature selection.
45
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
As discussed before, the most important condition for a successful ensemble learning method is to combine models which are different from each other.
Thus, to maintain diversity between committee members, we have employed two
strategies. Firstly, a well known ensemble method named RSM [20], is employed
to face the curse of dimensionality problem by constructing multiple classifiers
each one trained on different subset of examples projected on a smaller feature set RSM i . Secondly, the diversity is further maintained, by applying the
bootstrapping method [14].
The formal description of our approach is given in Algorithm 2. Given a
set of labeled training examples XL , and a set of unlabeled training examples
XU , described over the input space F = {f1 , . . . , fp }, our approach constructs
a committee according to the following steps. First, as described in the steps
3 and 4 of Algorithm 2, the committee is constructed as follows : For each
ensemble component i, a replication XL,b i of the labeled data set is obtained
by selecting instances from XL with replacement and then projecting them over
RSM i , a feature subspace with m randomly selected features (m < p). The
unlabeled data part XU is also projected over RSM i to generate XU i . Once
each ensemble component i is obtained, the CLS score in Algorithm 1 is used to
measure features relevance (step 6). A ranking of all features is finally obtained
with respect to their average relevances over all ensemble members (steps from
7 to 9).
A single learner is known to produce very bad results as the learning algorithms break down with high-dimensional data. Ensemble learning paradigms
train multiple component learners and then combine their output results. Ensemble techniques are considered as an effective solution to overcome the dimensionality problem and to improve the robustness and the generalization ability
of single learners,
By using bagging in tandem with random feature subspaces, our framework
try to deal with three different problems in the CLS score:
– High dimensionality : The major drawback of CLS was the application
on high-dimensional data. This is because the Euclidian distances between
examples (over all features) is an essential factor in the function score (Sij
in equation (1)). This makes the calculation of such distances less reliable
when dealing with very high-dimensional data leading to bad features scores.
Motivated by this, we adopt the use of the random manipulation strategy
over the feature space (RSM). Hence, we create N random subspaces of the
original features with a nearly equal apparition probability for all features.
The high dimension is then reduced in each subspace and the distances
calculated upon the new reduced dimension is more reliable. Consequently,
working on the projected random subspace allows us to mitigate the curse
of dimensionality and also help in enhancing the diversity between ensemble
components.
– Constraints : In CLS, instance level constraints are generated directly from
labels. In semi-supervised context such labels are few, then the number of
constraints (Ω = L(L − 1)/2 where L is the number of the labeled instances)
46
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
is rather few too. Moreover, the generated constraint set may contain some
noisy constraints which were proven to have deteriorate effects on the learning performance. In order to improve the positive effects of the pairwise
constraints, we propose the use of bagging method on the labeled part of
data in each random subspace. The bagging is made by sampling with replacement. The reason for using bagging is to enforce diversity on pairwise
constraints and then to compute a robust estimation of feature score against
small changes in the pairwise constraint set. Furthermore, the different bootstrap samples in different random subspaces helps in reducing the undesirable
effects of the noisy constraints.
– Unlabeled instance diversity : The computation of CLS score implied
the application of a clustering algorithm (SOM) to overcome the computation complexity of the score function. This is due to the fact that the
complexity of the CLS score is highly dependent to the unlabeled part of
data. Such a clustering was proved to considerably reduce this complexity.
In this work, based on the random subspace approach, we keep the use of
SOM algorithm in each subspace. Doing this, not only the computational
complexity is reduced, but also the diversity is gained by the diversity of
clusterings obtained in the different subspaces.
Algorithm 2 The EnsCLS algorithm
Require:
Set of labeled training examples (XL ); set of unlabeled training examples
(XU ); input space (F = {f1 , . . . , fp }); committee size (N )
1: Initialize the scores I(fr ) to zero for each feature r
2: for i = 1 : N do
3:
RSM i = randomly draw m features from F
4:
XL,b i = bootstrap sample from XL projected onto RSM i
5:
XU i = the unlabeled sample XU projected onto RSM i
6:
impi = CLS(XL,b i , XU i ) compute the constraint laplacian score of each
feature in RSM i using Algorithm 1
i
7:
for each feature r ∈ RSM
do
i
8:
I(fr ) = I(fr ) + impN(fr )
9:
end for
10: end for
11: rank the features in F according to their scores I in ascending order.
12: return F
5
Experimental results
In this section, we provide empirical results on several benchmark and real highdimensional datasets and compare EnsCLS against over state-of-the-art semi-
47
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 1. The datasets used in the experiments
Dataset # patterns # features # classes Reference
BasesHock
1993
4862
2
[36]
Leukemia
73
7129
2
[15]
Lymphoma
96
4026
9
[1]
Madelon
2598
500
2
[4]
PcMac
1943
3289
2
[36]
PIE10P
210
2420
10
[36]
PIX10P
100
10000
10
[36]
Prostata
102
12533
2
[28]
Relathe
1427
4322
2
[36]
supervised feature ranking algorithms. EnsCLS is compared with four other
feature selection methods: (1) the original CLS score [3], (2) the Constrained
Selection based Feature Selection framework (CSFS) [19], two ensemble-based
feature evaluation algorithms, including (3) the Bagging constraint Score (BS)
[30], and (3) the wrapper-type Semi-Supervised Feature Importance approach
(SSFI) [2]. Nine benchmark and real labeled datasets were used to assess the
performance of feature selection algorithms. They are described in Table 1. We
selected these datasets as they contain thousands features and are thus good
candidates for feature selection. Most of these datasets have already been used in
various empirical studies [35, 2] and cover different application domains: Biology,
image and text analysis.
5.1
Evaluation framework
To make fair comparisons, the same experimental settings in [3] was adopted here
for CLS and CSFS approaches, i.e., the neighborhood graph with a neighborhood
degree of 10, and the λ value is set to 0.1. For BS, we set the ensemble size to
100, as around this value the quality of this method is less insensitive to the
increase of the ensemble size (c.f. [30]). EnsCLS and SSFI are tuned similarly.
√
The number of features per bag is m = p, where p is the size of the input
space. The committee size N is computed using the following formula:
log(0.01)
.
(4)
N = 10 × ceil
√
log(1 − 1/ p)
This formula ensures that each feature is drawn ten times at a confidence level
of 0.01. Furthermore, as suggested by the authors in [2], the number of iterations
maxiter and the sample size n in SSFI are set to 10, and 1, respectively.
For each dataset, experimental results are averaged over 10 runs. At each run,
the whole dataset is splitted (in a stratified way) into a training partition with
2/3 of the observations and a test partition with the remaining 1/3 observations.
Training set is further splitted into labeled and unlabeled datasets. As in [35],
48
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Algorithm 3 Feature Evaluation Framework
1: for each dataset X do
2:
build a randomly stratified partition (T r, T e), from X where |T r| =
1
3 .|X|;
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
2
3 .|X|
and |T e| =
Generate labeled data XL by randomly sampling from T r 3 instances per
class;
XU = T r\XL ;
SFCLS = Apply CLS with XL ∪ XU ;
SFCSF S = Apply CSFS with XL ∪ XU ;
SFBS = Apply BS with XL ∪ XU ;
SFSSF I = Apply SSFI with XL ∪ XU ;
SFEnsCLS = Apply EnsCLS with XL ∪ XU ;
for i = 1 to 20 do
Select top i features from SFCLS , SFCSF S , SFBS , SFSSF I and
SFEnsCLS ;
T rCLS = ΠSFCLS (T r);
T rCSF S = ΠSFCSF S (T r);
T rBS = ΠSFBS (T r);
T rSSF I = ΠSFSSF I (T r);
T rEnsCLS = ΠSFEnsCLS (T r);
Train the Baselearner using T rCLS , T rCSF S , T rBS , T rSSF I and
T rEnsCLS and record accuracy obtained on T e;
end for
end for
the labeled sample set XL consists of randomly selected 3 patterns per class,
and the remaining patterns are used as unlabeled sample set XU . In order to
assess the quality of a feature subset obtained with the aforementioned semisupervised procedures, we train a SVM classifier (using LIBSVM package [7])
on the whole labeled training data and evaluate its accuracy on the test data. The
latter is taken as the score for the feature subset. The details of the evaluation
framework are shown in Algorithm 3. As mentioned above, the process specified
in Algorithm 3 is repeated 10 times. The obtained accuracy is averaged and
used for evaluating the quality of the feature subset selected according to each
algorithm.
5.2
Results
In Figure 1, we plotted the accuracies of the above feature selection approaches
against the 20 most important features. As may be observed, EnsCLS outperforms the other four methods by a noticeable margin. The major observations
from the analysis of these plots are three-fold:
– EnsCLS usually has better performances than CLS and CSFS. This firstly
validates the motivation behind our method EnsCLS that ensemble strategy
49
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
BaseHock data set
Lymphoma data set
Leukemia data set
0.79
0.83
EnsCLS
CLS
CSFS
BS
SSFI
0.76
0.73
0.8
EnsCLS
CLS
CSFS
BS
SSFI
0.8
EnsCLS
CLS
CSFS
BS
SSFI
0.77
0.74
0.77
0.71
0.7
0.68
0.74
0.64
0.61
Accuracy
0.65
Accuracy
Accuracy
0.67
0.71
0.68
0.62
0.59
0.56
0.65
0.58
0.55
0.62
0.52
0.59
0.53
0.5
0.47
0.49
0.44
1
2
3
4
5
6
7
8
0.56
9 10 11 12 13 14 15 16 17 18 19 20
1
2
3
4
5
6
The number of selected features
7
8
0.41
9 10 11 12 13 14 15 16 17 18 19 20
Madelon data set
0.73
0.7
Accuracy
Accuracy
Accuracy
0.55
0.64
0.61
0.52
0.58
0.55
0.49
0.52
2
3
4
5
6
7
8
4
5
6
7
8
0.49
9 10 11 12 13 14 15 16 17 18 19 20
1
2
3
4
5
The number of selected features
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
0.91
0.88
0.85
0.82
0.79
0.76
0.73
0.7
0.67
0.64
0.61
0.58
0.55
0.52
0.49
0.46
0.43
0.4
0.37
0.34
0.31
0.28
0.25
0.22
0.19
0.16
EnsCLS
CLS
CSFS
BS
SSFI
1
2
3
4
5
The number of selected features
PIX10P data set
9 10 11 12 13 14 15 16 17 18 19 20
PIE10P data set
0.67
1
3
PcMac data set
EnsCLS
CLS
CSFS
BS
SSFI
0.76
0.58
0.46
2
The number of selected features
0.79
EnsCLS
CLS
CSFS
BS
SSFI
0.61
1
The number of selected features
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
The number of selected features
Prostata data set
Relathe data set
0.71
0.94
0.8
0.91
0.77
0.88
0.74
0.68
EnsCLS
CLS
CSFS
BS
SSFI
0.71
0.85
EnsCLS
CLS
CSFS
BS
SSFI
0.76
0.73
Accuracy
Accuracy
0.79
0.65
0.62
0.59
0.56
0.53
0.56
0.5
0.64
0.62
0.59
0.7
0.67
Accuracy
0.65
EnsCLS
CLS
CSFS
BS
SSFI
0.68
0.82
0.47
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
The number of selected features
0.44
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
The number of selected features
0.53
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
The number of selected features
Fig. 1. Accuracy vs. different numbers of selected features. The number of labeled instances per class is set to 3.
has the potential to improve the quality and the stability of the CLS score
and also confirms the effectiveness of this ensemble strategy to rank the
features properly, compared to the powerful constraint selection method used
in CSFS.
– EnsCLS seems to combine more efficiently the labeled and unlabeled data
for feature evaluation and it shows promise for scaling to larger domains in a
semi-supervised way in view of the good performance on BaseHock, PcMac,
Madelon and Relathe datasets. This suggests the ability of the proposed
50
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 2. Mean and standard deviations of accuracy over the 20 most important
features. Bottom row of the table present average rank of accuracy mean used
in the computation of the Friedman test.
Data
BaseHock
Leukemia
Lymphoma
Madelon
PcMac
PIE10P
PIX10P
Prostata
Relathe
Av Rank
EnsCLS
CLS
CSFS
BS
SSFI
0.695±0.01 0.507±0.00 0.513±0.01 0.600±0.05 0.675±0.03
0.781±0.06 0.740±0.10 0.751±0.11 0.618±0.04 0.760±0.08
0.702±0.03 0.480±0.03 0.490±0.03 0.647±0.04 0.680±0.06
0.594±0.01 0.542±0.04 0.549±0.03 0.499±0.01 0.548±0.04
0.703±0.01 0.515±0.01 0.517±0.01 0.543±0.03 0.638±0.02
0.734±0.07 0.535±0.04 0.535±0.04 0.696±0.11 0.701±0.07
0.907±0.03 0.837±0.05 0.837±0.05 0.882±0.03 0.902±0.03
0.749±0.04 0.507±0.02 0.511±0.02 0.538±0.08 0.735±0.10
0.660±0.01 0.550±0.00 0.560±0.00 0.562±0.02 0.553±0.00
1.0000
4.6667
3.6667
3.3333
2.3333
ensemble method of CLS to rank the relevant features accurately, compared
to especially the other ensemble semi-supervised feature selection approaches
(BS and SSFI), by exploiting efficiently the topological information from the
unlabeled data.
– A closer inspection of the plots reveals that the accuracy on the features
selected by EnsCLS generally increases swiftly at the beginning (the number
of selected feature is small) and slows down afterwards. This suggests that
EnsCLS ranks the most relevant features first and that a classifier can achieve
a very good classification accuracy with the top 5 features while the other
methods require more features to achieve comparable results.
Fore sake of completeness, we also averaged the accuracy for different numbers of selected features. The averaged accuracies of EnsCLS and the other methods over the top 20 features are depicted in Table 2. In order to better assess
the results obtained for each algorithm, we adopt in this study the methodology
proposed by [9] for the comparison of several algorithms over multiple datasets.
In this methodology, the non-parametric Friedman test is firstly used to evaluate
the rejection of the hypothesis that all the classifiers perform equally well for
a given risk level. It ranks the algorithms for each dataset separately, the best
performing algorithm getting the rank of 1, the second best rank 2 etc. In case
of ties it assigns average ranks. Then, the Friedman test compares the average
ranks of the algorithms and calculates the Friedman statistic. If a statistically
significant difference in the performance is detected, we proceed with a post hoc
test. The Nemenyi test is used to compare all the classifiers to each other. In
this procedure, the performance of two classifiers is significantly different if their
average ranks differ more than some critical distance (CD). The critical distance
depends on the number of algorithms, the number of data sets and the critical
value (for a given significance level p) that is based on the Studentized range
statistic (see [9] for further details). In this study, based on the values in Table 2,
51
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
the Friedman test reveals statistically significant differences (p < 0.05) between
all compared approaches.
Furthermore, we present the result from the Nemenyi posthoc test with average rank diagrams as suggested by Demsar [9]. These are given on Figure 2.
The ranks are depicted on the axis, in such a manner that the best ranking
algorithms are at the rightmost side of the diagram. The algorithms that do not
differ significantly (at p = 0.1) are connected with a line. The critical difference
CD is shown above the graph (here CD=1.8336).
CD =1.8336
5
4
3
CLS
2
1
EnsCLS
CSFS
SSFI
BS
Fig. 2. Average ranks diagram comparing the feature selection algorithms in
terms of accuracy over different number of selected features.
Overall, EnsCLS performs best. However, its performances are not statistically distinguishable from the performances of SSFI. Another interesting observation upon looking at the average rank diagrams and Table 2 is that, almost in
all cases the ensemble methods, i.e. EnsCLS, SSFI and BS, achieve better performances than those of single methods including CLS and CSFS, respectively.
The statistical tests we use are conservative and the differences in performance for methods within the first group (EnsCLS and SSFI) are not significant.
To further support these rank comparisons, we compared, on each dataset and
for each pair of methods, the accuracy values in Table 2 using the paired t-test
(with p = 0.1). The results of these pairwise comparisons are depicted in Table
3 in terms of ”Win-Tie-Loss” statuses of all pairs of methods; the three values
in each cell (i, j) respectively indicate how times many the approach i is significantly better/not significantly different/significantly worse than the approach j.
Following [9], if the two algorithms are, as assumed under the null-hypothesis,
equivalent, each should win on approximately n/2 out of n data sets. The number of wins is distributed according to the binomial distribution and the critical
number of wins at p = 0.1 is equal to 7 in our case. Since tied matches support
the null-hypothesis we should not discount them but split them evenly between
the two classifiers when counting the number of wins; if there is an odd number
of them, we again ignore one.
In Table 3, each pairwise comparison entry (i, j) for which the approach i
is significantly better than j is boldfaced. From this table, the analogous trend
between EnsCLS and other feature selection methods can be observed as in Table
52
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 3. Pairwise t-test comparisons of FS methods in terms of accuracy. Bold
cells (i, j) highlights that the approach i is significantly better than j according
to the sign test at p = 0.1.
EnsCLS
CLS
CSFS
BS
SSFI
EnsCLS
−
0/1/8
0/1/8
0/1/8
0/4/5
CLS
8/1/0
−
4/3/2
5/2/2
6/3/0
CSFS
8/1/0
2/3/4
−
5/2/2
6/2/1
BS
8/1/0
2/2/5
2/2/5
−
7/2/0
SSFI
5/4/0
0/3/6
1/2/6
0/2/7
−
2 and Figure 2, i.e., EnsCLS and SSFI usually have better performances than
all other methods. On the other hand, It can be seen from Table 3 that EnsCLS
significantly outperforms SSFI.
6
Conclusion
Constraint Laplacian Score (CLS) which uses pairwise constraints for feature
selection has shown good performance in our previous work [3]. However, one
important problem of such approach is how to best use the available constraints
for dealing with low-quality ones that may deteriorate the learning performance.
Instead of making efforts on choosing constraints for single feature selection, as
recently done in the CSFS approach [19], we address, in this paper, this important issue from another view. We propose a novel semi-supervised feature selection method called Ensemble Laplacian Constraint Score (EnsCLS for short),
which firstly combines both data resampling (bagging) and random subspace
strategies for generating different views of the data. Once each ensemble component is obtained, the CLS score is used to measure features relevance. A ranking
of all features is finally obtained with respect to their average relevances over all
ensemble members.
Extensive experiments on a series of benchmark and real datasets have verified the effectiveness of our approach compared to other state-of-the-art semisupervised feature selection algorithms and confirm the ability of the used ensemble strategy to rank the relevant features accurately. They also show that the
proposed EnsCLS method can utilize labeled and unlabeled data in a more effective way than Constraint Laplacian Score. Furthermore, they indicate that our
method which inject some randomness for manipulating the available unlabeled
and labeled data (constraints) is superior to the recently proposed CSFS method
which actively selects constraints to improve the quality of the CLS score.
Future substantiation through more experiments on biological databases containing several thousands of variables and through evaluating the stability of the
feature selection method [25, 27] when small changes are made to the data are
currently being undertaken. Moreover, comparisons using different numbers of
pairwise constraints will be reported in due course.
53
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
References
1. A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C.
Boldrick, H. Sabet, T. Tran, X. Yu, J. I. Powell, L. Yang, G. E. Marti, T. Moore,
J. Hudson, L. Lu, D. B. Lewis, R. Tibshirani, G. Sherlock, W. C. Chan, T. C.
Greiner, D. D. Weisenburger, J. O. Armitage, R. Warnke, R. Levy, W. Wilson,
M. R. Grever, J. C. Byrd, D. Botstein, P. O. Brown, and L. M. Staudt. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.
Nature, 403(6769):503–511, 2000.
2. Hasna Barkia, Haytham Elghazel, and Alex Aussem. Semi-supervised feature importance evaluation with ensemble learning. In ICDM, pages 31–40, 2011.
3. K. Benabdeslem and M. Hindawi. Constrained laplacian score for semi-supervised
feature selection. In Proceedings of ECML-PKDD conference, pages 204–218, 2011.
4. C.L Blake and C.J Merz. Uci repository of machine learning databases, 1998.
5. L. Breiman. Bagging predictors. Machine Learning, 26(2):123–140, 1996.
6. Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
7. Chih. Chung Chang and Chih. Jen Lin. LIBSVM: A library for support vector
machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27,
2011. Software available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm.
8. I. Davidson, K. Wagstaff, and S. Basu. Measuring constraint-set utility for partitional clustering algorithms. In Proceedings of ECML/PKDD, 2006.
9. Janez Demsar. Statistical comparisons of classifiers over multiple data sets. Journal
of Machine Learning Research, 7:1–30, 2006.
10. T.G. Dietterich. Ensemble methods in machine learning. In First International
Workshop on Multiple Classier Systems, pages 1–15, 2000.
11. J.G. Dy and C. E. Brodley. Feature selection for unsupervised learning. Journal
of Machine Learning Research, (5):845–889, 2004.
12. Haytham Elghazel and Alex Aussem. Unsupervised feature selection with ensemble
learning. Machine Learning, pages 1–24, 2013.
13. R. Fisher. The use of multiple measurements in taxonomic problems. Annals
Eugen, 7:179–188, 1936.
14. Y Freund and R.E. Shapire. Experiments with a new boosting algorithm. In 13th
International Conference on Machine Learning, pages 276–280, 1996.
15. T.R. Golub, Slonim, D.K., P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov,
and H. Coller. Molecular classication of cancer: Class discovery and class prediction
by gene expression monitoring. Science, 286:531–537, 1999.
16. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal
of Machine Learning Research, (3):1157–1182, 2003.
17. M. F. Abdel Hady and F. Schwenker. Combining committee-based semi-supervised
learning and active learning. Journal of Computer Science and Technology,
25(4):681–698, 2010.
18. X. He, D. Cai, and P. Niyogi. Laplacian score for feature selection. In Advances
in Neural Information Processing Systems, 17, 2005.
19. M. Hindawi, K. Allab, and K. Benabdeslem. Constraint selection based semisupervised feature selection. In Proceedings of international conference on Data
Mining, pages 1080–1085, 2011.
20. Tin Kam Ho. The random subspace method for constructing decision forests. IEEE
Trans. Pattern Anal. Mach. Intell., 20(8):832–844, 1998.
21. Yi Hong, Sam Kwong, Yuchou Chang, and Qingsheng Ren. Consensus unsupervised feature ranking from multiple views. Pattern Recognition Letters, 29(5):595–
602, 2008.
54
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
22. Yi Hong, Sam Kwong, Yuchou Chang, and Qingsheng Ren. Unsupervised feature
selection using clustering ensembles and population based incremental learning
algorithm. Pattern Recognition, 41(9):2742–2756, 2008.
23. M. Kalakech, P. Biela, L. Macaire, and D. Hamad. Constraint scores for semisupervised feature selection: A comparative study. Pattern Recognition Letters,
32(5):656–665, 2011.
24. T. Kohonen. Self organizing Map. Springer Verlag, Berlin, 2001.
25. Ludmila I. Kuncheva. A stability index for feature selection. AIAP’07, pages
390–395, 2007.
26. M. Li and Z. H. Zhou. Improve computer-aided diagnosis with machine learning
techniques using undiagnosed samples. IEEE Transactions on Systems, Man, and
Cybernetics, 37(6):1088–1098, 2007.
27. Yvan Saeys, Thomas Abeel, and Yves Van de Peer. Robust feature selection using
ensemble feature selection techniques. In ECML/PKDD (2), pages 313–325, 2008.
28. Dinesh Singh, Phillip G. Febbo, Kenneth Ross, Donald G. Jackson, Judith Manola,
Christine Ladd, Pablo Tamayo, Andrew A. Renshaw, Anthony V. D’Amico, and
Jerome P. Richie. Gene expression correlates of clinical prostate cancer behavior.
Cancer Cell, 1(2):203–209, 2002.
29. A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for
combining multiple partitions. Journal of Machine Learning Research, 3:583–617,
2002.
30. Dan Sun and Daoqiang Zhang. Bagging constraint score for feature selection with
pairwise constraints. Pattern Recognition, 43:2106–2118, 2010.
31. A. Topchy, A.K. Jain, and W. Punch. Clustering ensembles: Models of consensus and weak partitions. IEEE Transaction on Pattern Analysis and Machine
Intelligence, 27(12):1866–1881, 2005.
32. Y. Yaslan and Z. Cataltepe. Co-training with relevant random subspaces. Neurocomputing, 73(10-12):1652–1661, 2010.
33. D. Zhang, S. Chen, and Z. Zhou. Constraint score: A new filter method for feature
selection with pairwise constraints. Pattern Recognition, 41(5):1440–1451, 2008.
34. Z. Zhao and H. Liu. Semi-supervised feature selection via spectral analysis. In
Proceedings of SIAM International Conference on Data Mining (SDM), pages 641–
646, 2007.
35. Z. Zhao and H. Liu. Semi-supervised feature selection via spectral analysis. In
SDM, pages 641–646, 2007.
36. Zheng Zhao, Fred Morstatter, Shashvata Sharma, Salem Alelyani, and Aneeth
Anand. Feature selection, 2011.
55
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Feature ranking for multi-label classification
using predictive clustering trees
Dragi Kocev, Ivica Slavkov, and Sašo Džeroski
Department of Knowledge Technologies, Jožef Stefan Institute,
Jamova 39, 1000 Ljubljana, Slovenia
{Dragi.Kocev,Ivica.Slavkov,Saso.Dzeroski}@ijs.si
Abstract. In this work, we present a feature ranking method for multilabel data. The method is motivated by the the practically relevant multilabel applications, such as semantic annotation of images and videos,
functional genomics, music and text categorization etc. We propose a
feature ranking method based on random forests. Considering the success of the feature ranking using random forest in the tasks of classification and regression, we extend this method for multi-label classification.
We use predictive clustering trees for multi-label classification as base
predictive models for the random forest ensemble. We evaluate the proposed method on benchmark datasets for multi-label classification. The
evaluation of the proposed method shows that it produces valid feature
rankings and that can be successfully used for performing dimensionality
reduction.
Key words: multi-label classification, feature ranking, random forest,
predictive clustering trees
1
Introduction
The problem of single-label classification is concerned with learning from examples, where each example is associated with a single label λi from a finite set
of disjoint labels L = {λ1 , λ2 , ..., λQ }, i = 1..Q, Q > 1. For Q > 2, the learning
problem is referred to as multi-class classification. On the other hand, the task of
learning a mapping from an example x ∈ X (X denotes the domain of examples)
to a set of labels Y ⊆ L is referred to as a multi-label classification (MLC). In
contrast to multi-class classification, alternatives in multi-label classification are
not assumed to be mutually exclusive: multiple labels may be associated with
a single example, i.e., each example can be a member of more than one class.
Labels in the set Y are called relevant, while the labels in the set L \ Y are
irrelevant for a given example.
Many different methods have been developed for solving MLC problems.
Tsoumakas et al. [16] summarize them into two main categories: a) algorithm
adaptation methods, and b) problem transformation methods. Algorithm adaptation methods extend specific learning algorithms to handle multi-label data directly. Problem transformation methods, on the other hand, transform the MLC
56
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
problem into one or more single-label classification problems. The single-label
classification problems are solved with a commonly used single-label classification method and the output is transformed back into a multi-label representation.
The issue of learning from multi-label data has recently attracted significant
attention from many researchers, motivated by an increasing number of new applications. The latter include semantic annotation of images and videos (news
clips, movies clips), functional genomics (gene and protein function), music categorization into emotions, text classification (news articles, web pages, patents,
emails, bookmarks, ...), directed marketing and others.
Albeit the popularity of the task of MLC, the tasks of feature ranking and
feature selection have not received much attention. The few available methods
are based on the problem transformation paradigm [14], thus they do not fully
exploit the possible label dependencies. More specifically, these methods use the
label powerset (LP) approach for MLC [16, 6] from the group of problem transformation methods. The basis of the LP methods is to combine entire label sets
into atomic (single) labels to form a single-label problem (i.e., single-class classification problem). For the single-label problem, the set of possible single labels
represents all distinct label subsets from the original multi-label representation.
In this way, LP based methods directly take into account the label correlations.
However, the space of possible label subsets can be very large. To resolve this
issue, Read [11] has developed a pruned problem transformation (PPT) method,
that selects only the transformed labels that occur more than a predefined number of times. Tsoumakas et al. [16] use the LP transformed dataset to calculate
simple χ2 statistic thus producing a ranking of the features. Doquire and Verleysen [6] use the PPT transformed dataset to calculate mutual information (MI)
for performing feature selection and they show that this method outperforms
the χ2 -based feature ranking.
Feature ranking for MLC with problem transformation has two major shortcomings. First, the label dependencies and interconnections are not fully exploited. Second, these methods are not scalable to domains with large number of labels because of the exponential growth of the possible label powersets.
Furthermore, the label powerset methods can yield a multi-class problem with
extremely skewed class distribution. To address these issues, we propose an algorithm adaptation method for performing feature ranking. We extend the random
forest feature ranking method [3] to the task of MLC. More specifically, we construct random forest that employs predictive clustering trees (PCTs) for MLC
as base predictive models [1], [8]. PCTs can be considered a generalization of
decision trees that are able to make predictions for structured outputs.
This work is motivated by several factors. First, the number of possible application domains for MLC and the size of the problems is increasing. For example,
in image annotation the number of available images and possible labels is growing rapidly; in functional genomics the measurement techniques have improved
significantly and there are high-dimensional genomic data available for analysis. Second, in Madjarov et al. [10] we have shown that the random forests of
57
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
PCTs for MLC are among the best predictive models for the task of MLC. Next,
random forests as feature ranking algorithms are very successful on simple classification tasks [17, 19]. Finally, in Kocev et al. [8] we have shown that the random
forests of PCTs are among the most efficient methods for predicting structured
outputs. This is especially important, since many of the methods for MLC are
computationally expensive and thus are not able to produce a predictive model
for a given domain in a reasonable time (i.e., few weeks) [10].
We evaluate the proposed method on 4 benchmark datasets for MLC using
7 different evaluation measures. We compare the feature ranking produced by
the proposed method to a random feature ranking. The random feature ranking
is the worst feature ranking thus if the proposed method is able to capture the
feature relevances then it should outperform the random ranking. We assess the
performance of the obtained rankings and the random rankings by using error
testing curves [12]. The goal of this study is to investigate whether random forests
of PCTs for MLC can produce good feature rankings for the task of multi-label
classification. Moreover, we want to check whether the produced rankings can
be used to reduce the dimensionality of the considered multi-label domains.
The remainder of this paper is organized as follows. Section 2 presents the predictive clustering trees. The method for feature ranking using random forests is
described in Section 3. Section 4 outlines the experimental design, while Section
5 presents the results from the experimental evaluation. Finally, the conclusions
and a summary are given in Section 6.
2
Predictive clustering trees for multi-label classification
Predictive clustering trees (PCTs) [1] generalize decision trees [4] and can be
used for a variety of learning tasks, including different types of prediction and
clustering. The PCT framework views a decision tree as a hierarchy of clusters:
the top-node of a PCT corresponds to one cluster containing all data, which is
recursively partitioned into smaller clusters while moving down the tree. The
leaves represent the clusters at the lowest level of the hierarchy and each leaf is
labeled with its cluster’s prototype (prediction).
PCTs can be induced with a standard top-down induction of decision trees
(TDIDT) algorithm [4]. The algorithm is presented in Table 1. It takes as input
a set of examples (E) and outputs a tree. The heuristic (h) that is used for
selecting the tests (t) is the reduction in variance caused by partitioning (P)
the instances (see line 4 of BestTest procedure in Table 1). By maximizing the
variance reduction the cluster homogeneity is maximized and it improves the
predictive performance. If no acceptable test can be found (see line 6), that is, if
the test does not significantly reduces the variance, then the algorithm creates
a leaf and computes the prototype of the instances belonging to that leaf.
The main difference between the algorithm for learning PCTs and other
algorithms for learning decision trees is that the former considers the variance
function and the prototype function (that computes a label for each leaf) as
parameters that can be instantiated for a given learning task. So far, PCTs have
58
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 1. The top-down induction algorithm for PCTs.
procedure PCT(E) returns tree
1: (t∗ , h∗ , P ∗ ) = BestTest(E)
2: if t∗ 6= none then
3:
for each Ei ∈ P ∗ do
4:
tree i = PCT(Ei )
S
5:
return node(t∗ , i {tree i })
6: else
7:
return leaf(Prototype(E))
procedure BestTest(E)
1: (t∗ , h∗ , P ∗ ) = (none, 0, ∅)
2: for each possible test t do
3:
P = partition induced
by t on E
P
i|
4:
h = Var (E) − Ei ∈P |E
Var (Ei )
|E|
∗
5:
if (h > h ) ∧ Acceptable(t, P) then
6:
(t∗ , h∗ , P ∗ ) = (t, h, P)
7: return (t∗ , h∗ , P ∗ )
been instantiated for the following tasks: prediction of multiple targets [8], [15],
prediction of time-series [13] and hierarchical multi-label classification [18].
One of the most important steps in the induction algorithm is the test selection procedure. For each node, a test is selected by using a heuristic function
computed on the training examples. The goal of the heuristic is to guide the
algorithm towards small trees with good predictive performance. The heuristic used in this algorithm for selecting the attribute tests in the internal nodes
is the reduction in variance caused by partitioning the instances. Maximizing
the variance reduction maximizes cluster homogeneity and improves predictive
performance.
In this work, we focus on the task of multi-label classification, which can be
considered as a special case of multi-target prediction. Namely, in the task of
multi-target prediction, the goal is to make predictions for multiple target variables. The multiple labels in MLC can be viewed as multiple binary variables,
each one specifying whether a given example is labelled with the corresponding label. Therefore, we compute the variance function same as for the task of
predicting multiple discrete variables [8], i.e., the variance function is computed
as the sum of the Gini indices [4] of the variables from the target tuple, i.e.,
PT
PCi
Var (E) = i=1 Gini (E , Yi ), Gini(E, Yi ) = 1 − j=1
pcij , where T is the number of target attributes, cij is the j-th class of target attribute Yi and Ci is the
number of classes of target attribute Yi . The prototype function returns a vector
of probabilities for the set of labels that indicate whether an example is labelled
with a given label. For a detailed description of PCTs for multi-target prediction
the reader is referred to [1, 8]. The PCT framework is implemented in the CLUS
system1 .
3
Feature ranking via random forests
We construct the random forest using predictive clustering trees as base classifiers. We exploit the random forests mechanism [3] to calculate the variable importance, i.e., the feature ranking. In the following subsections, first we present
the random forest algorithm and then we describe how it can be used to estimate
the importance of the descriptive variables.
1
CLUS is available for download at http://clus.sourceforge.net
59
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
3.1
Random Forests
An ensemble is a set of classifiers constructed with a given algorithm. Each new
example is classified by combining the predictions of every classifier from the
ensemble. These predictions can be combined by taking the average (for regression tasks) and the majority or probability distribution vote (for classification
tasks)[2], or by taking more complex combinations [9].
A necessary condition for an ensemble to be more accurate than any of its individual members, is that the classifiers are accurate and diverse [7]. An accurate
classifier does better than random guessing on new examples. Two classifiers are
diverse if they make different errors on new examples. There are several ways to
introduce diversity: by manipulating the training set (by changing the weight of
the examples [2] or by changing the attribute values of the examples [3]), or by
manipulating the learning algorithm itself [5].
A random forest [3] is an ensemble of trees, where diversity among the predictive models is obtained by using bootstrap replicates, and additionally by
changing the feature set during learning. More precisely, at each node in the
decision trees, a random subset of the input attributes is taken, and the best
feature is selected from this subset. The number of attributes that are retained
is given by
√ a function f of the total number of input attributes x (e.g., f (x) = 1,
f (x) = b x + 1c, f (x) = blog2 (x) + 1c . . . ). By setting f (x) = x, we obtain the
bagging procedure.
3.2
Feature ranking using random forests
Feature ranking of the descriptive variables can be obtained by exploiting the
mechanism of random forests. This method uses the internal out-of-bag estimates of the error and noising the descriptive variables. To create each tree
from the forest, the algorithm first creates a bootstrap replicate (line 4, from
the Induce RF procedure, Table 2). The samples that are not selected for the
bootstrap are called out-of-bag (OOB) samples (line 7, procedure Induce RF ).
These samples are used to evaluate the performance of each tree from the forest.
The complete algorithm is presented in Table 2.
Suppose that there are T target variables and D descriptive variables. The
trees in the forest are constructed using a randomized variant of the PCT construction algorithm (P CTrand ), i.e., in each node the split is selected from a
subset of the descriptive variables. After each tree from the forest is built, the
values of the descriptive attributes for the OOB samples are randomly permuted
one-by-one thus obtaining D OOB samples (line 3, procedure U pdate Imp).
The predictive performance of each tree is evaluated on the original OOB data
(Err(OOBi )) and the permuted versions of the OOB data (Erri (fd )). The performance is averaged across the T target variables. Then the importance of a
given variable (Ij ) is calculated as the relative increase of the mis-classification
error that is obtained when its values are randomly permuted. The importance
is at the end averaged over all trees in the forest. The variable importance is
calculated using the following formula:
60
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 2. The algorithm for feature ranking via random forests. E is the set of the
training examples, k is the number of trees in the forest, and f (x) is the size of the
feature subset that is considered at each node during tree construction.
procedure Induce RF(E, k, f (x))
returns Forest, Importances
1: F = ∅
2: I = ∅
3: for i = 1 to k do
4:
Ei = Bootstrap sample(E)
5:
T reei =SP CTrand (Ei , f (x))
6:
F = F T reei
7:
EOOB = E \ Ei
8:
Update Imp(EOOB , T ree, I)
9: I = Average(I, k)
10: return F, I
Importance(fd ) =
procedure Update Imp(EOOB , T ree, I)
1: ErrOOB = Evaluate(T ree, EOOB )
2: for j = 1 to D do
3:
Ej = Randomize(EOOB , j)
4:
Errj = Evaluate(T ree, Ej )
5:
Ij = Ij + (Errj − ErrOOB )/ErrOOB
6: return
procedure Average(I, k)
1: I T = ∅
2: for l = 1 to size(I) do
3:
IlT = Il /k
4: return I T
k
1 X Erri (fd ) − Err(OOBk )
·
k i=1
Err(OOBk )
(1)
where k is the number of bootstrap replicates and 0 < d ≤ D.
4
Experimental design
In this section, we give the specific experimental design used to evaluate the performance of the proposed method. We begin by briefly summarizing the multilabel datasets used in this study. Next, we present the evaluation measures and
discuss the construction of the error curves. Finally, we give the specific parameter instantiation of the methods.
4.1
Data description
We use four multi-label classification benchmark problems. Parts of the selected
problems were used in various studies and evaluations of methods for multi-label
learning. Table 3 presents the basic statistics of the datasets. We can note that
the datasets vary in size: from 391 up to 4880 training examples, from 202 up
to 2515 testing examples, from 72 up to 1836 features, from 6 to 159 labels, and
from 1.25 to 3.38 average number of labels per example (i.e., label cardinality
[16]). From the literature, these datasets come pre-divided into training and
testing parts: Thus, in the experiments, we use them in their original format.
The training part usually comprises around 2/3 of the complete dataset, while
the testing part the remaining 1/3 of the dataset.
The datasets come from the domains of multimedia and text categorization.
Emotions is a dataset from the multimedia domain where each instance is a piece
61
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 3. Description of the benchmark problems in terms of number of training
(#tr.e.) and test (#t.e.) examples, the number of features (D), the total number of labels (Q) and label cardinality (lc ). The problems are ordered by their overall complexity
roughly calculated as #tr.e. × D × Q.
emotions
medical
enron
bibtex
#tr.e. #t.e. D Q lc
391 202 72 6 1.87
645 333 1449 45 1.25
1123 579 1001 53 3.38
4880 2515 1836 159 2.40
of music. Each piece of music can be labelled with six emotions: sad-lonely, angryaggressive, amazed-surprised, relaxing-calm, quiet-still, and happy-pleased. The
domain of text categorization is represented with 3 datasets: medical, enron and
bibtex. Medical is a dataset used in the Medical Natural Language Processing
Challenge2 in 2007. Each instance is a document that contains brief free-text
summary of a patient symptom history. The goal is to annotate each document
with the probable diseases from the International Classification of Diseases. Enron is a dataset that contains the e-mails from 150 senior Enron officials categorized into several categories. The labels can be further grouped into four
categories: coarse genre, included/forwarded information, primary topics, and
messages with emotional tone. Bibtex contains metadata for bibtex items, such
as the title of the paper, the authors, book title, journal volume, publisher, etc.
These datasets are available for download at the web page of the Mulan project3 .
4.2
Experimental setup
We evaluate the proposed method using seven evaluation measures: accuracy,
micro precision, micro recall , micro F1 , macro precision, macro recall , and macro
F1 . These measures are typically used for evaluation of the performance of multilabel classification methods. The micro averaging implicitly includes information
about the label frequency, while macro averaging treats all the labels equally.
Due to the space limitations, we only show the results for micro F1 because the
F1 score unites the values for precision and recall. Moreover, the results and
the discussion are similar if the other measures were used. These measures are
discussed in detail in [10] and [16].
We asses the performance of the proposed method using error curves [12].
The error curves are based on the idea that the ‘correctness’ of the feature rank
is related to predictive accuracy. A good ranking algorithm would put on top
of a list a feature that is most important, and at the bottom a feature that
is least important w.r.t. some target concept. All the other features would be
in-between, ordered by decreasing importance. By following this intuition, we
2
3
http://www.computationalmedicine.org/challenge/
http://mulan.sourceforge.net/datasets.html
62
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
evaluate the ranking by performing a stepwise feature subset evaluation, which
is used for obtaining an error curve.
We generate two types of curves: forward feature addition (FFA) and reverse
feature addition (RFA) curve. Examples of these curves are shown in Figures 1,
2, 3, and 4. The FFA curve is constructed from the top-k ranked features, i.e.,
from the beginning of the ranking. In contrast, the RFA curve is constructed
from the bottom-k ranked features. For the FFA curves, we can expect that
as the number of features used to construct the predictive model increases, the
accuracy of the predictive models also increases. This can be interpreted as
follows: By adding more and more of the top-ranked features, the feature subsets
constructed contain more relevant features, reflected in the improvement of the
error measure.
On the other hand, for the RFA curves, we can expect a slight difference at
the beginning of the curve, which considering the previous discussion, is located
at the end of the x-axis. Namely, the accuracy of the models constructed with
the bottom ranked features is minimal, which means the ranking is correct in the
sense that it puts irrelevant features at the bottom of the ranking. As the number
of bottom-k ranked features used to construct the predictive model increases,
some relevant features get included and the accuracy of the models increases.
In summary, at each point k, the FFA curve gives us the error of the predictive
models constructed with the top-k ranked features, while the RFA curve gives
us the error of the bottom-k ranked features. The algorithm for constructing the
curves is given in Table 4.
Table 4. The algorithm for generating forward feature addition (FFA) and reverse
feature addition (RFA) curves.R = {Fr1 , . . . , Frn } is the feature ranking and Ft is the
target feature.
procedure ConstructErrorCurve(R, Ft )
returns Error Curve Err
RS ⇐ ∅
for i = 1 to n do
RS ⇐ RS ∪ f eature(R, i)
Err[i] = Err(M(RS , Ft ))
return Err
for FFA curve:
f eature(R, i) = {Fri }
for RFA curve:
f eature(R, i) = {Fr(n−i+1) }
We compare the performance of the proposed method to the performance of
a random ranking. We base this comparison on the idea that the random ranking
is the worst ranking possible [12]. This is similar to the notion of random predictive model in predictive modelling. If our algorithm indeed is able to capture
the variable importance correctly, then its error curves should be better than the
curves of a random ranking. In this work, we generate 100 random feature rankings for each dataset and we show the averaged error curves. We opted for the
comparison with the random rankings instead with the methods presented in [6]
63
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
and [16] because of the un-stable results produced by these rankings (especially
for the Emotions dataset). Moreover, the reported accuracies for the emotions
dataset is in the range of 0.3 to 0.5, and in our experiments the accuracy for the
emotions dataset is in the range of 0.6 to 0.8.
4.3
Parameter instantiation
The feature ranking algorithm with random forests of PCTs for MLC take as
input two parameters: the number of base predictive models in the forest and
the feature subset size. In these experiments, we constructed the feature rankings using a random forest with 500 PCTs for MLC. Each node in a PCT was
constructed by randomly selecting 10% of the features (as suggested in [8]).
For construction of the error curves, we selected random forests of PCTs for
MLC as predictive models. The random forests model in this case consists of 100
PCTs for MLC and each node was constructed using 10% of the features. Both
the predictive models and the feature rankings were constructed on the training
set, while the performance for the error curves is the one obtained on the testing
set.
5
Results and discussion
In this section, we present the results from the experimental evaluation of the
proposed method. We explain the results with respect to the variable importance
scores for the features, the FFA curves and the RFA curves. The FFA and RFA
curves are constructed using the micro F1 , however, the conclusions are still
valid if we consider the other evaluation measures. In the remainder, we discuss
the results for each of the datasets considered in this study.
The results for the Emotions dataset are given in Figure 1. They show that
the obtained ranking performs slightly better than the random ranking. Both
FFA curves increase with a similar rate and have a similar shape. However, on
a larger part the FFA curve of the ranking is above the curve of the random
ranking. The RFA curve shows that the obtained ranking places more nonrelevant features at the bottom of the ranking.
This finding can be confirmed and explained with the variable scores (Figure 1(a)). Namely, the curve with the variable scores is somewhat parallel to the
x-axis. This means that the majority of features in this dataset are approximately
equally relevant for the target concept (i.e., the multiple labels). Moreover, this
could indicate that there are redundant features that are present in the dataset.
All in all, selecting randomly a feature subset with a reasonable size (e.g., 25-30
features) is good enough to produce a predictive model with satisfactory predictive performance (i.e., the dimensionality can be easily reduced without a
significant loss of information).
The results for the Bibtex (Figure 2) and Enron (Figure 3) datasets are
somewhat similar to each other, thus we discuss them together. We can see
from the figures that the obtained ranking is clearly better than the random
64
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
0.12
0.7
0.7
RForest
Random
0.65
0.6
0.1
0.6
0.5
0.55
0.06
0.04
0.5
micro F1
micro F1
Importance
0.08
0.45
0.4
0.35
0.4
0.3
0.2
0.3
0.02
RForest
Random
0.25
0.1
0.2
0.0
0
10
20
30
40
50
60
70
0.0
0
80
10
20
30
40
50
60
70
80
0
Number of added features
Features
(a)
10
20
30
40
50
60
70
80
Number of removed features
(b)
(c)
Fig. 1. The performance of random forests of PCTs for MLC feature ranking algorithm
on the Emotions dataset. (a) feature importances reported by the ranking algorithm,
(b) FFA curve and (c) RFA curve.
ranking. The FFA curve of the obtained ranking is always above the FFA curve
of the random ranking, and, conversely, the RFA curve of the obtained ranking
is always bellow the RFA curve of the random ranking. Hence, more relevant
features are placed at the top and more non-relevant features are placed at the
bottom.
0.0045
0.3
0.3
0.25
0.25
0.2
0.2
RForest
Random
0.004
micro F1
Importance
0.003
0.0025
0.002
micro F1
0.0035
0.15
0.1
0.0015
0.15
0.1
0.001
0.05
RForest
Random
0.0005
0.0
0.0
0
200 400 600 800 1000 1200 1400 1600 1800 2000
Features
(a)
0.05
0.0
0
200 400 600 800 1000 1200 1400 1600 1800 2000
Number of added features
(b)
0
200 400 600 800 1000 1200 1400 1600 1800 2000
Number of removed features
(c)
Fig. 2. The performance of random forests of PCTs for MLC feature ranking algorithm
on the Bibtex dataset. (a) feature importances reported by the ranking algorithm, (b)
FFA curve and (c) RFA curve.
This is also confirmed with the variable importance scores. We can note
that the curve of the variable importances drops linearly, which means that
there are multiple features in the dataset that are more relevant for the target
concept than the remaining features. The dimensionality in these cases can be
significantly reduced. We will still obtain very good predictive performance if we
select the 500 top-ranked features (out of 1836) for the Bibtex dataset and the
50 top-ranked features (out of 1001) for the Enron dataset.
65
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
0.55
0.018
0.016
0.55
0.5
RForest
Random
0.5
0.014
0.45
0.45
0.01
0.008
micro F1
micro F1
Importance
0.012
0.4
0.35
0.4
0.35
0.006
0.3
0.004
0.3
RForest
Random
0.25
0.002
0.25
0.2
0.0
0
200
400
600
800
0
1000
200
400
600
800
0
1000
100 200 300 400 500 600 700 800 900 1000
Number of added features
Features
(a)
Number of removed features
(b)
(c)
Fig. 3. The performance of random forests of PCTs for MLC feature ranking algorithm
on the Enron dataset. (a) feature importances reported by the ranking algorithm, (b)
FFA curve and (c) RFA curve.
Finally, we discuss the results for the Medical dataset (given in Figure 4).
We can note that the obtained ranking is significantly better than the random
ranking. The FFA and RFA curves of the obtained ranking exhibit very steep
increase and decrease, respectively. On the other hand, the FFA and RFA curves
of the random ranking have linear increase and decrease. This means that there
are a few features that are very relevant for the target concept and that these
features carry the majority of the information for the target concept. This is further confirmed with the curve of the variable importances: this curve descends
exponentially. Considering all of this, we can drastically reduce the dimensionality of this dataset. The good predictive performance will be preserved even if
we select the 35 top-ranked features (out of 1449).
0.025
0.8
0.8
0.7
0.7
0.6
0.6
RForest
Random
0.01
0.5
micro F1
0.5
0.015
micro F1
Importance
0.02
0.4
0.4
0.3
0.3
0.2
0.2
0.005
RForest
Random
0.1
0.0
0.0
0
200
400
600
800 1000 1200 1400 1600
Features
(a)
0.1
0.0
0
200
400
600
800
1000 1200 1400 1600
Number of added features
(b)
0
200
400
600
800
1000 1200 1400 1600
Number of removed features
(c)
Fig. 4. The performance of random forests of PCTs for MLC feature ranking algorithm
on the Medical dataset. (a) feature importances reported by the ranking algorithm, (b)
FFA curve and (c) RFA curve.
66
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
6
Conclusions
In this work, we presented and evaluated a feature ranking method for the task of
multi-label classification (MLC). The proposed method is based on the random
forest feature ranking mechanism. The random forests are already proven as a
good method for feature ranking on the simpler tasks of classification and regression. Here, we propose and extension of the method to the task of MLC. To this
end, we use predictive clustering trees (PCTs) for MLC as base predictive models. The random forests of PCTs have state-of-the-art predictive performance for
the task of MLC. Here, we investigate whether this method can be also successful
for the task of feature ranking for MLC.
We evaluated the method on 4 benchmark multi-label datasets using 7 evaluation measures. The quality of the feature ranking was assessed by using forward
feature addition and reverse feature addition curves. To investigate whether the
obtained feature ranking is valid, i.e., that it places the more relevant features
closer to the top of the ranking and the non-relevant features closer to the bottom
of the ranking, we compare it to the performance of a random feature ranking.
We summarize the results as follows. First, we show that in a datasets where
many of the features are relevant for the target concept the produced ranking can
slightly outperform the random ranking. This is due to the fact that if several
features are (randomly) selected then the predictive model will have satisfactory
predictive performance. Next, in the datasets where there are several relevant
features for the target concept the produced ranking clearly outperforms the
random ranking. This means that the ranking algorithm is able to detect these
features and place them at the top of the ranking. Furthermore, in the datasets
where there are only few features of high relevance for the target concept, the
obtained ranking drastically outperforms the random ranking and satisfactory
predictive performance can be obtained by using only 2 − 3% of the features. All
in all, the experimental evaluation demonstrates that the random forests feature
ranking method can be successfully applied to the task of MLC.
We plan to extend this work in future along three major dimensions. First,
we plan to include other measures for predictive performance in the ranking
algorithm. In the current version, we use mis-classification rate. However, we
will consider MLC specific evaluation measures. Next, we will extend the proposed method for other structured output prediction tasks, such as multi-target
regression and hierarchical multi-label classification. Finally, we could estimate
the relevance of a feature by considering the reduction of the variance the feature
causes when selected for a test in a given node.
References
1. Blockeel, H.: Top-down induction of first order logical decision trees. Ph.D. thesis,
Katholieke Universiteit Leuven, Leuven, Belgium (1998)
2. Breiman, L.: Bagging Predictors. Machine Learning 24(2), 123–140 (1996)
3. Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)
67
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
4. Breiman, L., Friedman, J., Olshen, R., Stone, C.J.: Classification and Regression
Trees. Chapman & Hall/CRC (1984)
5. Dietterich, T.G.: Ensemble Methods in Machine Learning. In: Proc. of the 1st
International Workshop on Multiple Classifier Systems - LNCS 1857. pp. 1–15.
Springer (2000)
6. Doquire, G., Verleysen, M.: Feature Selection for Multi-label Classification Problems. In: Advances in Computational Intelligence. pp. 9–16 (2011)
7. Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(10), 993–1001 (1990)
8. Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Tree ensembles for predicting structured outputs. Pattern Recognition 46(3), 817–833 (2013)
9. Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. WileyInterscience (2004)
10. Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S.: An extensive experimental
comparison of methods for multi-label learning. Pattern Recognition 45(9), 3084–
3104 (2012)
11. Read, J., Pfahringer, B., Holmes, G.: Multi-label Classification Using Ensembles of
Pruned Sets. In: Proc. of the 8th IEEE International Conference on Data Mining.
pp. 995–1000 (2008)
12. Slavkov, I.: An Evaluation Method for Feature Rankings. Ph.D. thesis, IPS Jožef
Stefan, Ljubljana, Slovenia (2012)
13. Slavkov, I., Gjorgjioski, V., Struyf, J., Džeroski, S.: Finding explained groups of
time-course gene expression profiles with predictive clustering trees. Molecular
BioSystems 6(4), 729–740 (2010)
14. Spolaôr, N., Cherman, E.A., Monard, M.C., Lee, H.D.: A comparison of multi-label
feature selection methods using the problem transformation approach. Electronic
Notes in Theoretical Computer Science 292, 135 – 151 (2013)
15. Struyf, J., Džeroski, S.: Constraint Based Induction of Multi-Objective Regression
Trees. In: Proc. of the 4th International Workshop on Knowledge Discovery in
Inductive Databases KDID - LNCS 3933. pp. 222–233. Springer (2006)
16. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining Multi-label Data. In: Data Mining
and Knowledge Discovery Handbook, pp. 667–685. Springer Berlin / Heidelberg
(2010)
17. Čehovin, L., Bosnić, Z.: Empirical evaluation of feature selection methods in classification. Intelligent Data Analysis 14(3), 265–281 (2010)
18. Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H.: Decision trees for
hierarchical multi-label classification. Machine Learning 73(2), 185–214 (2008)
19. Verikas, A., Gelzinis, A., Bacauskiene, M.: Mining data with random forests: A
survey and results of new tests. Pattern Recognition 44(2), 330–349 (2011)
68
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Identification of Statistically Significant Features
from Random Forests
Jérôme Paul, Michel Verleysen, and Pierre Dupont
Université catholique de Louvain – ICTEAM/Machine Learning Group
Place Sainte Barbe 2, 1348 Louvain-la-Neuve – Belgium
{jerome.paul,michel.verleysen,pierre.dupont}@uclouvain.be
http://www.ucl.ac.be/mlg/
Abstract. Embedded feature selection can be performed by analyzing
the variables used in a Random Forest. Such a multivariate selection
takes into account the interactions between variables but is not easy
to interpret in a statistical sense. We propose a statistical procedure
to measure variable importance that tests if variables are significantly
useful in combination with others in a forest. We show experimentally
that this new importance index correctly identifies relevant variables.
The top of the variable ranking is, as expected, largely correlated with
Breiman’s importance index based on a permutation test. Our measure
has the additional benefit to produce p-values from the forest voting
process. Such p-values offer a very natural way to decide which features
are significantly relevant while controlling the false discovery rate.
1
Introduction
Feature selection aims at finding a subset of most relevant variables for a prediction task. To this end, univariate filters, such as a t-test, are commonly used
because they are fast to compute and their associated p-values are easy to interpret. However such a univariate feature ranking does not take into account the
possible interactions between variables. In contrast, a feature selection procedure embedded into the estimation of a multivariate predictive model typically
captures those interactions.
A representative example of such an embedded variable importance measure has been proposed by Breiman with its Random Forest algorithm (RF) [1].
While this importance index is effective to rank variables it is difficult to decide
how many such variables should eventually be kept. This question could be addressed through an additional validation protocol at the expense of an increased
computational cost. In this work, we propose an alternative that avoids such
additional cost and offers a statistical interpretation of the selected variables.
The proposed multivariate RF feature importance index uses out-of-bag
(OOB) samples to measure changes in the distribution of class votes when permuting a particular variable. It results in p-values, corrected for multiple testing,
measuring how variables are useful in combination with other variables of the
69
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
2
Identification of Statistically Significant Features from RF
model. Such p-values offer a very natural threshold for deciding which variables
are statistically relevant.
The remainder of this document is organised as follows. Section 2 presents
the notations and reminds Breiman’s RF feature importance measure. Section
3 introduces the new feature importance index. Experiments are discussed in
Section 4. Finally, Section 5 concludes this document and proposes hints for
possible future work.
2
Context and Notations
Let X n×p be the data matrix consisting of n data in a p-dimensional space and
y a vector of size n containing the corresponding class labels. A RF model [1] is
made of an ensemble of trees, each of which is grown from a bootstrap sample
of the n data points. For each tree, the selected samples form the bag (B), the
remaining ones constitute the OOB (B). Let B stand for the set of bags over the
ensemble and B be the set of corresponding OOBs. We have |B| = |B| = T , the
number of trees in the forest.
In order to compute feature importances, Breiman[1] proposes a permutation
test procedure based on accuracy. For each variable xj , there is one permutation
test per tree in the forest. For an OOB sample B k corresponding to the k-th
tree of the ensemble, one considers the original values of the variable xj and a
random permutation x̃j of its values on B k . The difference in prediction error
using the permuted and original variable is recorded and averaged over all the
OOBs in the forest. The higher this index, the more important the variable is
because it corresponds to a stronger increase of the classification error when
permuting it. The importance measure Ja of the variable xj is then defined as:


1 X 1 X
x̃j
Ja (xj ) =
I(hk (i) 6= yi ) − I(hk (i) 6= yi )
(1)
T
|B k |
B k ∈B
i∈B k
where yi is the true class label of the OOB example i, I is an indicator function,
hk (i) is the class label of the example i as predicted by the tree estimated on the
x̃
bag Bk , hkj (i) is the predicted class label from the same tree while the values of
the variable xj have been permuted on B k . Such a permutation does not change
the tree but potentially changes the prediction on the out-of-bag example since
its j-th dimension is modified after the permutation. Since the predictors with
x̃
the original variable hk and the permuted variable hkj are individual decision
trees, the sum over the various trees where this variable is present represents
the ensemble behaviour, respectively from the original variable values and its
various permutations.
3
A Statistical Feature Importance Index from RF
While Ja is able to capture individual variable importances conditioned to the
other variables used in the forest, it is not easily interpretable. In particular,
70
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Identification of Statistically Significant Features from RF
3
it does not define a clear threshold to highlight statistically relevant variables.
In the following sections, we propose a statistical feature importance measure
closely related to Ja , and then compare it with an existing approach that aims
at providing statistical interpretation to feature importance scores.
3.1
Definition
In [2], the authors analyse the convergence properties of ensembles of predictors.
Their statistical analysis allows us to determine the number of classifiers needed
in an ensemble in order to make the same predictions as an ensemble of infinite
size. To do so, they analyse the voting process and have a close look to the class
vote distribution of such ensembles.
In the present work, we combine the idea of Breiman’s Ja to use a permutation test with the analysis of the tree class vote distribution of the forest. We
propose to perform a statistical test that assesses whether permuting a variable
significantly influences that distribution. The hypothesis is that removing an important variable signal by permuting it should change individual tree predictions,
hence the class vote distribution.
One can estimate this distribution using the OOB data. In a binary classification setting, for each data point in an OOB, the prediction of the corresponding
tree can fall into one of the four following cases : correct prediction of class 1
(TP), correct prediction of class 0 (TN), incorrect prediction of class 1 (FP) and
incorrect prediction of class 0 (FN). Summing the occurrences of those cases over
all the OOBs gives an estimation of the class vote distribution of the whole forest.
The same can be performed when permuting a particular feature xj . This gives
an estimation of the class vote distribution of the forest after perturbing this
variable. The various counts obtained can be arranged into a 4 × 2 contingency
table. The first variable that can take four different values is the class vote. The
second one is an indicator variable to represent whether xj has been permuted
or not. Formally a contingency table is defined as follows for each variable xj :
xj
x̃j
TN s(0, 0) sx̃j (0, 0)
FP s(0, 1) sx̃j (0, 1)
FN s(1, 0) sx̃j (1, 0)
TP s(1, 1) sx̃j (1, 1)
(2)
where
s(l1 , l2 ) =
X X
I(yi = l1 and hk (i) = l2 )
(3)
B k ∈B i∈B k
x̃
and sx̃j (l1 , l2 ) is defined the same way with hkj (i) instead of hk (i).
In order to quantify whether the class vote distribution changes when permuting xj , one can use Pearson’s χ2 test of independence on the contingency
table defined above. This test allows to measure if joined occurrences of two
variables are independent of each other. Rejecting the null hypothesis that they
71
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
4
Identification of Statistically Significant Features from RF
are independent with a low p-value pχ2 (xj ) would mean that xj influences the
distribution and is therefore important. We note that, even on small datasets,
there is no need to consider a Fisher’s exact test instead of Pearson’s χ2 since
cell counts are generally sufficiently large: the sum of all counts is twice the sum
of all OOB sizes, which is influenced by the number of trees T .
If the importance of several variables has to be assessed e.g. to find out
which features are important, one should be careful and correct the obtained
p-values for multiple testing. Indeed, if 1000 dimensions are evaluated using the
commonly accepted 0.05 significance threshold, 50 variables are expected to be
falsely deemed important. To control that false discovery rate (FDR), p-values
can be rescaled e.g. using the Benjamini-Hochberg correction [3].
Let pfχdr
2 (xj ) be the value of pχ2 (xj ) after FDR correction, the new importance measure is defined as
Jχ2 (xj ) = pfχdr
2 (xj )
(4)
This statistical importance index is closely related to Breiman’s Ja . The two
terms inside the innermost sum of Equation (1) correspond to counts of FP et
FN for permuted and non permuted variable xj . This is encoded by the second
and third lines of contingency table in Equation (2). However, there are some
differences between the two approaches. First, the central term of Ja (eq. (1))
is normalized by each OOB size while the contingency table of Jχ2 (eq. (2))
considers global counts. This follows from the fact that Ja estimates an average
decrease in accuracy on the OOB samples while Jχ2 estimates a distribution
on those samples. More importantly, the very nature of those importance indices differ. Ja is an aggregate measure of prediction performances whereas Jχ2
(eq. (4)) is a p-value from a statistical test. The interpretation of this new index is therefore much more easy from a statistical significance viewpoint. In
particular, it allows one to decide if a variable is significantly important in the
voting process of a RF. As a consequence, the lower Jχ2 the more important the
corresponding feature, while it is the opposite for Ja .
3.2
Additional Related Work
In [4], the authors compare several ways to obtain a statistically interpretable
index from a feature relevance score. Their goal is to convert feature rankings to
statistical measures such as the false discovery rate, the family wise error rate
or p-values. To do so, most of their proposed methods make use of an external
permutation procedure to compute some null distribution from which those metrics are estimated. The external permutation tests repeatedly compute feature
rankings on dataset variants where some features are randomly permuted.
A few differences with our proposed index can be highlighted. First, even if it
can be applied to convert Breiman’s Ja to a statistically interpretable measure,
the approach in [4] is conceptually more complex than ours: there is an additional resampling layer on top of the RF algorithm. This external resampling
encompasses the growing of many forests and should not be confused with the
72
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Identification of Statistically Significant Features from RF
5
internal bootstrap mechanism at the tree level, inside the forest. This external
resampling can introduce some meta-parameters such as the number of external
resamplings and the number of instances to be sampled. On the other side, our
approach runs on a single RF. There is no need for additional meta-parameters
but it is less general: it is restricted to algorithms based on classifier ensembles.
The external resampling procedures in [4] implies that those methods are also
computationally more complex than Jχ2 . Indeed, they would multiply the cost
of computing a ranking with Ja by the number of external resamplings whereas
the time complexity of computing Jχ2 for p variables is exactly the same as with
Breiman’s Ja . If we assume that each tree node splits its instances into two sets of
equal sizes until having one point per leaf, then the depth of a tree is log n and the
time complexity of classifying an instance with one tree is O(log n). Hence, the
global time complexity of computing a ranking of p variables is O(T · p · n · log n).
Algorithm 1 details the time complexity analysis.
res ← initRes() // Θ(p)
for xj ∈ Variables do // Θ(p)
contT able ← init() // Θ(1 )
for B k ∈ B do // Θ(T )
x̃j ← perm(xj , B k ) // Θ(n)
for i ∈ B k do // O(n)
a ← hk (i) // Θ(depth)
x̃
b ← hkj (i) // Θ(depth)
contT able ← update(contT able, a, b, yi ) // Θ(1 )
end
end
res[xj ] ← χ2 (contT able) // Θ(1 )
end
return res
Algorithm 1: Pseudo-code for computing the importance of all variables with
a forest of T = |B| trees
4
Experiments
The following sections present experiments that highlight properties of the Jχ2
importance measure. They show that Jχ2 actually provides an interpretable importance index (Section 4.1), and that it is closely related to Ja both in terms of
variable rankings (Section 4.2) and predictive performances when used as feature
selection pre-filter (Section 4.3). The last experiments in Section 4.4 present some
predictive performances when restricting models to only statistically significant
variables.
73
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
6
4.1
Identification of Statistically Significant Features from RF
Interpretability of Jχ2
The main goal of the new feature importance measure is to provide an interpretable index allowing to retrieve variables that are significantly important in
the prediction of the forest. In order to check that Jχ2 is able to identify those
variables, first experiments are conducted on an artificial dataset with a linear decision boundary. This dataset is generated the same way as in [4]. Labels
y ∈ {−1, 1}n are given by y = sign(Xw) where w ∈ Rp and X ∈ Rn×p . Data
values come from a N (0, 1) distribution. The number p of variables is set to
120. The first 20 weights wi are randomly sampled from U(0, 1). The other 100
weights are set to 0 such that relevant variables only belong to the first 20 ones
(but all these variables need not be relevant e.g. whenever a weight is very small).
The number of instances is n = 500 such that X ∈ R500×120 . In order to add
some noise, 10% of the labels are randomly flipped.
To check that a feature selection technique is able to identify significant
variables, we report the observed False Discovery Rate (FDR) as in [4]:
F DR =
FD
FD + TD
(5)
where F D is the number of false discoveries (i.e. variables which are flagged as
significantly important by the feature importance index but that are actually not
important) and T D the number of true discoveries. A good variable importance
index should yield a very low observed FDR.
A RF, built on the full dataset, is used to rank the variables according to their
importance index. In order to decide if a variable is significantly important, we
fix the p-value threshold to the commonly accepted 0.05 value after correcting for
multiple testing. Figure 1 shows importance indices obtained by forests of various
sizes and different numbers m of variables randomly sampled as candidate in
each tree node. As we can see, the traditional (decreasing) Ja index does not
offer a clear threshold to decide which variables are relevant or not. Similarly to
the methods presented in [4], the (increasing) Jχ2 index appears to distinguish
more clearly between relevant and irrelevant variables. It however requires a
relatively large number of trees to gain confidence that a feature is relevant.
When computed on small forests (plots on the left), Jχ2 may fail to identify
variables as significantly important but they are still well ranked as shown by
the FDR values. Moreover, increasing the parameter m also tends to positively
impact the identification of those variables when the number of trees is low.
4.2
Concordance with Ja
As explained in Section 3.1, Jχ2 and Ja share a lot in their computations. Figure
2 compares the rankings of the two importance measures on one sampling of the
microarray DLBCL[5] dataset (p = 7129, class priors = 58/19). It shows that
variable ranks in the top 500 are highly correlated. Spearman’s rank correlation
coefficient is 0.97 for those variables. One of the main differences between the
rankings produced by Ja and Jχ2 is that the first one penalizes features whose
74
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Identification of Statistically Significant Features from RF
0.8
0.6
0.0
0.2
0.4
0.6
0.4
0.0
0.2
m = 10
0.8
1.0
T = 10000
1.0
T = 500
10
20
0
10
20
30
40
50
30
40
50
0
10
20
0
10
20
30
40
50
30
40
50
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1.0
1.0
0
0.0
0.0
0.2
m = 120
7
rank
rank
Fig. 1. Importance indices computed on an artificial dataset with a linear decision
boundary. For the sake of visibility, Ja has been rescaled between 0 and 1. The horizontal line is set at 0.05. Jχ2 (xj ) below this line are deemed statistically relevant.
75
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
8
Identification of Statistically Significant Features from RF
0
500
1000
1500
permuted versions would increase the prediction accuracy while the second one
would favour such a variable since it changes the class vote distribution. That
explains why features at the end of Ja ’s ranking have a better rank with Jχ2 .
In particular, after rank 1250 on the horizontal axis, features have a negative Ja
value for they somehow lower the prediction performance of the forest. But, since
they influence the class vote distribution, they are considered more important
by Jχ2 . Although those differences are quite interesting, the large ranks of those
variables indicates that they encode most probably nothing but noise. Furthermore, only top ranked features are generally interesting and selected based on
their low corrected p-values.
0
500
1000
1500
Fig. 2. Rankings produced by Ja and Jχ2 on one external sampling of the DLBCL
dataset
4.3
Feature Selection Properties
As shown in section 4.2, Ja and Jχ2 provide quite correlated variable rankings.
The experiments described in this section go a little bit deeper and show that,
when used for feature selection, the properties of those two importance indices
are also very similar in terms of prediction performances and stability of the
feature selection.
In order to measure the predictive performances of a model, the Balanced
Classification Rate (BCR) is used. It can be seen as the mean of per-class accuracies and is preferred to accuracy when dealing with non-balanced classes.
It also generalizes to multi-class problems more easily than AUC. For two class
problems, it is defined as
TN
1 TP
+
(6)
BCR =
2
P
N
76
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Identification of Statistically Significant Features from RF
9
Stability of feature selection indices aim at quantifying how much selected sets
of features vary when little changes are introduced in a dataset. The Kuncheva
index (KI) [6] measures to which extent K sets of s selected variables share
common elements.
K−1
K
X X
|Si ∩ Sj | −
2
KI({S1 , ..., SK }) =
2
K(K − 1) i=1 j=i+1
s − sp
s2
p
(7)
2
where p is the total number of features and sp is a term correcting the chance to
share common features at random. This index ranges from −1 to 1. The greater
the index, the greater the number of commonly selected features. A value of 0
is the expected stability for a selection performed uniformly at random.
In order to evaluate those performances and to mimic little changes in datasets,
an external resampling protocol is used. The following steps are repeated 200
times:
Randomly select a training set T r made of 90% data. The remaining 10%
form the test set T e.
• train a forest of T trees to rank the variables on T r
• for each number of selected features s
∗ train a forest of 500 trees using only the first s features on T r
∗ save the BCR computed on T e and the set of s features
The statistics recorded at each iteration are then aggregated to provide mean
BCR and KI.
Figure 3 presents the measurements made over 200-resamplings from the
DLBCL dataset according to the number of features kept to train the classifier.
It shows that the two indices behave very similarly with respect to the number
of features and the number of trees used to rank the features. Increasing the
number of trees allows to get more stable feature selection in both cases. This
kind of behaviour has also been shown in [7].
4.4
Prediction from Significantly Important Variables
Experiments show that Jχ2 ranks features roughly the same way as Ja while
providing a statistically interpretable index. One can wonder if it is able to
highlight important variables on real-world datasets and furthermore if those
variables are good enough to make a good prediction by themselves. Table 1
briefly describes the main characteristics of 4 microarray datasets used in our
study.
Using the same protocol as in Section 4.3, experiments show that the number
of selected variables increases with the number of trees, which is consistent with
the results in Section 4.1. As we can see on Table 2, it is also very dataset dependent with almost no features selected on the DLBCL dataset. Similar results are
observed in [4]. When comparing the predictive performances of a model built
on only significant variables of Jχ2 and a model built using the 50 best ranked
77
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Identification of Statistically Significant Features from RF
0.0
0.0
0.2
0.2
0.4
0.4
KI
BCR
0.6
0.6
0.8
0.8
1.0
1.0
10
1
5
10
50
1
500
5
10
50
500
s
s
Fig. 3. Average BCR and KI of Ja and Jχ2 over 200 resamplings of the DLBCL dataset
according to the number of selected features, for various numbers T of trees
Table 1. Summary of the microarray datasets: class priors report the n values in each
class, p represents the total number of variables.
Name
Class priors
DLBCL [5]
58/19
Lymphoma [8]
22/23
Golub [9]
25/47
Prostate [10]
52/50
78
p
7129
4026
7129
6033
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Identification of Statistically Significant Features from RF
11
variables of Ja , a paired T-test shows significant differences in most of the cases.
However, except for the DLBCL dataset, when using 10000 trees, the average
predictive performances are quite similar to each other. This confirms that, provided the number of trees is large enough and depending on the dataset, Jχ2
is able to select important variables that can be used to build good predictive
models.
Table 2. Various statistics obtained over 200-resamplings when keeping only significantly relevant variables. T is the number of trees used to build the forest. avg(srel )
(resp. max, min) is the average (resp. maximum, minimum) number of Jχ2 significantly
important features used to make the prediction. BCR is the average BCR obtained on
models for which there is at least one significant feature with Jχ2 . BCR50 is the average
BCR obtained when using the 50 Ja best ranked features in each iteration where Jχ2
outputted at least one significant feature.
T avg(srel ) min(srel ) max(srel ) BCR BCR50
5000
0.04
0
1 0.52
0.67
10000
0.99
0
5 0.69
0.83
golub
5000
5.96
3
10 0.93
0.97
10.82
8
14 0.96
0.97
10000
lymphoma 5000
0.66
0
6 0.62
0.82
10000
4.85
2
9 0.93
0.94
prostate
5000
4.95
2
8 0.93
0.94
10000
7.92
6
11 0.93
0.94
DLBCL
5
Conclusion and Perspectives
This paper introduces a statistical feature importance index for the Random
Forest algorithm which combines easy interpretability with the multivariate aspect of embedded feature selection techniques. The experiments presented in
Section 4 show that it is able to correctly identify important features and that it
is closely related to Breiman’s importance measure (mean decrease in accuracy
after permutation). The two approaches yield similar feature rankings. In comparison to Breiman’s importance measure, the proposed index Jχ2 brings the
interpretability of a statistical test and allows us to decide which variables are
significantly important using a very natural threshold at the same computational
cost.
We show that growing forests with many trees increases the confidence that
some variables are statistically significant in the RF voting process. This observation may be related to [7] where it is shown that feature selection stability of
tree ensemble methods increases and stabilises with the number of trees. The
proposed importance measure may open ways to formally analyse this effect,
similarly to [2]. We have evaluated Jχ2 on binary classification tasks. Although
there is a straightforward way to adapt it to the multi-class setting, future work
79
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
12
Identification of Statistically Significant Features from RF
should assess whether it is practically usable, in particular how many trees would
be needed when increasing the number of classes. Finally one should also evaluate the possibility to apply this approach on other ensemble methods, possibly
with different kinds of randomization.
References
1. Breiman, L.: Random Forests. Machine Learning 45(1) (October 2001) 5–32
2. Hernández-Lobato, D., Martı́nez-Muñoz, G., Suárez, A.: How large should ensembles of classifiers be? Pattern Recognition 46(5) (2013) 1323 – 1336
3. Benjamini, Y., Hochberg, Y.: Controlling the False Discovery Rate: A Practical and
Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society.
Series B (Methodological) 57(1) (1995) 289–300
4. Huynh-Thu, V.A.A., Saeys, Y., Wehenkel, L., Geurts, P.: Statistical interpretation of machine learning-based feature importance scores for biomarker discovery.
Bioinformatics (Oxford, England) 28(13) (July 2012) 1766–1774
5. Shipp, M.A., Ross, K.N., Tamayo, P., Weng, A.P., Kutok, J.L., Aguiar, R.C.,
Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G.S., Ray, T.S., Koval, M.A.,
Last, K.W., Norton, A., Lister, T.A., Mesirov, J., Neuberg, D.S., Lander, E.S.,
Aster, J.C., Golub, T.R.: Diffuse large b-cell lymphoma outcome prediction by
gene-expression profiling and supervised machine learning. Nat Med 8(1) (January
2002) 68–74
6. Kuncheva, L.I.: A stability index for feature selection. In: AIAP’07: Proceedings
of the 25th conference on Proceedings of the 25th IASTED International MultiConference, Anaheim, CA, USA, ACTA Press (2007) 390–395
7. Paul, J., Verleysen, M., Dupont, P.: The stability of feature selection and class
prediction from ensemble tree classifiers. In: ESANN 2012, 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine
Learning. (April 2012) 263–268
8. Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A.,
Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Powell, J.I., Yang, L., Marti, G.E.,
Moore, T., Hudson, J., Lu, L., Lewis, D.B., Tibshirani, R., Sherlock, G., Chan,
W.C., Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke, R., Levy, R.,
Wilson, W., Grever, M.R., Byrd, J.C., Botstein, D., Brown, P.O., Staudt, L.M.:
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769) (February 2000) 503–511
9. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P.,
Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
science 286(5439) (1999) 531–537
10. Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo,
P., Renshaw, A.A., D’Amico, A.V., Richie, J.P., et al.: Gene expression correlates
of clinical prostate cancer behavior. Cancer cell 1(2) (2002) 203–209
Acknowledgements Computational resources have been provided by the supercomputing facilities of the Université catholique de Louvain (CISM/UCL)
and the Consortium des Equipements de Calcul Intensif en Fédération Wallonie
Bruxelles (CECI) funded by the Fond de la Recherche Scientifique de Belgique
(FRS-FNRS).
80
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Prototype Support Vector Machines:
Supervised Classification in Complex Datasets
April Tuesday Shen and Andrea Pohoreckyj Danyluk
Williams College
Abstract. Classifier learning generally requires model selection, which
in practice is often an ad hoc and time-consuming process that depends
on assumptions about the structure of data. To avoid this difficulty,
especially in real-world data sets where the underlying model is both unknown and potentially complex, we introduce the ensemble of prototype
support vector machines (PSVMs). This algorithm trains an ensemble
of linear SVMs that are tuned to different regions of the feature space
and thus are able to separate the space arbitrarily, reducing the need
to decide what model to use for each dataset. We present experimental
results demonstrating the efficacy of PSVMs in both noiseless and noisy
datasets.
Keywords: Ensemble methods, Classification, Support Vector Machines
1
Introduction
The goal of classification is to accurately predict class labels for a set of data.
In machine learning, this is accomplished via algorithms that learn classification
models for particular types of class distributions from sets of labeled training
data. However, in real-world datasets, class distributions may be arbitrarily complex, and they are not generally known before learning takes place. Hence a data
mining practitioner must choose an algorithm and its associated model without
prior knowledge about the class distributions of the dataset in question. This
often requires testing multiple models to find one that works well [1]. The process of model selection, which is already arbitrary and time-consuming, becomes
even more problematic for datasets with the most difficult class distributions.
In this paper, we introduce the ensemble of prototype support vector machines (PSVMs)1 as a classification learning algorithm addressing the problem
of model selection in complex datasets. The PSVM algorithm learns a collection
of linear classifiers tuned to different regions of the space in order to separate
classes with arbitrarily complicated distributions. This algorithm is based on the
exemplar SVM (ESVM) approach [3], which trains a separate linear separator
specific to each instance in the training set. The PSVM algorithm trains an initial
1
Cheng et al. [2] refer to their profile support vector machines as PSVMs. In this
paper, PSVM should be taken to unambiguously refer to our prototype support
vector machines.
81
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
ensemble of ESVMs, but then iteratively improves boundaries to allow classifiers to capture groups of similar instances. Hence these new classifiers are tuned
to more generalized prototypes rather than to specific exemplars. We present
empirical evidence that PSVMs are capable of high classification accuracy in a
variety of noiseless and noisy datasets with different class distributions.
The remainder of this paper is organized as follows. In Section 2 we motivate the problem and introduce related work. In Section 3 we describe exemplar
SVMs in more detail than we have thus far. In Section 4 we describe our PSVM
algorithm. Section 5 details experiments comparing PSVMs with other classifier learning algorithms on a selection of datasets. Finally, we summarize our
conclusions and suggest future work.
2
Motivation and Related Work
In supervised machine learning, we can think of a learned classifier as a model
or function that partitions the feature space into different regions corresponding to the data points of different classes. Many learning algorithms and their
associated models are highly accurate fits to certain types of class distributions.
For datasets that are linearly separable, linear SVMs [4] are a good choice. Such
datasets include, for example, a number of text classification problems [5]. For
more complicated class distributions, such as ones where multiple noncontiguous
regions are mapped to the same class, C4.5 [6], a common decision tree learning
algorithm, is a better choice.
However, regardless of how aptly a given model captures a particular class
distribution, choosing an inappropriate model can still result in poor classification performance. Of course, every learning algorithm has some inductive bias
that limits the set of possible models it explores. It is unreasonable to expect an
algorithm to be agnostic towards the structure of the data in question. The point
is that choices about what learning algorithm to use, and hence what model to
learn, must happen prior to training and generally without knowledge of what
the data distribution looks like. In many cases, this forces the practitioner to
simply train and test multiple algorithms in order to discover which one performs
the best, a time-consuming and ad hoc process.
Model selection is even more difficult in datasets with complex class distributions – for instance, ones that are highly nonlinear or contain many small
disjuncts – since standard algorithms and models may not be sufficient in these
situations. There are many ways to attack the problem of complex class distributions directly. One approach is to reduce or reformulate the feature space,
since class boundaries may only seem complex when data is viewed in a particular space. Substantial research has been done in both feature selection (see [7]
for feature selection in supervised learning and [8] for feature selection in unsupervised learning) and feature extraction (e.g., principle component analysis,
multidimensional scaling, constructive induction [9], and more recently, manifold
learning [10]). Unfortunately, all of these approaches still make strong assumptions about the fundamental underlying classifier models.
82
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Other research has been concerned with the class distributions themselves
[11]. Some of this work is focused on specific types of difficult distributions, such
as highly unbalanced class distributions [12] or distributions that include many
small disjuncts [13]. While extremely valuable, this focused work does not tackle
the wider array of possible class distributions that present challenges to different
models and classification algorithms.
Another approach to handling complex class distributions is to learn classifiers from different subsets of training examples. Such ensemble methods include,
for example, AdaBoost [14], which builds a series of models that increasingly focus on examples in the difficult-to-capture regions of the feature space.
Of particular relevance to us are approaches that attempt to approximate
complex decision surfaces with an ensemble of hyperplanes. These include exemplar SVMs [3], which we discuss in more detail in Section 3, and localized and
profile SVMs [2].
Localized and profile SVMs combine the benefits of SVMs with instancebased methods in order to learn local models of the example space. Localized
SVMs train a new SVM model for each test instance using that instance’s nearest
neighbors from the training set. As expected, this is very slow at test time.
Profile SVMs also defer SVM learning to test time, but take advantage of the
fact that multiple nearby test examples may require only a single SVM in that
region. Profile SVMs use a variation of k-means to cluster training examples
based on their relationship to the test examples, before learning a local SVM for
each cluster. While profile SVMs are demonstrably more efficient than localized
SVMs, they still defer training to test time, which may be unreasonable for many
applications. Our approach differs from the localized SVM framework in that it
is able to approximate a range of complex decision surfaces with ensembles of
linear SVMs without the reliance on test examples of transductive or quasitransductive [2] approaches.
3
Exemplar Support Vector Machines
In this section we discuss exemplar SVMs (ESVMs) in greater detail, as they are
a foundation on which our algorithm is based. Exemplar SVMs were developed
for object recognition [3]. The ESVM algorithm trains a separate SVM for each
exemplar image from the training set, with that exemplar as the sole positive
instance and many instances of the other classes as negative instances. If one
of these exemplar SVMs positively classifies a novel instance, then this suggests
that the novel instance shares the class label of (i.e., depicts the same object as)
that model’s exemplar. Thus an ensemble of ESVMs can be used to classify new
data instances, either through voting or a more complicated procedure.
Object recognition tasks provide a good example of the types of complex data
distributions we seek to handle. Consider, for instance, four images containing
front and side views of bicycles and motorcycles, respectively. The two containing
side views are conceivably closer in pixel-value feature spaces than are the front
83
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
and side views of a motorcycle (or bicycle), yet it is the objects that we wish to
identify – i.e., motorcycle or bicycle – not the orientation.
The ESVM framework is well-motivated for object recognition and other
domains with complex class distributions for a number of reasons. Learning a
number of separate classifiers permits each classifier to focus on a difference
feature subset. This can be useful since different regions of the example space
may be well-characterized by different combinations of features.
An approach based on ensembles and exemplars also provides comprehensive
coverage of the feature space, which is critical for classes whose instances could
potentially lie in a number of diverse regions of the space. The existence of at
least one exemplar in each of these regions can allow the ensemble as a whole
to recognize their presence, without the need to create a single overly general
classifier to accommodate them.
The ESVM algorithm is an appealing starting point for our work as it leverages the power of both SVMs and ensembles to accommodate complexity. At
the same time, the fact that it learns a separate SVM for each training instance
makes it unnecessarily time- and space-intensive for many datasets. Our approach begins with ESVMs but then learns a set of SVMs that are tuned to
general prototypes, rather than specific exemplars.
4
The Prototype SVM Algorithm
In this section we describe how we train an ensemble of prototype support vector
machines (PSVMs). The algorithm begins by training an ensemble of exemplar
support vector machines (ESVMs). It then iteratively improves and generalizes
the boundaries of the SVMs to achieve the final ensemble of PSVMs. There are
three major components of the main PSVM algorithm: initialization, shifting
of boundaries, and prediction. Algorithm 1 describes the high-level training of
PSVMs, showing how these three components are used.
4.1
Initialization
The algorithm first trains an ensemble of ESVMs, with one model for each
instance in the training set. This requires that the algorithm create a training
set of positive and negative examples specific to each ESVM. Initializing positive
sets is straightforward, as each set is a single example that serves as the exemplar
for the ESVM. In theory, negative sets could contain every instance of a different
class from the exemplar. However, we do not want the single positive instance
to be overwhelmed by negative instances. Furthermore, in spaces with highly
variable class distributions, distant regions of the space may have no influence in
how local regions should be partitioned, so negatives far from the exemplar might
be useless or even detrimental for learning good classifiers. Finally, in general,
classes will not be linearly separable, so in order to learn relatively high-quality
linear discriminants, the algorithm chooses a sample of the potential negatives
that is linearly separable from the exemplar.
84
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Algorithm 1 Train(T ): Train an ensemble of PSVMs.
Input: set of labeled training data, T
Parameters: number of iterations, s, and fraction of data to hold out for validation, v
1: Split data into training and validation sets, D and V .
2: P ← [[di ] for di ∈ D] # List of positive sets, each initially just the exemplar.
3: N ← [ChooseNegatives(D, di ) for di ∈ D] # List of negative sets.
4: for j = 0, . . . , s do
5:
Ej ← [ ]
# The ensemble being trained on this iteration.
6:
for Pi ∈ P and Ni ∈ N do
7:
Train a linear SVM, using Pi and Ni .
8:
Add this SVM to Ej .
9:
aj ← Test(V, Ej ) # Accuracy of Ej on V .
10:
P, N ← Shift(D, Ej , P, N )
11: return ensemble Ej with highest accuracy on V
To accommodate these issues, we choose negatives in the manner described
in Algorithm 2. We first find the negative closest in Euclidean distance to the
exemplar. This defines a hyperplane passing through the exemplar and normal
to the vector between the exemplar and that negative. The candidate negatives
are then those that lie on the positive side of this hyperplane (i.e., the same
side of the hyperplane as the closest negative). Note that no margin is enforced,
and the hyperplane being considered is only a very rough approximation of the
hyperplane that will be learned by the SVM algorithm. From those candidate
negatives, the algorithm chooses only a small number, k, of them. The number
can be user-selected, although empirically about seven negatives seems to work
reasonably well. The k negatives chosen are those closest to the exemplar, so
that the training set for each model is kept mostly localized and the training
process is not “distracted” by instances in distant regions of the feature space.
Once a negative set for each exemplar has been initialized, the algorithm
trains an SVM for each exemplar, as in the ESVM algorithm.
4.2
Shifting
After training the initial ensemble of ESVMs, we shift the boundaries according
to Algorithm 3. (Note that the SVMs in Malisiewicz et al.’s original ESVM
approach are shifted and generalized as well, though by a different process and
not for the purpose of creating prototypes.) Shifting in our PSVM algorithm
accomplishes three main goals:
1. It generalizes classifiers from a single exemplar to a cluster of nearby instances.
2. It adjusts boundaries that misclassified negative instances.
3. It removes useless classifiers from the ensemble altogether.
85
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Algorithm 2 ChooseNegatives(D, di ): Initialize set of negatives for a given
data instance.
Input: set of training data, D, and training instance, di
Parameters: number of negatives to return, k
1: Ni ← [ ]
2: Di ← all instances in D of a different class label from di
3: Compute Euclidean distance from di to each element of Di .
4: Sort Di in ascending order by distance.
5: x ← closest negative in Di
6: n ← (x−di )/||x−di || # Normal vector to the hyperplane passing through di .
7: for dj ∈ Di do
8:
if n · (dj − di ) > 0 then
9:
Add dj to Ni
10:
if |Ni | = k then
11:
return Ni
12: return Ni
Algorithm 3 Shift(D, E, P, N ): Adjust and drop models from the ensemble.
Input: set of training data, D; ensemble of models, E;
positive and negative sets for each model, P and N
Parameters: probability to add to negative set, p
1: C ← [[ ] for dj ∈ D] # List of candidate models for each dj .
2: for mi ∈ E and dj ∈ D do
3:
if mi classifies dj positively then
4:
if class(dj ) = class(mi ) then
5:
Compute the distance of dj to mi ’s exemplar.
6:
Add mi and its distance to the list of candidates Cj .
# dj is misclassified, i.e. a hard negative.
7:
else
8:
Add dj to mi ’s negative set Ni with some probability p.
9: for dj ∈ D do
10:
Add dj to the positive set Pk , where mk is the closest model in Cj .
11: for mi ∈ E do
12:
if mi did not classify anything positively then
13:
Remove mi from the ensemble, by removing Pi from P and Ni from N .
14: return P, N
86
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Generalization is at the core of what turns exemplar SVMs into prototype
SVMs. The algorithm generalizes by adding new instances to the positive sets
of models. For a given instance, we define a candidate model to be one that
classifies the instance positively and whose exemplar is of the same class as the
new instance. These candidate models are ones that could be improved by adding
this new instance to their positive sets. But it could be problematic to add each
instance to all of its candidate models. Because linear classifiers simply divide
the space into half-spaces, they run a serious risk of overgeneralizing by adding
too many positives that may have nothing to do with the original exemplar
and its cluster. To avoid this, if a particular instance is positively classified by
multiple models, the algorithm only adds it to the positive set of the model with
the closest exemplar. As in the initialization of negatives, this helps keeps each
model tuned to a local region rather than attempting to capture wide swaths of
the feature space.
The algorithm also improves the models by adding misclassified negative
instances to their negative sets. This performs a kind of hard negative mining.
If we discover that a model classifies an instance as a positive example when it
should be a negative, we need to shift the linear separator to exclude the negative
example. To do this, we consider adding the negative instance explicitly to the
negative set for that model. However, again since we are dealing with complicated
class distributions and we want to keep models localized, some negatives actually
should be classified incorrectly by individual models, and there is no principled
way of identifying these in every possible dataset. Hence we only add to the
negative set with some probability, as set by the user. This has the additional
benefit of adding randomness and hence robustness to the algorithm.
The final step in the shifting algorithm is to remove models that do not classify any instances positively. Depending, in part, on the choice of regularization
parameter, some SVMs may default to classifying all examples as negative. These
are not useful for the overall ensemble and are therefore removed. Fortunately,
the use of an ensemble provides redundancy; since the algorithm begins with
a classifier for each instance of the training data, some models can be dropped
from the ensemble without detriment, unless the class distribution is extremely
difficult. Dropping models also provides some robustness to noise, as noisy exemplars may be more difficult to separate from their closest negatives.
We perform the shifting procedure some number of times as specified by
the user. After each shift, the algorithm trains a new ensemble with the new
positive and negative sets, then tests this ensemble on a held-out validation set.
The ensemble with the highest classification accuracy on the validation set is
retained. Empirically, we find that accuracy generally stabilizes fairly quickly –
between 10 and 20 iterations at most.
4.3
Prediction
For validation after each shift and at test time, the ensemble of PSVMs predicts
class labels for novel instances as follows. For each new instance, if a model
classifies it positively, this corresponds to a vote for that model’s positive class
87
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
(i.e., the class of its original exemplar). We also generate the probability that the
instance should be assigned the model’s positive class, as in [15], and sum these
probability values rather than the raw votes. The predicted class label assigned
by the entire ensemble is simply the class with the maximum sum of weighted
votes.
5
Experiments and Results
In this section we discuss our experiments. Overall, our PSVM algorithm performs better than the other algorithms we tested on datasets with the most
complicated class distributions, and its performance is not significantly worse
than the other algorithms when applied to simpler data distributions. We also
demonstrate that our PSVM algorithm degrades gracefully in the presence of
noise.
5.1
Experimental Setup
We tested our PSVM algorithm against C4.5 [6], AdaBoost [14] with C4.5 as
the base classifier, linear SVMs [4], AdaBoost with linear SVMs as the base
classifier, SVMs with a quadratic kernel, and multilayer perceptrons trained
with backpropagation [16]. We chose these for their generally good performance
and their varied strengths. For each algorithm, we used the implementations
provided by Weka [17]. In particular, we used Weka’s wrapper of LIBSVM [15]
for the SVM experiments, as we also used LIBSVM for our PSVM algorithm.
We used Weka’s default parameters for our experiments, except we reduced
the training time of multilayer perceptrons from 500 to 200 iterations for reasons
of time. Weka’s default number of iterations for AdaBoost is quite low, namely
10, but we note that AdaBoosted linear SVMs finished in fewer than 10 iterations
in the synthetic datasets and glass, and performance on the other datasets with
as many as 100 iterations was statistically identical to performance with 10
iterations.
For PSVMs we used the following default parameters: v = 25% (percent of
data used for validation), s = 10 (iterations of shifting), k = 7 (number of initial
negatives), and p = 0.5% (hard negative mining probability).
There are three general categories of datasets we used in our experiments.
These are listed in Table 1. First are synthetic datasets (see Figure 1), specifically
designed to have unusual class distributions to provide a proof of concept of
the power of PSVMs. The spirals dataset is a standard benchmark for neural
networks originally from CMU [18]; we generated the other two. In isolated, each
cluster is normally distributed and the background data is uniform outside three
standard deviations of each cluster’s mean. In striated, each stripe is normally
distributed with greater standard deviation in one direction. Next are benchmark
datasets from UCI’s Machine Learning Repository [19]. The last dataset is a
natural language classification task from the Semantic Evaluation (SemEval)
workshop [20]. The data consists of raw Twitter messages, or tweets, and the task
88
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
is to classify them according to their sentiment as objective (i.e., no sentiment),
positive, neutral, or negative.
Table 1: The datasets used in the experiments. The top three are synthetic, with
the spirals dataset taken from CMU’s Neural Networks Benchmarks [18]. The
next four are benchmark datasets from UCI’s Machine Learning Repository [19].
The Twitter dataset is from SemEval [20].
Synthetic
UCI
Real-world
Dataset
isolated
striated
spirals
iris
glass
vehicle
segment
twitter
Instances
800
800
194
150
214
846
2310
600
Features (type)
2 (real)
2 (real)
2 (real)
4 (real)
9 (real)
18 (integer)
19 (real)
2715 (integer)
Classes (instances per class)
2 (500 / 300)
2 (400 / 400)
2 (97 / 97)
3 (50 / 50 / 50)
6 (70 / 76 / 17 / 13 / 9 / 29)
4 (212 / 217 / 218 / 199)
7 (330 each)
4 (99 / 157 / 30 / 314)
Fig. 1: Synthetic two-dimensional datasets: isolated, striated, and spirals.
The Twitter dataset contained only raw tweets and sentiment labels, and
hence we preprocessed and featurized that dataset. Much research has gone into
good feature representations for natural language texts and tweets in particular
(see, for example, [21] or [22]), but as the focus of our work is not sentiment
analysis, we used a basic but reasonable set of features for this data, including
single words, links, usernames, hashtags, standard emoticons, and words from
the MPQA Subjectivity Lexicon [23].
5.2
Results on Datasets With No Noise Added
For each algorithm, we report the results of ten-fold cross validation on each
dataset. Table 2 compares our PSVMs with each of the other algorithms. Also see
Table 3 for counts of wins, losses, and ties of PSVMs over the other algorithms.
Overall, the PSVM algorithm performs about as well as the other algorithms
in all datasets, and has significantly higher accuracy in the datasets with the
89
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 2: Experiment results. Shown are classification accuracy means with one
standard deviation. ◦ indicates statistically significant improvement of PSVMs
over the other algorithm, • indicates statistically significant degradation based
on a corrected paired T-test at the 0.05 level.
Dataset
PSVM
isolated
striated
spirals
iris
glass
vehicle
segment
twitter
90.9(4.07)
97.3(1.66)
21.8(18.9)
96.0(4.42)
52.8(8.72)
79.4(4.49)
95.2(1.18)
55.8(5.12)
Boosted
C4.5
C4.5
98.1(1.79)•
69.4(13.4)◦
0.0(0.0)◦
94.0(6.29)
69.1(6.40)•
73.8(4.48)◦
97.1(0.93)•
54.5(2.89)
Linear
SVM
98.5(1.46)•
90.1(7.38)◦
0.0(0.0)◦
94.0(5.54)
72.9(7.85)•
75.7(3.56)
98.1(0.85)•
60.5(7.99)
62.5(0.0)◦
48.5(3.94)◦
0.0(0.0)◦
98.7(2.67)
63.5(8.08)•
80.4(4.50)
96.3(0.93)•
62.2(3.66)•
Boosted
Linear
SVM
62.5(0.0)◦
53.4(16.3)◦
5.63(8.97)◦
97.3(3.27)
63.1(7.73)•
80.4(4.23)
96.1(0.89)
62.2(4.15)•
Multilayer
Perceptron
81.9(4.38)◦ 80.6(2.81)◦
72.8(4.18)◦ 75.0(25.0)◦
0.0(0.0)◦
20.6(28.9)
96.7(4.47) 97.3(4.42)
69.7(6.55)• 70.6(8.82)•
80.4(4.53) 79.7(4.61)
95.8(1.38) 96.2(1.30)
52.3(5.54) 52.3(5.54)
Polynomial
SVM
Table 3: The total number of wins, losses, and ties for PSVMs over other algorithms, in noiseless and noisy datasets.
Regular
Noiseless
Noisy
16 - 13 - 19 14 - 7 - 21
most difficult class distributions. Its performance is especially impressive in the
synthetic datasets. This is no surprise; while those datasets were not designed
specifically for this algorithm, they were designed to exemplify particularly difficult class distributions. Although PSVMs do not perform as impressively in
the other domains, in general they do not perform significantly worse than the
other algorithms. The exceptions are glass and segment; we hypothesize that
this is because these are the datasets with the largest number of classes, and
the extension of SVMs to multiclass domains is not as natural as it is for the
other algorithms. The glass domain has an especially small number of instances
in certain classes, and since we hold out 25% of the training data for validation
after each shift, this may explain PSVM’s poor performance in this dataset.
The spirals dataset is especially difficult for all algorithms, though we would
expect good results from SVMs with a radial basis kernel. Of the algorithms
tested, multilayer perceptrons are the only other algorithm besides PSVMs that
perform relatively well on spirals. However, it is worth noting that multilayer
perceptrons took on the order of days to run the full suite of experiments, whereas
the other algorithms, including PSVMs, each took on the order of minutes or
hours.
5.3
Results on Datasets With Noise Added
In order to test the robustness of PSVMs to noise, we repeated the experiments
described above, on the same datasets but with noise injected into the class
labels. These results are reported in Table 4.
90
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 4: Results for PSVMs in noisy datasets. Shown are classification accuracy
means with one standard deviation. ◦ indicates statistically significant improvement, • statistically significant degradation.
Dataset
PSVM
isolated
striated
spirals
iris
glass
vehicle
segment
71.9(7.71)
76.8(7.42)
29.0(15.4)
88.7(5.21)
50.0(5.50)
68.1(4.43)
84.3(2.18)
Boosted
Linear
SVM
84.8(4.36)• 83.6(4.65)• 59.3(3.67)◦ 57.9(3.71)◦
60.3(7.24)◦ 64.3(6.55)◦ 52.0(2.86)◦ 56.8(7.36)◦
9.79(6.28)◦ 9.79(6.28)◦ 11.9(10.2)◦ 16.9(17.1)
84.7(7.33) 83.3(7.45) 90.0(6.83) 90.0(6.15)
62.2(7.46)• 66.3(8.87)• 53.7(8.91) 57.5(7.90)•
63.1(1.89)◦ 67.6(2.54) 71.5(3.57) 69.1(4.24)
85.5(2.30) 85.4(2.23) 84.1(1.82) 84.0(2.12)
C4.5
Boosted
C4.5
Linear
SVM
Polynomial
SVM
75.5(4.91)
66.8(3.12)◦
10.3(6.95)◦
90.0(5.37)
61.6(8.95)•
57.4(5.12)◦
67.8(3.95)◦
Multilayer
Perceptron
73.9(3.51)
67.4(18.5)
17.1(13.4)
91.3(6.70)
61.7(4.14)•
71.0(2.61)
85.4(2.51)
For injecting noise we followed the same methodology as in [24]: we selected
10% of the instances uniformly and without replacement, then changed their
class labels to an incorrect one chosen uniformly. Note that we assume the Twitter dataset is already noisy; because we could not quantify the baseline noise
level, we did not inject noise into this dataset.
Every algorithm’s performance degrades in the presence of noise, as we would
expect. (The spirals dataset is an anomaly; the dataset is so small and its class
distribution so unusual that adding noise seems to make it easier to partition for
most algorithms.) In general, the PSVM ensemble is fairly robust to the presence
of noise. PSVMs retain their advantage in the datasets with the trickiest class
distributions, and they degrade gracefully on the benchmark datasets as well.
It is also worth noting that, for the most part, the sizes of the final PSVM
ensembles are not affected by the presence of noise (see Table 5). This suggests
that the algorithm is not retaining extra models to account for noisy instances.
In addition, note that the number of models in the PSVM ensemble for every
dataset is less than the number in the baseline ESVM ensemble. Because the
final ensemble output by our PSVM algorithm is the one from the best iteration
so far as determined by a validation set, it is clear that the unshifted ESVMs
(i.e., the first iteration in the PSVM learning process) are never found to be
best.
6
Summary and Future Work
In this paper we have described the ensemble of prototype support vector machines (PSVMs), an algorithm for performing supervised classification in datasets
with complex class distributions. This work is motivated primarily by the problem of model selection. When a data mining practitioner needs to choose a
classifier-learning algorithm for a dataset, he most likely has no a priori knowledge of the class distributions that the algorithm will need to model. The issues
involved in finding good models are exacerbated when distributions are especially
complicated.
In response to these challenges, the PSVM algorithm works by learning an
ensemble of linear classifiers tuned to different sets of instances in the training
91
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 5: Average number of models in the final ensembles of PSVMs for noiseless
and noisy datasets, rounded to the nearest whole number. Also listed is the
number of models for the initial ESVM ensemble; these numbers are identical
for noiseless and noisy datasets.
isolated
striated
spirals
iris
glass
vehicle
segment
twitter
ESVM
720
720
175
135
193
761
2079
540
Noiseless
323
149
47
54
35
110
453
42
Noisy
122
170
56
24
37
113
323
X
data. Such an ensemble is flexible enough to have high performance in datasets
with arbitrarily complex class boundaries, with minimal parameter tuning. It
accomplishes this by leveraging both the power of SVMs as effective linear classifiers and the power of ensembles to provide flexibility and improve accuracy
without the need to specify a particular kernel function. This algorithm is based
on the ensemble of exemplar SVMs for object recognition from [3]. The core of the
PSVM approach is an initial ensemble of exemplar SVMs, followed by a shifting
algorithm that refines linear models, drops unnecessary models, and generalizes
models from single exemplars to clusters of similar instances, or prototypes.
Our results demonstrate that PSVMs generally have the highest accuracy
among the algorithms we tested in the datasets with the more complex distributions, and good performance in standard benchmark datasets. In addition, the
results for noisy datasets provide evidence that PSVMs are more robust to noise
than other algorithms that seek to maximize flexibility.
The main goal of the PSVM algorithm is to reduce the need to make datadependent algorithmic decisions before knowing about data distributions. While
model selection is no longer necessary with this algorithm, there are still parameters that must be set: the size of the validation set, the number of shifting
iterations, the size of the initial negative sets, and the probability of mining
hard negatives. Optimal settings for these may be dataset-specific and affect the
quality of the final ensemble. This remains a limitation of our approach.
There are several interesting directions for future work related to the PSVM
algorithm. Further experiments on synthetic, benchmark, and real-world datasets
would provide additional information on the capabilities of the algorithm. It
would also be worthwhile to explore a variety of modifications to the basic algorithm. For example, we might leverage the ability of ensembles to select feature
sets independently for each model. This can be useful since different regions of
the example space may be well-characterized by different combinations of features. One elegant way of doing this would be to use 1-norm SVMs to effectively
92
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
perform feature selection in tandem with learning the SVM [25]. Finally, it would
be interesting to investigate the connection of the PSVM algorithm with similar algorithms that might utilize an explicit clustering step during training. We
hypothesize that the shifting process of PSVMs enables the organic discovery
of clusters without needing to specify the number of centroids as in k-means. It
would be worth exploring to what extent this hypothesis is supported empirically
and theoretically.
References
1. Chatfield, C.: Model uncertainty, data mining and statistical inference. Journal of
the Royal Statistical Society. Series A (Statistics in Society) 158(3) (1995)
2. Cheng, H., Tan, P.N., Jin, R.: Efficient algorithm for localized support vector
machine. IEEE Trans. Knowl. Data Eng. 22(4) (2010)
3. Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-SVMs for object
detection and beyond. In: Proceedings of the 2011 International Conference on
Computer Vision. ICCV ’11 (2011)
4. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3) (1995)
5. Joachims, T.: Text categorization with support vector machines: Learning with
many relevant features. In: Proceedings of the 10th European Conference on Machine Learning. ECML ’98 (1998)
6. Quinlan, J.R.: C4. 5: programs for machine learning. Morgan Kaufmann (1993)
7. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. The
Journal of Machine Learning Research 3 (2003)
8. Ferreira, A., Figueiredo, M.: Unsupervised feature selection for sparse data. In: Proceedings of the 2011 European Symposium on Artificial Neural Networks. ESANN
’11 (2011)
9. Callan, J.P., Utgoff, P.E.: A transformational approach to constructive induction.
In: Proceedings of the 8th International Workshop on Machine Learning. ML ’91
(1991)
10. Xiao, R., Zhao, Q., Zhang, D., Shi, P.: Facial expression recognition on multiple
manifolds. Pattern Recognition 44(1) (2011)
11. Japkowicz, N.: Concept-learning in the presence of between-class and within-class
imbalances. In: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence.
AI ’01 (2001)
12. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: onesided selection. In: Proceedings of the 1997 International Conference on Machine
Learning. ICML ’97 (1997)
13. Holte, R., Acker, L., Porter, B.: Concept learning and the problem of small disjuncts. In: Proceedings of the 1989 International Joint Conference on Artificial
Intelligence. IJCAI ’89 (1989)
14. Schapire, R.E.: A brief introduction to boosting. In: Proceedings of the 1999
International Joint Conference on Artificial Intelligence. IJCAI ’99 (1999)
15. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Trans. Intell. Syst. Technol. 2(3) (May 2011)
16. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations
by error propagation. In Rumelhart, D.E., McClelland, J.L., PDP Research Group,
C., eds.: Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. MIT Press, Cambridge, MA, USA (1986) 318–362
93
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
17. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.:
The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1)
(November 2009) 10–18
18. White, M., Sejnowski, T., Rosenberg, C., Qian, N., Gorman, R.P., Wieland, A., Deterding, D., Niranjan, M., Robinson, T.: Bench: CMU neural networks benchmark
collection. http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/
neural/bench/cmu/0.html (1995)
19. Bache, K., Lichman, M.: UCI machine learning repository. http://archive.ics.
uci.edu/ml (2013)
20. Wilson, T., Kozareva, Z., Nakov, P., Ritter, A., Rosenthal, S., Stoyanov, V.:
Semeval-2013 task 2: Sentiment analysis in twitter. http://www.cs.york.ac.uk/
semeval-2013/task2/ (2013)
21. Barbosa, L., Feng, J.: Robust sentiment detection on twitter from biased and
noisy data. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters. COLING ’10, Stroudsburg, PA, USA, Association for
Computational Linguistics (2010)
22. Davidov, D., Tsur, O., Rappoport, A.: Enhanced sentiment learning using twitter
hashtags and smileys. In: Proceedings of the 23rd International Conference on
Computational Linguistics: Posters. COLING ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010)
23. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phraselevel sentiment analysis. In: Proceedings of the conference on Human Language
Technology and Empirical Methods in Natural Language Processing. HLT ’05
(2005) 347–354
24. Dietterich, T.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning
40(2) (2000)
25. Tan, M., Wang, L., Tsang, I.W.: Learning sparse SVM for feature selection on very
high dimensional datasets. In: Proceedings of the 2010 International Conference
on Machine Learning. ICML ’10 (2010)
94
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Software reliability prediction
via two different implementations of
Bayesian model averaging
Alex Sarishvili1 and Gerrit Hanselmann2
1
Fraunhofer Institute for Industrial Mathematics ITWM, 67663 Kaiserslautern,
Germany, [email protected]
2
Siemens AG, Corporate Technology 81739 München, Germany,
[email protected]
Abstract. The problem of predicting software reliability is strengthened
by the uncertainty of selecting the right model. While Bayesian Model
Averaging (BMA) provides a means to incorporate model uncertainty
in the prediction, research on the influence of parameter estimation and
the performance of BMA in different situations will further expedite the
benefits of using BMA in reliability prediction. Accordingly, two different
methods for calculating the posterior model weights, required for BMA,
are implemented and benchmarked considering different data situations.
The first is the Laplace method for integrals. The second is the Markov
Chain Monte Carlo (MCMC) method using Gibb’s and Metropolis within
Gibb’s sampling. For the last the explicit conditional probability density
functions for grouped failure data are provided for each of the model
parameters. With a number of different simulations of mixed grouped
failure data we can show the robustness and superior performance of
MCMC measured by mean squared error on long and short range predictions.
1
Introduction
There exists a large number of different reliability prediction models with a wide
variety of underlying assumptions. The problem of selecting the single right
model or combination of models is receiving considerable attention with the
growing need of improved software reliability predictions [1], [2], [3], [4], [5],
[6], [7]. A conceivably simple approach to integrate uncertainty about the right
model in the prediction is the equally weighted linear combination (ELC) of
models studied by [8]. ELC is defined by
k
1Xˆ
fˆtELC =
fi (t),
k i=1
where fˆi (t) is the prediction of f (t) the number of accumulated faults found
until time t, using model Mi and k is the number of models. It basically defines
every prediction model as good as the other.
95
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
2
Instead of giving equal trust to each and every model a more sophisticated
approach considers the model performance on the data to distribute trust on the
different models. This can be done using Bayesian Model Averaging (BMA) that
calculates posterior weights for every model and places trust amongst the models
accordingly. Several realistic simulation studies comparing the performance of
BMA [9, 10] showed that in general BMA has better performance. These studies
were performed for a variety of situations, e.g., linear regression [11], Log-Linear
models [12], logistic regression [13], and wavelets [14]. Other studies, especially
regarding BMA out-of-sample performance have shown quite consistent results,
namely BMA having better predictive performance than competing methods.
Theoretical results on the performance of BMA are given by [15].
This paper investigates and describes two different approaches for estimating
the posterior model probabilities for BMA, in the case were grouped failure data
is given. One is the Laplace method for integrals. The other is the Markov
Chain Monte Carlo (MCMC) method in its most general form, which allows
implementation using Gibb’s and Metropolis within Gibb’s samplers.
Through the comparison of two different implementations of BMA on
grouped and small simulated data sets we show the significance of the presented
methods on the prediction performance of BMA. Furthermore we demonstrate
the superiority of BMA over single prediction models and over the ELC combination technique. We decided to simulate the data and not to use standard data
sets available online for the following reason: model performance estimated on
only few data sets (each of them is one realization of a stochastic process) have
the drawback of missing generality and in most cases have limited informative
value.
For the combination four different non-homogeneous Poisson process
(NHPP) models for grouped failure data with different mean value functions [16]
have been used (Table 1). These models were selected because of their degree of
popularity and convenience by the illustration of evaluation results. The doubt
on validity of the assumption that the software reliability behaviour follows the
NHPP is presented in [17].
The rest of this paper is organized as follows. Section 2 gives a short survey of
software reliability growth modeling. Section 3 gives an introduction to Bayesian
model averaging and describes the Laplace method and the MCMC method for
calculating the posterior model weights. The simulation setup and the results
are illustrated in Section 4. The conclusion and future work is given in Section
5. The models are introduced in [18], [19], [20], [21] respectively.
Table 1. Models under consideration
Model
Delayed S-Shaped (DSS)
Goel-Okumoto (GO)
Goel Generalized (GG)
Mean value function
µ(t) = a(1 − [1 + βt]e−βt )
µ(t) = a(1 − e−bt )
γ
µ(t) = a(1 − e−bt )
Inflection S-Shaped (ISS) µ(t) =
96
a(1−e−bt )
1+βe−bt
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
3
2
Software Reliability Growth Models
Software reliability is defined as the probability of failure-free operation for a
specified period of time under specified operating conditions [22]. Thereby, a
software failure is an inconsistent behavior with respect to the specified behavior
originating from a fault in the software code [23].
Software reliability modeling started in the early 70s. In contrast to hardware reliability, software reliability is concerned with design faults only. Software suffers not from deterioration nor does it fail accidentally if not in use.
Design faults occur deterministically until they have been removed. Notwithstanding, software’s failure behavior can be described using random models [24].
The reason is that even though the failures occur deterministically under the
same conditions, their occurrence during usage may be random. A failure will
no longer occur if its underlying fault has been removed. The process of finding
and removing faults can be described mathematically by using software reliability growth models (SRGM) and most existing reliability models draw on this
assumption of an improving reliability over time due to continuous testing and
repair. An overview of these models can be found in [25], [26], [27], [28], [29].
This paper uses models of the NHPP class [16]. NHPP models have been used
successfully in practical software reliability engineering. These models assume
that N (t), the number of observed failures up to time t, can be modeled as a
NHPP, as Poisson process with a time varying intensity function. A counting
process {N (t), t ≥ 0} is an NHPP if N (t) has a Poisson distribution with mean
value function µ(t) = E[N (t)], i. e.,
P (N (t) = n) =
µ(t)n −µ(t)
e
, n = 0, 1, 2, ... .
n!
The mean value function µ(t) is the expected cumulative number of failures in
[0, t). Different NHPP software reliability growth models have different forms of
µ(t). (see also the table 1)
3
Bayesian Model Averaging
A software reliability growth model uses failures found during testing and fault
removal to describe the failure behavior over time. Different models have different
assumptions concerning the failure behavior of software. Let M = {M1 , ..., Mk }
be the k NHPP models that predict the cumulated number of failures, fi (t), i =
1, ..., k , for each time t . The BMA model predicts the expected cumulated number of failures at time t, µ̂(t)bma , by averaging over predictions of the models
Mi , i = 1 ... k . Thereby the models are weighted using their posterior probabilities. This leads to the following BMA pdf
p(f (t)| d(t)) =
k
X
p(fi (t)| Mi , d(t))p(Mi | d(t)),
i=1
97
(1)
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
4
where p(fi (t)| Mi , d(t)) is the prediction pdf of fi (t) under model Mi and
p(Mi | d(t)) is the posterior probability of model Mi given data d(t)3 . The BMA
point prediction of the cumulated number of experienced failures is
fˆBM A (t) =
k
X
fˆi (t) p(Mi | d(t)),
(2)
i=1
where the posterior probability for model Mi at time t is given by
p(d(t)| Mi )p(Mi )
,
p(Mi | d(t)) = Pk
i=1 p(d(t)| Mi )p(Mi )
with
p(d(t)| Mi ) =
Z
(3)
p(d(t)| θi , Mi )p(θi | Mi )dθi ,
(4)
being the integrated likelihood of model Mi . Thereby, θi is the vector of parameters of model Mi , p(θi |Mi ) is the prior density of θi under model Mi ,
p(d(t)| θi , Mi ) is the likelihood, and p(Mi ) is the prior probability that Mi is
the true model [10].
3.1
Laplace method for integrals
In this paper two methods for implementing BMA have been examined. The
first is the approximation of the integral in (4) by the method of Laplace. In
regular statistical models (roughly speaking, those in which the maximum likelihood estimate (MLE) is consistent and asymptotically normal) the best way to
approximate the integral in (4) is usually using the Laplace method.
The integrated likelihood from (4) can be estimated in the following way.
For simplicity the conditional information on models has been omitted from the
equations. Let g(θi ) = log(p(d(t)| θi )p(θi )) and let θ̃i = arg maxθ∈Θi g(θ). After
Taylor series expansion truncated at the second term the following is obtained:
g(θi ) ≈ g(θ̃i ) + 1/2(θi − θ̃i )′ g ′′ (θ̃i )(θi − θ̃i ).
It follows
p(d(t)|Mi ) =
Z
e(g(θi )) dθi = eg(θ̃i )
Z
e(1/2(θi −θ̃i )
T
g ′′ (θ̃i )(θi −θ̃i ))
dθi .
(5)
By recognizing the integrand as proportional to the multivariate normal density
and using the Laplace method for integrals
p(d(t)|Mi ) = eg(θ̃i ) (2π)Di /2 |A|−1/2 ,
(6)
where Di is the number of parameters in the model Mi , and Ai = −g ′′ (θ̃i ).
It can be shown that for large N which is the number of data available the
3
Since the observed data changes with time it is denoted by the time dependent
function d(t).
98
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
5
θ̃i ≈ θ̂i , where θ̂i is the MLE, and Ai ≈ N Ii . Thereby Ii is the expected Fisher
information matrix. It is a Di × Di matrix whose (k, l)-th elements are given by:
2
∂ log(p(d(t)|Mi )) Ikl = −E
θ=θ̂i .
∂θk ∂θl
Taking the logarithm of (6) leads to
log(p(d(t)|Mi )) = log p(d(t)|θ̂i ) + log p(θ̂i )
+Di /2 log(2π) − Di /2 log(N )
−1/2 log |Ii | + O(N −1/2 ).
(7)
This is the basic Laplace approximation. In (7) the information matrix Ii is
estimated as the negative of the Hessian of the log-likelihood at the maximum
likelihood estimate. Furthermore the results can be compared with the Bayesian
information criterion (BIC) approximation. If only the terms which are O(1) or
less for N → ∞ are retained in (7) the following can be derived:
log(p(d(t)|Mi )) = log p(d(t)|θ̂i ) − Di /2 log(N ) + O(1).
(8)
This approximation is the well known BIC approximation. Its error O(1) does not
vanish for an infinite amount of data, but because the other terms on the right
hand side of (8) tend to infinity with the number of data, they will eventually
dominate the error term. This is the case when software testing has progressed
sufficiently and much failure data is available.
One choice of the prior probability distribution function of the parameters
in (7) is the multivariate normal distribution with mean θ̂i and the variance
equal to the inverse of the Fisher information matrix. This seems to be a reasonable representation of the common situation where there is only little prior
information. Under this prior using (7) the posterior approximation is:
log(p(d(t)|Mi )) = log p(d(t)|θ̂i ) − Di /2 log(N ) + O(N −1/2 ).
(9)
Thus under this prior the error is O(N −1/2 ) which is much smaller for moderate to large sample sizes and which tends to zero as N tends to infinity. (9)
was pointed out by [30]. This is a variant of the so called non-informative priors.
Non-informative priors are useful in case when the analyst has no relevant experience to specify a prior, and that the subjective elicitation in multi-parameter
problems is impossible. The Laplace method with normal prior (9) is used for
the approximation of the posterior probabilities.
3.2
Markov Chain Monte Carlo method
Another way of calculating the marginal likelihoods which are needed for estimating the posterior weights of the models is MCMC. MCMC generates samples
from the joint pdf of the model parameters. In this paper the Gibb’s sampler is
99
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
6
used for the MCMC implementation. The transition distribution of this Markov
chain is the product of several conditional probability densities. The stationary
distribution of the chain is the desired posterior distribution [31]. After the samples from the parameter joint pdf for any model i are generated, the integral (4)
can be approximated by the sum
p(d(t)| Mi ) =
T
X
(j)
(j)
p(d(t)| θi , Mi )p(θi | Mi ),
(10)
j=1
where T is the number of Gibb’s sampler iterations. Below the parameter conditional probability density functions are given, which are needed for the Gibb’s
sampler implementation. A similar MCMC implementation was ddescribed by
[32] but for non-grouped data. Because the likelihood function for grouped data
is different then for the non-grouped data the conditional densities needed for
Gibb’s implementation are different, as well.
The likelihood function of different NHPP models for interval failure count
data is
t
Y
(µ(i) − µ(i − 1))di −µ(t)
p(d(t)|θi , Mi ) =
e
.
(11)
di !
i=1
It is necessary to make an assumption about the prior distribution of the model
parameters. It is convenient to choose the Gamma distribution as prior, for it
supports the positivity of the parameters and is quite versatile to reflect densities
with increasing or decreasing failures rates.
Under this assumption the posterior density, for instance for the delayed
s-shaped model is
p(a, β|d(t)) ∝ p(d(t)|a, β, DSS)aα1 −1 e−aα2 β β1 −1 e−ββ2 .
Inserting the corresponding mean value function into (11) yields p(d(t)| α, β, DSS).
Subsections 3.2.1 to 3.2.4 describe the gamma prior distributions of the parameters of the considered models.
Since not all conditional densities have convenient forms the Metropoliswithin-Gibbs algorithm is used to approximate the joint pdf of the model parameters. One way to avoid this computationally intensive Metropolis-within-Gibbs
sampling is data augmentation [31]. This is however not considered in this paper.
The following subsections describe the conditional densities for the different
mean value functions of the NHPP models. These conditional densities can be
used in the Gibbs sampler to generate the desired joint parameter probability
distributions and therefore the corresponding posterior pdf ’s for each model.
3.2.1
Conditional densities for DSS model parameters
For the DSS-model with Gamma a-priori distributed parameters
a ∼ Γ (α1 , α2 ) and β ∼ Γ (β1 , β2 ), the following conditional densities are sampled
p(β|a, d(t)) ∝ e(−β
Pt
i=1
id(i)−µ̄(t)−β2 β ) β1 −1
β
t
Y
i=1
100
Ad(i) ,
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
7
where A = −(1+βi)+eβ (1+βi−β) and µ̄(t) = −a(1+βt)e−βt is the second term
of the expectation function of the DSS model. The conditional density function
for a is
!
t
X
−βt
a|β, d(t) ∼ Γ
d(i) + α1 , 1 − (1 + βt)e
+ α2 .
i=1
3.2.2
Conditional densities for GO model parameters
Considering the GO-model the conditional densities for the parameters a and b
are
!
t
X
−bt
a|b, d(t) ∼ Γ
d(i) + a1 , 1 − e
+ a2
i=1
and
p(b|a, d(t)) ∝ e(−b
Pt
where
A = (eb − 1)
i=1
Pt
id(i)−µ̄(t)−b2 b)
i=1
d(i) b1 −1
b
A,
,
and µ̄(t) = −ae−bt is the second term of the expectation function of the GO
model.
3.2.3
Conditional densities for GG model parameters
For the GG-model with a Gamma prior distribution the conditional densities
are: For the parameter a ∼ Γ (a1 , a2 ):
a|b, γ, d(t) ∼ Γ
t
X
d(i) + a1 , 1 − e
−btγ
+ a2
i=1
!
.
For the parameter b ∼ Γ (b1 , b2 ):
p(b|a, γ, d(t)) ∝ Ae−µ̄(t)−b2 b bb1 −1 ,
where
A=
t Y
i=1
−
1
ebiγ
+
1
eb(i−1)γ
d(i)
.
For the parameter γ ∼ Γ (γ1 , γ2 ):
p(γ|a, b, d(t)) ∝ Ae−µ̄(t)−γ2 γ γ γ1 −1 ,
γ
where µ̄(t) = −aet is the second term of the expectation function of the GG
model.
101
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
8
3.2.4
Conditional densities for ISS model parameters
For the parameter a ∼ Γ (a1 , a2 ) of the ISS model the conditional density is
!
t
X
1 − e−bt
+ a2 .
a|b, β, d(t) ∼ Γ
d(i) + a1 ,
1 + βe−bt
i=1
For parameters b ∼ Γ (b1 , b2 ) and β ∼ Γ (β1 , β2 ) the conditional density is
p(β|a, b, d(t)) ∝ (β + 1)
and
p(b|a, β, d(t)) ∝ eb(
Pt
i=1
Pt
i=1
id(i)−b2 )
d(i)
Aβ β1 −1 e−β2 β
(eb − 1)
Pt
i=1
d(i)
Abb1 −1
respectively. Thereby,
A=
t
Y
(ebi + β)(ebi + βeb )
i=1
−d(i)
e−µ(t) ,
where µ(t) is the expectation function of the ISS model.
4
Evaluation
This paper examines two methods to calculate the posterior weights for the
model combination using BMA. The evaluation compares the prediction performance of different combinations:
– BMA using MCMC
– BMA using Laplace
– ELC.
Besides comparing the prediction performance of the combinations, BMA using
MCMC is also compared to the performance of individuals models. The comparison of model performances is done by measuring the mean squared error on
long and short range software reliability predictions.
The experimental procedure begins with simulation of mixed NHPP realizations via thinning algorithm, which is described in the next section. The
obtained data are then used for Gibb’s sampler (Section 3.2) and Laplace procedure(Section 3.1) to estimate the posterior pdf s of selected NHPP models.
Finally the performance comparison is made in section 4.4. The following sections present the details of the evaluation.
4.1
Simulation
To achieve this evaluation different data sets have to be simulated. In detail 100
mixed NHPP processes with uniformly distributed random mixing coefficients,
102
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
9
and with random model parameters have been simulated. In [33] was shown
that failure data can be much better described using several models instead
of only one. Therefore simulating mixed NHPP processes is more realistic and
used in this paper. The following distributions and parameters were used in the
simulation
–
–
–
–
for
for
for
for
DSS Model a ∼ U[50,125] and β ∼ U[0.05,0.15]
GO Model a ∼ U[50,125] and b ∼ U[0.05,0.15]
ISS Model a ∼ U[50,125] , b ∼ U[0.05,0.15] and β ∼ U[2.5,3.5]
GG Model a ∼ U[50,125] , b ∼ U[0.05,0.15] and γ ∼ U[2.5,3.5] .
For the simulation of a single NHPP model an approach called thinning or
rejection method [34] was used. It is based on the following observation. If there
exists a constant λ̄ such that λ(t) ≤ λ̄ for all t. Let T1∗ , T2∗ , ... be the successive
arrival times of a homogeneous Poisson process with intensity λ, and we accept
the ith arrival time with probability λ(Ti∗ )/λ̄, then the sequence T1 , T2 , ... of the
accepted arrival times are the arrival times of a NHPP with rate function λ(t)
[35].
The generated data sets were divided in training and validation parts. The
validation parts were used to calculate the predictive performance. All algorithms
were implemented in Matlab environment.
4.2
Performance measure
The performance measure used for the evaluation is the standard measure mean
squared error (MSE). The MSE for a specific model i can be expressed as
M SEi =
N
2
1 Xˆ
fi (t) − fi (t) ,
N t=1
with fˆi (t) the predicted number of failures of model i at time t and fi (t) is the
actual number of observed (simulated) failures at time t.
The Bayes MSE is estimated by means of the Monte Carlo integration. Let
(k,m)
θi
be the variates of the parameters of model i drawn in the k-th replication
(k,m)
and m-th iteration of the Gibbs sampler and let fi (θi
, t) be the model i
(k,m)
output if we use the parameter vector θi
estimated with the data till time t,
then the M SEi is calculated as follows:
f˜i (t) =
K
2 X
KM
M
X
(k,m)
fi (θi
, t),
(12)
k=1 m= M +1
M SEi =
1
N
N X
t=1
2
f˜i (t) − fi (t)
2
where f˜i (t) is the Bayesian MCMC estimation of the accumulated number of
failures by model i with the data from the time interval [1, t], K is the number
of replications and M is the number of iterations of the Gibbs sampler and N
is the number of data.
103
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
10
4.3
Evaluation parameters
The a-priori densities of the parameters were chosen from the Gamma distribution function with following parameters:
–
–
–
–
for
for
for
for
DSS Model a ∼ Γ (1, 0.001) and β ∼ Γ (1, 0.001)
GO Model a ∼ Γ (1, 0.001) and b ∼ Γ (1, 0.001)
ISS Model a ∼ Γ (1, 0.001), b ∼ Γ (1, 0.001) and β ∼ Γ (2, 1)
GG Model a ∼ Γ (1, 0.001), b ∼ Γ (1, 0.001) and γ ∼ Γ (1, 0.001)
The number of MCMC iterations was M = 1000 and number of replications
K = 25, see formula 12.
The BMA point estimate is given as follows:
– For BMA with MCMC: Inserting the outcome of (10) together with the
equal model a-priori probabilities P (Mi ) = 1/4 into (3), results in the BMA
weighting factors. Inserting the product of this factors and the outcome of
(12), into 2) results in the BMA MCMC point estimate of the system output.
– For BMA with Laplace: Inserting the exponential of the outcome of (9)
together with the equal model a-priori probabilities P (Mi ) = 1/4 into (3),
results in the BMA weighting factors. Inserting the product of this factors
and the ML estimates of the model outputs into (2) results in the BMA
Laplace point estimate of the system output.
4.4
4.4.1
Evaluation results
Comparing model performance of MCMC BMA to single models
Figure 1 shows the performance comparison of the MCMC BMA Model vs. single
models using 75% of the simulated data for the model parameter estimation
and the remaining 25% of data for the model validation. In detail the Figure
shows on the x-axis the MSE of BMA MCMC and on the y-axis the MSE of
the respective single model. Therefore every point above the equal performance
border line indicates that the performance of MCMC BMA was better than that
of the respective single model. Figure 2 shows the same comparison but with only
50% of the data for the model parameter estimation and the remaining data for
the validation.
In Figure 1 the DSS model MSE is near the equal performance border line
for an M SE < 16, that means for simulated data which had a high weighting
coefficient on DSS Model in the thinning algorithm. However the more interesting
case of a longer term prediction, Figure 2 shows that the BMA MCMC model
outperforms all single models by far.
104
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
11
Performance comparison on validation data(25% of the whole data set)
MSE´s over 100 simulation runs, single models
1400
MCMC BMA vs GO
MCMC BMA vs. DSS
1200
MCMC BMA vs. ISS
MCMC BMA vs. GG
1000
Equal Performance border
800
600
400
200
0
0
5
10
15
20
25
MSE´s over 100 simulation runs, MCMC BMA
30
Fig. 1. Predictive performance comparison BMA MCMC vs single models, 25% of the
data were used for the validation
Performance comparison on validation data(50% of the whole data set)
MSE´s over 100 simulation runs, single models
2500
MCMC BMA vs GO
MCMC BMA vs. DSS
2000
MCMC BMA vs. ISS
MCMC BMA vs. GG
Equal Performance border
1500
1000
500
0
0
5
10
15
20
25
30
MSE´s over 100 simulation runs, MCMC BMA
35
Fig. 2. Predictive performance comparison BMA MCMC vs single models, 50% of the
data were used for the validation
4.4.2
Comparing model performance of MCMC BMA to Laplace
BMA and ELC
The Figures 3 and 4 show the comparison results considering different combinations. Considering long term prediction BMA MCMC has better performance
than ELC or BMA Laplace in almost every of the 100 simulation runs (Figure
3). However if 75% of the data were used for model fitting ELC had in 20%
and BMA Laplace in 23% of the 100 simulation runs similar or better predictive
performance than BMA MCMC (Figure 4). The BMA Laplace model was in 53
of 100 simulation cases better than the ELC on the small data set(50% of the
data used for the parameter estimation) see the figure 3). For the bigger data set
(75% of the data used for the parameter estimation) BMA Laplace had a smaller
MSE than ELC in 64 out of the 100 simulation runs. This trend shows the better approximative results of the Laplace method for large data sets. Comparing
the prediction performance of BMA MCMC and BMA Laplace revealed that
105
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
12
MSE´s over 100 simulation runs, MCMC Laplace and ELC
Performance comparison on validation data(50% of the whole data set)
40
35
30
25
20
15
10
MCMC BMA vs ELC
MCMC BMA vs. Laplace BMA
Equal Performance border
5
0
0
5
10
15
20
25
30
MSE´s over 100 simulation runs, MCMC BMA
35
Fig. 3. Predictive performance comparison BMA MCMC vs combining models, 50%
of the data were used for the validation
MSE´s over 100 simulation runs, MCMC Laplace and ELC
Performance comparison on validation data(25% of the whole data set)
35
30
25
20
15
10
MCMC BMA vs ELC
MCMC BMA vs. Laplace BMA
Equal Performance border
5
0
0
5
10
15
20
25
MSE´s over 100 simulation runs, MCMC BMA
30
Fig. 4. Predictive performance comparison BMA MCMC vs combining models, 25%
of the data were used for the validation
106
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
13
the BMA MCMC is clearly better suited for small data sets. If only 50% of the
data was used as training data set BMA MCMC had a much better prediction
performance than BMA Laplace. With more and more data the BMA Laplace
is improving. This trend can be observed when for instance 75% of the data was
used as training data set. In this case already 23% of the predictions had similar
or better performance than BMA MCMC.
5
Conclusion
A review of the relevant literature reveals great agreement that there is no single
model that can be used for all cases. This paper addressed this issue and studied two ways of implementing Bayesian Model Averaging for Non-Homogeneous
Poisson Process models for grouped failure data. It could be shown that BMA
had better prediction performance than the single models. Also it could be shown
that BMA has better prediction performance than other simpler combination of
approaches like ELC.
Considering the two ways of implementing BMA it was shown that the
MCMC approach is by far better than the Laplace method if the data set used
for parameter estimation is small. Laplace should not be used when the number
of detected failures is small and therefore the ratio between the error term and
the other terms on the right hand side of (9) is high. On the other hand, the
Laplace approximation does not require complicated computational procedures
like Metropolis within Gibb’s sampler. However since the multiplicative normal
distribution does not account for skewness, the accuracy of the approximation
is low in many cases.
The MCMC exploration of the support of a-posteriori probability density
function was pretty fast. The problem was the Metropolis within Gibbs variant
of the algorithm, which was time consuming. One possibility to avoid these
difficulties is the introduction of latent random variables for augmentation of
Gibbs conditional densities. The number of iterations in the Gibbs sampler was
determined by monitoring of convergence of averages [31]. We showed the high
predictive performance of the MCMC BMA in comparison with Laplace BMA
and ELC combination methods in the case if one wants to make long range
predictions in the software testing processes for moderate sized software projects.
In this paper we concentrated on four NHPP based SRG Models. An extension of the presented implementation techniques of BMA to more models is
possible. For larger model spaces techniques for optimal model space reduction
like ”Occam’s window” or optimal model space exploration like M C 3 [36] or
reversible jump MCMC could be of interest.
References
1. Abdel-Ghaly, A.A., Chan, P., Littlewood, B.: Evaluation of competing software
reliability predictions. IEEE Transactions on Software Engineering 12(9) (1986)
950–967
107
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
14
2. Nikora, A.P., Lyu, M.R.: Software Reliability Measurement Experience. In: Handbook of Software Reliability Engineering. McGraw–Hill (1995) 255–302
3. Brocklehurst, S., Littlewood, B.: Techniques for Prediction Analysis and Recalibration. In: Handbook of Software Reliability Engineering. McGraw–Hill (1995)
119–166
4. Almering, V., van Genuchten, M., Cloudt, G., Sonnemans, P.J.M.: Using software
reliability growth models in practice. Volume 11-12. (2007) 82–88
5. Singpurwalla, N.D., Wilson, S.P.: Statistical Methods in Software Engineering:
Reliability and Risk. Springer (1999)
6. Ravishanker, N., Liu, Z., Ray, B.K.: Nhpp models with markov switsching for
software reliability. Computational statistics and data analysis 52 (2008) 3988–
3999
7. Dharmasena, L.S., Zeephongsekul, P.: Fitting software reliability growth curves
using nonparametric regression methods. Statistical Methodology 7 (2010) 109–
120
8. Lyu, M.R., Nikora, A.: Applying reliability models more effectively. IEEE Software
9(4) (1992) 43–52
9. Raftery, A.E., Madigan, D., Volinsky, C.T.: Accounting for model uncertainty in
survival analysis improves predictive performance. Technical report, Department
of Statistics, GN-22, University of Washington (1994)
10. Hoeting, J.A., Madigan, D., Raftery, A.E., Volinsky, C.T.: Bayesian model averaging: A tutorial. Statistical Science 14 (1999) 382–417
11. George, E.I., McCulloch, R.E.: Variable selection via gibbs sampling. Journal of
the American Statistical Association 14(Series B) (1993) 107–114
12. Clyde, M.A.: Bayesian model averaging and model search strategies. Bayesian
Statistics 6 (1999) 157–185
13. Viallefont, V., Raftery, A.E., Richardson, S.: Variable selection and Bayesian model
averaging in case-control studies. Statistics in Medicine 20 (2001) 3215–3230
14. Clyde, M.A., George, E.I.: Flexible empirical Bayes estimation for wavelets. Journal of Royal Statistical Society 62(Series B) (2000) 681–698
15. Raftery, A.E., Zheng, Y.: Long run performance of Bayesian model averaging.
Technical report, Department of Statistics University of Washington (2003)
16. Huang, C.Y., Lyu, M.R., Kuo, S.Y.: A unified scheme of some nonhomogeneous
Poisson process models for software reliability estimation. IEEE Transactions on
Software Engineering 29(3) (2003) 261–269
17. Cai, K.Y., Hu, D.B., Bai, C.G., Hu, H., Jing, T.: Does software reliability growth
behavior follow a non-homogeneous Poisson process. Information and Software
Technology 50 (2008) 1232–1247
18. Yamada, S., Ohba, M., Osaki, S.: S-shaped reliability growth modeling for software
error detection. IEEE Transactions on Reliabiity R-32 (1983) 475–478
19. Goel, A.L., Okumoto, K.: Time-dependent error-detection rate model for software reliability and other performance measures. IEEE Transactions on Reliability
28(3) (1979) 206–211
20. Goel, A.L.: Software reliability models: Assumptions, limitations, and applicability.
IEEE Transactions on Software Engineering 11(12) (1985) 1411–1423
21. Ohba, M.: Inflection S-shaped softwrae reliability growth models. In: Stochastic
Models in Reliability Theory. Springer (1984) 144–162
22. ANSI/IEEE: Standard Glossary of Software Engineering Terminology. Std-7291991 edn. (1991)
23. IEEE: IEEE Standard Glossary of Software Engineering Terminology. Institute of
Electrical & Electronics Enginee (2005)
108
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
15
24. Lyu, M.R.: Software Reliabiliy Theory. In: Encyclopedia of Software Engineering.
John Wiley & Sons (2002)
25. Xie, M.: Software Reliability Modeling. World Scientific Publishing (1991)
26. Lyu, M.R.: Handbook of Software Reliability Engineering. McGraw–Hill (1995)
27. Musa, J.D., Iannino, A., Okumoto, K.: Software Reliability Measurement Prediction Application. McGraw–Hill (1987)
28. Lakey, P.B., Neufelder, A.M.: System and Software Reliability Assurance Notebook. Produced for Rome Laboratory by SoftRel (1997)
29. Pham, H.: Software Reliability. Springer (2000)
30. Kass, R.E., Wasserman, L.: Improving the laplace approximation using posterior
simulation. Technical report, Carnegie Mellon University, Dept. of Statistics (1992)
31. Casella, G. In: Monte Carlo Statistical Methods, Springer texts in statistics.
Springer (1999)
32. Kuo, L., Yang, T.Y.: Bayesian computation for nonhomogeneous Poisson processes
in software reliability. Journal of the American Statistical Association 91/434
(1996) 763–773
33. Gokhale, S., Lyu, M., Trivedi, K.: Software reliability analysis incorporating fault
detection and debugging activities. (1998) 202
34. Grandell, J.: Aspects of risk theory. Springer (1991)
35. Burnecki, K., Härdle, W., Weron, R.: An introduction to simulation of risk processes. Technical report, Hugo Steinhaus Center for Stochastic Methods (2003)
36. Madigan, D., York, J.: Bayesian graphical models for diskrete data. International
Statistical Review 63 (1995) 215–232
109
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Multi-Space Learning for Image Classification
Using AdaBoost and Markov Random Fields
W. Zenga , X. Chenb , H. Chengc , J. Huab
a
c
Department of Eletrical Engineering and Computer Science, The University of
Kansas, 1520 West 15th Street, Lawrence, Kansas, USA
b
Computer Science Department, Wayne State University, 5057 Woodward Ave.,
3010, Detroit, MI, USA
University of Electronic Science and Technology of China, Chengdu, China, 611731
[email protected], [email protected]
Abstract. In various applications (e.g., automatic image tagging), image classification is typically treated as a multi-label learning problem,
where each image can belong to multiple classes (labels). In this paper,
we describe a novel strategy for multi-label image classification: instead
of representing each image in one single feature space, we utilize a set
of labeled image blocks (each with a single label) to represent an image
in multi-feature spaces. This strategy is different from multi-instance
multi-label learning, in which we model the relationship between image
blocks and labels explicitly. Furthermore, instead of assigning labels to
image blocks, we apply multi-class AdaBoost to learn a probability of a
block belonging to a certain label. We then develop a Markov random
field-based model to integrate the block information for final multi-label
classification. To evaluate the performance, we compare the proposed
method to six state-of-art multi-label algorithms on a real world data
set collected on the internet. The result shows that our method outperforms other methods in several evaluation indicators, including Hamming
loss, ranking-loss, macro-averaging F1, micro-averaging F1 and so on.
Keywords: Image classification, Multi-label learning, Markov random
field
1
Introduction
With the rapid development of multimedia applications, the number of images
in personal collection, public data sets, and web is growing. It is estimated that
every minute around 3000 images are uploaded to Flickr. In 2010, the number of
images Flickr hosted had exceeded five billion. The increasingly growing number
of images presents significant challenges in organizing and indexing images. In
addition to scene analysis [3, 31, 34, 35, 37], image retrieval [4, 19, 26], contentsensitive image filtering [6], and image representation [18], extensive attention
has been drawn to automatic management of images. Automatic image tagging,
for example, is a process to assign multiple keywords to a digital image. It is
typically transformed as a multi-class or multi-label learning problem. In multiclass learning (MCL) [1, 2, 5, 8], an image is assigned with one and only one label
110
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
2
Lecture Notes in Computer Science: Authors’ Instructions
from a set of predefined categories, while in multi-label learning (MLL), an image
is assigned with one or more labels from a predefined label set. In this paper, we
focus on MLL in real-world application.
One of the commonly-used MLL methods is called problem transformation,
which transforms a multi-label learning problem into multiple binary learning
problems using a strategy called binary relevance (BR) [15, 27]. A BR-based
learning model typically constructs a binary classifier for each label using regrouped data sets. While it is simple to implement, a BR-based method neglects
label dependency, which is crucial in image classification. For example, an image labeled with beach may also be labeled with sea, while an image labeled
with mountain is unlikely labeled with indoor. More sophisticated algorithms
are advanced to address the label dependency [11, 12, 29, 31]. However, in most
of the existing methods, an image with multiple labels is represented by one
feature vector, while these labels are from different sub-regions in image responsible for different labels. To solve the ambiguity between image regions and labels, multi-instance learning (MIL) methods are developed in MLL methods. In
multi-instance learning including multi-instance multi-label learning (MIMLL),
an image is transformed into a bag of instances. The bag is positive, if at least one
instance in the bag is positive and negative otherwise. MIL [30, 32] attempted
to model the relationship between each sub-region in image with an associated
label. To extract the sub-region, techniques from image segmentation is applied.
However, image segmentation is an open problem in image processing, which
will make MIL computationally expensive. Also the accuracy of segmentation
interferes the performance of MIL.
In this paper, we propose a multi-space learning (MSL) method using Adaboost and Markov random field to transform MLL tasks into MCL problems.
We utilize normalized real-value outputs from one-against-all multi-class AdaBoost to represent the association between a block (instead of a bag or the
entire image) and a potential label. The normalized real-valued output will also
represent a contribution of a block in an entire image with a label. This step
will solve the ambiguity between instances and labels, since the labels in multiclass classification share no intersection in labeling examples. This will solve the
ambiguity of labeling an image. Then, we use Markov Random Fields (MRF)
models to integrate label sets for an image. Compared to MIMLL, MRF-based
integration is a more advanced way to integrate the results from blocks rather
than the hard logic provided from MIL. In image classification, that different
labels describes different regions is the major reason for labeling ambiguity. In
our framework, we follow the characteristics in image classification to transform
MLL task into MCL problem. The key contributions of this paper are highlighted
as follows:
(1) We propose an algorithm for multi-space learning which explicitly models
the relationship between each image block and labels.
(2) We derive a MRF-based model to integrate the results from every block in
an image. Instead of predefining values for parameters, the values of parameters
in MRF and integration thresholds are obtained via training images.
111
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Lecture Notes in Computer Science: Authors’ Instructions
3
The rest of the paper is organized as follows. In Section 2, we will describe
the proposed method with multi-space learning and MRF-based integration.
In Section 3, we will discuss our data set including all the image descriptors
we have used and the final statistics. This is followed by experiment results of
performance comparison. Finally, we will conclude and discuss about our future
work in Section 5.
2
Methodology
The system overview is showed in Figure 1. In this framework, we convert training images and testing images into multi-space representations with overlapped
blocks of fixed size. We utilize a set of single-labeled training blocks with the
same fixed size to train a one-against-all multi-class AdaBoost classifier. The
classifier is used to calculate the real-valued outputs related to the association
between block and labels for training images and testing images. A Markov random field model is used to integrate these real-valued outputs. Via thresholds
estimation from the integration results of training images, testing images are
predicted label by label. We will discuss feature extraction in Section 3. In the
Fig. 1. The Framework of Our Algorithm
framework, we use a multi-space representation to extract blocks from every
image. The blocks are fixed-sized and overlapped. From training images, we can
create a set of training blocks. The set of training blocks are denoted as B trn ,
which contains image blocks labeled with one and only one label from a finite
label set of L with q semantic labels and a newly introduced label called background to filter out non-object blocks. The set of training images Itrn is labeled
with the same label set L and Itst denotes the testing images. Therefore, for
labeled training blocks, we have q + 1 categories, i. e.,
Btrn = {(b1 , yb1 ), . . . , (bN , ybN )}, ∀bi , ybi ∈ L∗ ,
112
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
4
Lecture Notes in Computer Science: Authors’ Instructions
where L∗ = {L, background }. For training and testing images, we have the following representations as Itrn = {(X1 , Y1 ), . . . , (Xn , Yn )}, ∀i, Yi ⊆ L, Itst =
{T1 , . . . , Tm }. In multi-space representation, we have the following definitions.
(i)
(i)
(i)
(i)
Let Xi = {b(1,1) , . . . , b(j,k) , . . . , b(ri ,ci ) }, where b(j,k) denotes the image block in
j-th row and k-th column in the i-th training image, and ri and ci denote the row
and column numbers of blocks contained in Xi . For testing images, we have the
(i)
(i)
(i)
(i)
same representation as Ti = {t(1,1) , . . . , t(j,k) , . . . , t(ri ,ci ) }, where t(j,k) denotes
the image block in j-th row and k-th column in the i-th testing image, and r i and
ci denote the row and column numbers of blocks contained in Ti . It should be
noted that the extracted blocks Btrn is not necessarily contained in the
S union
set of multi-space representations of training images, which is Btrn * i Xi .
Training image is blocked via multi-space representation. The blocks in training images are fixed, once the image is given. However, training blocks in B trn
are extracted in training images at random positions where an object locates.
Figure 1 shows an multi-space representation, the rectangle size is 75*100 pixels.
The overlapping along the x axis is 25 pixels, and 40 pixels along the y axis. The
blocks are extracted with sequence. Thus, it efficiently records the content and
spatial information of an individual image. Features are extracted upon every
block in the image. It should be noted that given the size of the image, the
number and the location of blocks can be calculated.
In our framework, we train a multi-class classifier mapping every block in
Btrn to a label in L∗ . In the experiment, we use multi-class AdaBoost [38] with
one-against-all strategy and one dimensional decision stumps as weak learners
denoted as hil (b), i = 1, . . . , M ; l ∈ L∗ , where i is the index of different weak
learners, l is the index of different labels, and M is the iteration number of
AdaBoost. Accordingly, The weight for its corresponding weak learner is denoted
as αli , i = 1, . . . , M ; l ∈ L∗ The AdaBoost we used follows Algorithm 1 in [38].
The only difference is we record normalized real-valued outputs instead of direct
labels to block b. In the step of multi-class AdaBoost, for an assigned label l, all
the image blocks in Btrn labeled with l are considered as positive examples; while
remaining image blocks in Btrn are considered as negative examples. To describe
the normalized real-valued outputs, we firstly introduce a boolean expression
operator as JπK for a boolean statement π. If π is true, JπK = 1; otherwise,
JπK = 0. Then the normalized real-valued output fl (b) for an image block b in
Itrn and Itst given an assigned label l is as follows:
fl (b) = P
PM
i
i
i=1 αl · Jhl (b) = lK
PM
0
i
i0
k∈L∗
i0 =1 αl · Jhl (b)
= lK
.
(1)
All these normalized outputs viewed as block-label association for image
blocks in Itrn are kept in parameter estimation for MRF-based integration, which
will be discussed in the next part.
113
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Lecture Notes in Computer Science: Authors’ Instructions
5
Algorithm 1. Estimation of 2-node Potentials
1:For l = 1 to |L|:
2: Initialize nl = 0;
3: Initialize JH,l and JV,l with 10*10 zero metrices.
4:
5:
6:
7:
8:
For Xi ∈ Itrn and l ∈ Yi
nl ← nl + 1;
Initialize CH,l,Xi and CV,l,Xi with two 10*10
zeros matrices;
(j,k)
(j,k+1)
For (bi , bi
) ∈ Xi and k + 1 6 ci
(i)
x = Q(fl (b(j,k) ));
(i)
9:
10:
11:
12:
y = Q(fl (b(j,k+1) ));
CH,l,Xi (x, y) ← CH,l,Xi (x, y) + 1.
End
CH,l,Xi
;
JH,l,Xi (x, y) = ri ·(c
i −1)
13:
14:
For (bi , bi
) ∈ Xi and j + 1 6 ri
(i)
x = Q(fl (b(j,k) ));
15:
16:
17:
18:
y = Q(fl (b(j+1,k) ));
CV,l,Xi (x, y) ← CV,l,Xi (x, y) + 1.
End
C
i
JV,l,Xi (x, y) = (riV,l,X
−1)·ci ;
(j,k)
(j+1,k)
(i)
19:
JH,l ← JH,l + JH,l,Xi ;
18:
JV,l ← JV,l + JV,l,Xi ;
20: End
J
21: JH, l ← nH,l
;
l
JV,l
22: JV, l ← nl .
23:End
Outputs: JH,l and JV,l .
Our method fully utilizes the normalized outputs from the multi-class AdaBoost classifier to build up Markov random field models for MLL. We derive a
MRF model for information integration as follows. For an assigned label l ∈ L,
our goal is to maximize the likelihood defined as P (Xi |l), which is proportional
to a Gibbs distribution as follows [14], P (Xi |l) ∝ e−U (Xi |l) , where U (Xi |l) is
called energy function. The energy function takes the following form [14],
X
X
(i)
(i)
(i)
U (Xi |l) =
(Vl1 (b(j,k) ) +
(2)
Vl2 (b(j,k) , b(j 0 ,k0 ) )).
(i)
(i)
(i)
(i)
b(j 0 ,k0 ) ∈N (b(j,k) )
b(j,k)
(i)
(i)
Note that Vl1 (b(j,k) ) and Vl2 (b(j,k) , b(j 0 ,k0 ) ) are potentials for one block and two
adjacent blocks in MRFs horizontally or vertically, given a label l. We introduce
114
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
6
Lecture Notes in Computer Science: Authors’ Instructions
(i)
(i)
(i)
the following definition, Vl1 (b(j,k) ) = −fl (b(j,k) ) where fl (b(j,k) ) represents the
contribution of each block belonging to a certain label. When the contribution
is increasing, the energy function is decreasing. Thus, we introduce the format
of one-block potential as in Formula (2).
(i)
(i)
To formulate the potentials of two blocks Vl2 (b(j,k) , b(j 0 ,k0 ) ), we firstly quantize the normalized real-valued outputs from multi-class AdaBoost with function
Q, where b denotes a image block.
As 0 6 fl (b) 6 1,
p
p−1
6 fl b <
,
10
10
Q(fl (b)) = 10, if fl b = 10.
Q(fl (b)) = p, if
(3)
After quantization, given a label l, we count the different combinations of the ten
levels horizontally and vertically upon a training image. Thus, we get two count
matrices denoted as CH,l,Xi and CV,l,Xi . The two matrices are normalized by
the numbers of the two-adjacent blocks horizontally and vertically. JH,l,Xi and
JV,l,Xi denote the normalized count matrices. Finally, the averages of JH,l,Xi and
JV,l,Xi over all positive images labeled with l is created as JH,l and JV,l called
joint contribution matrices of two adjacent blocks horizontally and vertically.
After parameter estimation for potentials of two blocks, JH,l and JV,l are
(i)
(i)
outputs as a codebook to extend Vl2 (b(j,k) , b(j 0 ,k0 ) ) defined as follows, x =
(i)
(i)
Q(fl (b(j,k) )), y = Q(fl (b(j 0 ,k0 ) )), and if the two blocks are located in horizontal
direction,
λ
(i)
(i)
Vl2 (b(j,k) , b(j 0 ,k0 ) ) = −
JH,l (x, y);
(i)
|N (b(j,k) )|
otherwise,
(i)
λ
(i)
Vl2 (b(j,k) , b(j 0 ,k0 ) ) = −
(i)
|N (b(j,k) )|
(i)
JV,l (x, y),
(4)
(i)
where N (b(j,k) ) denotes the neighbors of block b(j,k) and λ denotes a parameter
to rate the different contribution on one-block potential and two-block potentials.
The integration results are calculated to predict label sets via thresholds to testing images. For the convenience of numerical calculation, we derive the following
formula for normalized integration by block number ni . As the number of blocks
−U (Xi |l)
)
i |l)
= −U (X
. This
will lead bias to the integration results Intgl Xi = ln(e ni
ni
P
(i)
formula is expanded with the following forms:Intgl1 (Xi ) = b(i) ∈Xi fl (b(j,k) ,
Intgl2 (Xi )
=
P
(i)
(j,k)
P
(i)
(i)
(i)
b(j 0 ,k0 ) ∈N (b(j,k) )
1
2
Intgl (Xi )+Intgl (Xi )
b(j,k) ∈Xi
(i)
λ·Jl (b(j,k) ,b(j 0 ,k0 ) )
(i)
|N (b(j,k) )|
(i)
(i)
, and
Intgl (Xi ) =
, where Jl (b(j,k) , b(j 0 ,k0 ) ) denotes joint contribuni
tion obtained by Algorithm 1. Whether to use horizontal or vertical direction
115
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Lecture Notes in Computer Science: Authors’ Instructions
(i)
7
(i)
depends on the locations of b(j,k) and b(j 0 ,k0 ) . With the integration results got
from Itrn , the threshold for predicting assigned label l is estimated in maximizing F1 measurement in Itrn . An image will be predicted as a positive image for l,
when the integration result is above the threshold. Otherwise, the image is predicted as a negative image for l. Integration results are used to predict label sets
for testing images via thresholding. Since we want to evaluate the performance
from ranking-based criteria in multi-label classification, we use the integration
result subtracted by threshold given a label l. The thresholds are considered as
zero-baselines for prediction.
3
Database
We collect 4100 images from the internet and label them with building, car, dog,
human, mountain, and water, according to their contents. The resolution of all
the images is controlled under 800*600.
In feature extraction, we use 13 different descriptors to represent the images.
They focus on different characteristics on the images, such as color, texture,
edges, contour and frequency information. They also showed different advantages to describe local details or global features in images. Table 2 describes the
feature sets we have used. In the proposed algorithm, we extract features upon
every fixed size block. The dimensionality of features on a block is 2684. In experiment comparison, six MLL algorithms are used. Features for entire images
are extracted with the same 13 feature sets. The dimensionality for an entire
image is 2629.
Table 1. Sample Numbers per Label Set
Label Train/Test Label Train/Test
b
250/125
b+h
167/83
c
250/125
c+h
167/83
d
250/125
d+h
167/83
h
250/125
m+w
167/83
m
250/125 b+c+h
133/67
w
250/125 b+m+w
133/67
b+c
167/83
d+h+w
133/67
Among 4100 images, 2734 images are selected randomly for training and
the remaining 1366 images are used for testing (2/3 for training and 1/3 for
testing). In our algorithm, the training blocks is normalized according to sample
mean and variance of every dimension. These sample means and variances are
recorded to normalize the image blocks in both training and test images. The
similar normalization strategy is used to normalize the data set used in multilabel classification for comparison experiment.
Label-Cardinality and Label-Density
[27] are used commonly
Pn
Pn in multi-label
|Yi |
i=1 |Yi |
data,Label − Cardinality =
, Label − Density = i=1
, where n
n
n·q
denotes the sample size of training set, and q denotes the predefined label set
size. For MLL problem in our database, Label-Cardinality in our database is
1.5973, and Label-Density is 0.2662. In Table 1, we use b, c, d, h, m, and w
116
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
8
Lecture Notes in Computer Science: Authors’ Instructions
Table 2. Feature Set Description
Features
Block-wise color moment [24]
RGB histogram [23]
HSV histogram [23]
Description
Parameter(s)
Value
and —–
Mean, standard deviation
skewness of HSV
64-bin normalized histogram of
RGB
64-bin normalized histogram of
HSV
Co-occurrence of pixels with a given
distance and color level
bin-num = 64
bin-num = 64
dist=1 or 3
color-level =64
global-bin=1
local-bin=25
edge distribution histogram [13] Global, local and semi-label edge horizon-bin=5
distribution with five filters
vertical-bin=5
center-bin=1
Uh = 0.4
Gabor wavelet
Ul = 0.05
Gabor wavelet transformation [17]
transformation for texture
K=6
S=4
LBP [[20], [36]]
Local descriptor of binary patterns Default
parameter
values
LPQ [21]
Local descriptor of phase quantiza- Default
parameter
tion
values
moment invariants [7]
Shape descriptor
—–
Tamura texture feature [25]
Global descriptor of coarseness, —–
contrast, and directionality
Haralick Texture Feature [9]
Please refer to [9]
—–
Scale-invariant
numspatialbins=4
SIFT[16]
feature transform
numorientbins=12
Generic Fourier
max-rad =4
GFD [28]
descriptor
max-ang=15
color correlogram [10]
117
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Lecture Notes in Computer Science: Authors’ Instructions
9
for short to represents building, car, dog, human, mountain, and water. From
the training images, we generate 3000 sample blocks for training for every label
including background. These image blocks are used to train the AdaBoost model.
4
Experiment
In this section, we evaluate the proposed multi-space learning on the database.
We compare our algorithm with six state-of-art MLL algorithms, namely AdaBoost.MH [22] which combines MCL with label ranking, back-propagation (BP)
for MLL (BP-MLL) [33] which modifies the error term of traditional BP, instance
differentiation (INSDIF) [35] which converts MLL into MIML upon differences
between an image from different label centroids, binary relevance SVM with linear kernel (LBRSVM), multi-label kNN (ML-KNN) [34] which combines MAP
principle with kNN, SVM with low-dimensional shared space named as MLLS
[11].
In the experiment, we assign 100 as maximum number of iterations for BPMLL, AdaBoost.MH, and also multi-class AdaBoost. Other parameters are obtained via 3-fold cross-validation. The criteria of optimization in cross-validation
is F1-measure. We use three different aspects of criteria for evaluation, namely,
example-based, ranking-based and label-based criteria, including hamming loss,
one-error, coverage, ranking loss, average precision, together with micro-averaging
and macro-averaging recall, precision and F1.
Let H be a learned classifier, f denote the real-valued function associated
with H, and T = {t1 , t2 , . . . , tm } be the testing data set. Yi is the true label
for ti . The definitions for example-based and ranking-based criteria are listed as
follows:
Pm
|H(ti )∆Yi |
Jargmaxy∈L fy (ti ) ∈
/ Yi K
, 1 − err(f ) =
, (5)
hloss(H) = i=1
mq
m
Pm
maxy∈Yi rankf (ti , Yi )
cov(f ) = i=1
− 1,
(6)
m
Pm
|Si |
rloss(f ) = i=1
, Si = {(y1 , y2 )|fyi (ti ) 6 fy2 (ti ), (y1 , y2 ) ∈ Yi × Ȳi }, (7)
m
m
1 X 1 X
|Si0 |
avgprec(f ) =
,
(8)
m i=1 |Yi |
rankf (ti , y)
y∈Yi
Si0
0
0
= {y ∈ Yi |rankf (ti , y ) 6 rankf (ti , y)}.
(9)
T Pl , F Nl , and F Pl denote true positive rate, false negative rate, and false positive rate for label l. Thus, micro-averaging and macro-averaging recall, precision,
and F1 are defined in the following way.
Pq
Pq
T Pl
T Pl
micro-rec = Pq
, micro-prec = Pq l=1
,
(T
P
+
F
N
)
(T
P
l
l
l + F Pl )
l=1
l=1
l=1
118
(10)
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
10
Lecture Notes in Computer Science: Authors’ Instructions
1X
1X
T Pl
T Pl
, macro-prec =
,
q i=1 T Pl + F Nl
q i=1 T Pl + F Pl
q
macro-rec =
q
(11)
micro-rec × micro-prec
macro-rec × macro-prec
, macro-F 1 = 2·
.
micro-rec + micro-prec
macro-rec + macro-prec
(12)
Table 3 and Table 4 show the comparison results. (-) means the smaller the
value is, the better the performance is; while (+) means the opposite. As can
be seen, the proposed MSL algorithm outperforms the other six algorithms in
several important criteria, including hamming loss, one-error, coverage, ranking
loss, average precision, micro-averaging F1 and macro-averaging F1.
For multi-label classification, not only the prediction with higher accuracy is
important, but also the ranking on the association between examples and labels
is vital. As other than thresholding, label ranking is another popular integration
strategy for prediction on multi-label data. Label-based criteria are borrowed
from the field of information retrieval, which reflect classifier performance excluding the imbalance factor from learning domain. The label-based criteria are
recall and precision. F1 measure shows the balance between recall and precision.
F1 is highly related to two factors. The first one is the absolute value of either
recall or precision. The second one is the difference between recall and precision. High-value of F1 means the values of precision and recall are both high.
Multi-label classification utilizes F1 as a crucial evaluation criterion via different averaging strategies. Among them, micro-averaging and macro-averaging are
two common ones. The former describes the performance based on equal power
of every example, while the latter focuses on the equal power of every label to
generate F1 measure.
micro-F 1 = 2·
Table 3. Performance on Hamming loss, One-error, Coverage, Ranking loss, and Average precision
Hamming Loss One-Error Coverage Ranking Loss Average Precision
(-)
(-)
(-)
(-)
(+)
AdaBoost.MH
21.18
33.82
1.49
16.46
75.98
BP-MLL
27.45
23.72
1.68
23.26
73.48
INSDIF
17.18
24.31
1.23
11.61
83.48
LBRSVM
18.95
26.13
1.31
12.95
81.78
ML-KNN
16.34
24.45
1.25
12.04
83.77
MLLS
24.91
22.41
1.23
11.38
84.48
MSL
13.68
10.83
1.04
7.16
91.03
In addition to the evaluation showed above in Table 3 and Table 4, we change
the threshold values for prediction to draw precision-recall curves in Figure 2
and Figure 3. The micro-averaging and macro-averaging precision-recall curves
basically show how sensitive a classifier would be interfered by threshold change.
In summary, the larger the area under precision-recall curve (AUPRC) is, the
119
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Lecture Notes in Computer Science: Authors’ Instructions
11
Table 4. Performance on micro-averaging and macro-averaging recall, precision and
F1
macro-rec macro-prec macro-F1 micro-rec micro-prec macro-F1
(+)
(+)
(+)
(+)
(+)
(+)
AdaBoost.MH
57.01
75.18
64.84
60.61
60.16
60.38
BP-MLL
77.63
48.51
59.71
78.52
49.04
60.37
INSDIF
60.65
75.93
67.43
62.21
69.96
65.86
LBRSVM
62.84
73.56
67.78
64.32
64.46
64.39
ML-KNN
61.67
72.95
66.84
62.91
72.19
67.22
MLLS
90.41
53.01
66.82
89.61
51.87
65.71
MSL
77.01
72.26
74.56
77.65
72.81
75.15
better the classifier is. The better here means more robust with the threshold
change. Overall, MSL method yields superior performance compared to the other
six MLL algorithms.
Fig. 2. micro-averaging precision-recall Fig. 3. macro-averaging precision-recall
curve
curve
5
Conclusion
In this paper, we present an algorithm using training blocks extracted in training
images and image multi-space representations to generate a multi-space learning
method, which utilizes a multi-class AdaBoost to train a multi-class classifier. In
this sense, we try to transform image classification from a multi-label learning
problem to a multi-class learning problem. In addition to that, rather than using a predefined logic to integrate results from regions in multi-instance learning,
we derive a Markov random field model to integrate the normalized real-valued
outputs from AdaBoost. MRF-based multi-space learning maintains the content and spatial information in images. Hence, MRF-based integration is a more
advanced method to integrate results from different regions. Our algorithm is ex-
120
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
12
Lecture Notes in Computer Science: Authors’ Instructions
perimentally evaluated through a multi-label image database and proven highly
effective for image classification.
6
Acknowledgement
This material is based upon work supported by the National Science Foundation
under Grant No. 0066126.
References
1. O. Boiman, E. Shechtman, and M. Irani. In defense of nearest-neighbor based
image classification. In CVPR’08, pages 1–8, June 2008.
2. A. Bosch, A. Zisserman, and X. M. noz. Image classification using random forests
and ferns. In ICCV’07, pages 1–8, October 2007.
3. M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene
classfication. Pattern Recognition, 37:1757–1771, September 2004.
4. G. Carneiro, A. B. Chan, P. J. Moreno, and N. Vasconcelos. Supervised learning
of semantic classes for image annotation and retrieval. IEEE Trans. on Pattern
Recognition and Machine Intelligence, 29(3):394–410, March 2007.
5. X. Chen, X. Zeng, and D. van Alphen. Multi-class feature selection for texture
classification. IEEE Trans. Pattern Analysis and Machine Intelligence, 27:1685–
1691, October 2006.
6. T. Deselaers, L. Pimenidis, and H. Ney. Bag-of-visual-words models for adult image
classification and filtering. In ICPR’08, pages 1–4, December 2008.
7. S. A. Dudani, K. J. Breeding, and R. B. McGhee. Aircraft identification by moment
invariants. IEEE trans. on Computers, 26:39–46, January 1977.
8. Y. Fu and T. S. Huang. Image classification using correlation tensor analysis. IEEE
trans. on Image Processing, 17:226–234, February 2008.
9. R. M. Haralick, K. Shanmugam, and I. Dinstein. Textural features for image
classification. IEEE trans. on Systems, Man and Cybernetics, 3:610–621, November
1973.
10. J. Huang, S. R. Kumar, M. Mitra, W. Zhu, and R. Zabih. Image indexing using
color correlograms. In CVPR’97, pages 762–768, June 1997.
11. S. Ji, L. Tang, S. Yu, and J. Ye. Extracting shared subspace for multi-label classification. In KDD’08, pages 381–389, August 2008.
12. F. Kang, R. Jin, and R. Sukthankar. Correlated label propagation with application
to multi-label learning. In CVPR’06, pages 1719–1726, October 2006.
13. D. kwon Park, Y. S. Jeon, and C. S. Won. Efficient use of local edge histogram
descriptor. In Proceedings of the 2000 ACM workshops on Multimedia, pages 51–54,
December 2000.
14. S. Z. Li. Markov Random Field Modeling in Image Analysis. Springer-Verlag New
York, Inc., Secaucus, New Jersey, 2001.
15. X. Lin and X. Chen. Mr. knn: Soft relevance for multi-label classification. In
CIKM’10, pages 349–358, October 2010.
16. D. G. Lowe. Object recognition from local scale-invariant features. In ICCV’99,
pages 1150–1157, September 1999.
17. B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of
image data. IEEE Trans. Pattern Analysis and Machine Intelligence, 18:837–842,
August 1996.
18. E. Nowak, F. Jurie, and B. Triggs. Sampling strategies for bag-of-features image
classification. In ECCV’06, pages 490–503, May 2006.
121
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Lecture Notes in Computer Science: Authors’ Instructions
13
19. J. F. Nunes, P. M. Moreira, J. M. R. S. T. E. Nowak, F. Jurie, and B. Triggs. Shape
based image retrieval and classification. In Iberian Conference on Information
Systems and Technologies, pages 1–6, August 2010.
20. T. Ojala, M. Pietikäinen, and T. Mäenpää. A generalized local binary pattern
operator for multiresolution gray scale and rotation invariant texture classification.
In ICAPR’01, pages 397–406, March 2001.
21. V. Ojansivu, E. Rahtu, and J. Heikkilä. Rotation invariant local phase quantization
for blur insensitive texture analysis. In ICPR’08, pages 1–4, December 2008.
22. R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, May 2000.
23. L. G. Shapiro and G. Stockman. Computer Vision. Prentice Hall, Inc., Upper
Saddle River, New Jersey, 2001.
24. M. Stricker and M. Orengo. Similarity of color images. In Proceedings of Storage
and Retrieval for Image and Video Databases, pages 381–392, February 1995.
25. H. Tamura, S. Mori, and T. Yamawaki. Textural features corresponding to visual
perception. IEEE Trans. on Systems, Man and Cybernetics, 8:460–473, June 1978.
26. B. Tomasik, P. Thiha, and D. Turnbull. Tagging products using image classification. In SIGIR’09, pages 792–793, July 2009.
27. G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In Data
Mining and Knowledge Discovery Handbook, 2010.
28. A. Vijay and mahua Bhattacharya. Content-based medical image retrieval using the generic fourier descriptor with brightness. In ICMV’09, pages 330–332,
December 2009.
29. H. Wang, H. Huang, and C. Ding. Image annotation using multi-label correlated
green’s function. In ICCV’09, pages 2029–2034, October 2009.
30. C. Yang, M. Dong, and J. Hua. Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning. In CVPR’06, pages
2057–2063, June 2006.
31. Z. Younes, F. Abdallah, and T. Denceux. Multi-label classification algorithm derived from k-nearest neighbor rule with label dependencies. In European Signal
Processing Conference, August 2008.
32. Z. Zha, X. Hua, T. Mei, J. Wang, G. Qi, and Z. Wang. Joint multi-label multiinstance learning for image classification. In CVPR’08, pages 1–8, June 2008.
33. M. Zhang and Z. Zhou. Multi-label neural networks with applications to functional genomics and text categorization. IEEE Trans. on Knowledge and Data
Engineering, 18:1338–1351, 2006.
34. M. Zhang and Z. Zhou. Ml-knn: A lazy learning approach to multi-label learning.
Pattern Recognition, 40:2038–2048, July 2007.
35. M. Zhang and Z. Zhou. Multi-label learning by instance differentiation. In
AAAI’07, pages 669–674, July 2007.
36. G. Zhao and M. Pietikäinen. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. on Pattern Analysis
and Machine Intelligence, 29(6):915–928, June 2007.
37. Z. Zhou and M. Zhang. Multi-instance multi-label learning with application to
scene classification. In NIPS’06, pages 1609–1616, December 2006.
38. J. Zhu, H. Zou, S. Rosset, and T. Hastie. Multi-class adaboost. Statistics and Its
Interface, 2:349–460, 2009.
122
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
An Empirical Comparison of Supervised
Ensemble Learning Approaches
Mohamed Bibimoune1,2 , Haytham Elghazel1 , Alex Aussem1
1
Université de Lyon, CNRS
Université Lyon 1, LIRIS UMR 5205, F-69622, France
[email protected], [email protected],
[email protected]
2
ProbaYes,
82 allée Galilée, F-38330 Montbonnot, France
Abstract. We present an extensive empirical comparison between twenty
prototypical supervised ensemble learning algorithms, including Boosting, Bagging, Random Forests, Rotation Forests, Arc-X4, Class-Switching
and their variants, as well as more recent techniques like Random Patches.
These algorithms were compared against each other in terms of threshold,
ranking/ordering and probability metrics over nineteen UCI benchmark
datasets with binary labels. We also examine the influence of two base
learners, CART and Extremely Randomized Trees, and the effect of calibrating the models via Isotonic Regression on each performance metric.
The selected datasets were already used in various empirical studies and
cover different application domains. The experimental analysis was restricted to the hundred most relevant features according to the SNR filter
method with a view to dramatically reducing the computational burden
involved by the simulation. The source code and the detailed results of
our study are publicly available.
Key words: Ensemble learning, classifier ensembles, empirical performance comparison.
1
Introduction
The ubiquity of ensemble models in Machine Learning and Pattern Recognition
applications stems primarily from their potential to significantly increase prediction accuracy over individual classifier models [25]. In the last decade, there
has been a great deal of research focused on the problem of boosting their performance, either by placing more or less emphasis on the hard examples, by
constructing new features for each base classifier, or by encouraging individual
accuracy and/or diversity within the ensemble. While the actual performance of
any ensemble model on a particular problem is clearly dependent on the data
and the learner, there is still much room for improvement as the comparison
between all the proposals provide valuable insight into understanding their respective benefit and their differences.
123
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
There are few comprehensive empirical studies comparing ensemble learning
algorithms [1, 9]. The study performed by Caruana and Niculescu-Mizil [9] is
perhaps the best known study however it is restricted to small subset of well
established ensemble methods like random forests, boosted and bagged trees,
and more classical models (e.g., neural networks, SVMs, Naive Bayes). On the
other had, many authors have compared their ensemble classifier proposal with
others. For instance, Zhang et al. compared in [29] RotBoost against Bagging,
AdaBoost, MultiBoost and Rotation Forest using decision tree-based estimators,
over 36 data sets from the UCI repository. In [23], Rodriguez et al. examined the
Rotation Forest ensemble on a selection of 33 data sets from the UCI repository
and compared it with Bagging, AdaBoost, and Random Forest with decision
trees as the base classifier. More recently, Louppe et al. investigated a very simple, yet effective, ensemble framework called Random Patches that builds each
individual model of the ensemble from a random patch of data obtained by
drawing random subsets of both instances and features from the whole dataset.
With respect to AdaBoost and Random Forest, these experiments on 16 data
sets showed that the proposed method provides on par performance in terms of
accuracy while simultaneously lowering the memory needs, and attains significantly better performance when memory is severely constrained. Despite these
attempts that have emerged to enhance the capability and efficiency, we believe
an extensive empirical evaluation of most of the ensemble proposal algorithms
can shed some light into the strength and weaknesses.
We briefly review these algorithms and describe a large empirical study comparing several ensemble method variants in conjunction with two types of unpruned decision trees : the standard CART decision tree and another randomized
variant called Extremely Randomized Tree (ET) proposed by Geurts et al in [13]
as base classifier, both using the Gini splitting criterion. As noted by Caruana et
al. [9], different performance metrics are appropriate for each domain. For example Precision/Recall measures are used in information retrieval; medicine prefers
ROC area; Lift is appropriate for some marketing tasks, etc. The different performance metrics measure different tradeoffs in the predictions made by a classifier.
One method may perform well on one metric, and worse on another, hence the
importance to gauge their performance on several performance metrics to get
a broader picture. We evaluate the performance of Boosting, Bagging, Random
Forests, Rotation Forests, and their variants including LogitBoost, VadaBoost,
RotBoost, and AdaBoost with stumps. For the sake of completeness, we added
more recent techniques like Random Patches and less conventional techniques
like Class-Switching and Arc-X4. All these voting algorithms can be divided into
two types: those that adaptively change the distribution of the training set based
on the performance of previous classifiers (as in boosting methods) and those
that do not (as in Bagging). Our purpose was not to cover all existing methods,
and we have restricted ourselves to well performing methods that have been
presented in the literature, without claiming exhaustivity, but trying to cover a
wide range of implementation ideas.
124
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
The data sets used in the experiments were all taken from the UCI Machine
Learning Repository. They represent a variety of problems but do not include
high-dimensional data sets owing to the computational expense of running Rotation Forests. The comparison is performed based on three performance metrics:
accuracy, ROC Area and squared error. For each algorithm we examine common parameters values. Following [9] and [22], we also examine the effect that
calibrating the models via Isotonic Regression has on their performance.
The paper is organized as follows. In Section 2, we begin with basic notation and follow with a description of the base inducers that build classifiers.
We use two variants of decision tree inducers: unlimited depth, and extremely
randomized tree. We then describe three performance metrics and the Isotonic
calibration method that we use throughout the paper. In Section 3, we describe
our set of experiments with and without calibration and report the results. We
raise several issues and for future work in Section 4 and conclude with a summary
of our contributions.
2
Ensemble Learning Algorithms & Parameters
Before discussing the ensemble algorithms chosen in this comprehensive study,
we would like to mention that, contrary to [9] which attempted to explore the
space of parameters for each learning algorithm, we decided to fix the parameters
to their common value except for a few data dependent extra parameters that
have to be finely pretuned. The number of trees was fixed to 200 in accordance
with a recent empirical study [15] which tends to show that ensembles of size less
or equal to 100 are too small for approximating the infinite ensemble prediction.
Although it is shown that for some datasets the ensemble size should ideally be
larger than a few thousands, our choice for the ensemble size tries to balance
performance and computation cost. This shall now summarize the parameters
used for each learning algorithm below.
Bagging (Bag) [4]: Practically, Bag has many advantages. It is fast, simple
and easy to program. It has no parameters to tune. Bag is sometimes proposed
with an optimization of the bootstraps samples size to perform better. However
we fixed the default size equal to the size of the initial dataset.
Random Forests (RF) [7]: the number of feature selected at each node for
building the trees was fixed to the root square of the total number of features.
Random Patches (RadP) [19]: this method was proposed very recently to
tackle the problem of insufficient memory w.r.t. the size of the data set. The idea
is to build each individual model of the ensemble from a random patch of data
obtained by drawing random subsets of both instances and features from the
whole dataset; ps and pf are hyper-parameters that control the number of samples and features in a patch. These parameters are tuned using an independent
validation dataset. It is worth mentioning that RadP was initially designed to
overcome some shortcomings of the existing ensemble techniques in the context
of huge data sets. As such, they were not meant to outperform the other methods
125
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
on small data sets or without an memory limitation. We chosed, however, this
algorithm as an interesting alternative to Bag and RF.
AdaBoost (Ad) [11]: we used the standard algorithm proposed by Freund
and Schapire.
AdaBoost Stump (AdSt): in this particular version of Ad, the base learner
is replaced by a stump. A stump is a decision tree with only one node. While
the base learner is highly biased, when combined with AdaBoost, it is believed
to compete with the best methods while providing a serious computational advantage.
VadaBoost (Vad) [26]: this is another ensemble method called Variance Penalizing AdaBoost that appeared recently in the literature. VadaBoost is similar
to AdaBoost except that the weighting function tries to minimize both empirical
risk and empirical variance. This modification is motivated by the recent empirical bound which relates the empirical variance to the true risk. Vad depends on
a hyper-parameter, λ, that will be tuned on a validation set.
Arc-X4 (ArcX4) [5]: the method belongs to the family of Arcing (Adaptive
Resampling and Combining) algorithms. It started out as a simple mechanism
for evaluating the effect of Ad.
LogitBoost (Logb) [12]: LogitBoost is a boosting algorithm formulated by
Friedman et al. Their original paper [12] casts the Ad algorithm into a statistical
framework. When regarded as a generalized additive model, the Logb algorithm
is derived by applying the cost functional of logistic regression. Note that there
is no final vote as each base classifier is not an independent classifier but rather
a correction for the whole model.
Rotation Forests (Rot) [23]: this method builds multiple classifiers on
randomized projections of the original dataset The feature set is randomly split
into K subsets (K is a parameter of the algorithm) and PCA is applied to each
subset in order to create the training data for the base classifier. The idea of
the rotation approach is to encourage simultaneously individual accuracy and
diversity within the ensemble. The size of each subsets of feature was fixed to
3 as proposed by Rodriguez. The number of sub classes randomly selected for
the PCA was fixed to 1 as we focused on binary classification. The size of the
bootstrap sample over the selected class was fixed to 75% of its size.
RotBoost (Rotb) [29]: this method combines Rot and Ad. As the main
idea of Rot is to improve the global accuracy of the classifiers while keeping
the diversity through the projections, the idea here is to replace the decision
tree by Ad. This can be seen as an attempt to improve Rot by increasing the
base learner accuracy without affecting the diversity of the ensemble. The final
decision is the vote over every decision made by the internal Ad. The parameter
setup for Rotb is the same as for Rot. In order to be fair in term of ensemble size,
we construct an ensemble consisting of 40 Rotation Forests which are learned
by Ad during 5 iterations. Hence the total number of trees is 200. This ratio has
been shown to be approximatively the empirically best in [29].
Class-Switching (Swt) [6]: Swt is a variant of the output flipping ensembles proposed by Martinez-Munoz and Suarez in [21]. The idea is to randomly
126
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
switch the class labels at a certain user defined rate p. The decision of the final
classifier is again the plurality vote over these base classifiers. p will be tuned on
a validation set.
Considering the four data dependent parameters mentioned above (i.e., ps ,
pf ,p and λ), we randomly split each dataset into two parts, 80% for training and
20% for validation, The later is used to search the best hyper-parameters and
is not used afterwards for training or comparison purposes (it will be discarded
from the whole data set). We then construct the ensemble on the training set
by increasing each parameters from 0.1 to 1.0. The parameters yielding the best
accuracy on the validation set are retained. It is worth noting that the other two
performance metrics (i.e., mean square error and AUC) could also be applied for
parametrization. All the above methods were implemented in Matlab - except
the CART algorithm in the Matlab statistics toolbox and the ET algorithm
in the regression tree package [13] -, in order to make fair comparisons and
also because some algorithms are not publicly available (e.g., random patches,
output switching). To make sure our Matlab implementations were correct, we
did a sanity check against previous papers on ensemble algorithms.
2.1
The decision tree inducers
As mentioned above, we use two distinct decision tree inducers: a decision tree
(CART) and a so-called Extremely Randomized Tree (ET) proposed in [13]. In
[19], Louppe and Geurts found out that every sub-sampling (samples and/or
feature) ensemble method they experimented with was improved when ET was
used as base learner instead of a standard decision tree. ET is a variant of
decision tree which aims to reduce even more the variance of ensemble methods
by reducing the variance of the tree as base learner. At each node, instead of
cutting at the best threshold among every possible ones, the method selects an
attribute and a threshold at random. To avoid very bad cuts, the score-measure
of the selected cut must be higher than a user-defined threshold otherwise it
has to be re-selected. This process is repeated until a convenient threshold is
found or until it does not remain any attribute to pick up (The algorithm uses
one threshold per attribute). According to the authors, the reducing variance
strength of his algorithm arises from the fact that threshold are selected totally
at random, contrary to preceding methods proposed by Kong and Dietterich in
[18] which select at random a threshold among the best ones or by Ho in [16]
which select the best one among a fixed number of thresholds. Therefore, we used
both unpruned DT and ET as base learners. For ET, we used he regression tree
package proposed in [13]. To distinguish ensemble with DT and ET, we added
’ET’ at the end of the algorithm names to indicate that extremely randomized
trees are used.
2.2
Performance Metrics & Calibration
The performance metrics can be splitted into three groups: threshold metrics,
ordering/rank metrics and probability metrics [8]. For thresholded metrics, like
127
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
accuracy (ACC), it makes no difference how close a prediction is to a threshold,
usually 0.5, what matters whether it is above or below the threshold. In contrast,
the ordering/rank metrics, like the area under the ROC curve (AUC), depend
only on the ordering of the instances, not the actual predicted values, while the
probability metrics, like the squared error (RMS), interpret the predicted value
of each instance as the conditional probability of the output label being in the
positive class given the input.
In many applications it is important to predict well calibrated probabilities;
good accuracy or area under the ROC curve are not sufficient. Therefore, all
the algorithms were run twice, with and without post calibration, in order to
compare the effects of calibrating ensemble methods on the overall performance.
The idea is not new, Niculescu-Mizil and Caruana have investigated in [9] the
benefit of two well known calibration methods, namely Platt Scaling and Isotonic
Regression [28], on the performance of several classifiers. They concluded that
AdaBoost and good ranking algorithms in general are those which draw the most
benefits from calibration. As expected, these benefits are the most noticeable on
the root mean squared error metric. In this paper, we only focus on Isotonic
Regression because it was originally designed for decision trees model although
Platt Scaling could also applied to decision trees. To this purpose, we use the
pair-adjacent violators (PAV) algorithm described in [28, 9] that finds a piecewise
constant solution in linear time.
2.3
Data sets
We compare the algorithms on nineteen binary classification problems of various sizes and dimensions. Table 1 summarizes the main characteristics of these
data sets utilized in our empirical study. This selection includes data sets with
different characteristics and from a variety of fields. Among them, we find some
data sets with thousands of features. As explained by Liu in [17], if Rot or Rotb
are applied to classify such datasets, a rotation matrix with thousands of dimensions is required for each tree, which entails a dramatic increase in computational
complexity. To keep the running time reasonable, we had no choice but to resort
to a dimension reduction technique for these problems; the same strategy was
adopted in several works [29, 23, 17]. Based on Liu’s comparison, we took the
best of the three proposed filter methods for rotation forest, the signal to noise
ratio [27]. SNR was used to rank all the features; we kept the 100 top relevant
features and discarded the others. Of course this choice necessarily entails some
compromises as there will generally be some loss of information. So the reader
shall bear in mind that the actual size of the data sets is limited to 100 features
in the experiments.
3
Performances analysis
In this section, we report the results of the experimental evaluation. For each
test problem, we use 5-fold cross validation (CV) on 80% of the data (recall
128
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 1. Characteristics of the nineteen problems used in this study
Data sets
Basehock
Breast-cancer
Cleve
Colon
Ionosphere
Leukemia
Madelon
Musk
Ovarian
Parkinson
PcMac
Pima
Promoters
Relathe
Smk-Can
Spam
Spect
Wdbc
Wpbc
#inst
1993
699
303
62
351
73
2600
476
54
195
1943
768
106
1427
187
4601
267
569
194
#feat #labels Reference
4862
2
[30]
9
2
[3]
13
2
[3]
[2]
2000
2
34
2
[3]
7129
2
[14]
500
2
[3]
166
2
[3]
1536
2
[24]
22
2
[3]
3289
2
[30]
8
2
[3]
57
2
[3]
4322
2
[30]
19993
2
[30]
57
2
[3]
[3]
22
2
[3]
30
2
33
2
[3]
that 20% of each data set is used to calibrate the models and to select the best
parameters). In order to get reliable statistics over the metrics, the experiments
were repeated 10 times. So the results obtained are averaged over 50 iterations
which allows us to apply statistical tests in order to discern significant differences
between the 20 methods.
Detailed average performances of the 20 methods for all 19 data sets using the
protocol described above are reported in Tables 1-6 of the supplementary material1 . For each evaluation metric, we present and discuss the critical diagrams
from the tests for statistical significance using all data sets.
Table 2 shows the normalized score for each algorithm on each of the three
metrics. Each entry in the table averages these scores across the fifty trials and
nineteen test problems. The table is divided into two blocks to separately illustrate the performances for both calibrated and uncalibrated models. The last
column per block, Mean, is the mean (only for illustration purposes, not for
statistical analysis) over the three metrics (ACC, AU C, 1 − RM S) and nineteen problems, and fifty trials. In the table, higher scores always indicate better
performance.
Considering all three metrics together, it appears that the strongest models
among the uncalibrated ones are Rotation Forest (Rot), Rotation Forest using
extremely randomized tree (RotET), Rotboost (Rotb) and its ET-based variant
1
http://perso.univ-lyon1.fr/haytham.elghazel/copem2013-supplementary.pdf
129
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 2. Average normalized scores by metric for each learning algorithm obtained over nineteen test problems. We give complete results over all evaluation
metrics in supplementary material.
Approach
Uncalibrated Models
ACC
AUC
1-RMS Mean
0,865
0,903
0,700
0,823
Rot
0,823? 0,875? 0,660? 0,786
Bag
0,857
0,893
0,668? 0,806
Ad
0,864
0,896
0,689
0,816
RF
0,865
0,897
0,702
0,821
Rotb
0,852? 0,892? 0,686
0,810
ArcX4
0,833? 0,874? 0,598? 0,769
AdSt
0,811? 0,809? 0,617? 0,746
CART
0,845
0,884
0,635? 0,788
Logb
?
0,859
0,888
0,638? 0,795
Swt
0,850
0,889
0,669? 0,803
RadP
0,858
0,894
0,684
0,812
Vad
0,871
0,901
0,698
0,823
RotET
0,836? 0,893
0,673? 0,800
BagET
0,862
0,898
0,667? 0,809
AdET
0,704 0,824
RotbET 0,866 0,900
0,901
0,693
0,821
ArcX4ET 0,868
0,866
0,890
0,649? 0,802
SwtET
0,908 0,680
0,816
RadPET 0,861
0,864
0,899
0,681
0,815
VadET
ACC
0,837
0,820?
0,836
0,835
0,841
0,829
0,817?
0,808?
0,823
0,829?
0,836
0,839
0,843
0,833
0,838
0,844
0,842
0,841
0,844
0,841
Calibrated Models
AUC
1-RMS Mean
0,864
0,673
0,791
0,844
0,649? 0,771
0,863
0,669
0,789
0,857
0,669
0,787
0,861
0,676
0,793
0,853
0,659? 0,780
0,845
0,653
0,771
0,806? 0,622? 0,745
0,854
0,660
0,779
0,848? 0,660? 0,779
0,851
0,662
0,783
0,864
0,671
0,791
0,858
0,675
0,792
0,852
0,663
0,783
0,861
0,674
0,791
0,859
0,678
0,794
0,859
0,673
0,791
0,850
0,673
0,788
0,867 0,678 0,797
0,864
0,678
0,794
(RotbET), and ArcX4ET. Among calibrated models, the best models overall are
Rotation Forest (Rot) and its ET-based variant (RotET), Rotboost (Rotb) and
its ET-based variant (RotbET), boosted extremely randomized trees (AdET),
ArcX4ET, Vadaboost (Vad) and its ET-based variant (VadET), and Random
Patch using extremely randomized tree (RadPET). With or without calibration, the poorest performing models are decision trees (CART), bagged trees
(Bag), and AdaBoost Stump (AdSt). Looking at individual metrics, calibration
generally slightly degrades the results on accuracy and AUC and is remarkably
effective at obtaining excellent performance on the RMS score (probability metric) for especially boosting-based algorithms. Indeed, calibration improves the
performance (in terms of RMS) of boosted stumps (AdSt), LogitBoost (Logb),
Class-Switching with or without extremely randomized trees (Swt and SwtEt),
and provides a small, but noticeable improvement for boosted trees with or without extremely randomized trees (Ad and AdET), and a single tree (CART). If we
consider only large data sets in Tables 1-6 of the supplementary materials (i.e.
Ovarian, Smk-Can, Leukemia), reported results show that RMS values decrease
with calibration when boosting-based approaches are used, while their AUC and
ACC are not affected.
130
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Regarding now the performances of ET-based variants of the algorithms,
across all three metrics, with or without calibration, it is observed that each
ensemble method with ET always outperforms ensembles of standard DT. This
observation confirms the results obtained in [19] and clearly suggests that using
random split thresholds, instead of optimized ones like in DT, pays off in terms
of generalization error, especially for small data sets.
In order to better assess the results obtained for each algorithm on each
metric, we adopt in this study the methodology proposed by [10] for the comparison of several algorithms over multiple datasets. In this methodology, the
non-parametric Friedman test is firstly used to evaluate the rejection of the hypothesis that all the classifiers perform equally well for a given risk level. It ranks
the algorithms for each data set separately, the best performing algorithm getting the rank of 1, the second best rank 2 etc. In case of ties it assigns average
ranks. Then, the Friedman test compares the average ranks of the algorithms
and calculates the Friedman statistic. If a statistically significant difference in
the performance is detected, we proceed with a post hoc test. The Nemenyi test
is used to compare all the classifiers to each other. In this procedure, the performance of two classifiers is significantly different if their average ranks differ
more than some critical distance (CD). The critical distance depends on the
number of algorithms, the number of data sets and the critical value (for a given
significance level p) that is based on the Studentized range statistic (see [10] for
further details). In this study, the Friedman test reveals statistically significant
differences (p < 0.05) for each metric with and without calibration. As seen in
table 2, the algorithm performing best on each metric is boldfaced. Algorithms
performing significantly worse than the best algorithm at p = 0.1 (CD=6.3706)
using the Nemenyi posthoc test are marked with ’?’ next to them. Furthermore,
we present the result from the Nemenyi posthoc test with average rank diagrams
as suggested by Demsar [10]. These are given on Figure 1. The ranks are depicted on the axis, in such a manner that the best ranking algorithms are at the
rightmost side of the diagram. The algorithms that do not differ significantly (at
p = 0.1) are connected with a line. The critical difference CD is shown above
the graph.
As may be observed in Figure 1, ET-based variant of Rotboost (RotbET)
performs best in terms of accuracy. In the average ranks diagrams corresponding
to accuracy, two groups of algorithms could be separated. The first consists of
all algorithms which have seemingly similar performances with the best method
(i.e. RotbET). The second contains the methods that performs significantly
worse than RotbET, including Bagging (Bag) and its ET-based variant (BagET);
ArcX4, Boosted stumps (AdS) and single tree (CART).
The statistical tests we use are conservative and the differences in performance for methods within the first group are not significant. To further support
these rank comparisons, we compared the 50 accuracy values obtained over each
dataset split for each pair of methods in the first group by using the paired
t-test (with p = 0.05) as done [19]. The results of these pairwise comparisons
are depicted (see the supplementary material) in terms of ”Win-Tie-Loss” sta-
131
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Fig. 1. Average ranks diagram comparing the 20 algorithms in terms of three
metrics (Accuracy, AUC and RMS)
Average ranks diagram of uncalibrated models in terms of Accuracy
Average ranks diagram of calibrated models in terms of Accuracy
CD =6.3706
CD =6.3706
2019181716151413121110 9 8 7 6 5 4 3 2 1
CART
Bag
AdSt
BagET
ArcX4
Logb
RadP
Swt
Vad
Ad
2019181716151413121110 9 8 7 6 5 4 3 2 1
CART
Bag
AdSt
Swt
ArcX4
Logb
Vad
VadET
AdET
Ad
RotbET
Rotb
RotET
ArcX4ET
RF
SwtET
Rot
AdET
VadET
RadPET
Average ranks diagram of uncalibrated models in terms of AUC
Average ranks diagram of calibrated models in terms of AUC
CD =6.3706
CD =6.3706
2019181716151413121110 9 8 7 6 5 4 3 2 1
CART
Bag
Swt
AdSt
ArcX4
Logb
SwtET
Vad
RadP
Rotb
2019181716151413121110 9 8 7 6 5 4 3 2 1
CART
Swt
Bag
ArcX4
AdSt
SwtET
RadP
BagET
Vad
Logb
RadPET
VadET
RotbET
RotET
AdET
Rot
ArcX4ET
RF
BagET
Ad
Average ranks diagram of uncalibrated models in terms of RMS
RadPET
Rot
RotbET
Ad
Rotb
AdET
VadET
RF
RotET
ArcX4ET
Average ranks diagram of calibrated models in terms of RMS
CD =6.3706
CD =6.3706
2019181716151413121110 9 8 7 6 5 4 3 2 1
AdSt
CART
Swt
Logb
Bag
Ad
AdET
SwtET
BagET
RadP
RotbET
Rotb
ArcX4ET
RadPET
RotET
SwtET
RF
Rot
BagET
RadP
2019181716151413121110 9 8 7 6 5 4 3 2 1
CART
Bag
Swt
ArcX4
AdSt
Logb
BagET
Vad
Ad
RadP
RotbET
Rotb
Rot
RotET
ArcX4ET
RF
ArcX4
RadPET
VadET
Vad
132
RadPET
RotbET
Rotb
RotET
ArcX4ET
VadET
Rot
RF
AdET
SwtET
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
tuses of all pairs of methods; the three values in each cell (i, j) respectively
indicate how times many the approach i is significantly better/not significantly
different/significantly worse than the approach j. Following [10], if the two algorithms are, as assumed under the null-hypothesis, equivalent, each should win
on approximately N/2 out of N data sets. The number of wins is distributed according to the binomial distribution and the critical number of wins at p = 0.1
is equal to 13 in our case. Since tied matches support the null-hypothesis we
should not discount them but split them evenly between the two classifiers when
counting the number of wins; if there is an odd number of them, we again ignore
one.
In the Table 7 (see the supplementary material), each pairwise comparison
entry (i, j) for which the approach i is significantly better than j is boldfaced.
The analysis of this table reveals that the approaches that are never beaten
by any other approach are: all the Rotation Forest-based methods (Rot, Rotb,
RotET and RotbET), AdET and ArcX4ET. We may also notice from Figure 1
and Table 8 (see the supplementary material) for accuracy on calibrated models
the following. First, the calibration is beneficial to Random Patch algorithms
(RadP and RadPET) and Bagged trees (BagET) in terms of ranking. It hurts the
ranking of boosted trees but does not affect the performances of Rotation Forestbased methods and ArcX4ET. Overall, RotbET is ranked first, then come Rotb,
ArcX4ET and RadPET. Looking at Table 8 (see the supplementary material),
the dominating approaches include again all Rotation Forest-based methods and
ArcX4ET, as well as RadPET and VadET (c.f. Table 3). Another interesting
observation upon looking at the average rank diagrams is that ensembles of ET
lie mostly on the right side of the plot compared to their DT counterparts, hence
their superior performance.
As far as the AUC is concerned (c.f. Figure 1), RadPET ranks first. However, its performance is not statistically distinguishable from the performance
of five other algorithms: RotET, RotbET, Ad, AdET and VadET (c.f. Table 9
in supplementary material). In our experiments, ET improved the ranking of all
ensemble approaches by at least 10% on average when compared to DT. This
corroborate our previous finding, namely that ET should be preferred to DT in
the ensembles. Figure 1 and Table 10 (see the supplementary material) indicate
that calibration reduces the ranking of some approaches, especially VadET and
RotET (among the best uncalibrated approaches in terms of AUC) but slightly
improves the ranks of the approaches that adaptively change the distribution
(Logb, AdSt, Ad, Vad, Rotb, ArcX4) and Rot. This explain why equally performing methods like RadPET are, after calibration, judged insignificant (c.f.
Table 3).
Regarding the RMS results reported in Figure 1 and Table 11 (see the supplementary material). Rot, Rotb and RotbET significantly outperforms the other
approaches. Here again, ET-based method outperforms the DT ones by a noticeable margin. We found calibration to be remarkably effective at improving
the ranking of boosting-based algporithms in terms of RMS values, especially
Ad, AdET, AdSt, Logb and VadET. This is the reason why that algorithms
133
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 3. List of dominating approaches per metric, with and without calibration
Metric Without calibration
ACC
AdET, ArcX4ET, Rot, Rotb,
RotET, RotbET
AUC
Ad, AdET, RotET, RotbET,
RadPET, VadET
RMS
Rot, Rotb, RotbET
With calibration
ArcX4ET, Rot, Rotb,
RotbET, RotET, RadPET,
VadET
Ad, AdET, ArcX4ET, Logb,
Rot, Rotb, RotbET, RadPET,
Vad, VadET
Ad, AdET, Logb, Rot, Rotb,
RotET, RotbET, RadPET,
Vad, VadET
that adaptively change the distribution have integrated the list of dominating
approaches (c.f. Table 3).
3.1
Diversity-error diagrams
To achieve higher prediction accuracy than individual classifiers, it is crucial
that the ensemble consists of highly accurate classifiers which at the same time
disagree as much as possible. To illustrate the diversity-accuracy patterns of the
ensemble, we use the kappa-error diagrams proposed in [20]. The latter are scatterplots with L × (L − 1)/2 points, where L is the committee size. Each point
corresponds to a pair of classifiers. On the x-axis is a measure of diversity between the pair, κ. On the y-axis is the averaged individual error of the classifiers
in the pair, ei,j = (ei + ej )/2. As small values of κ indicate better diversity and
small values of ei,j indicate better performance; the diagram of an ideal ensemble should be filled with points in the bottom left corner. Since we have a large
number of algorithms to compare and due to space limitation, we only plot the
distance between their corresponding centroids in Figure 2 for the 18 ensemble
methods (Logb and CART are excluded), for the ”Musk” and ”Relathe” data sets
only. The following is observed: (1) Rot-based algorithms outperform the others
in terms of accuracy; (2) ArcX4, Bag and RF exhibit equivalent patterns, they
are slightly more diverse but slightly less accurate than Rot-based algorithms;
(3) while boosting-based methods (AdSt, Ad, AdET) and switching are more
diverse, their accuracies are lower than the others, except SwtET as ET is generally able to increase the individual accuracy, and (4) no clear picture emerged
when one examines Random Patch-based algorithms. Not surprinsingly, as the
classifiers become more diverse, they become less accurate and vice versa. Furthermore, according to the results in the previous subsection, it seems that the
more accurate the base classifiers are, the better the performance. This corroborates the conclusion drawn in [23], namely that individual accuracy is probably
the more crucial component of the tandem diversity-accuracy, contrary to the
diversifying strategies.
134
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Fig. 2. Centroids of κ-Error Diagrams of different ensemble approaches for two
data sets. x-axis= κ, y-axis= ei,j (average error of pair of classifiers). (01)
Rot; (02) Bag; (03) Ad; (04) RF; (05) Rotb; (06) ArcX4; (07) AdSt; (08) Swt;
(09) RadP; (10) Vad; (11) RotET; (12) BagET; (13) AdET; (14) RotbET; (15)
ArcX4ET; (16) SwtET; (17) RadPET; (18) VadET.
Musk
Relathe
0.45
7
0.4
0.45
8
0.4
0.35
8
13
0.35
3
13
0.3
7
17
10
18
3
9
0.3
16
10
4
0.25
12 6 2
0.25
15
18
9
1
11
0.2
17
16
212
0.2
5
15
6
4
14
514 11
1
0.15
0.15
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
The kappa-error relative movement diagrams in Figure 3 display the difference between the κ and accuracy of the DT-based method and the ET-based
one. There are as many points as data sets. Points in the upper-right corner
represent datasets for which the ET-based method outperformed the standard
DT-based algorithm in terms of both diversity and accuracy, points up-left indicate that ET-based method improved the accuracy but degrades diversity. We
may notice that ET as a base learner improves one criteria at the expense of the
other. Furthermore, according to the resulting win/tie/loss counts for each ETbased approach against the DT-based one summarized in Table 4, we find that
the approaches for which the ET-variant is significantly superior to the standard
one are those for which the accuracy (i.e. Swt) or the diversity (i.e. Bag, ArcX4
and RadP) is significantly better.
Before we conclude, we would like to mention that some of the above findings
need to be regarded with caution. We list a few caveats and our comments on
these.
– The experimental analysis was restricted to the 100 most relevant features
with a view to dramatically reducing the computational burden required to
run Rotation Forest-based methods. Thus, the results reported here are valid
for data sets of small to moderate sizes. The data sets used in the experiments did not include very large-scale data sets. Moreover, the complexity
135
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Fig. 3. Centroids of κ-Error relative movement diagrams of different ensemble
approaches
1:
3:
5:
7:
6
0.04
1 vs. 11
3 vs. 13
6 vs. 15
9 vs. 17
2: 2 vs. 12
4: 5 vs. 14
6: 8 vs. 16
8: 10 vs. 18
0.02
∆ error
3
4
8
0
25
7
1
−0.02
−0.04
−0.12
−0.1
−0.08
−0.06
−0.04
−0.02
0
0.02
0.04
∆κ
Table 4. The win/tie/loss results for ET-based ensembles vs. DT-based ensembles. Bold cells indicate significant differences at p = 0.1
Approaches
Uncalibrated Models
ACC
AUC
RMS
8/8/3 11/2/6 7/6/6
RotET/Rot
11/6/2 13/4/2 13/3/3
BagET/Bag
7/10/2 7/10/2 11/4/4
AdET/Ad
3/12/4 6/10/3 5/11/3
RotbET/Rotb
ArcX4ET/ArcX4 14/5/0 13/2/4 13/1/5
10/8/1 9/5/5 13/2/4
SwtET/Swt
RadPET/RadP 9/10/0 10/7/2 14/1/4
10/7/2 9/9/1 9/5/5
VadET/Vad
Calibrated Models
ACC
AUC
RMS
6/11/2 7/8/4 8/7/4
13/5/1 12/5/2 12/6/1
6/11/2 4/8/7 6/12/1
3/13/3 3/11/5 4/10/5
10/9/0 9/7/3 14/4/1
14/3/2 10/6/3 13/4/2
10/7/2 12/4/3 13/4/2
6/9/4 3/11/5 7/9/3
In Total
47/42/25
74/29/11
41/55/18
24/67/23
73/28/13
69/28/17
68/33/13
44/50/20
issue should be addressed to balance the computation cost with the obtained
performance in a real scenario.
– We used the same ensemble size L = 200 for all methods. It is known that
bagging fares better for large L. On the other hand, AdaBoost would benefit
from tuning L. It is not clear what the outcome would be if L was treated
as hyperparameter and tuned for all ensemble methods compared here. We
acknowledge that a thorough experimental comparison of a set of methods
needs tuning each of the methods to its best for every data set and every
performance metric. Interestingly, while VadaBoost, Class-Switiching and
Random Patches were slightly favored as we tuned some of their parameters
on an independent validation set, these methods were not found to compare
favorably with Rotation Forest and its variants.
136
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
– The comparison was performed on binary classification problems solely. Mutliclass and multi-label classification problems were not investigated. These
can, however, be turned into binomial classifiers by a variety of strategies.
4
Discussion & Conclusion
We described an extensive empirical comparison between twenty prototypical
supervised ensemble learning algorithms over nineteen UCI benchmark datasets
with binary labels and examined the influence of two variants of decision tree
inducers (unlimited depth, and extremely randomized tree) with and without
calibration. The experiments presented here support the conclusion that the
Rotation Forest family of algorithms (Rotb, RotbET, Rot and RotET) outperforms all other ensemble methods with or without calibration by a noticeable
margin, which is much in line with the results obtained in [29]. It appears that the
success of this approach is closely tied to its ability to simultaneously encourage
diversity and individual accuracy via rotating the feature space and keeping all
principal components. Not surprinsingly, the worse performing models are single
decision trees, bagged trees, and AdaBoost Stump. Another conclusion we can
draw from these observations is that building ensembles of extremely randomized
trees is very competitive in terms of accuracy even for small sized data sets. This
confirms the effectiveness of using random split thresholds, instead of optimized
ones like in decision trees. We found calibration to be remarkably effective at
lowering the RMS values of boosting-based methods.
References
1. Eric Bauer and Ron Kohavi. An empirical comparison of voting classification
algorithms: Bagging, boosting, and variants. In Machine Learning, pages 105–139,
1998.
2. Amir Ben-Dor, Laurakay Bruhn, Agilent Laboratories, Nir Friedman, Miche‘l
Schummer, Iftach Nachman, U. Washington, U. Washington, and Zohar Yakhini.
Tissue classification with gene expression profiles. Journal of Computational Biology, 7:559–584, 2000.
3. C.L Blake and C.J Merz. Uci repository of machine learning databases, 1998.
4. Leo Breiman. Bagging predictors. In Machine Learning, pages 123–140, 1996.
5. Leo Breiman. Bias, variance, and arcing classifiers. Technical report, 1996.
6. Leo Breiman. Randomizing outputs to increase prediction accuracy. Machine
Learning, 40(3):229–242, 2000.
7. Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
8. Rich Caruana and Alexandru Niculescu-Mizil. Data mining in metric space: an
empirical analysis of supervised learning performance criteria. In KDD, pages
69–78, 2004.
9. Rich Caruana and Alexandru Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In ICML, pages 161–168, 2006.
10. Janez Demsar. Statistical comparisons of classifiers over multiple data sets. Journal
of Machine Learning Research, 7:1–30, 2006.
137
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
11. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting, 1997.
12. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28:2000, 1998.
13. Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.
Machine Learning, 63(1):3–42, 2006.
14. T.R. Golub, Slonim, D.K., P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov,
and H. Coller. Molecular classication of cancer: Class discovery and class prediction
by gene expression monitoring. Science, 286:531–537, 1999.
15. Daniel Hernández-Lobato, Gonzalo Martı́nez-Muñoz, and Alberto Suárez. How
large should ensembles of classifiers be? Pattern Recognition, 46(5):1323–1336,
2013.
16. Tin Kam Ho. The random subspace method for constructing decision forests. IEEE
Trans. Pattern Anal. Mach. Intell., 20(8):832–844, 1998.
17. Kun hong Liu and De-Shuang Huang. Cancer classification using rotation forest.
Comp. in Bio. and Med., 38(5):601–610, 2008.
18. Eun Bae Kong and Thomas G. Dietterich. Error-correcting output coding corrects
bias and variance. In ICML, pages 313–321, 1995.
19. Gilles Louppe and Pierre Geurts. Ensembles on random patches. In ECML/PKDD
(1), pages 346–361, 2012.
20. Dragos D. Margineantu and Thomas G. Dietterich. Pruning adaptive boosting. In
International Conference on Machine Learning (ICML), pages 211–218, 1997.
21. Gonzalo Martı́nez-Muñoz and Alberto Suárez. Switching class labels to generate
classification ensembles. Pattern Recognition, 38(10):1483–1494, 2005.
22. Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with
supervised learning. In ICML, pages 625–632, 2005.
23. Juan José Rodrı́guez, Ludmila I. Kuncheva, and Carlos J. Alonso. Rotation forest: A new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell.,
28(10):1619–1630, 2006.
24. M. Schummer, W. V. Ng, and R. E. Bumgarnerd. Comparative hybridization of
an array of 21,500 ovarian cdnas for the discovery of genes overexpressed in ovarian
carcinomas. Gene, 238(2):375–385, 1999.
25. Friedhelm Schwenker. Ensemble methods: Foundations and algorithms [book review]. IEEE Comp. Int. Mag., 8(1):77–79, 2013.
26. Pannagadatta K. Shivaswamy and Tony Jebara. Variance penalizing adaboost. In
NIPS, pages 1908–1916, 2011.
27. Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander,
and Eric S. L. Class prediction and discovery using gene expression data. pages
263–272. Press, 2000.
28. Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates
from decision trees and naive bayesian classifiers. In ICML, pages 609–616, 2001.
29. Chun-Xia Zhang and Jiang-She Zhang. Rotboost: A technique for combining rotation forest and adaboost. Pattern Recognition Letters, 29(10):1524–1536, 2008.
30. Zheng Zhao, Fred Morstatter, Shashvata Sharma, Salem Alelyani, and Aneeth
Anand. Feature selection, 2011.
138
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Clustering Ensemble on Reduced Search Spaces
Sandro Vega-Pons and Paolo Avesani
NeuroInformatics Laboratory (NILab), Fondazione Bruno Kessler, Trento, Italy
Centro Interdipartimentale Mente e Cervello (CIMeC), Università di Trento, Italy
{vega,avesani}@fbk.eu
Abstract. Clustering ensemble has become a very popular technique in
the past few years due to its potentialities for improving the clustering
results. Roughly speaking it consists in the combination of different partitions of the same set of objects in order to obtain a consensus one. A
common way of defining the consensus partition is as the solution of the
median partition problem. This way, the consensus partition is defined
as the solution of a complex optimization problem. In this paper, we
study possible prunes of the search space for this optimization problem.
Particularly, we introduce a new prune that allows a dramatic reduction
of the search space. We also give a characterization of the dissimilarity
measures that can be used to take advantage of this prune and we proof
that the lattice metric fits in this family. We carry out an experimental
study comparing, under different circumstances, the size of the original
search space and the size after the proposed prune. Outstanding reductions are obtained, which can be very beneficial for the development of
clustering ensemble algorithms.
Keywords: Clustering ensemble, partition lattice, median partition, search
space reduction, dissimilarity measure.
1
Introduction
Clustering ensemble has become a popular technique to deal with data clustering
problems. When different clustering algorithms are applied to the same dataset,
different clustering results can be obtained. Instead of trying to find the best one,
the idea of combining these individual results in order to obtain a consensus has
gained an increasing interest in the last years. In practice, such a procedure could
produce high quality final clusterings.
In the past ten years, motivated by the success of the combination of supervised classifiers, several clustering ensemble algorithms have been proposed
in the literature [1]. Different mathematical and computational tools have been
used for the development of clustering ensemble algorithms. For example, there
are methods based on Co-Association Matrix [2], Voting procedures [3], Genetic
Algorithms [4], Graph Theory [5], Kernel Methods [6], Information Theory [7],
Fuzzy Techniques [8], among others.
However, the consensus clustering, which is the final result of all clustering
ensemble algorithms, is not always defined in the same way. For many methods,
139
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
2
S. Vega-Pons and P. Avesani
the consensus partition lacks of a formal definition, it is just implicitly defined
as the objective function of the particular algorithm. This makes the theoretical
study of the consensus partition properties to be difficult. This is the case, for example, of Relabeling and Voting [3], Graph based methods [5] and Co-association
Matrix based methods [2]. On the other hand, there are some methods that use
an explicit definition of the consensus partition concept. In this approach, the
consensus partition is defined as the solution of an optimization problem, the
problem of finding the median partition with respect to the clustering ensemble. Before defining this problem, we introduce the notation that will be used
throughout this paper.
Let X = {x1 , x2 , . . . , xn } be a set of n objects and P = {P1 , P2 , . . . , Pm } be
a set of m partitions of X. A partition P = {C1 , C2 , . . . , Ck } of X is a set of k
subsets of X (clusters) that satisfies the following properties:
(i) Ci 6= ∅, for all i = 1, . . . , k;
(ii) Ci ∩ Cj = ∅, for all i 6= j;
Sk
(iii) i=1 Ci = X.
Furthermore, PX is defined as the set of all possible partitions of X, P ⊆ PX
and the consensus partition is denoted by P ∗ , P ∗ ∈ PX .
Formally, given an ensemble P of m partitions, the median partition is defined
as:
m
X
P ∗ = arg min
d(P, Pi )
(1)
P ∈PX
i=1
1
where d is a dissimilarity measure between partitions.
Despite the median partition has been accepted in the clustering ensemble
community, almost no studies about its theoretical properties have been done by
scientists of this area. However, theoretical studies about the median partition
problem have been carried out by the discrete mathematicians much before this
problem gained interest in the machine learning community. Nevertheless, it has
been mainly studied when it is defined by using the symmetric difference distance
(or Mirkin distance) [9]. One of the most important results is the proof that the
problem of finding the median partition with this distance is N P-hard [10]. A
proper analysis with other (dis)similarity measures has not been done.
Despite the complexity of the problem depends on the applied (dis)similarity
measure, it seems to be a hard problem for any meaningful measure [1]. The
application of an exhaustive search for the optimum solution, would only be
computationally feasible for very small size problems. Therefore, several heuristic procedures have been applied to face this problem, for example: simulated
annealing [6, 11] and genetic algorithms [12].
Despite the good results with these heuristics, they are still designed for
finding the optimum solution in the whole search space. An interesting approach
is to study the properties of the problem in order to find a possible prune of the
search space, reducing the complexity. In some clustering ensemble algorithms,
1
The problem can also be equally defined by maximizing the similarity with all partitions, in the case that d is a similarity measure.
140
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Clustering Ensemble on Reduced Search Spaces
3
an intuitive simplification of the problem, called fragment clusters [13, 14], has
been used. The idea is that, if a subset of objects has been placed in the same
cluster in all partitions to be combined, it is expected to find it in the consensus
partition. Therefore, a representative object (fragment object) for each of those
subsets can be computed. This way, the problem is reduced to work with the
fragment objects. Once the consensus result is obtained, each fragment object is
replaced back by the set of objects that it represents, in order to obtain the final
consensus partition. This idea has also been used in the context of ensemble of
image segmentations under the name of super-pixels [15, 16].
The above explained reduction needs objects to be placed in the same cluster
for all partitions. As the number of partitions increases or partitions are more
independent or some noisy partitions are included in the ensemble, the probability of having such subsets of objects with the same cluster label in all partitions
decreases. Therefore, this prune of the search space could be useless in practice.
Stronger prunes of the search space are needed for real applications.
In this paper, we introduce a new prune that leads to a dramatic reduction
of the size of the search space. The paper is structured as follows, in Section 2
we present the basic concepts on lattice theory that are needed to introduce
our results. In Section 3 a relation between the dissimilarity measure used to
define the median partition problem and possible prunes of the search space is
establish. First, a formalization of the fragment objects based prune is given
by introducing the properties to be fulfilled by the dissimilarity measure. Afterwards, we introduce a new prune of the search space and provide a family of
dissimilarity measures for which this prune is possible. Moreover, we present a
measure that fits in this family, which can be used in practice to take advantage
of the reduction of the search space. In Section 4, both prunes of the search
space are experimentally evaluated on synthetic data. The size of the reduced
search spaces is compared with the size of the whole search space under different
conditions. Finally, Section 5 concludes this study.
2
Partition Lattice
The cardinality of the set of all partitions PX is given by the |X|-th Bell number
which can be computed by the following recursion formula Bn+1 =
Pn [17],
n
k=0 k Bk . The Bell number has an exponential growing as the number of
objects increases, e.g. B3 = 5, B10 = 115975 and B100 = 4.75 × 10115 (Much
bigger than the estimation of the number of all atoms in the observable universe,
around 1080 ) 2 . Therefore, even for a relatively small number of objects, the set
of all partitions of them PX is huge.
Over PX , a partial order relation3 (called refinement) can be defined. For
all P, P 0 ∈ PX , we say P P 0 if and only if, for all cluster C 0 ∈ P 0 there are
2
3
http://www.universetoday.com/36302/atoms-in-the-universe/
A binary relation which is reflexive (P P ), anti-symmetric (P P 0 and P 0 P
implies P = P 0 ) and transitive (P P 0 and P 0 P 00 implies P P 00 ) for all
P, P 0 , P 00 ∈ PX .
141
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
4
S. Vega-Pons and P. Avesani
Sv
clusters Ci1 , Ci2 , . . . , Civ ∈ P such that C 0 = j=1 Cij . In this case, it is said
0
0
that P is finer than P or equivalently, P is coarser than P .
The set of all partitions PX of a finite set X, endowed with the refinement
order () is a lattice (see example in Fig. 1). Therefore, for each pair of partitions
P , P 0 two binary operations are defined: meet (P ∧ P 0 ) which is the coarsest of
all partitions finer than both P and P 0 , and join (P ∨ P 0 ) which is the finest of
all partitions coarser than both P and P 0 .
For example, in Fig. 1, if P1 = {{a, b, c, d}}, P2 = {{a, b}, {c, d}}, P3 =
{{a, b, c}, {d}}, P4 = {{a, b}, {c}, {d}}, then: P4 P2 P1 ; P2 ∧ P3 = P4 and
P2 ∨ P3 = P1 .
{a, b, c, d}
{a} {b, c, d}
{a} {b} {c, d}
{b} {a, c, d}
{c} {a, b, d}
{a} {c} {b, d}
{d} {a, b, c}
{a} {d} {b, c}
{a, b} {c, d}
{a, b} {c} {d}
{a, c} {b, d}
{a, c} {b} {d}
{a, d} {b, c}
{a, d} {b} {c}
{a} {b} {c} {d}
Fig. 1. Hasse diagram or graphical representation of the lattice associated to the set
of partitions PX of the set X = {a, b, c, d}.
Among several properties, the partition lattice (PX , ) satisfies the property
of being an atomic lattice. The partitions Pxy , composed of a cluster containing
only the objects x and y, and the remaining clusters containing only one object,
are called atoms. For example, in Figure 1, Pab = {{a, b}, {c}, {d}} and Pbc =
{{a}, {b, c}, {d}} are two atoms. The partition lattice is atomic because, every
partition P is the join of the elementary partitions Pxy for all pair of objects
x, y which are in the same cluster in P .
An important concept that is needed to understand the results proposed in
this paper is the q-quota rules [18]. Given a real number q ∈ [0, 1], the q-quota
rule cq is defined in the following way:
cq (P) =
_
{Pxy : γ(xy, P) > q}
(2)
where γ(xy, P) = N (xy,P)
and N (xy, P) is the number of times that the
m
objects x, y ∈ X are in the same cluster in the partitions in P.
Two interesting cases of q-quota rules are the following:
W
– unanimity rule: u(P) =W {Pxy : γ(xy, P) = 1}
– majority rule: m(P) = {Pxy : γ(xy, P) > 0.5}
142
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Clustering Ensemble on Reduced Search Spaces
5
Notice that any q-quota rule is a partition of PX . For example: u(P) is the
partition obtained by the join of all atoms Pxy , such that the objects x and y
are placed in the same cluster in all partitions in P. In the same way, m(P) is
the join of all atoms Pxy such that x and y are in the same cluster in more than
half of the partitions in P. Next we present a toy example.
Example 1. Let X = {1, 2, 3, 4, 5, 6} be a set of objects and P = {P1 , P2 , P3 , P4 , P5 }
be a set of partitions of X, such that:
P1 = {{1, 2}, {3, 4}, {5, 6}}, P2 = {{1, 2, 4}, {3, 5, 6}}, P3 = {{1, 2, 3}, {4, 5, 6}},
P4 = {{1, 3, 4}, {2, 5, 6}}, P5 = {{1, 3}, {2, 4}, {5, 6}}.
In this case u(P ) = {{1}, {2}, {3}, {4}, {5, 6}} since 5 and 6 are the only
elements that are grouped in the same cluster in all partitions. On the other
hand, m(P) = {{1, 2, 3}, {4}, {5, 6}} = P12 ∨ P13 ∨ P56 . Notice that objects 2 and
3 are in the same cluster in m(P) even though P23 is not a majority atom, i.e.
γ(23, P) = 1/5 < 0.5. This is a chaining effect of the fact that P12 and P13 are
majority atoms.
3
Methods
The two rules defined in the previous Section (unanimity and majority rules)
allow the definition of two different subsets of the partition space PX . Let us
consider UX ⊆ PX the set of all partitions coarser than u(P), i.e. UX = {P ∈
PX : u(P) P }. Analogously, MX ⊆ PX is defined as the set of all partitions
coarser than m(P), i.e. MX = {P ∈ PX : m(P) P }. It is not difficult to verify
that MX ⊆ UX ⊆ PX , because any atom Pxy satisfying Pxy u(P) also holds
Pxy m(P).
In this section we will describe the conditions such that the median partition
can be searched just in the reduced spaces UX and MX . The median partition
problem could have more than one solution, therefore equation (1) should be
written in a more precise way as follows:
MP = arg min
P ∈PX
m
X
d(P, Pi )
(3)
i=1
where MP is the set of all median partitions. If we only consider the reduced
search space UX , the median partition problem is defined as:
MU = arg min
P ∈UX
m
X
d(P, Pi )
(4)
i=1
In the same way, when only MX is considered as search space, the median
partition problem is given by:
MM = arg min
P ∈MX
m
X
d(P, Pi )
i=1
Another concept that we will use is the sum-of-dissimilarities (SoD):
143
(5)
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
6
S. Vega-Pons and P. Avesani
Definition 1. Given a set of partitions P ⊂ PX and a dissimilarity d : PX ×
PX → R, the sum-of-dissimilarities of a partition P to P (SoD(P )) is defined
as:
m
X
SoD(P ) =
d(P, Pi )
i=1
∗
Notice that a median partition P is an element of PX with a minimum SoD
value, i.e. P ∗ = arg minP ∈PX SoD(P ).
In Section 3.1 we present a family of dissimilarity functions for which MP =
MU and therefore the reduced search space UX can be used instead of PX . In
Section 3.2 a family of dissimilarity functions for which MP = MM is also presented. We also prove that the lattice metric belongs to this family of functions.
3.1
Prune of the search space based on Unanimity Rule
Definition 2. A dissimilarity measure between partitions du : PX × PX → R is
said to be u-atomic, if for every pair of partitions P, P 0 ∈ PX and every atom
Pxy such that Pxy P and Pxy P 0 , then d(P ∨ Pxy , P 0 ) < d(P, P 0 ).
Proposition 1. Let P ⊂ PX be a set of partitions and du be an u-atomic dissimilarity function. For all partition P ∈ PX and every atom Pxy such that
Pxy P and Pxy : γ(xy, P) = 1, we have:
SoD(P ∨ Pxy ) < SoD(P )
Proof. SoD(P ) = du (P, P1 ) + . . . + du (P, Pm ) for all Pi ∈ P, with i = 1, . . . , m.
In the same way, SoD(P ∨ Pxy ) = du (P ∨ Pxy , P1 ) + . . . + du (P ∨ Pxy , Pm ).
As γ(xy, P) = 1, all partitions Pi ∈ P hold Pxy Pi . For each term in both
equations we have du (P ∨Pxy , Pi ) < du (P, Pi ) based on the definition of u-atomic
dissimilarity and therefore SoD(P ∨ Pxy ) < SoD(P ).
t
u
Proposition 2. Let P ⊂ PX be a set of partitions, u(P) be the unanimity rule
and du be an u-atomic function. Every median partition P ∗ ∈ MP holds u(P) P ∗.
Proof. Let us assume that P ∗ ∈ MP is a median partition and u(P) P ∗ .
Then, there is at least one atom Pxy u(P) such that Pxy P ∗ . According to
Proposition 3, P ∗ would not be a median element because the partition P ∗ ∨ Pxy
would have a smaller SoD value. Therefore, the assumption is false and we
conclude that u(P) P ∗
t
u
Corollary 1. Let P ⊂ PX be a set of partitions and du be an atomic function,
then problems (3) and (4) have the same set of solutions (MP = MU ).
Proof. This is a direct consequence of Proposition 2 and equations (3) and (4).
If all solutions of equation (3) are coarser than m(P), they are in UX . Therefore,
problems (3) and (4) are equivalent.
t
u
144
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Clustering Ensemble on Reduced Search Spaces
7
In practice, there is a simple way to reduce the search space from PX to UX .
First, u(P) should be computed and for each cluster a representative element
yi is defined. This way, a set of objects Y with |Y | ≤ |X| is obtained, and the
corresponding set of partitions PY will be equivalent to UX . This is exactly the
idea of fragment clusters. As we have previously mentioned, this idea has been
intuitively used before and have also been proven to be valid for some common
dissimilarity measures [13] such as: Mutual Information and Mirkin distance.
In this section, we have presented the notion of u-atomic function and we have
proven that for any u-atomic dissimilarity measure this prune of the search space
can be used.
3.2
Prune of the search space based on the Majority Rule
Definition 3. A dissimilarity measure between partitions dm : PX × PX → R
is said to be m-atomic, if for every pair of partitions P, P 0 ∈ PX and every
atom Pxy such that Pxy P , there is a constant real value c > 0 such that the
following properties hold:
– (i) if Pxy P 0 , then dm (P ∨ Pxy , P 0 ) ≤ d(P, P 0 ) − c
– (ii) if Pxy P 0 , then dm (P ∨ Pxy , P 0 ) ≤ d(P, P 0 ) + c
Notice that following this definition any m-atomic function is also u-atomic.
Proposition 3. Let P ⊂ PX be a set of partitions and dm be an m-atomic
function. For all partition P ∈ PX and every atom Pxy such that Pxy P and
Pxy : γ(xy, P) > 0.5, we have:
SoD(P ∨ Pxy ) < SoD(P )
Proof. SoD(P ) = dm (P, P1 ) + . . . + dm (P, Pm ) for all Pi ∈ P, with i = 1, . . . , m.
In the same way, SoD(P ∨ Pxy ) = dm (P ∨ Pxy , P1 ) + . . . + dm (P ∨ Pxy , Pm ).
As γ(xy, P) > 0.5 there are t > m/2 partitions Pi ∈ P such that Pxy Pi ,
and therefore according to definition 3, dm (P ∨ Pxy , Pi ) ≤ dm (P, Pi ) − c. On
the other hand, there are l < m/2 partitions Pj ∈ P such that Pxy Pj and
dm (P ∨ Pxy , P 0 ) ≤ dm (P, P 0 ) + c.
Therefore, SoD(P ∨ Pxy ) ≤ SoD(P ) − t · c + l · c, and taking into account
that t > l and c > 0 we have that: SoD(P ∨ Pxy ) < SoD(P ) and the proposition
is proven.
t
u
Proposition 4. Let P ⊂ PX be a set of partitions, m(P) be the majority rule and
dm be a m-atomic function. Every median partition P ∗ ∈ MP holds m(P) P ∗ .
Proof. The proof is analogous to the one for Proposition 2. Let us assume that
P ∗ ∈ MP is a median partition and m(P) P ∗ . Then, there is at least one atom
Pxy m(P) such that Pxy P ∗ . According to Proposition 3, P ∗ would not be a
median element because the partition P ∗ ∨ Pxy would have a smaller SoD value.
Therefore, the assumption is false and we conclude that m(P) P ∗
t
u
145
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
8
S. Vega-Pons and P. Avesani
Corollary 2. Let P ⊂ PX be a set of partitions and dm be a m-atomic function,
then problems (3) and (5) have the same set of solutions (MP = MM ).
Proof. This is a direct consequence of Proposition 4 and equations (3) and (5).
t
u
We have proven that if the median partition problem is defined with a matomic function, any solution of the problem will be found in the reduced search
space MX . As in the case of the fragment clusters prune, there is a simple way
to reduce the search space from PX to MX . In this case, m(P) should be first
computed and for each cluster a representative element yi is defined. This way,
a set of objects Y with |Y | ≤ |X| is obtained, and the corresponding set of
partitions PY will be equivalent to MX .
So far, we have presented the notion of m-atomic function and we have proven
that for any m-atomic dissimilarity measure this prune of the search space can be
applied. Now, we present an existing distance between partitions and we prove
that it is m-atomic.
Definition 4. (Lattice Metric [18]) The function δ : PX × PX → R defined as
δ(P, P 0 ) = |P | + |P 0 | − 2|P ∨ P 0 |, where |P | denotes the number of clusters in
partition P , is called lattice metric.
Proposition 5. The lattice metric δ : PX × PX → R is m-atomic.
Proof. Let P, Pxy , Pzt , P 0 ∈ PX be 4 partitions such that Pxy and Pzt are two
atoms holding Pxy P , Pxy P 0 ; and Pzt P , Pzt P 0 . We have to proof
that there is a constant c value such that:
(i) δ(P ∨ Pxy , P 0 ) ≤ δ(P, P 0 ) − c, and (ii) δ(P ∨ Pzt , P 0 ) ≤ δ(P, P 0 ) + c
Working on (i), we have:
|P ∨ Pxy | + |P 0 | − 2|P ∨ Pxy ∨ P 0 | ≤ |P | + |P 0 | − 2|P ∨ P 0 | − c
as Pxy P 0 , then P ∨ Pxy ∨ P 0 = P ∨ P 0 . Therefore:
|P ∨ Pxy | ≤ |P | − c
as Pxy P we have |P ∨ Pxy | = |P | − 1 because |P ∨ Pxy | means the join of two
clusters in |P |. Thus, we obtain
c≤1
Now, working on (ii):
|P ∨ Pzt | + |P 0 | − 2|P ∨ Ptz ∨ P 0 | ≤ |P | + |P 0 | − 2|P ∨ P 0 | + c
−1 − 2|P ∨ Ptz ∨ P 0 | ≤ −2|P ∨ P 0 | + c
c ≥ 2|P ∨ P 0 | − 1 − 2|P ∨ Ptz ∨ P 0 |
the right-hand side of the inequality takes the higher value when Pzt P ∨ P 0 ,
and in this case |P ∨ Ptz ∨ P 0 | = |P ∨ P 0 | − 1. Therefore,
c ≥ 2|P ∨ P 0 | − 1 − 2(|P ∨ P 0 | − 1)
146
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Clustering Ensemble on Reduced Search Spaces
9
c≥1
Taking into account both results, we conclude that δ is a m-atomic function with
c = 1.
t
u
This means that if the median partition problem is defined with the lattice
metric δ the search space of the problem is reduced to MX . Notice that this metric corresponds to the minimum path length metric in the neighboring graph of
the lattice (see Figure 1), when all edges have a weight equal to 1 [18]. Therefore, we could say that this is an informative measure to compare partitions that
takes into account the lattice structure of the partition space. Furthermore, it
allows a pruning of the search space for the median partition problem.
4
Experimental Results and Discussion
Let X = {x1 , . . . , xn } be a set of n objects where xi ∈ Rd is a vector in
multidimensional space. We assume that each dimension of the vector xi =
(xi,1 , . . . , xi,d ) is drawn from the uniform distribution xi,j ∼ U (0, 1) and are mutually independent. We generated synthetic datasets for d = 3 and n = 1000(=
103 ), 3375(= 153 ), 8000(= 203 ), 15625(= 253 ), 27000(= 303 ). Objects in the
datasets lie inside a 3-dimensional cube starting at the origin of the cartesian
coordinates system and with edge length of 1.
In order to generate different partitions of this dataset, we use a simple clustering algorithm based on cuts of the cube by random hyperplanes. Furthermore,
we model different dependency between partitions in the ensemble by clustering
the dataset taking into account different subsets of the dimensions of the objects
representation.
We carry out three different kinds of experiments to illustrate the behavior
of the proposed method for pruning the search space. In Section 4.1 we work
with different dataset sizes (n) and three different levels of dependency between
partitions. In Section 4.2, we vary the number of clusters (k) in the partitions to
be combined and finally, in Section 4.3, we carry out experiments with different
amount of partitions (m) in the cluster ensemble. For all experiments we show:
|X|: Number of objects in the dataset. |X| = n.
|PX |: Size of the original search space for the median partition problem.
|u(P)|: Number of clusters in the unanimity rule partition.
|UX |: Size of the search space after applying the unanimity rule based
prune (fragment clusters based prune).
|m(P)|: Number of clusters in the majority rule partition.
|MX |: Size of the search space after applying the majority rule based prune
(the proposed prune).
In all tables, the sizes of the search spaces are given in powers of 10. This
way, the order-of-magnitude differences among the sizes of the different search
spaces can be easily appreciated. In order to provide an uniform notation, even
the small values are given in powers of ten, e.g. if |PX | = 203 we will write 102 .
Results reported in all tables correspond to the median values of individual
results after 5 repetitions.
147
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
10
4.1
S. Vega-Pons and P. Avesani
Analysis by increasing the number of objects and varying the
independence degree
In this section we compare the sizes of |PX |, |UX | and |MX | for different dataset
sizes. We generated 10 partitions, where each partition has a number of clusters
equal to a random number in the interval [2, n/2]. In Table 1, only the first
feature dimension of the objects was used for the computation of partitions. The
idea is to generate partitions with high degree of dependency, i.e. partitions with
similar distribution of objects in the clusters. In the case of Table 2, a medium
degree of dependency is explore by using features d = 0, 1. Finally, in Table 3
all features d = 0, 1, 2 are used to analyze the behavior of the prunes in the case
of ensembles with highly independent partitions.
Table 1. Comparison of |PX |, |UX | and |MX | for different dataset sizes |X|. Partitions
were generated with a high degree of dependency (d = 0). We use m = 10, and for
each partition k = random(2, n/2). Results are the average of 5 trials.
|X|
|PX |
|u(P)|
|UX |
|m(P)|
|MX |
1000
101928
10
106
6
102
3375
107981
15
1010
9
104
8000
1021465
20
1014
10
105
15625
1045847
25
1019
10
105
27000
1084822
30
1024
14
108
Table 2. Comparison of |PX |, |UX | and |MX | for different dataset sizes |X|. Partitions
were generated with a medium degree of dependency (d = 0, 1). We use m = 10, and
for each partition k = random(2, n/2). Results are the average of 5 trials.
|X|
|PX |
|u(P)|
|UX |
|m(P)|
|MX |
1000
101928
98
10113
6
102
3375
107981
195
10268
7
103
8000
1021465
345
10539
14
109
15625
1045847
558
10963
15
1010
27000
1084822
689
101239
25
1019
Table 3. Comparison of |PX |, |UX | and |MX | for different dataset sizes |X|. Partitions
were generated with a low degree of dependency (d = 0, 1, 2). We use m = 10, and for
each partition k = random(2, n/2). Results are the average of 5 trials.
|X|
|PX |
|u(P)|
|UX |
|m(P)|
|MX |
1000
101928
777
101429
12
107
3375
107981
2094
104589
35
1030
148
8000
1021465
5877
1015096
43
1039
15625
1045847
7122
1018801
50
1048
27000
1084822
9014
1024587
71
1075
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Clustering Ensemble on Reduced Search Spaces
11
From this experiment we can appreciate the following:
– The cardinality of MX is always considerable lower than the cardinality of
UX and the cardinality of UX is also always much lower than the cardinality
of PX .
– As the number of elements in the dataset increases the size of all search
spaces also increases. However, |PX | grows faster than |UX | and at the same
time |UX | grows faster than |MX |.
– Increasing the independence in the partitions in the ensemble, the cardinality
of the resulting search spaces after both prunes also increases. The higher the
dependency between partitions, the higher the probability of finding groups
of objects that were placed in the same cluster in all partitions or in more
than half of the partitions.
– Despite the original search space PX is huge in all cases, the reduced search
space after the majority rule based prune MX is sometimes very small. In
this case, the exact solution of the median partition problem could be even
found by following an exhaustive search. On the other hand, the reduced
search space after the unanimity rule prune UX is, many times, too big to
be useful in practice.
4.2
Analysis increasing number of clusters in the partitions
In this section we used a dataset of size 10×10×10 = 1000. The three dimensions
of each object are taken into account for the generation of the partitions in the
ensemble. We generated different ensembles of m = 10 partitions with different
number of clusters k = 5, 20, 50, 100, 200. The results of this experiment are
reported in Table 4.
Table 4. Comparison of |PX |, |UX | and |MX | when the number of clusters k in the
partitions is increased. Partitions were generated by using the full representation of
objects d = 1, 2, 3. The dataset size is 1000 and we use m = 10 with k clusters.
k
|X|
|PX |
|u(P)|
|UX |
|m(P)|
|MX |
5
1000
1021465
41
1037
1
100
20
1000
1021465
161
10211
3
101
50
1000
1021465
523
10891
10
106
100
1000
1021465
647
101149
31
1026
200
1000
1021465
769
101412
115
10139
From Table 4 we can see that the size of the reduced search spaces increases
together with the number of clusters in the partitions to be combined. This is
expected, because the higher the number of clusters, the less the probability of
finding groups of objects that are placed in the same cluster in all partitions or
more than half of the partitions. Furthermore, when a small number of clusters
149
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
12
S. Vega-Pons and P. Avesani
(w.r.t the number of objects) is used, the reduction of the search space after
the majority rule based prune is too big. The median partition could have very
few clusters or even one cluster. This is a consequence of the chaining effect
illustrated in Example 1. This kind of medians could be useless in practical
applications.
4.3
Analysis increasing the number of partitions
In this section we used a dataset of size 10×10×10 = 1000. The three dimensions
of each object are taken into account for the generation of the partitions in
the ensemble. We generated ensembles with different number of partitions m =
5, 10, 20, 50, 100, where each partition has a number of clusters equal to a random
number in the interval [2, n/2]. The results of this experiment are reported in
Table 5.
Table 5. Comparison of |PX |, |UX | and |MX | when the number of partitions k in the
ensemble is increased. Partitions were generated by using the full representation of
objects d = 1, 2, 3. The dataset size is 1000 and we generate m partitions, each one
with k = random(2, n/2) clusters.
m
|X|
|PX |
|u(P)|
|UX |
|m(P)|
|MX |
5
1000
1021465
601
101052
27
1021
10
1000
1021465
694
101249
14
109
20
1000
1021465
789
101456
13
108
50
1000
1021465
772
101418
6
102
100
1000
1021465
746
101362
1
100
While the size of the search space after unanimity rule based prune |UX |
remains stable, the cardinality of MX decreases as the number of partitions in
the ensemble increases. The higher the number of partitions in the ensemble,
the higher the probability of finding groups of objects that are placed in the
same cluster in more than half of the partitions. However, this reduction could
be sometimes too big such that the resulting median partition has few clusters
or just one. This could be inappropriate in practical applications.
5
Conclusions
We studied two possible reductions of the search space for the median partition
problem. In the first case, we introduced a family of functions that allow the
application of the fragment clusters based prune. This prune have been used
in an intuitive manner or with a few measures for which the suitability of this
prune has been proven. A characterization of the measures that allow this prune
is presented. Furthermore, we introduced a stronger prune of the search space
150
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Clustering Ensemble on Reduced Search Spaces
13
for the median partition problem. In this case, we also presented a family of
dissimilarity measures that allow the application of this prune and we proved
that the lattice metric fits in this family.
The proposed prune is able to do a dramatic reduction of the search space.
Even for relatively big number of objects, for which the original search space
is really huge, the reduced search space is many times small enough such that
the median partition can be found by an exhaustive search. Even in the cases
when the reduced search space is still big, any heuristic procedure could take
advantage of the strong reduction with respect to the original size of the space.
Despite this prune can be beneficial in several problems, sometimes the median partition defined with a function that allows this prune, has a small number
of clusters. In some extreme cases, it could even be just one cluster, making this
kind of consensus useless. In practice, this limitation could be smoothed by generating an ensemble of partitions with a high number of clusters, which will be
reduced in the consensus partition computation. This idea has been previously
used in the clustering ensemble context [19].
The advantages of the proposed prune from the computational point of view
are clear in our experiments with synthetic data. A further step would be to
analyze the quality of the median partition obtained by this method on real
datasets.
The two studied prunes correspond to two particular cases of the q-quota
rules presented in Section 2: unanimity (q = 1) and majority (q = 0.5). The first
one leads to a commonly weak reduction of the search space, while the second
prune could be too strong sometimes. A possible good trade-off could be found
for prunes associated to other quota rules, e.g. q = 2/3 or 3/4. A characterization
of the dissimilarity measures between partitions that would allow this kind of
prunes is worth to be study.
Acknowledgments. This research has been supported by the RESTATE Programme, co-funded by the European Union under the FP7 COFUND Marie
Curie Action - Grant agreement no. 267224.
References
1. Vega-Pons, S., Ruiz-Shulcloper, J.: A survey of clustering ensemble algorithms.
International Journal of Pattern Recognition and Artifcial Intelligence. 25 (3)
(2011) 337 – 372
2. Fred, A.L., Jain, A.K.: Combining multiple clustering using evidence accumulation.
IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005) 835–
850
3. Ayad, H.G., Kamel, M.S.: On voting-based consensus of cluster ensembles. Pattern
Recognition 43 (2010) 1943 – 1953
4. Yoon, H.S., Ahn, S.Y., Lee, S.H., Cho, S.B., Kim, J.H.: Heterogeneous clustering
ensemble method for combining different cluster results. In: BioDM 2006. Volume
3916 of LNBI. (2006) 8292
151
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
14
S. Vega-Pons and P. Avesani
5. Strehl, A., Ghosh, J.: Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3 (2002) 583–617
6. Vega-Pons, S., Correa-Morris, J., Ruiz-Shulcloper, J.: Weighted partition consensus
via kernels. Pattern Recognition 43(8) (2010) 2712–2724
7. Topchy, A.P., Jain, A.K., Punch, W.F.: Clustering ensembles: Models of consensus
and weak partitions. IEEE Trans. Pattern Anal. Mach. Intell. 27(12) (2005) 1866–
1881
8. Punera, K., Ghosh, J.: Consensus-based ensembles of soft clusterings. Applied
Artificial Intelligence 22(7&8) (2008) 780–810
9. Mirkin, B.G.: Mathematical Classification and Clustering. Kluwer Academic Press,
Dordrecht (1996)
10. Wakabayashi, Y.: Aggregation of Binary Relations: Algorithmic and Polyhedral
Investigations. PhD thesis, Universitat Augsburg (1986)
11. Filkov, V., Skiena, S.: Integrating microarray data by consensus clustering. International Journal on Artificial Intelligence Tools 13(4) (2004) 863 – 880
12. Luo, H., Jing, F., Xie, X.: Combining multiple clusterings using information theory
based genetic algorithm,. In: IEEE International Conference on Computational
Intelligence and Security. Volume 1. (2006) 84 – 89
13. Wu, O., Hu, W., Maybank, S.J., Zhu, M., Li, B.: Efficient Clustering Aggregation
Based on Data Fragments. IEEE Transactions on Systems, Man, and Cybernetics,
Part B (Cybernetics) 42(3) (2012) 913–926
14. Chung, C.H., Dai, B.R.: A fragment-based iterative consensus clustering algorithm
with a robust similarity. Knowledge and Information Systems (2013) 1–19
15. Singh, V., Mukherjee, L., Peng, J., Xu, J.: Ensemble clustering using semidefinite
programming with applications. Machine Learning 79 (2010) 177 200
16. Vega-Pons, S., Jiang, X., Ruiz-Shulcloper, J.: Segmentation ensemble via kernels.
In: First Asian Conference on Pattern Recognition (ACPR), 2011. 686–690
17. Spivey, M.Z.: A generalized recurrence for bell numbers. Journal of Integer Sequences 11 (2008) 1 – 3
18. Leclerc, B.: The median procedure in the semilattice of orders. Discrete Applied
Mathematics 127 (2003) 285 – 302
19. Fred, A.: Finding consistent clusters in data partitions. In: 3rd. Int. Workshop on
Multiple Classifier Systems. (2001) 309–318
152
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
An Ensemble Approach
to Combining Expert Opinions
Hua Zhang1 , Evgueni Smirnov1 , Nikolay Nikolaev2 , Georgi Nalbantov3 , and Ralf
Peeters1
1
Department of Knowledge Engineering, Maastricht University,
P.O.BOX 616, 6200 MD Maastricht, The Netherlands
{hua.zhang,smirnov,ralf.peeters}@maastrichtuniversity.nl
2
Department of Computing, Goldsmiths College, University of London,
London SE14 6NW, United Kingdom
[email protected]
3
Faculty of Health, Medicine and Life Sciences, Maastricht University,
P.O.BOX 616, 6200 MD Maastricht, The Netherlands
[email protected]
Abstract. This paper introduces a new classification problem in the context of
human computation. Given training data annotated by m human experts s.t. for
each training instance the true class is provided, the task is to estimate the true
class of a new test instance. To solve the problem we propose to apply a wellknown ensemble approach, namely the stacked-generalization approach. The key
idea is to view each human expert as a base classifier and to learn a meta classifier
that combines the votes of the experts into a final vote. We experimented with the
stacked-generalization approach on a classification problem that involved 12 human experts. The experiments showed that the approach can outperform significantly the best expert and the majority vote of the experts in terms of classification
accuracy.
1 Introduction
Human computation is an interdisciplinary field involving systems of humans and computers capable of solving problems that neither party can solve better separately [4].
This paper introduces a new classification problem in the context of human computation and proposes an ensemble-related approach to that task.
The classification problem we define is essentially a single-label classification problem. Assume that we have m human experts that estimate the true class of instances
coming from some unknown probability distribution. We collect these instances together with the experts’ class estimates and label them with their true classes. The resulting instances’ collection form our training data. In this context our classification
problem is to estimate the true class of a new test instance, given the training data and
the class estimates given by m human experts for that instance.
To solve the problem we defined above we propose to apply a well-known ensemble
approach, namely the stacked-generalization approach [6]. The key idea is to view each
human expert as a base classifier and to learn a meta classifier that predicts the class
153
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
for a new instance given the class estimates provided by m human experts for that
instance. This implies that the meta classifier combines the votes of the experts into a
final vote. It is proposed to be learned using the training data given with expert class
estimates and true classes. We experimented with the stacked-generalization approach
on a classification problem that involved 12 human experts. The experiments showed
that the approach can outperform significantly the best expert and the majority vote of
the experts in terms of classification accuracy.
Our work can be compared with other work on classification considered in the context of human computation and crowdsourcing [5, 7]. In these two fields the main emphasis is on classification problems where the training data is labeled by experts only;
i.e., the true instance classes are not provided. We note that our classification problem
is conceptually simpler but somehow it has not been considered so far. There are many
applications in medicine, finance, meteorology etc. where our classification problem is
central. Consider for example a set of meteorologists that predict whether it will rain
next day. The true class arrives in 24 hours. We can record the meteorologist predictions
and the true class over a time period to form our data. Then stacked generalization is
applied and thus we hopefully will be able to predict better than the best meteorologist
or the majority vote of the meteorologists.
The remainder of the paper is organized as follows. Section 2 formalizes our classification task and describes the stacked generalization as an approach to that task. The
experiments are given in Section 3. Finally, Section 4 concludes the paper.
2 Classification Problem and Stacked Generalization
Let X be an instance space, Y a class set, and p(x, y) be an unknown probability distribution p(x, y) over the labeled space X × Y . We assume existence of m number of
human experts capable of estimating the true class of any instance (x, y) ∈ X × Y
according to p(x, y). We draw n labeled instances (x, y) ∈ X × Y from p(x, y). Any
expert i ∈ 1..m provides an estimate y (i) ∈ Y of the true class y of each instance x
without observing y. This implies that the description x of any instance is effectively
extended by the class estimates y (1) , ..., y (m) ∈ Y given by the m experts. Thus, we
consider any instance as a m + 2-tuple (x, y (1) , ..., y (m) , y). The set of the n instances
formed in this way results in training data D. In this context we define our classification problem. Given the training data D, a test instance x ∈ X, the class estimates
y (1) , ..., y (m) ∈ Y provided by the m experts for x, the classification problem is to estimate the true class for the instance x according to the unknown probability distribution
p(x, y).
Our solution to the classification problem defined above is to employ stacked generalization [6]. The key idea is to consider each human expert i ∈ 1..m as a base classifier
(providing class estimates) and then to learn a meta classifier that combines the class
estimates of the experts into a final class estimate. The meta classifier is a function that
can have two possible forms either h : X, Y m → Y or h : Y m → Y . The difference is
whether the instance descriptions in X are considered. Once the decision of the metaclassifier format is finalized we build the classifier using the training data D. We note
154
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
that our use of the stacked generalization does not impose any restrictions on the type
of the meta classifier (as opposed to [1]).
3 Experiments
For our experiments we chose a difficult language-style classification problem4 . We had
317 sentences in English that were composed according to either a Chinese style or an
American style. An example of two such sentences with the same meaning are given
below:
– Chinese Style: “I recommend you to take a vacation.”
– American Style: “I recommend that you take a vacation.”
The sentences were labeled by 12 experts that did not know the true classes of those
sentences. The language-style classification problem was to estimate the true class for
any new sentence given the class estimates provided by the 12 experts for that sentence.
The language-style classification problem was indeed a difficult problem. The expert accuracy rates were in the interval [0.27, 0.54]. The mean accuracy rate was 0.39
and standard deviation was 0.08. The accuracy rate of the majority vote of the experts
was 0.71.
The sentences with the labels of the 12 experts and their true classes formed our
training data. We trained meta classifiers predicting the true class of the sentences. We
considered two types of meta classifiers h : X, Y 12 → Y and h : Y 12 → Y . The
input of the first type of meta classifiers consisted of bag-of-word representation of the
sentence to be classified and the classes provided by all the 12 experts. The input of the
second type consisted of the classes provided by the 12 experts only. The output of both
types of meta classifiers was the class estimate for the instance to be classified.
In addition we experimented with the meta classifiers with and without use of
feature selection. The feature-selection procedure employed was the wrapper method
based on greedy stepwise search [2].
The accuracy rates of the meta classifiers were estimated using 10-fold cross-validation. However, to decide whether these classifiers were good, we needed to determine
whether their accuracy rates were statistically greater than those of the best expert and
majority vote of the experts. We note that this is not a trivial problem, since the k-fold
cross-validation is not applicable for the human experts employed. Nevertheless we performed paired t-test that we designed as follows. We split the training data randomly
into k folds. For any fold we received: the class estimates provided by the meta classifiers, the class estimates of the best expert, and the class estimates of the majority vote
of the experts for all the instances in the fold. Using this information we computed for
be
any fold j ∈ 1..k the accuracy rate am
j of the meta classifiers, the accuracy rate aj
mv
of the best expert, and the accuracy rate aj of majority vote of the experts. Then we
be
m
mv
computed the paired difference dj = am
j −aj (dj = aj −aj ), and the point estimate
P
¯
k
d¯ = (
dj )/k. Using this data the t-statistics that we used was d−µ√d , where µd is
j=1
Sd / n
the true mean and Sd is the sample standard deviation.
4
The data can be freely downloaded from: https://dke.maastrichtuniversity.nl/smirnov/hua.zip.
155
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 1. Accuracy rates of meta classifiers for the language-style classification task. s (s̄ ) indicates (non-) presence of sentence representation in the input. w (w̄) indicates (non-) use of
wrapper. The rates in bold are significantly greater than the accuracy rate 0.71 of the majority
vote of the experts on 0.05 significance level.
Classifier
AdaBoostM1
k-Nearest Neighbor
Logistic Regression
Naive Bayes
RandomForest
s̄-w̄
0.78
0.75
0.76
0.76
0.73
s-w̄
0.76
0.76
0.72
0.72
0.75
s̄-w
0.76
0.77
0.76
0.78
0.76
s-w
0.71
0.78
0.71
0.76
0.74
The accuracy rates of the meta classifiers are provided in Table 1. Since the majority
vote of the human experts outperformed the best expert, the table shows the results of
the statistical paired t-test of comparison of the accuracy rates of the meta classifiers
and the majority vote of the human experts on 0.05 significance level5 .
Two main observation can be derived from Table 1:
(O1) 18 out of 20 meta classifiers have accuracy rate significantly greater than the accuracy rate 0.71 of majority vote of the experts.
(O2) the stacked generalization achieves the best classification accuracy rates when:
(O2a) the instances to be classified are represented by the expert estimates only, and
(O2b) feature selection is employed. In this case we achieved an average rate of 0.766.
During the experiments we recorded the running time of training the meta classifiers. The results are provided in Table 2. They show that:
(O3) wrapper-based meta classifiers require more time. Among them the most efficient
are the meta classifiers that do not employ the sentence representation.
(O4) meta classifiers that do not use wrappers require less time. Among them the most
efficient are the meta classifiers that do not employ the sentence representation.
4 Conclusion
This section analyzes observations (O1) - (O4) from section 3. Based on the analysis it
provides final conclusions.
We start with observation (O1). This observation allows us to conclude that the
stacked generalization can outperform significantly the best expert and the majority
vote of the experts in terms of generalization performance . This implies that the classification problem we defined and the approach we proposed are indeed useful.
Observation (O2a) is a well-known fact in stacked generalization [1]. However in
the context of this paper it has additional meaning. More precisely we can state that for
5
For the sake of completeness we trained classifiers h : X → Y as well. Their accuracy rates
were in interval [0.47, 053]; i.e., they were statistically worse than the experts’ majority vote.
156
COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with
Ensemble Methods, Prague, Czech Republic, 27 September 2013.
Table 2. Time (ms) for building meta classifiers. s (s̄) indicates (non-) presence of sentence
representation in the input. w (w̄) indicates (non-) use of wrapper.
Classifier
AdaBoostM1
k-Nearest Neighbor
Logistic Regression
Naive Bayes
RandomForest
s̄-w̄
0.03
0
0.02
0
0.03
s-w̄ s̄-w
s-w
2.21 14.85 257.63
0 26.49 296.84
0.05 4.95 133.47
0.1 0.31 53.11
1.33 23.51 219.64
our classification problem we do have to know the class estimates of the experts only in
order to receive the best accuracy rates. The input from the application domain (in our
case English text) is less important. In addition we note that according to observations
(O3) and (O4) the use of the expert class estimates only implies less computational cost.
Observation (O2b) is an expected result in context of feature selection. However
it also has a practical implication for our classification problem, namely it allows to
choose combination of the most adequate experts. In our experiments for example only
half of the experts was chosen to maximize the accuracy. This means that we can reduce
the number of human experts and thus the overall financial cost. Of course this has a
price: increase of computational complexity according to observation (O3).
Future research will focus on the problem of human-experts’ evolution. Indeed in
real life the experts change due to many factors (e.g.; training, ageing etc.). Solving this
problem will have a high practical impact. For that purpose we plan to apply techniques
from concept drift [8] and transfer learning [3].
References
1. S. Dzeroski and B. Zenko. Is combining classifiers better than selecting the best one. In
Proceedings of the Nineteenth International Conference on Machine Learning, pages 123–
130, 2002.
2. I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh. Feature Extraction, Foundations and Applications. Physica-Verlag, Springer, 2006.
3. Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Trans. Knowl. Data
Eng., 22(10):1345–1359, 2010.
4. A. Quinn and B. Bederson. Human computation: a survey and taxonomy of a growing field. In
Proceedings of the International Conference on Human Factors in Computing Systems, CHI
2011, pages 1403–1412. ACM, 2011.
5. V. Raykar, S. Yu, L. Zhao, G. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from
crowds. Journal of Machine Learning Research, 11:1297–1322, 2010.
6. D. Wolpert. Stacked generalization. Neural Networks, 5(2):241–259, 1992.
7. Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo Valadez, L. Bogoni, L. Moy, and
J. Dy. Modeling annotator expertise: Learning when everybody knows a bit of something.
Journal of Machine Learning Research, 9:932–939, 2010.
8. I. Zliobaite. Learning under Concept Drift: an Overview. Technical Report. 2009, Faculty of
Mathematics and Informatics, Vilnius University: Vilnius, Lithuania, 2009.
157