Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Solving Complex Machine Learning Problems with Ensemble Methods ECML/PKDD 2013 Workshop Prague, 27 September 2013 Edited by Ioannis Katakis Daniel-Hernández Lobato Gonzalo Martı́nez-Munõz Ioannis Partalas National and Kapodistrian University of Athens Universidad Autónoma de Madrid Université Joseph Fourier, Grenoble COPEM 2013 Preface Ensemble methods have experienced a huge grow within the machine learning community as a consequence of their generalization performance and robustness. In particular, ensemble learning provides a straight-forward way of improving, generally, the performance of single learners. Thus, multiple classifier systems represent a solution worth of being considered in applications where high predictive performance is strictly required. The emphasis in the COPEM workshop was to discuss ensemble strategies that not only focused on supervised classification, but that could be used to solve difficult and general machine learning problems. The workshop brought together members of the ensemble methods community and also researchers from other fields that could benefit from using such techniques to address interesting research challenges. More precisely, the goals of the COPEM workshop that were successfully achieved include: a) the discussion of state-of-the-art approaches that exploit ensembles to solve complex machine learning problems and, b) bringing the community together to exchange views and opinions for future research lines and applications in ensemble learning, and to initiate new collaborations towards new challenges. COPEM was held in Prague, Czech Republic, on September 27th 2013 as a workshop of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2013). A total number of 22 candidate papers were submitted for evaluation. Paper submissions went through an exhaustive peer-review process. All papers received at last two independent reviews and some papers even three. From the total number of submitted papers, 11 papers were accepted for presentation at the workshop. In order to produce an interesting scientific program, the organizing committee made the selection of accepted papers by taking into consideration both the comments and the numerical scores provided by the reviewers, giving special attention to borderline papers. The review process was handled by 17 members of the program committee and, additionally, 9 external reviewers were recruited. The program of the workshop was divided in four sessions of 1 hour and 30 minutes each. In each session at most 4 presentations took place with a total time of 22 minutes per presentation. Furthermore, the last session included some time which was used to start a discussion about potential uses of ensemble methods to address difficult machine learning problems and to give the conclusions of the workshop. Lastly, the program featured an invited talk by Prof. Pierre Dupont (Université catholique de Louvain). This talk focused on the use of ensemble methods for the robust identification of biomarkers. A topic analyzed with interest in the COPEM workshop was the use of ensemble methods to identify relevant features or attributes for prediction. The type of learning problems considered for this task included supervised, semisupervised and multi-label learning problems. Additionally, traditional methods for identifying relevant features using ensembles were re-visited to further explore their statistical properties and their stability in the feature selection process. The utility of ensemble methods to carry out other difficult learning tasks such as software reliability prediction or image classification was also covered. Finally, different meta-learners were analyzed as potential solutions to combine several human expert opinions. These methods were shown to improve the prediction performance over the best single human expert. The second stream of papers covered many interesting topics and applications related to ensemble methods. An interesting application presented was anomaly detection. In particular, multiple, weak detectors can be used in order to meet the various requirements of this application (e.g. low training complexity, on-line training, detection accuracy). In the workshop, a new generalization of bagging is introduced. It is called Neighbourhood Balanced Bagging and the sampling probabilities of examples are modified according to the class distribution in their neighbourhoods. Prototype Support Vector machines is a new approach that trains an ensemble of linear SVMs that are tuned to different regions of the feature space. Additionally, an empirical comparison of supervised ensemble learning approaches was presented including well known methods like Boosting, Bagging and Random Forests. Finally, we had a presentation on clustering ensembles which is probably the least discussed topic in the ensemble literature. Acknowledgements We would like to thank all the authors who submitted papers to the workshop as well as all the participants. Additionally, we would like to sincerely thank the invited speaker Prof. Pierre Dupont. We are also grateful to the program committee members as well as to the external reviewers for their indispensable help during the review period for providing with high quality reviews in a very short period of time. We should also acknowledge the help of the ECML/PKDD workshop chairs, Andrea Passerini and Niels LandWehr for their help and excellent cooperation. Ioannis Katakis is funded by the the EU INSIGHT project (FP7-ICT 318225). Daniel Hernández-Lobato and Gonzalo Martı́nez-Muñoz acknowledge financial support from the Spanish Dirección General de Investigación, project TIN2010-21575-C02-02. Prague, September 2013 Ioannis Katakis Daniel-Hernández Lobato Gonzalo Martı́nez-Munõz Ioannis Partalas 4 Workshop Organization Workshop Organizers Ioannis Katakis Daniel-Hernández Lobato Gonzalo Martı́nez-Munõz Ioannis Partalas National and Kapodistrian University of Athens (Greece) Universidad Autónoma de Madrid (Spain) Universidad Autónoma de Madrid (Spain) Université Joseph Fourier (France) Program Committee Massih-Reza Amini Alberto Suárez José Miguel Hernández-Lobato Christian Steinruecken Luis Fernando Lago Jérôme Paul Grigorios Tsoumakas Eric Gaussier Alexandre Aussem Lior Rokach Dimitrios Gunopulos Ana M. González Johannes Furnkranz Indre Zliobaite José Dorronsoro Rohit Babbar Jesse Read University Joseph Fourier (France) Universidad Autónoma de Madrid (Spain) University of Cambridge (United Kingdom) University of Cambridge (United Kingdom) Universidad Autónoma de Madrid (Spain) Université catholique de Louvain (Belgium) Aristotle University of Thessaloniki (Greece) University Joseph Fourier (France) University Claude Bernard Lyon 1 (France) Ben-Gurion University of the Negev (Israel) National and Kapodistrian University of Athens (Greece) Universidad Autnoma de Madrid (Spain) TU Darmstadt (Germany) Aalto University (Finland) Universidad Autónoma de Madrid (Spain) University Joseph Fourier (France) Universidad Carlos III de Madrid (Spain) 5 External Reviewers Aris Kosmopoulos Antonia Saravanou Bartosz Krawczyk Newton Spolaôr Nikolas Zygouras Dimitrios Kotsakos George Tzanis Dimitris Kotzias Efi Papatheocharous NCSR “Demokritos” (Greece) National and Kapodistrian University of Athens (Greece) Wroclaw University of Technology (Poland) Aristotle University of Thessaloniki (Greece) National and Kapodistrian University of Athens (Greece) National and Kapodistrian University of Athens (Greece) Aristotle University of Thessaloniki (Greece) National and Kapodistrian University of Athens (Greece) Swedish Institute of Computer Science (Sweden) Sponsors 6 Contents 1 Invited talk: Robust biomarker identification with ensemble feature selection methods . . . . . . . . . . . . . . . . . . . . . . . Pierre Dupont 9 2 Local Neighbourhood in Generalizing Bagging for Imbalanced Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Jerzy Bllaszczyński, Jerzy Stefanowski and Marcin Szajek 3 Anomaly Detection by Bagging . . . . . . . . . . . . . . . . . . . 25 Tomáš Pevný 4 Efficient semi-supervised feature selection by an ensemble approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Mohammed Hindawi, Haytham Elghazel and Khalid Benabdeslem 5 Feature ranking for multi-label classification using predictive clustering trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Dragi Kocev, Ivica Slavkov and Sašo Džeroski 6 Identification of Statistically Significant Features from Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Jérôme Paul, Michel Verleysen and Pierre Dupont 7 Prototype Support Vector Machines: Supervised Classification in Complex Datasets . . . . . . . . . . . . . . . . . . . . . . . 81 April Shen and Andrea Danyluk 8 Software Reliability prediction via two different implementations of Bayesian model averaging . . . . . . . . . . . . . . . . . . 95 Alex Sarishvili and Gerrit Hanselmann 9 Multi-Space Learning for Image Classification Using AdaBoost and Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . 110 Wenrong Zeng, Xue-Wen Chen, Hong Cheng and Jing Hua 7 10 An Empirical Comparison of Supervised Ensemble Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Mohamed Bibimoune, Haytham Elghazel and Alex Aussem 11 Clustering Ensemble on Reduced Search Spaces . . . . . . . . . 139 Sandro Vega-Pons and Paolo Avesani 12 An Ensemble Approach to Combining Expert Opinions . . . . 153 Hua Zhang, Evgueni Smirnov, Nikolay Nikolaev, Georgi Nalbantov and Ralf Peeters 8 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Robust Biomarker Identification with Ensemble Feature Selection Methods (Invited Talk) Pierre Dupont Université catholique de Louvain, Belgium [email protected] Abstract. Biomarker identification is an important topic in biomedical applications of computational biology, including applications such as gene selection from high dimensional data produced on microarrays or RNA-seq technologies. From a machine learning and statistical viewpoint, such identification is a feature selection problem, typically from thousands of potentially relevant dimensions and only a few dozens of samples. In such a context, the lack of stability of the feature selection process is often problematic as the list of selected biomarkers may largely vary for only marginal fluctuations of the data samples. We describe in this talk simple yet effective ways to increase the robustness of such selection through ensemble methods and various aggregation mechanisms to build a consensus list from an ensemble of lists. The increased robustness is key for subsequent biological validation of the selected markers. We first describe selection methods that are embedded in the estimation of an ensemble of support vector machines (SVMs). SVMs are powerful classification models that have shown state-of-theart performance on several diagnosis and prognosis tasks on biological data. Their feature selection extensions also offer good results for gene selection tasks. We show that the robustness of SVMs for biomarker discovery can be substantially increased by using ensemble feature selection techniques, while at the same time improving their classification performances. We also briefly discuss some alternative ensemble mechanisms extending univariate methods. Those methods are simpler and less computationally intensive than embedded approaches. Whenever based on a simple statistical test, such as a paired t-test or a Wilcoxon rank test, they also offer a convenient way to weight each candidate feature with its associated p-value and to construct the consensus list accordingly. We conclude our talk by stressing the well known risk of selection bias and how such risk can be limited through appropriate estimation procedures. 9 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Local Neighbourhood in Generalizing Bagging for Imbalanced Data Jerzy BÃlaszczyński, Jerzy Stefanowski and Marcin Szajek Institute of Computing Sciences, Poznań University of Technology, 60–965 Poznań, Poland {jerzy.blaszczynski, jerzy.stefanowski, marcin.szajek}@cs.put.poznan.pl Abstract. Bagging ensembles specialized for class imbalanced data are considered. We show that difficult distributions of the minority class can be handled by analyzing the content of the local neighbourhood of examples. First, we introduce a new generalization of bagging, called Neighbourhood Balanced Bagging, where sampling probabilities of examples are modified according to the class distribution in their neighbourhoods. Experiments show that it is competitive to other extensions of bagging. Finally, we demonstrate that assessing types of the minority examples based on the analysis of their neighbourhoods could help in explaining why some ensembles work better for imbalanced data than others. 1 Introduction One of the sources of difficulties while constructing accurate classifiers is class imbalance in data. This difficulty manifests itself by the fact that one of the target classes is significantly less numerous than the other classes. This problem often occurs in many applications, and constitutes a difficulty for most learning algorithms. As a results many classifiers are biased toward majority classes and fail to recognize examples from the minority class. The class imbalance problem has received a growing research interest in the last decade and a number of specialized methods have been proposed, see their review [7]. In general, they are categorized in data level and algorithm level ones. Methods from the first category are based on pre-processing, and they transform the original data distribution into more balanced one. The simplest methods are: random over-sampling, which replicates examples from the minority class, and random under-sampling, which randomly eliminates examples from the majority classes until a required degree of balance between classes is reached. The more informative methods, e.g., SMOTE, introduce additional synthetic examples according to an internal characteristics of regions around examples from the minority class [4]. The methods from the second category are classifier dependent ones. They also include ensembles of classifiers. However, the standard methods of ensemble construction are oriented toward improving the overall classification accuracy and they do not solve sufficiently the recognition of the minority class. The new proposed ensembles usually include either integrating pre-processing 10 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. methods before learning component classifiers or embedding the cost-sensitive framework in the ensemble learning process, see their review in [5]. Although several specialized ensembles have been presented as adequate to class imbalance, there is still lack of their general comparison or discussion of their competence area. To the best to our knowledge, only two comprehensive studies were carried out in different experimental frameworks [5, 10]. The main conclusions from the comparative study [5] were that simpler versions of using under-sampling or SMOTE inside ensembles worked better than more complex solutions. Then, the experiments with few best boosting and bagging generalizations over noisy and imbalanced data sets showed that bagging outperformed boosting. In our previous study we experimentally compared main bagging variants for class imbalance and also observed that under-sampling bagging performed much better than variants with over-sampling [2]. In particular, the Roughly Balanced Bagging [8] achieved the best results. We keep our interest in bagging extensions and identify two tasks to be undertaken. The first task is looking for the hypothesis why the under-sampling bagging works better than over-sampling ones. The other task concerns an attempt to construct yet another ensemble, more similar to over-sampling the minority class, leading to performance closer to the Roughly Balanced Bagging. Our new view is to resign from the simple integration of pre-processing with unchanged bagging sampling technique. Unlike using the equal probabilities of each example in bootstrap sampling we want to change probabilities of their drawing and to focus this sampling toward the minority class and additionally more to the examples located in difficult sub-regions of the minority class. While considering the probability of each example to be drawn we propose to analyze class distributions in the local neighbourhood of the minority example [13]. Depending on the distribution of examples from the majority class in this neighbourhood, we can evaluate whether this example could be safe or unsafe (difficult) to be learned. This approach is inspired by our earlier positive experience with studying the ”nature” of imbalanced data where such local characteristics was successfully modeled with the k-nearest neighbourhood [13]. To sum up, the main contributions of our study are the following. The first aim is to introduce a new extension of bagging for class imbalance, where the probability of selecting an example into the bootstrap sample is influenced by the analysis of the class distribution in a local neighbourhood of the example. The new proposal is compared against existing extensions over several data sets. Then, the second aim is to use the same type of analysis to explain how contents of bootstrap samples affect performance of the Roughly Balanced Bagging and the proposed new extension. 2 Related Works Due to space limits, we briefly discuss the most related works only. The reader is referred to [5] for the most comprehensive review of current ensembles addressing the class imbalance. Below we discuss extensions of bagging. 11 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Recall that original Breiman’s bagging [3] is based on the bootstrap aggregation, where the training set for each classifier is constructed by random uniformly sampling (with replacement) instances from the original training set (usually keeping the size of the original data). Then, T component classifiers are induced by the same learning algorithm from these T bootstrap samples. Their predictions form the final decision with the equal weight majority voting. However, bootstrap samples are still biased toward the majority class. Most of proposals overcome this drawback by using pre-processing techniques, which change the balance between classes in bootstraps. In Underbagging approaches the number of the majority class examples in each bootstrap sample is randomly reduced to the cardinality of the minority class (Nmin ). In the simplest proposal Exactly Balanced Bagging (EBBag), while creating each training bootstrap sample, the entire minority class is just copied and combined with randomly chosen subsets of the majority class to exactly balance cardinalities between classes. The base classifiers and their aggregation are constructed as in the standard bagging. The Roughly Balanced Bagging (RBBag) [8] results from the critique of the EBBag. Instead of fixing the constant sample size, it equalizes the sampling probability of each class. For each of T iterations the size of the majority class in the final bootstrap sample (Smaj ) is determined probabilistically according to the negative binominal distribution. Then, Nmin examples are drawn from the minority class and Smaj examples are drawn from the entire majority class using bootstrap sampling as in the standard bagging (with or without replacement). The class distribution inside the bootstrap samples maybe slightly imbalanced and varies over iterations. According to [8] this approach is more consistent with the nature of the original bagging and performs better than EBBag. In our experiments on larger collection of data [2] both RBBag and EBBag achieved quite similar results for the sensitivity measure while RBBag was slightly better than EBBag for G-mean and F-measure. Another way to transform bootstrap samples includes over-sampling the minority class before training classifiers. In this way the number of minority examples is increased in each bootstrap sample while the majority class is not reduced as in underbagging. This idea is realized with different over-sampling techniques. We present two approaches further used in experiments. Overbagging is the simplest version which applies over-sampling to transform each bootstrap sample. Smaj of minority class examples is sampled with replacement to exactly balance the cardinality of the minority and the majority class. Majority examples are sampled with replacement as in the original bagging. Another approach is used in SMOTEBagging to increase diversity of component classifiers. First, SMOTE is used instead of random over-sampling of the minority class. Then, SMOTE resampling rate (α) is stepwise changed in each iteration from small to high values (e.g,. from 10% to 100%). This ratio defines the number if minority examples (α × Nmin ) to be additionally re-sampled in each iteration. Quite similar trick is also used to construct bootstrap samples in ”from underbagging to overbagging” ensemble. 12 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 3 Neighbourhood Balanced Bagging for Imbalanced Data The proposed extension of bagging descends from results of studying sources of difficulties in learning from imbalanced classes. The high imbalance ratio between cardinalities of minority and majority class is not the only and not even the main reason of these difficulties. Other, as we call them, data factors, which characterize class distributions, are also influential. The experimental studies, as e.g. [9], demonstrate that the degradation of classification performance is linked to the decomposition of the minority class into many sub-parts containing very few examples. It means that the minority class does not form a homogeneous, compact distribution of the target concept but it is scattered into many smaller sub-clusters surrounded by majority examples (they correspond to the small disjuncts as they are harder to be learned and more contribute to classification errors than larger sub-concepts). Other factors related to the class distribution (occurring together with the class rarity) concern the effect of too strong overlapping between the classes [6] or presence of too many single minority examples inside the majority class regions [12]. We follow studies, as [11, 12], where the data factors are linked to different types of examples creating the minority class distribution. Authors differentiate between safe and unsafe examples. Safe examples are ones located in the homogeneous regions populated by examples from one class only. Other examples are unsafe and more difficult for learning. Unsafe examples are categorized into borderline (placed close to the decision boundary between classes), rare cases (isolated groups of few examples located deeper inside the opposite class), or outliers. The appropriate treatment of these types of minority examples within pre-processing methods should lead to improving learning classifiers, e.g., as it has been done by Stefanowski inside the informed pre-processing method SPIDER [14]. The question is how to identify these types of examples. In [13], it is achieved by analyzing the class distribution inside a local neighbourhood of the considered example, which is modeled by k-nearest neighbour examples. The distance between the examples is calculated according to the HVDM metric (Heterogeneous Value Difference Metric) [16]. Then, the number of neighbours from the opposite class indicates how safe or unsafe is the considered example (see [13] for details). Inspired by the positive results of [14, 13], we will exploit characteristics of the local neighbourhood in a different quantitative way. The result is a new modification of bagging, which is called Neighbourhood Balanced Bagging (NBBag). The idea behind NBBag is to focus sampling process toward these minority examples, which are hard to be learned (i.e. unsafe ones) while decreasing probabilities of selecting examples from the majority class at the same time. Recall that the idea of changing sampling probabilities has been considered in our previous work with applying bagging to noisy data and improving the overall accuracy [1]. Here, we postulate another strategy to change bootstrap samples. It is carried through a conjunction of sampling modifications at two levels: global and local ones. 13 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. At the first, global level, we attempt at increasing the chance of drawing the minority examples with respect to the imbalance ratio in the original data. We implement it by changing the probability of sampling of majority examples. More precisely, first we set p1min probability of sampling of each minority example to 1. Then, we downscale p1maj probability of sampling of a majority example to Nmin Nmaj , where Nmin , Nmaj are numbers of examples in the minority and majority class in the original data, respectively. Intuitively, it could refer to the situation, where minority and majority classes contain examples of the same type, e.g., safe ones, and the class distributions are not affected by other data factors. Thus, this simple modification of probabilities exploits information about the global between-class imbalance. It should lead to bootstrap samples with approximately globally balanced class cardinalities. However, the experimental studies [2, 5, 10] show that the global balancing in overbagging (somehow similar to our global level) is not competitive to other extensions of bagging. Moreover, most studied imbalance data sets contain many unsafe minority examples while the majority classes comprise rather safe ones, see results in [12]. However, while more focusing on the local characteristics of the minority class one should differently treat types of unsafe examples, as earlier successful experiments with such pre-processing methods as SPIDER [14] or generalizations of SMOTE as Borderline-SMOTE (see its description in [7]) pointed out that safe minority examples could be less over-sampled than borderline or other unsafe ones. The second local level is intended to shift sampling of minority examples to these unsafe examples that are harder to be learned – what is identified by analyzing their k-nearest neighbours. This level can be modeled in different ways, having in mind the following rule: the more unsafe example, the more amplified probability of its drawing. This is partly inspired by earlier successful experiences with informed pre-processing methods. The modification rule could be done with either linear or non-linear function. In this study we use the formula L2min , defined as: L2min = 0 (Nmaj )ψ , k (1) 0 is the number of examples in the neighbourhood, which belong to where Nmaj the majority class; ψ is an exponential scaling factor, which in default case of a linear modification is set to 1. The value ψ may be increased if one wants to strengthen the role of rare cases and outliers in bootstraps. This increase may correspond to data sets where the minority class distribution in the original data is scattered into many rare cases or outliers, and the number of safe examples is significantly limited (see exemplary data, e.g. balance-scale, in the further experiments – section 4). The formula L2min requires re-scaling as it may lead to the probability equal 0 to 0 for completely safe examples, i.e., for Nmaj = 0. We propose to re-formulate it as: β × (L2min + 1) 14 (2) COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. where β is a technical coefficient referring to drawing a completely safe example. Intuitively, safe examples from both minority and majority classes should have the same probability of being selecting to bootstraps. Setting β to 0.5 keeps this intuition. Adding the number ”1” corresponds to a normalization of sampling probabilities inside the conjunctive combination, if one expects that pmin ∈ [0, 1]. Then, we hypothesize that examples from majority class are, by default, not balanced on the second level, which is reflected by L2maj = 0. The intuition behind this hypothesis is that examples from majority class, are more likely to be safe. Even when it is false for some data, it is still quite apparent that amplifying majority rare or outlying examples, at this level, would increase difficulties of learning classifiers from the minority classes interiors disrupted by them. Finally, local and global levels are combined by a multiplication. This combination could correspond to the independence assumption, i.e. the distribution of examples in the neighbourhood is independent from the global distribution of examples in the whole data set. This leads us to the final formulations of the probability of selecting minority and majority classes, respectively as: pmin = p1min × β(L2min + 1) = p1min × 0.5(L2min + 1) = 0.5(L2min + 1), (3) Nmin pmaj = p1maj × β(L2maj + 1) = p1maj × 0.5 = × 0.5, (4) Nmaj resulting from L2maj = 0, and default β set to 0.5. 4 Experiments The first part experiments is an evaluation of the new proposed NBBag while the others concern using the local neighbourhood analysis to assess types of examples and studying the contents of bootstrap samples. 4.1 Evaluation of Bagging Extensions First, we compare performance of NBBag with existing extensions of bagging. As a baseline for this comparison we use a balanced bagging (BBag), i.e., a variant which attempts to globally balance cardinalities of majority class and minority class in bootstrap samples (this is achieved by using only the first ”global” level of NBBag of decreasing probability of the minority examples according to the imbalance ratio). Following our earlier study [2], we chose Rough Balanced Bagging (RBBag) as the best under-sampling extension. As our approach is more similar to over-sampling, we also consider: Overbagging (OverBag) and SMOTEBagging (SMOBag). All implementations are done for the WEKA framework. Component classifiers in all bagging variants are learned with C4.5 tree learning algorithm (J4.8), which uses standard parameters except disabling pruning. For all bagging variants, we tested the following numbers T of component classifiers: 20, 50 and 100. Due to space limit, we present detailed results for T = 50 only. Results for 15 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. other T lead to similar general conclusions. In case of using SMOBag, we used 5 neighbours and the oversampling ratio α was stepwise changed in each sample starting from 10%. In NBBag we tested different sizes of the neighbourhood with k = 5, 7, 9 and 11. Their best values depend on a particular data set, however using 7 neighbours is the best in case of the linear amplification (ψ = 1). This option is further denoted as 7NBBag. As minority classes in some data sets are composed of mainly rare examples and outliers, we also considered the other variant of increasing a selection of examples with ψ = 2. For it the best results are obtained with a slightly smaller number of neighbours equal 5 – so it will be denoted as 5NBBag2 . We conduct our analysis on 20 real-world data sets representing different domains, sizes and imbalance ratio. Most of data sets come from the UCI repository, and have been used in other works on class imbalance. Two data sets abominal and scrotal-pain come from our medical applications. For data sets with more than two classes, we chose the smallest one as a minority class and combined other classes into one majority class. Their characteristics are preN sented in Table 1 where IR is the imbalance ratio defined as Nmaj . min Table 1. Data characteristics # examples # attributes Minority class IR Data set abdominal pain 723 13 positive 2.58 balance-scale 625 4 B 11.76 breast-cancer 286 9 recurrence-events 2.36 breast-w 699 9 malignant 1.90 car 1728 6 good 24.04 cleveland 303 13 3 7.66 cmc 1473 9 2 3.42 credit-g 1000 20 bad 2.33 ecoli 336 7 imU 8.60 flags 194 29 white 10.41 haberman 306 4 2 2.78 hepatitis 155 19 1 3.84 ionosphere 351 34 b 1.79 new-thyroid 215 5 2 5.14 postoperative 90 8 S 2.75 scrotal pain 201 13 positive 2.41 solar-flareF 1066 12 F 23.79 transfusion 748 4 1 3.20 vehicle 846 18 van 3.25 yeast-ME2 1484 8 ME2 28.10 The performance of bagging ensembles is measured using: sensitivity of the minority class (the minority class accuracy), its specificity (an accuracy of recognizing majority classes), their aggregation to the geometric mean (G-mean) and F-measure (referring to the minority class, and used with equal weights ”1” as- 16 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. signed to precision and recall). For their definitions see, e.g. [7]. These measures are estimated with the stratified 10-fold cross-validation repeated several times to reduce the variance. The average values of G-mean and sensitivity are presented in Tables 2 and 3, respectively. The differences between classifier average results will be also analyzed using either Friedman or Wilcoxon statistical tests (with a standard significance level 0.05). In all these tables the last row contains average ranks calculated as in the Friedman test – the lower average rank, the better classifier. Table 2. G-mean [%] of compared bagging ensembles Data set abdominal pain balance-scale breast-cancer breast-w car cleveland cmc credit-g ecoli flags haberman hepatitis ionosphere new-thyroid pima scrotal pain solar-flareF transfusion vehicle yeast-ME2 avg. rank BBag SMOBag OverBag RBBag 7NBBag 5NBBag2 79.04 80.85 79.44 79.99 80.26 80.82 19.74 0.00 1.40 58.12 47.36 61.07 60.60 52.57 56.17 58.62 59.32 56.53 96.11 95.88 96.23 96.13 96.21 96.14 96.21 95.26 95.29 97.09 96.80 96.98 51.02 25.03 22.77 72.14 58.06 65.75 61.12 57.74 59.95 64.86 62.81 64.33 65.87 80.68 71.75 86.89 66.48 66.94 84.41 58.38 51.42 72.91 86.52 86.74 61.16 62.48 64.30 65.84 61.46 61.46 61.22 60.02 58.11 65.92 62.24 48.65 74.92 68.47 72.16 80.23 78.19 75.33 90.88 90.30 90.47 90.25 90.76 89.95 95.35 95.18 95.36 97.15 96.73 97.02 74.51 72.33 73.54 74.59 74.18 72.30 73.41 70.42 72.01 74.17 72.29 71.42 64.97 55.04 58.07 84.91 66.80 71.13 67.33 63.96 64.83 67.39 64.98 39.56 95.13 94.34 94.61 94.77 95.49 95.91 63.59 59.41 59.70 84.37 69.35 74.86 3.65 5.0 4.35 1.95 2.83 3.23 The results of Friedman tests (with CD = 1.69), reveal that, in both cases of G-mean and sensitivity, RBBag, 7NBBag, and 5NBBag2 are significantly better than the rest of classifiers, without significant difference among them. Still, we can give some more detailed observations. For G-mean, RBBag is the best classifier according to average ranks (see Table 2). It is also significantly better than several classifiers according to the Wilcoxon test - although the difference is not significant to 7NBBag. The worst classifier with respect to G-mean is SMOBag. Although it is a more complex approach using the informed SMOTE method, one can notice that the much simpler overbagging or BBag give better evaluation measures. Analyzing the recognition of the minority examples, i.e. the sensitivity measure in Table 3, the best performing is 5NBBag2 with respect to the average 17 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 3. Sensitivity [%] of compared bagging ensembles Data set abdominal pain balance-scale breast-cancer breast-w car cleveland cmc credit-g ecoli flags haberman hepatitis ionosphere new-thyroid pima scrotal pain solar-flareF transfusion vehicle yeast-ME2 avg. rank BBag SMOBag OverBag RBBag 7NBBag 5NBBag2 75.54 71.57 74.22 80.99 78.51 80.99 4.90 0.00 0.67 65.67 35.51 72.45 54.12 34.35 44.91 59.81 59.88 66.71 96.02 95.02 95.98 96.41 96.47 96.35 94.20 92.54 92.62 100.00 95.36 95.80 29.14 17.22 16.11 77.22 39.71 54.57 50.93 40.05 46.47 66.80 57.12 66.61 61.27 71.67 60.83 90.28 67.53 73.93 77.14 55.00 66.67 78.33 82.00 84.29 82.35 45.89 52.89 68.56 82.94 82.94 56.30 49.81 49.86 61.34 69.51 87.28 65.62 54.44 62.78 84.17 73.44 69.38 85.56 83.70 84.70 86.00 87.38 87.94 92.57 92.22 93.06 97.50 95.43 96.00 74.70 65.13 67.38 78.09 80.56 85.07 69.83 58.56 65.89 75.78 70.34 71.86 47.44 37.33 42.17 87.33 50.70 58.84 61.46 51.53 56.54 69.83 72.64 92.08 94.77 92.14 93.46 96.48 95.63 96.48 41.57 39.11 39.11 91.56 49.80 59.22 4.05 5.78 5.08 1.95 2.52 1.62 ranks. However, according to the Wilcoxon test, its difference to RBBag is not significant. Again, the worst classifier in this comparison is SMOBag. We do not show values of the F-measure, due to space limits. Nevertheless, these results indicate, similarly to the results for G-mean, that RBBag, 7NBBag, and 5NBBag2 are better than other classifiers with no significant difference among them (e.g., the Wilcoxon test p value is 0.3 while comparing the pair of best classifiers RBBag and 7NBBag). Looking more precisely at results in Tables 2 and 3 one can also notice that classifiers leading to high improvements of the sensitivity also strongly deteriorate G-mean at the same time (it means that the recognition of the majority class is much worse). For example see transfusion data set, which contains many outliers (see Table 4) and using ψ = 2 in the variant 5NBBag2 leads the highest sensitivity 92.08% and the worst G-mean 39.56% among all compared classifiers. The similar trade off occurs also for, e.g., haberman data set and in the case of RBBag for, e.g., car. The linear amplification of the local probability of the minority class (ψ = 1) is the more conservative approach and it could be used if one wants to improve the sensitivity while still keeping the accuracy of majority classes at the sufficient level. However, tuning other intermediate ψ values between 1 and 2 could be the topic of further experiments. Finally, notice that using the imbalance ratio to global balancing classes in bootstrap samples is not sufficient. Consider results of BBag which works sim- 18 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. ilarly to over-bagging. Taking into account information about the local neighbourhood of minority examples improves classification performance with respect to all evaluation measures. To conclude, the introduction of local modifications of sampling probabilities inside the combination rule of NBBag maybe the crucial element leading to the significantly better performance of the ensembles than all overbagging variants as well as for making it competitive to RBBag. 4.2 Analyzing Data Characteristics and Bootstrap Samples The aim of this part of experiments is to learn more about the nature of the best bagging extensions. First, we want to study class data characteristics in considered data sets and to identify types of examples (recall their distinction in section 3). Following the method introduced in [13] we propose to assign types of examples using information about class labels in their k-nearest local neighbourhood. In this analysis we will use k = 5, because k = 3 may poorly distinguish the nature of examples, and k = 7 has led to quite similar decisions [13]. This choice is also similar to the size of neighbourhood used in NBBag. For the considered example x and k = 5, the proportion of the number of neighbours from the same class as x against neighbours from the opposite class can range from 5:0 (all neighbours are from the same class as the analyzed example x) to 0:5 (all neighbours belong to the opposite class). Depending on this proportion, we assign the type labels to the example x in the following way [13]: Proportions 5:0 or 4:1 inside the neighbourhood – the example x is labeled as a safe example (as it is surrounded by examples from the same class); 3:2 or 2:3 – it is a borderline example; 1:4 – it is interpreted as a rare case; 0:5 – it is an outlier. For higher values of k such proportions could be interpreted in a similar way. The results of such labeling of the minority class examples are presented in Table 4. The first observation is that many data sets contain rather a small number of safe examples. The exceptions are three data sets composed of almost only safe examples: breast-w, car, and flags. On the other hand, there are data sets such as cleveland, balance-scale or solar-flareF, which do not contain any safe examples. We carried out the similar neighbourhood analysis for the majority classes and make a contrary observation – nearly all data sets contain mainly safe majority examples (e.g. yeast-ME2: 98.5%, ecoli: 91.7%) and sometimes a limited number of borderline examples (e.g. balance-scale: 84.5% safe and 15.6% borderline examples). What is even more important nearly all data sets do not contain any majority outliers and at most 2% of rare examples. Thus, we can repeat similar conclusions to [13], saying that in most data sets the minority class includes mainly difficult unsafe examples. Then, one can observe that for safe data sets nearly all bagging extensions achieve similar high performance (see Tables 2 - 3 for breast-w, new-thyroid). A quite similar observation concerns data sets with still high number of safe examples, limited borderline ones and no / or nearly no rare cases or outliers - see, e.g., vehicle. One the other hand, the strong differences between classifiers occur for the most difficult data distributions with a limited number of safe minority 19 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 4. Labeling minority examples expressed as a percentage of each type of examples occurring in this class Data set abdominal pain balance-scale breast-cancer breast-w car cleveland cmc credit-g ecoli flags haberman hepatitis ionosphere new-thyroid pima scrotal pain solar-flareF transfusion vehicle yeast-ME2 Safe 61,39 0,00 21,18 91,29 47,83 0,00 13,81 15,67 28,57 100,00 4,94 18,75 44,44 68,57 29,85 50,85 2,33 18,54 74,37 5,88 Border 23,76 0,00 38,82 7,88 47,83 45,71 53,15 61,33 54,29 0,00 61,73 62,50 30,95 31,43 56,34 33,90 41,86 47,19 24,62 47,06 Rare 6,93 8,16 27,06 0,00 0,00 8,57 14,41 12,33 2,86 0,00 18,52 6,25 11,90 0,00 5,22 10,17 16,28 11,24 0,00 7,84 Outlier 7,92 91,84 12,94 0,83 4,35 45,71 18,62 10,67 14,29 0,00 14,81 12,50 12,70 0,00 8,58 5,08 39,53 23,03 1,01 39,22 examples. Furthermore, the best improvements of all evaluation measures for RBBag or NBBag are observed for the unsafe data sets. For instance, consider cleveland (no safe examples, nearly 50% of outliers) where RBBag has 72% G-mean comparing to overbagging with 22.7%. Similar highest improvements occur for balance-scale (containing the highest number of outliers among all data sets) where NBBag gets 61.07% while OverBag 1.4%. Similar situations also occur for yeast-ME2, ecoli, haberman or solar-flare. We can conclude that RBBag and NBBag strongly outperform other bagging extensions for the most difficult data sets with large numbers of outliers or rare cases – sometimes occurring with borderline examples. In order to better understand these improvements achieved by RBBag and NBBag, we perform the same neighbourhood analysis and labeling types of minority examples inside their bootstraps. For each bootstrap sample we label types of minority examples basing on class labels of the k-nearest neighbours. Then, we average results from all bootstraps. The results of this labeling are presented in two rows of Table 5, referring to each classifier. We present result for 7NBBag only, due to space limits and skip some safe data, where there is no big changes of distributions between both variants of NNBag. In our opinion these results reveal very interesting properties of both ensembles. While comparing Tables 4 and 5 notice that RBBag and NBBag strongly change types of the minority class distributions into safer ones inside their bootstraps. For many data sets which originally contain high numbers of rare cases 20 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 5. Distributions of types of the minority examples [in %] inside bootstrap samples for each classifier and data set Data set Classifier NBBag abdominal pain RBBag NBBag balance-scale RBBag NBBag breast-cancer RBBag NBBag breast-w RBBag NBBag cleveland RBBag NBBag cmc RBBag NBBag credit-g RBBag NBBag ecoli RBBag NBBag flags RBBag NBBag haberman RBBag NBBag hepatitis RBBag NBBag ionosphere RBBag BBag new-thyroid RBBag NBBag scrotal pain RBBag NBBag solar-flareF RBBag NBBag transfusion RBBag NBBag vehicle RBBag NBBag yeast-ME2 RBBag 21 Safe 65.14 72.60 59.23 39.68 37.55 35.56 93.44 93.57 64.58 42.86 38.33 42.47 36.07 34.44 81.61 85.33 100.00 100.00 33.37 25.34 65.89 67.01 94.96 51.98 96.54 90.83 62.92 64.67 84.83 70.52 35.71 41.00 86.84 89.80 86.66 64.31 Border 21.5 19.15 20.52 59.02 43.51 52.82 6.04 5.60 17.59 53.33 41.14 50.96 49.21 59.58 7.76 11.75 0.00 0.00 46.82 66.09 23.78 26.60 3.62 31.10 2.41 9.17 25.24 29.34 7.67 21.37 46.43 41.90 10.4 10.20 7.51 34.29 Rare 7.41 5.11 5.63 0.05 11.39 7.52 0.40 0.29 7.07 0.44 10.74 3.13 7.47 2.79 3.96 0.00 0.00 0.00 9.63 4.07 4.38 1.25 0.49 6.60 0.15 0.00 6.57 3.50 3.76 2.69 9.44 3.76 1.46 0.00 1.79 0.22 Outlier 5.96 3.15 14.62 1.25 7.54 4.10 0.12 0.54 10.76 3.37 9.79 3.44 7.26 3.19 6.67 2.92 0.00 0.00 10.19 4.50 5.95 5.14 0.93 10.32 0.90 0.00 5.27 2.49 3.74 5.43 8.42 13.33 1.30 0.00 4.05 1.18 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. or outliers, the transformed bootstrap samples contain now more safe examples. For instance, consider the very difficult balance-scale data set (containing originally 91.8% outliers), where RBBag creates bootstrap samples with at most 4% outliers and 7.5% rare cases while moving the rest of examples into safe and borderline ones. Similar data type shift could be observed for: yeast-ME2 (originally 5% safe examples, now over 70%), solar-flareF, ecoli, ionosphere, hepatitis, cleveland. Finally, one can notice that RBBag usually constructs slightly safer data than NBBag. Recall that literature known extensions of bagging are based on the simple idea of balancing distributions in bootstrap samples. However, our results indicate that transforming distributions of examples into safer ones can be more influential. In case of RBBag it could be connected with strong filtering majority class examples in each bootstrap sample. Notice that many data sets contain nearly 1000 examples with around 50 minority ones. For instance, the number of all examples in solar flare is 1066 while the minority class contains 43 examples only. The new created bootstrap samples include only 43 safe majority examples and as a result most of the majority class examples (also reflecting their original distribution) disappear. It can be interpreted as a kind of cleaning around the minority class examples, so they become safer in their local neighbourhood. Having such a transformed distribution in each sample can help construct base classifiers, which are more biased toward the minority class. On the other hand, the size of the learning set can be dramatically reduced. As a result, some bootstrap samples may lead to weak classifiers, and this type of ensemble may need more component classifiers than NBBag, which uses larger bootstrap samples. 5 Discussion and Final Remarks The difficulty of learning classifiers from imbalanced data comes from complex distributions of the minority class. Besides the unequal class cardinalities, the minority class is decomposed into smaller sub-parts, affected by strong overlapping, rare cases or outliers. In our study we attempt to capture these data characteristics by analyzing the local neighbourhood of minority class examples. Our main message it to show that such kind of local information could be useful both for proposing a new type of bagging and to explain why some ensembles work better than others. Our first contribution includes the introduction of the Nearest Balanced Bagging which is based on different principles than all known bagging extensions for class imbalances. First, instead of integrating bagging with pre-processing, we keep the standard bagging idea but we change radically probabilities of sampling examples by increasing the chance of drawing more difficult minority examples. Furthermore, we promote to amplify the role of difficult examples with respect to their local neighbourhood. The experimental results show that this proposal is significantly better than existing over-sampling generalizations of bagging and 22 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. it is competitive to Roughly Balanced Bagging (being the best known undersampling variant). The other contribution is to use the local neighbourhood analysis to assess the type of examples in data. It comes from the earlier research of Stefanowski and Napierala [13], however it is now applied in the context of ensembles, which uncover new characteristics of studied ensembles. First, the strongest differences between classifiers have been noticed for data sets containing the most unsafe minority examples. Indeed, both RBBag and NBBag ensembles have strongly outperformed all overbagging variants for such data. Furthermore, the analysis of types of minority examples inside bootstrap samples has clearly showed that RBBag and NBBag strongly changed data characteristics comparing to the original data sets. Many examples from the minority class labeled as unsafe (in particular as rare cases or outliers) have been transformed to more safe ones. This might be more influential for improving the classification performance than the simple global class balancing, which was previously considered in the literature and applied to many of existing approaches to generalize bagging. References 1. BÃlaszczyński, J., SÃlowiński, R., Stefanowski, J.: Feature Set-based Consistency Sampling in Bagging Ensembles. Proc. From Local Patterns To Global Models (LEGO), ECML/PKDD Workshop, 19–35 (2009) 2. BÃlaszczyński, J., Stefanowski, J., Idkowiak L.: Extending bagging for imbalanced data. Proc. of the 8th CORES 2013, Springer Series on Advances in Intelligent Systems and Computing 226, 269–278 (2013) 3. Breiman, L.: Bagging predictors. Machine Learning, 24 (2), 123–140 (1996) 4. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artifical Intelligence Research, 16, 341-378 (2002) 5. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H. Herrera, F.: A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and HybridBased Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 99, 1–22 (2011) 6. Garcia, V., Sanchez, J.S., Mollineda, R.A.: An empirical study of the behaviour of classifiers on imbalanced and overlapped data sets. In: Proc. of Progress in Pattern Recognition, Image Analysis and Applications 2007, Springer, LNCS, vol. 4756, 397–406 (2007) 7. He H., Garcia E.: Learning from imbalanced data. IEEE Transactions on Data and Knowledge Engineering, 21 (9), 1263–1284 (2009) 8. Hido S., Kashima H.: Roughly balanced bagging for imbalance data. Statistical Analysis and Data Mining, 2 (5-6), 412–426 (2009) 9. Jo, T., Japkowicz, N.: Class Imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter, 6 (1), 40–49 (2004) 10. Khoshgoftaar T., Van Hulse J., Napolitano A.: Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Systems, Man, and Cybernetics–Part A, 41 (3), 552–568 (2011) 23 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 11. Kubat, M., Matwin, S.: Addresing the curse of imbalanced training sets: one-side selection. In Proc. of Int. Conf. on Machine Learning ICML 97, 179–186 (1997) 12. Napierala, K., Stefanowski, J., Wilk, Sz.: Learning from Imbalanced Data in Presence of Noisy and Borderline Examples. In: Proc. of 7th Int. Conf. RSCTC 2010, Springer, LNAI vol. 6086, pp. 158–167 (2010) 13. NapieraÃla, K., Stefanowski, J.: The influence of minority class distribution on learning from imbalance data. In. Proc. 7th Int. Conference HAIS 2012, Part II, LNAI vol. 7209, Springer, pp. 139-150 (2012) 14. Stefanowski, J., Wilk, Sz.: Selective Pre-processing of Imbalanced Data for Improving Classification Performance. In: Proc. of 10th Int. Conference DaWaK 2008, Springer Verlag, LNCS vol. 5182, 283-292 (2008) 15. Wang, S., Yao, T.: Diversity analysis on imbalanced data sets by using ensemble models. In Proc. IEEE Symp. Comput. Intell. Data Mining, 324-331 (2009). 16. Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. Journal of Artifical Intelligence Research, 6, 1-34 (1997) 24 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Anomaly Detection by Bagging Tomáš Pevný [email protected] Agent Technology Center, Department of Computers, Czech Technical University in Prague Abstract. Many contemporary domains, e.g. network intrusion detection, fraud detection, etc., call for an anomaly detector processing a continuous stream of data. This need is driven by the high rate of their acquisition, limited resources for storing them, or privacy issues. The data can be also non-stationary requiring the detector to continuously adapt to their changes. A good detector for these domains should therefore have a low training and classification complexity, on-line training algorithm, and, of course, a good detection accuracy. This paper proposes a detector trying to meet all these criteria. The detector consists of multiple weak detectors, each implemented as a onedimensional histogram. The one-dimensional histogram was chosen because it can be efficiently created on-line, and probability estimates can be efficiently retrieved from it. This construction gives the detector linear complexity of training with respect to the input dimension, number of samples, and number of weak detectors. Similarly, the classification complexity is linear with respect to number of weak detectors and the input dimension. The accuracy of the detector is compared to seven anomaly detectors from the prior art on the range of 36 classification problems from UCI database. Results show that despite detector’s simplicity, its accuracy is competitive to that of more complex detectors with a substantially higher computational complexity. Keywords: anomaly detection, on-line learning, ensemble methods, large data Abstract. 1 Introduction The goal of an anomaly detector is to find samples, which in some sense deviate from the majority. It finds application in many important fields, such as network intrusion detection, fraud detection, monitoring of health, environmental and industrial processes, data-mining, etc. These domains frequently need a detector with low complexity of training and classification, which can efficiently process large number of samples, ideally in real-time. These requirements implicate that the detector should be trained on-line, which is also important for domains, 25 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. where the data cannot be stored, or the data are non-stationary and the detector needs to be updated continously. This paper describes a detector (further called ADBag), which has a provably linear complexity of training with respect to the number of training samples n, and their dimension d. The classification has a low complexity which scales linearly with dimension d. The detector consists of an ensemble of k weak detectors, each implemented as a one-dimensional histogram with b bins. The onedimensional histogram was chosen because it can be created online in a single pass over the data, and probability estimates can be efficiently retrieved from it. Since each weak detector process only one-dimensional data, the input space Rd is reduced to R by a projection on a randomly generated vector w. Projection vectors w create the diversity among weak detectors, which is important for the success of the ensemble approach. As will be explained in more detail in Section 3, ADBag can be related to a naïve Parzen window estimator [20], as each weak detector provides an estimate of the probability of a sample. According to [26], Parzen window estimator gives frequently better results than more complex classifiers1 . ADBag’s accuracy is experimentally compared to the selected prior art algorithms on 36 problems downloaded from the UCI database [6] listed in the classification problems with numerical attributes. Although there is no single dominating algorithm, ADBag’s performance is competitive to others, but with a significantly smaller computational and storage requirements. On a dataset with millions of features and samples it is demonstrated that ADBag can efficiently handle large-scale data. The paper is organized as follows. The next section briefly reviews relevant prior art, shows the computational complexity, and discusses issues related to the on-line learning and classification. ADBag is presented in Section 3. In Section 4, it is experimentally compared to the prior art and its efficiency is demonstrated on a large-scale dataset [17]. Finally, Section 5 concludes the paper. 2 Related work A survey on anomaly and outlier detection [3] contains plenty of methods for anomaly and outlier detection. Below, those relevant to us or otherwise important are reviewed. We remark that ADBag falls into the category of model-based detectors, since every weak detector essentially creates a very simple model. 2.1 Model-based detectors Basic model-based anomaly detectors assume the data follow a known distribution. For example principal component transformation based detector [25] with complexity of training of O(nd3 ) assumes a multi-variate normal distribution. 1 Parzen window estimator is not suitable for real-time detection, since the complexity of obtaining an estimate of pdf depends linearly on the number of observed samples n. 26 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Although multi-variate normal distribution rarely fits the real data, as will be seen later, this detector frequently provides good results. On the contrary, One-Class Support Vector Machine [23] (OC-SVM) does not assume anything about the distribution of the data. It finds the smallest area, where 1 − ν fraction of data are located (ν is parameter of the method specifying desired false-positive rate). This is achieved by projecting the data to high-dimensional (feature) space, and then finding hyperplane best separating the data from the origin. It has been noted that when OC-SVM is used with a linear kernel, it introduces bias to the origin [26]. This problem is removed by using Gaussian kernel. Support Vector Data Description algorithm [26] (SVDD) removes the bias in OC-SVM by replacing the separation hyperplane by a sphere encapsulating most of the data. It has been showed that if OC-SVM and SVDD are used with a Gaussian kernel, they are equivalent [23]. Due to this fact, SVDD is omitted in the comparison section. The complexity of the training of both methods is super-linear with respect to the number of samples, n, being O(n3 d) in the worst-case. Recently proposed FRAC [19] aimed to bridge the gap between supervised and unsupervised learning. FRAC is an ensemble of models, each estimating one feature on basis of others (for data of a dimension d, FRAC uses d different models). The rationale behind this is that anomalous samples exhibit different dependencies among features, which can be detected from prediction errors modeled by histograms. FRAC’s complexity depends on the algorithm used to implement models, which can be large, considering that a search for a possible hyper-parameters needs to be done. Because of this, an ordinary linear least-square regression is used leading to the complexity O(nd4 ). It is stated that FRAC is well suited for an on-line setting, but it might not be straightforward. For real-valued features, every update changes all models together with distribution of their errors. Consequently, to update the histograms of their errors, all previously observed samples are required. This means that unless some simplifying assumptions are accepted, the method cannot be used in an online settings. Generally, the complexity of the classification of model-based detectors is negligible in comparison to the complexity of training. Yet for some methods this might be difficult to control. An example is OC-SVM, where the classification complexity depends on the number of support vectors, which is a linear function of the number of training samples n. The on-line training of all above detectors is generally difficult. Since there is no closed-form solution for the on-line PCA, the on-line version of the PCA detector does not exist either. The on-line adaptation of OC-SVM is discussed in [11], but the solution is an approximation of the solution returned by the batch version. The exact on-line version of SVDD is described in [27], but the algorithm requires substantial bookkeeping, which degrades its usability in realtime applications. Moreover, the bookkeeping increases the storage requirements, which are not bounded anymore. 27 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 2.2 Distance-based detectors Distance based detectors use all available data as a model. All data are usually presented in a single batch and the outliers are found within. Thus, it is assumed that the majority of samples comes from one (nominal) class. The lack of training phase makes the adaptation to on-line settings easy — new samples are just added to the set of already observed samples. Notice that this increases the complexity of the classification phase, which is a linear function of the number of samples n. A k-nearest neighbor [12] (KNN) is a popular method to identify outliers inspired by the corresponding method from classification. It ranks samples according to their distance to k th - nearest neighbor. KNN has been criticized for not being able to detect outliers in data with clusters of different density [2]. The local outlier factor [2] (LOF) solves this problem by defining the outlier score as a fraction of sample’s distance to its k th -nearest neighbor and the average of the same distance of all its k nearest neighbors. True inliers have score around one, while outliers have score much greater. The prior art here is vast and it is impossible to list all, hence we refer to [29] for more. The complexity of the classification phase of nearest-neighbor based detectors is driven by the nearest-neighbor search, which is with a naïve implementation O(n) operation. To alleviate, more efficient approaches have been adopted based on bookkeeping [22], better search structures like KD-trees, or approximate search. Nevertheless, the complexity of all methods depends in some way on the number of training samples n. 2.3 Ensembles in outlier detection The ensemble approach has been so far little utilized in the anomaly detection. A significant portion of the prior art focuses on a unification of scores [8,24], which is needed for diversification by using different algorithms [18]. A diversification by a random selection of sub-spaces has been utilized in [14]. ADBag’s random projection can be considered as a modification of the subspace method. Yet, the important difference is that the random projection relates all features together and not just some of them, as the sub-space method does. Also, all previous work use heavy-weighted detectors, while ADBag uses very simple detector, which gives it its low complexity. 2.4 Random projections Random projections have been utilized mainly in distance-based outlier detection schemes to speedup the search. De Vries et al. [5] uses the property that random projections approximately preserve L2 distances among set of points [10]. Thus instead of performing the k th -NN search in a high-dimensional space in the LOF, the search is conducted in the space of reduced dimension but on a larger neighborhood, which is then refined by the search in the original dimension. 28 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Similarly, Pham et al. [21] use random projections to estimate distribution of angles between samples, which has been proposed in [13] as a good anomaly criterium. Unlike both above schemes, ADBag is a model-based detector that does not rely on the notion of distance or angles, but on the notion of probability density. ADBag brings random projections to the extreme, by projecting the input space into a single dimension, which extremely simplifies the complexity of all operations over it. 3 Algorithm description ADBag is an ensemble of equal, weak detectors. Every weak detector within the ensemble is a histogram in one-dimensional space R, which is created by projection the input space Rd to a vector w. Projection vectors are generated randomly during the initialization of a weak histograms before any data have been observed. Let hi (x) = p̂i (xT wi ) denotes the output of ith weak detector (p̂i is empirical probability distribution function (pdf) of data projected on wi ). ADBag’s output is an average of negative logarithms of output of all weak detectors. Specifically, k 1X log hi (x). (1) f (x) = − k i=1 The rest of this section describes ADBag in detail, and it also explains the design choices. Subsection 3.1 describes strategies to generate projection vectors wi . Following Subsection 3.2 points to issues related to on-line creation of histograms and presents a method adopted from the prior art. Finally, Subsection 3.3 explains the rationale behind the aggregation function and ADBag itself. The section finishes with a paragraph discussing ADBag’s hyper-parameters and their effect on the computational complexity. 3.1 Projections Projection vectors are generated randomly at the initialization of weak detectors. In all experiments presented in Section 4, their elements were generated according to the normal distribution with zero mean and unit variance, which was chosen due to its use in the proof of Johnson-Lindenstrauss (JL) lemma [10] (JL lemma shows that L2 distances between points in the projected space approximates the same quantity in the input space). Other probabilities to generate random vectors w certainly exists. According to [15] it is possible to use sparse vector w, which is interesting, as it would allow the detector to elegantly deal with missing variables. 3.2 Histogram Recall that one of the most important requirements was that the detector should operate (learn and classify) over data-streams, which means that the usual ap- 29 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Algorithm 1: Algorithm constructing approaximation of the probability distribution of the data {x1 , . . . , xn } projected on the vector w. initialize H = {}, zmin = +∞, zmax = −∞, and w ∼ N (0, 1d ) ; for j ← 1 to n do z = xT j w; zmin = min{zmin , z}; zmax = min{zmax , z}; if ∃(zi == z) then mi = mi + 1; continue else H = H ∪ {z, 1} end if |H| > b then Sort pairs in H such that z1 < z2 < . . . < zb+1 ; Find i minimizing zi+1 − zi ; Replace pairs (zi , mi ), (zi+1 , mi+1 ), by the pair zi mi + zi+1 mi+1 , mi + mi+1 mi + mi+1 end H = H ∪ {(zmin , 0), (zmax , 0)}; Sort pairs in H such that zmin < z1 < z2 < . . . < zmax ; proaches, such as equal-area, or equal-width histograms cannot be used. The former requires the data to be available in a single batch, while the latter requires the bounds of the data to be known in beforehand. To avoid these limitations and have a bound resources, an algorithm proposed in [1] is adopted, despite it does not guarantee the convergence to the true pdf. A reader interested in this problem should look to [16] and to references therein. In the experimental section, ADBag with equal-area histogram created in the batch training is compared to the ADBag with the on-line histogram with the conclusion that both offer the same performance. The chosen on-line histogram approximates the distribution of data in a set of pairs H = {(z1 , m1 ), . . . , (zb , mb )}, where zi ∈ R and mi ∈ N, and b is an upper bound on the number of histogram bins. The algorithm maintains pairs (zi , mi ), such that every point zi is surrounded by mi points, of which half is to the left and half is to the right to zi . Consequently, the number of points in the i+1 interval [zi , zi+1 ] is equal to mi +m , and the probability of point z ∈ (zi , zi+1 ) 2 is estimated as a weighted average. The construction of H is described in Algorithm 1. It starts with H = {} being an empty set. Upon receiving a sample, z = xT w, it looks if there is a pair (zi , mi ) in H such that z is equal to zi . If so, the corresponding 30 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Algorithm 2: Algorithm returning approximate of probability density in point x projected on the vector w. H = H ∪ {(zmin , 0), (zmax , 0)}; Sort pairs in H such that zmin < z1 < z2 < . . . < zmax ; z = xT w; if ∃(i|zi < z ≤ zi+1 ) then z m +zi+1 mi+1 ; return i2Mi (zi+1 −zi ) else return 10−10 end ; count mi is increased by one. If not, a new pair (z, 1) is added to H. If size of H exceeds the maximal number of bins b, the algorithm finds two closest pairs (zi , mi ), (zi+1 , mi+1 ), and replaces them with an interpolated pair zi mi +zi+1 mi+1 , mi mi +mi+1 + mi+1 . Keeping zi sorted makes all above operations efficient. The estimation of the probability density in point z = xT w is described in Algorithm 2. Assuming pairs in H are sorted according to zi (the sorting is explicitly stated, but as has been mentioned above, for efficiency H should be sorted all the time), an i such that zi < z ≤ zi+1 is found first. If such i Pb mi +zi+1 mi+1 exists, then the density in z is estimated as zi2M i=1 mi . (zi+1 −zi ) , where M = Otherwise, it is assumed that z is outside the estimated region and 10−10 is returned. 3.3 Aggregation of weak detectors ADBag’s output on a sample x ∈ Rd can be expressed as f (x) = − k 1X log p̂i (xwiT ) k i=1 = − log k Y ! k1 p̂i (xwiT ) i=1 ∼ − log p(xw1 , xw2 , . . . , xwk ), (2) where p̂i denotes empirical marginal pdf along the projection wi , and p(xw1 , xw2 , . . . , xwk ) denotes the joint pdf. The equation shows that ADBag’s output is inversely proportional to the joint probability of the sample under the assumption that marginals on projection vectors wi and wj are independent ∀i, j ∈ k, i 6= j (used in the last line of Equation (2)). Similarly, the output can be viewed as a negative log-likehood of 31 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. the sample, meaning that less likely the sample is, the higher value of anomaly it gets. The independence of xwiT and xwjT for i 6= j assumed in last equation in (2) is questionable and in reality it probably does not hold. Nevertheless, the very same assumption is made in the naïve bayes classifier, which despite that it is almost always violated, gives results competitive to more sophisticated classifiers. Zhang [28] explains this phenomenon from a theoretical point of view and gives conditions, under which effects of conditional dependencies cancel out making naïve bayes equal to the bayes classifier. These conditions depend on the probability distribution of both classes and they are difficult to be verified in practice, since they require the exact knowledge of conditional dependencies among features. Nevertheless, due to ADBag’s similarity to the Parzen window classifier, the similar argumentation might explain ADBag’s performance. Another line of thoughts can relate ADBag to a PCA based detector [25]. If dimension d is sufficiently high, then projection vectors wi and wj , i 6= j are approximately orthogonal. Assuming again the independence of xwiT and xwjT , the projected data are orthogonal and uncorrelated, which are the most important properties of Principal Component Analysis (PCA). 3.4 Hyper-parameters ADBag is controlled by two hyper-parameters: the number of weak detectors k and the number of histogram bins b within every detector. Both parameters are very predictable in a way they influence the accuracy. Generally speaking, the higher the number of weak detectors, the better the accuracy. Nevertheless, after certain threshold, adding more detectors does not significantly improve the accuracy. In all experiments in Section 4.2, we set k = 150. Afterward investigation of the least k at which ADBag has the accuracy above 99% of the maximum has found that this k was most of the time well below 100. √ The number of histogram bins was set to b = [ n], where n is the number of samples. The rationale behind this choice is the following. If the number of √ samples n → +∞, and b = n, then the equal area histogram converges to the true probability distribution function. In practice, b should be set according to available resources and expected number of samples. Interestingly, investigation on a dataset with millions of features and samples revealed that the effect of b on the accuracy is limited and small values of b are sufficient (see Section 4.3 for details). Both hyper-parameters influence the computational complexity of ADBag’s training and classification. It is easy to see that both complexities depend at most linearly on both parameters. 32 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 4 Experiments ADBag’s accuracy was evaluated and compared to the state of the art on 36 problems from UCI database [6] listed under the category “classification problems with numerical attributes without missing variables”. The following algorithms were chosen due to their generality and acceptance by a community: PCA based anomaly detector [25], OC-SVM [23], FRAC [19], KNN [12], and LOF [2]. Although these algorithms were not designed for realtime on-line learning, they were selected because they provide a good benchmark. For every problem, the class with the highest number of samples were used as a nominal class and all samples from remaining classes were used as representatives of anomalous class. In one repetition of the experiment, 75% of nominal samples was used for training and the rest was used for testing. The samples from the anomalous class were not sampled — all of them were always used. The data have been always normalized to have zero mean and unit variance on the training part of the nominal class. The distance-based algorithms (KNN and LOF) used the training data as a data to which distance of classified sample was calculated. Every experiment was repeated 100 times. To avoid problems with a tradeoff between false positive and false negative rate, the area under the ROC curve (AUC) is used as a measure of the quality of the detection. This measure is frequently used for comparisons of this kind. The matlab package used in the evaluation process is available at http://agents.fel.cvut.cz/~pevnak. 4.1 Settings of hyper-parameters LOF and KNN method both used k = 10 (recall that in both methods, k denotes the number of samples determining the neighborhood), as recommended in [2]. OC-SVM with a Gaussian kernel is cumbersome to use in practice, since its two hyper-parameters (width of the Gaussian kernel γ and expected false positive rate ν) have unpredictable effect on the accuracy (note that anomalous samples are not available during training). Hence, the following heuristic has been adopted. A false positive rate on the grid 10j |i ∈ {0, . . . , 5}, j ∈ {−3, . . . , 3} , (ν, γ) ∈ 0.01 · 2i , d has been estimated by five-fold cross-validation (d is the input dimension of the problem). Then, the lowest ν and γ (in this order) with estimated false positive rate below 1% has been used. If such combination of parameters does not exist, the combination with the lowest false positive rate has been used. The choice of 1% false positive rate is justified by training on samples from the nominal class only, where no outliers should be present. The reason for choosing lowest ν and γ is that a good generalization is expected. The choice of the parameters is probably not optimal for maximizing AUC, but it shows the difficulty of using algorithms with many hyper-parameters. SVM implementation has been taken from the libSVM library [4]. 33 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. FRAC detector used ordinary linear least-square estimators, which in this setting does not have any hyper-parameters. The PCA detector based on principal component transformation used all components with eigenvalues greater than 0.01. √ ADBag used k = 150 weak detectors, and b = n bins in the histogram (n is the number of samples from nominal class). As will be discussed later, 150 weak detectors is for most problems unreasonably high, but it was used to avoid the poor performance due to their low number. The same holds for b. 4.2 Experimental results Table 1 shows average AUCs of compared detectors on all 36 tested problems. The first thing to notice is that there is no single detector dominating the others. Every detector excels in at least one problem, and it is inferior in other. Therefore the last row in Table 1 shows the average rank of a given detector over all problems (calculated only for unsupervised detectors with batch learning). According to this measure, KNN, and the proposed ADBag detector with equalarea histogram provide overall the best performance, and they are on average equally good. The result shows that increased complexity of the detector does not necessarily lead to a better performance, which can be caused by more complicated setting of parameters to be tuned. Column captioned kmin in Table 1 shows a sufficient number of weak detectors, determined as the least k providing AUC higher then 0.99 · AUC on 150 detectors. For all problems, the sufficient number of weak detectors is well below the maximum number of 150. For many problems, kmin is higher than the dimension of the input space. This shows that diverse set of random projections provides a different views on the problem leading to the better accuracy. For almost all problems, the accuracy increases with the number of projections. The only exception is “musk-2” dataset, which is not unimodal, as the plot of first two principal components reveals three clusters of data. Contrary, the “spect-heart” is actually a difficult problem even for the supervised classification, as the Adaboost [7] algorithms achieves only 0.62 AUC. Investigation of problems on which ADBag is worse than the best detector shows that ADBag performs poorly in cases, where the support of the probability distribution of nominal class is not convex, and it encapsulates the support of the probability distribution of the anomalous class. We believe that these cases are rare in very high-dimensional problems, for which ADBag is designed. Finally, comparing accuracies of batch and on-line versions (Table 1), it is obvious that the on-line version of ADBag is no worse than the batch version. This is important for an efficient processing of large data and application in non-stationary problems [9]. 34 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. dataset abalone blood-transfusion breast-cancer-wisconsin breast-tissue cardiotocography ecoli gisette glass haberman ionosphere iris isolet letter-recognition libras madelon magic-telescope miniboone multiple-features musk-2 page-blocks parkinsons pendigits pima-indians sonar spect-heart statlog-satimage statlog-segment statlog-shuttle statlog-vehicle synthetic-control-chart vertebral-column wall-following-robot waveform-1 waveform-2 wine yeast Average rank batch unsupervised methods online FRAC PCA KNN LOF SVM ADBag ADBag d n 0.44 0.46 0.46 0.50 0.60 0.59 0.59 8 1146 0.41 0.53 0.56 0.46 0.53 0.56 0.55 4 428 0.94 0.96 0.95 0.95 0.90 0.96 0.96 30 268 0.87 0.90 0.91 0.92 0.94 0.94 0.94 9 16 0.71 0.72 0.73 0.80 0.87 0.82 0.82 21 1241 0.93 0.97 0.98 0.98 0.98 0.98 0.98 7 107 — 0.78 0.78 0.83 0.74 0.74 0.75 5000 2250 0.64 0.68 0.67 0.68 0.55 0.67 0.65 9 57 0.67 0.61 0.67 0.66 0.33 0.64 0.64 3 169 0.96 0.96 0.97 0.95 0.93 0.96 0.96 34 169 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4 38 0.87 0.98 0.99 0.99 0.99 0.99 0.98 617 180 0.99 0.99 0.98 0.98 0.79 0.96 0.95 16 610 0.66 0.88 0.72 0.75 0.72 0.82 0.81 90 18 0.51 0.51 0.52 0.52 0.53 0.51 0.51 500 975 0.83 0.80 0.83 0.84 0.69 0.73 0.73 10 9249 0.62 0.54 0.77 0.54 — 0.83 0.83 50 70174 0.77 1.00 0.99 1.00 0.99 0.99 0.99 649 150 0.83 0.79 0.79 0.84 0.18 0.53 0.42 166 4186 0.96 0.96 0.96 0.91 0.86 0.95 0.94 10 3685 0.71 0.68 0.61 0.67 0.85 0.74 0.73 22 110 1.00 1.00 0.99 1.00 0.99 0.99 0.99 16 858 0.72 0.72 0.75 0.68 0.64 0.73 0.73 8 375 0.67 0.66 0.53 0.62 0.65 0.63 0.63 60 83 0.31 0.27 0.22 0.23 0.68 0.32 0.31 44 159 0.81 0.97 0.99 0.99 0.85 0.99 0.99 36 1150 0.99 1.00 0.99 0.99 0.98 0.99 0.99 19 248 0.91 0.98 1.00 1.00 0.65 0.92 0.92 8 34190 0.98 0.98 0.89 0.94 0.60 0.85 0.86 18 164 0.97 1.00 1.00 1.00 1.00 1.00 1.00 60 75 0.85 0.88 0.88 0.89 0.87 0.95 0.90 6 150 0.75 0.68 0.78 0.74 0.63 0.70 0.68 24 1654 0.89 0.90 0.90 0.89 0.93 0.91 0.92 21 1272 0.81 0.82 0.81 0.80 0.76 0.78 0.80 40 1269 0.93 0.95 0.95 0.93 0.91 0.93 0.93 13 53 0.72 0.72 0.72 0.71 0.67 0.74 0.72 8 347 4.0 3.2 3.1 3.2 4.3 3.1 kmin 10 2 20 18 41 12 78 10 28 39 4 40 53 79 59 27 19 12 1 12 73 6 32 99 1 21 10 9 69 9 20 59 67 111 63 33 Table 1: Average unnormalized area under ROC curve calculated from 100 repetitions (higher is better). The best performance for a given problem are bold-faced. The last row shows the average rank of the algorithm from all 36 problems (lower is better). 35 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 4.3 URL dataset A URL dataset [17] contains 2.4 million samples with 3.2 million features, hence it is a good dataset where ADBag’s power can be demonstrated. Each sample in the dataset contains sparse features extracted from an URL. The class of benign samples contains random URL (obtained by visiting http://random. yahoo.com/bin/ryl). The class of malicious samples was obtained by extracting links in spam e-mails. URL’s were collected during 120 days. The original work used the dataset in a supervised binary classification scenario obtaining accuracy around 98%. Here, we use the dataset in the anomaly detection scenario utilizing samples from the benign class for training. 0.8 AUC AUC 0.8 0.78 0.76 KNN Continuous Fixed Updated 0.6 28 27 26 b 25 24 100 200 300 400 500 0.4 0 20 40 60 80 100 120 Day k (a) ADBag with respect to k and b. (b) KNN and versions of ADBag. Fig. 1: Left figure shows AUC of updated ADBag on URL dataset for different number of weak detectors k, and histogram bins b. Right figure shows AUC of KNN detector in reduced space, Continuously updated ADBag (Continuous), ADBag trained on the first day (Fixed), and ADBag trained every day (Updated) with respect to days. ADBag was evaluated with different number of weak detectors k ∈ {50, 100, 150, . . . , 500}, and different number of bins b ∈ {16, 32, 64, 128, 256}. A three strategies to train ADBag were investigated. Fixed ADBag was trained on benign samples from the day zero only, which has never been used for evolution. Continuous ADBag was trained on all samples from benign class up to the day before the testing data came from. This means that if Continuous ADBag was evaluated on data from day l, the training used benign samples from days 0, 1, 2, . . . , l − 1 (Continuous ADBag was trained in the on-line manner). Finally, Updated ADBag was trained on benign samples from the previous day than the data for evaluation. Note that the prior art used in the previous section cannot be used directly for a benchmarking purposes, because it cannot handle these data. In order to have a method that ADBag can be compared to, a strategy from [5] has been adopted 36 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. and used with the KNN detector (the best according to results in previous subsec6 tion). Specifically, a random projection matrix W ∈ R3.2·10 ,500 , Wij ∼ N (0, 1) has been created and all samples were projected in this new, k-dimensional space. Remark that due to Johnson-Lindenstrauss lemma, L2 distances between points should be approximately preserved, hence the KNN method should work without modification. The KNN method was executed with 10 and 20 nearest neighbors, finding the former being better. Missing features were treated as zeros, which allows to efficiently handle notyet-seen variables by adding new row(s) to projection matrix W. This strategy has been used for both ADBag and KNN methods. Figure 1b shows AUCs of KNN and three variants of ADBag in every day in the data set. According to the plot, in some days ADBag detectors were better, whereas in other days it was vice versa. The average difference between the KNN and ADBag retrained every day was about 0.02, which is negligible difference considering that Continuous ADBag was approximately 27 times faster. Interestingly, all three versions of ADBag provided similar performance, meaning that the benign samples were stationary. This phenomenon has been caused by the fact that the distribution of all URLs on the internet has not changed considerably during 120 days, when benign samples were acquired. Figure 1a shows, how the accuracy of Continuous ADBag changes with the number of weak detectors k, and the number of histogram bins b. As expected, higher values yields to higher accuracy, although competitive results can be achieved for k = 200 and b = 32. This suggests that b can have a low value, which does not have to be proportional to the number of samples n. Continuos ADBag’s average time of update and classification per day for the most complex setting with k = 500 and b = 256 was 5.86s. The average classification time for the KNN detector with k = 500 and 10 nearest neighbors was 135.57. Both times are without projection of data to a lower-dimensional space, which was done separately. This projection took 669.25s for 20,000 samples. These numbers show that ADBag is well suited for efficient detection of anomalous events on large-scale data. Its accuracy is competitive to the state of the art methods, while its running times are order of magnitude lower. Running times were measured on Macbook air equipped with a 2-core 2Ghz Intel core i7 processor and 8Gb of memory. 5 Conclusion This paper has proposed an anomaly detector with bounded computational and storage requirements. This type of detector is important for many contemporary applications requiring to process large data, which are beyond capabilities of traditional algorithms, such as one-class support vector machine or nearestneighbor based methods. The detector is built as an ensemble of weak detectors, where each weak detector is implemented as a histogram in one dimension. This one dimension is obtained by projecting the input space to a randomly generated projection 37 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. vector. Projection vectors are generated randomly, which simultaneously creates the needed diversity between weak detectors. The accuracy of the detector was compared to five detectors from the prior art on 36 classification problems from UCI datasets. According to the results, the proposed detector and the nearest-neighbor based detector provide overall the best performance. It was also demonstrated that the detector can efficiently handle dataset with millions of samples and features. The fact, that the proposed detector is competitive to established solutions is especially important, if one takes its small computational and memory requirements into the account. Moreover, the detector can be trained on-line on a data-streams, which open doors to its application in non-stationary problems. With respect to experimental results, the proposed detector represents an interesting alternative to established solutions, especially if large data needs to be efficiently handled. It would be interesting to investigate the impact of sparse random projections on the accuracy, as this will increase the efficiency and enable the detector to be applied on data with missing features. 6 Acknowledgments This work was supported by the Grant Agency of Czech Republic under the project P103/12/P514. References 1. Y. Ben-Haim and E. Tom-Tov. A streaming parallel decision tree algorithm. The Journal of Machine Learning Research, 11:849–872, 2010. 2. M. M. Breunig, H.P. Kriegel, R.T. Ng, and J. Sander. Lof: identifying density-based local outliers. SIGMOD Rec., 29(2):93–104, 2000. 3. V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3):1–58, 2009. 4. C. C. Chang and C. J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. 5. T. de Vries, S. Chawla, and M. E. Houle. Finding local anomalies in very high dimensional space. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 128–137. IEEE, 2010. 6. A. Frank and A. Asuncion. UCI machine learning repository, 2010. http:// archive.ics.uci.edu/ml. 7. Y. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23–37. Springer, 1995. 8. J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms into probability estimates. In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 212–221. IEEE, 2006. 9. E. Hazan and C. Seshadhri. Efficient learning algorithms for changing environments. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 393–400. ACM, 2009. 38 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 10. W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984. 11. J. Kivinen, A. J. Smola, and R.C. Williamson. Online Learning with Kernels. IEEE Transactions on Signal Processing, 52(8):2165–2176, 2004. 12. E. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In Proceedings of the International Conference on Very Large Data Bases, pages 211–222, 1999. 13. H.-P. Kriegel and A. Zimek. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 444–452. ACM, 2008. 14. A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In Conference on Knowledge Discovery in Data: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, volume 21, pages 157–166, 2005. 15. P. Li. Very sparse stable random projections for dimension reduction in l α (0 < α ≤ 2)) norm. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’07, page 440, 2007. 16. X. Lin. Continuously maintaining order statistics over data streams. In Proceedings of the eighteenth conference on Australasian database-Volume 63, pages 7–10. Australian Computer Society, Inc., 2007. 17. J. Ma, L. K Saul, S. Savage, and G. M. Voelker. Identifying suspicious urls: an application of large-scale online learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 681–688. ACM, 2009. 18. H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In Database Systems for Advanced Applications, pages 368–383. Springer, 2010. 19. K. Noto, C. Brodley, and D. Slonim. Frac: a feature-modeling approach for semi-supervised and unsupervised anomaly detection. volume 25, pages 109–133. Springer US, 2012. 20. E. Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962. 21. N. Pham and R. Pagh. A near-linear time approximation algorithm for anglebased outlier detection in high-dimensional data. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 877–885. ACM, 2012. 22. D. Pokrajac, A. Lazarevic, and J. L. Latecki. Incremental Local Outlier Detection for Data Streams. In 2007 IEEE Symposium on Computational Intelligence and Data Mining, pages 504–515. IEEE, 2007. 23. B. Schölkopf, J. C. Platt., J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Comput., 13(7):1443–1471, 2001. 24. E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlier rankings and outlier scores. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012, pages 1047–1058, 2012. 25. M. L. Shyu. A novel anomaly detection scheme based on principal component classifier. Technical report, DTIC Document, 2003. 26. D. M. J. Tax and R. P. W. Duin. Support vector data description. Mach. Learn., 54(1):45–66, 2004. 27. D.M.J. Tax and P. Laskov. Online svm learning: from classification to data description and back. In Neural Networks for Signal Processing, 2003. NNSP’03. 2003 IEEE 13th Workshop on, pages 499–508, 2003. 39 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 28. H. Zhang. The optimality of naive bayes. In V Barr and Z Markov, editors, Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference. AAAI Press, 2004. 29. A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5):363–387, 2012. 40 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Efficient semi-supervised feature selection by an ensemble approach Mohammed Hindawi1 , Haytham Elghazel2 , Khalid Benabdeslem2 1 INSA de Lyon LIRIS, CNRS UMR 5205 F-69621, France [email protected] 2 University of Lyon1 LIRIS, CNRS UMR 5205 F-69622, France {haytham.elghazel, khalid.benabdeslem}@univ-lyon1.fr Abstract. Constrained Laplacian Score (CLS) is a recently proposed method for semi-supervised feature selection. It presented an outperforming performance comparing to other methods in the state of the art. This is because CLS exploits both unsupervised and supervised parts of data for selecting the most relevant features. However, the choice of the little supervision information (represented by pairwise constraints) is still a critical issue. In fact, constraints are proven to have some noise which may deteriorate the learning performance. In this paper we try to override any negative effects of constraints set by the variation of their sources. This is done by an ensemble technique using both a resampling of data (bagging) and a random subspace strategy. The proposed approach generates a global ranking of features by aggregating multiple Constraint Laplacian Scores on different views of the available labeled and unlabeled data . We validate our approach by empirical experiments over high-dimensional datasets and compare it with other representative methods. Key words: Feature selection, semi-supervised learning, constraint score, ensemble methods. 1 Introduction In nowadays machine learning applications, data acquisition tools have well developed making it’s easier to get continuously a voluminous rough data. The huge quantity of data in its turn, has a dramatically deterioration effects on both stocking and treating the data via the classical learning algorithm due to the “curse of dimensionality”. In order to override this problem, feature selection has become one of the most important techniques to reduce the dimensionality. Feature selection can be defined as the process of choosing the most relevant features of data. The relevance of a feature may differ according to the learning context, which may be roughly divided into supervised, unsupervised and semi supervised feature selection. 41 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. In supervised feature selection, where all data instances are labeled, a relevance of a feature is measured according to its correlation with label information. Then a ’good’ feature would be the one at which the instances with the same labels record the same (or closer) values, and vice-versa [16]. Unsupervised feature selection is considered as a much harder problem due to the absence of labels; hence the relevance of a feature is measured according to its ability in preserving some data characteristics (e.g. variance) [11]. Actually, the supervised feature selection methods outperform the unsupervised ones due to the presence of labels which represent the background knowledge about the data. However, labels availability is not always guaranteed, this is because labels -generally- require experts’ intervention which is costly to obtain. Adding the aforementioned idea of rapid data acquisition tools development, a more frequent case in machine learning applications is to provide labeling information for a small part of data, then the data is called ‘semi-supervised’, which in its turn produces the so-called “small-labeled sample problem” [34]. In [3] we proposed Constrained Laplacian Score (CLS) as a semi-supervised scoring method which makes profit of the data structure and the label information (transformed into pairwise constraints). CLS has scored an outstanding performance towards other competitive methods. However, the method was sensible to the noise in the constraints set. To tackle this problem, we later proposed a Constrained Selection based Feature Selection framework (CSFS) [19] in which we enhanced the function score in order to be more efficient. In order to overcome the problem of noisy constraints, CSFS exploits a constraint selection process according to a coherence measure (proposed in [8]), which considers that two constraints are incoherent if they represent two contradictive powers and coherent if not. When the constraint selection is done, the remaining constraints are obviously fewer but more efficient. CSFS outperformed the results of its ancestor CLS, this could be explained by the amelioration of the scoring function and the elimination of the constraint noise. However, CSFS had two critical points : firstly, even if they are efficient, the size of selected constraint set was rather small, this has led in some cases to dramatic minimization of the constraints use feasibility. In addition, CSFS and CLS are based on the Euclidian distance between instances in the computation of feature scores, in this case the calculation of such distance becomes less reliable when data is of high dimensionality. To overcome the two mentioned problems, we present an ensemble-based framework called EnsCLS (for Ensemble Constraint Laplacian Score) for semisupervised feature selection. EnsCLS combines both a resampling of data (bagging) and a random selection of features (random subspaces or RSM for short) strategy. The CLS score is then used to measure features relevance on each replicate of data and the score average of all features across all ensemble components is considered. A combination of these two main strategies (bagging and RSM) for producing feature ranking, leads to an exploration of distinct views of inter-pattern relationships and allows to (i) compute robust estimates of variable importance against small changes in the pairwise constraint set, and (ii) to mitigate the curse of dimensionality. 42 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. The rest of the paper is organized as follows: Section 2 reviews recent studies on semi-supervised feature selection and ensemble methods. Section 3 briefly recalls Constraint Laplacian Score algorithm. Then we discuss the details of the proposed EnsCLS algorithm in Section 4. Experiments using relevant highdimensional benchmarks and real datasets are presented in Section 5. Finally, we conclude this paper in Section 6. 2 Related works In this section, we briefly present the semi-supervised feature selection and the semi-supervised ensemble approaches that appeared recently in the literature. 2.1 Feature selection With the advent of semi-supervised feature selection, some unsupervised methods are adopted to this context by ignoring the few label information. Laplacian score [18], as example, determines a feature relevance according to the variance of data along it. The variance is an important measure of data, nevertheless, labeled data also carry valuable information and represent the background knowledge about the domain. At the other hand, another score, called constraint score [33], depends only on the few available labeling information which is transformed into constraints. Actually, constraint score proved that utilizing a few number of constraints it may perform competitively to other full labels methods (like Fisher score [13]), this had made constraint score more adaptive to the smalllabeled sample problem. However, constraint score ignores the “large” unlabeled data part which carry the real data structure. In addition, the performance of constraint score is severely influenced by the choice of the constraint set. To overcome this problem, authors in [30] proposed a bagging approach (BS) to the constraint score in order to ameliorate the overall classification accuracy. The main drawback of the method is -as mentioned- the still ignorance of the unlabeled part of data which is generally far larger than the labeled one. In order to make profit of the both labeled and unlabeled parts of data, a score called C4 [23] has proposed a simple multiplication of the Laplacian and Constraint scores in order to compromise between the two scores. However, the method is biased towards the features with good Laplacian score but bad constraint score and vice-versa. 2.2 Ensemble learning Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the last decade. They combine multiple models into one usually more accurate than the best of its components. This improvement of performances relies on the concept of diversity which states that a good classifier ensemble is an ensemble in which the examples that are misclassified are different from one individual classifier to another. Dietterich [10] states that 43 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. ”A necessary and sufficient condition for an ensemble of classifiers to be more accurate than any of its individual members is if the classifiers are accurate and diverse”. Many methods have been proposed to generate accurate, yet diverse, sets of models. Bagging [5], boosting [14] and Random Subspaces [20] are the most popular examples of this methodology. While Bagging obtains a bootstrap sample by uniformly sampling with replacement from original training set, boosting resamples or reweighs the training data by emphasizing more on instances that are misclassified by previous classifiers. Likewise bagging, random subspaces method (RSM) are another excellent source of obtaining diversity through feature set manipulation that provides different views of data and allows to improve the quality of classification solutions. Recently, besides classification ensemble, there also appears clustering [29, 31] and semi-supervised learning [26, 32, 17] ensemble for which it has been shown that combining the strengths of a diverse set of clusterings or semi-supervised learners can often yield more accurate and robust solutions. Last but not least, considerable attention was paid to exploiting the power of ensemble with a view to identify and remove the irrelevant features in a supervised [6, 27], unsupervised [21, 22, 12] and semi-supervised [2] setting. 3 Constraint Laplacian Score In this section we present a brief description of the CLS score [3] upon which we depend in our framework. In fact, CLS utilizes both parts of data, labeled and unlabeled. The labeled part is transformed into pairwise constraints, which can be classified on two subsets: ΩML (a set of Must-Link constraints) and ΩCL (a set of Cannot-Link constraints) – Must-Link constraint (M L): involving two instances xi and xj , specifies that they have the same label. – Cannot-Link constraint (CL): involving two instances xi and xj , specifies that they have different labels. Let X be a dataset of n instances characterized by p features. X consists of two subsets: XL for labeled data and XU for unlabeled data. Let r be a feature to evaluate. We define its vector by fr = (fr1 , ..., frn ). The CLS of r, which should be minimized, is computed by: P 2 i,j (fri − frj ) Sij (1) CLSr = P P i 2 i j|∃l,(xl ,xj )∈ΩCL (fri − αrj ) Dii P where D is a diagonal matrix with Dii = j Sij , and Sij is defined by the neighborhood relationship between instances (xi = 1, .., n) as follows: Sij = 8 kxi −xj k2 > > λ <e− > > : 0 if ((xi , xj ) ∈ XU and xi , xj are neighbors) or (xi , xj ) ∈ ΩM L otherwise 44 (2) COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Algorithm 1 CLS Require: A data set X(n × p), which consists of two subsets: XL (L × p), the subset of labeled training instances and XU (U × p), the subset of unlabeled training instances; the input space (F = {f1 , . . . , fp }); the constant λ and the neighborhood degree k. 1: Construct the constraint sets (ΩML and ΩCL ) from the labeled part: XL . 2: Calculate the dissimilarity matrix S and the diagonal matrix D. 3: for r = 1 to p do 4: Calculate CLSr according to eq(1). 5: end for where λ is a constant to be set, and xi , xj are neighbors means that xi is among k nearest neighbors of xj . frj if (xi , xj ) ∈ ΩCL i (3) αrj = µr if i = j and xi ∈ XU fri otherwise P where µr = n1 i fri (the mean of the feature vector fr ). CLS represents an enhanced version of both scores Laplacian [18] and Constraintbased [33]. In fact, Laplacian score can be seen as a special version of CLS when there are no labels (X = XU ), and when (X = XL ), CLS can be considered as an adjusted version of constraint score [33]. In CLS, we proposed a more efficient combination of both scores by a new score function, including the geometrical structure of unlabeled data and the constraint-preserving ability of labeled data. With CLS, on the one hand, a relevant feature should be the one on which those two instances (neighbors or related by an M L constraint) are close to each other. On the other hand, the relevant feature should be the one with a larger variance or on which those two instances (related by a CL constraint) are well separated. We present the whole procedure of CLS in Algorithm 1. Note that this algorithm is computed in time O(p × max(n2 , log p)). To reduce this complexity, we proposed in our prior work [3] to apply a clustering on XU . The idea was to substitute this huge part of data by a smaller one XU′ = (u1 , ..., uK ) by preserving the geometric structure of XU , where K is the number of clusters. We proposed to use the Self-Organizing Map (SOM) based clustering [24], for its ability to preserve the topological relationship of data well and thus the geometric structure of their distribution. With this strategy, we reduced the complexity to O(p × max(U, log p)), where U is the size of XU . 4 Ensemble Constraint Laplacian Score In this section we present our ensemble based approach of constrained laplacian score for semi-supervised feature selection. 45 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. As discussed before, the most important condition for a successful ensemble learning method is to combine models which are different from each other. Thus, to maintain diversity between committee members, we have employed two strategies. Firstly, a well known ensemble method named RSM [20], is employed to face the curse of dimensionality problem by constructing multiple classifiers each one trained on different subset of examples projected on a smaller feature set RSM i . Secondly, the diversity is further maintained, by applying the bootstrapping method [14]. The formal description of our approach is given in Algorithm 2. Given a set of labeled training examples XL , and a set of unlabeled training examples XU , described over the input space F = {f1 , . . . , fp }, our approach constructs a committee according to the following steps. First, as described in the steps 3 and 4 of Algorithm 2, the committee is constructed as follows : For each ensemble component i, a replication XL,b i of the labeled data set is obtained by selecting instances from XL with replacement and then projecting them over RSM i , a feature subspace with m randomly selected features (m < p). The unlabeled data part XU is also projected over RSM i to generate XU i . Once each ensemble component i is obtained, the CLS score in Algorithm 1 is used to measure features relevance (step 6). A ranking of all features is finally obtained with respect to their average relevances over all ensemble members (steps from 7 to 9). A single learner is known to produce very bad results as the learning algorithms break down with high-dimensional data. Ensemble learning paradigms train multiple component learners and then combine their output results. Ensemble techniques are considered as an effective solution to overcome the dimensionality problem and to improve the robustness and the generalization ability of single learners, By using bagging in tandem with random feature subspaces, our framework try to deal with three different problems in the CLS score: – High dimensionality : The major drawback of CLS was the application on high-dimensional data. This is because the Euclidian distances between examples (over all features) is an essential factor in the function score (Sij in equation (1)). This makes the calculation of such distances less reliable when dealing with very high-dimensional data leading to bad features scores. Motivated by this, we adopt the use of the random manipulation strategy over the feature space (RSM). Hence, we create N random subspaces of the original features with a nearly equal apparition probability for all features. The high dimension is then reduced in each subspace and the distances calculated upon the new reduced dimension is more reliable. Consequently, working on the projected random subspace allows us to mitigate the curse of dimensionality and also help in enhancing the diversity between ensemble components. – Constraints : In CLS, instance level constraints are generated directly from labels. In semi-supervised context such labels are few, then the number of constraints (Ω = L(L − 1)/2 where L is the number of the labeled instances) 46 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. is rather few too. Moreover, the generated constraint set may contain some noisy constraints which were proven to have deteriorate effects on the learning performance. In order to improve the positive effects of the pairwise constraints, we propose the use of bagging method on the labeled part of data in each random subspace. The bagging is made by sampling with replacement. The reason for using bagging is to enforce diversity on pairwise constraints and then to compute a robust estimation of feature score against small changes in the pairwise constraint set. Furthermore, the different bootstrap samples in different random subspaces helps in reducing the undesirable effects of the noisy constraints. – Unlabeled instance diversity : The computation of CLS score implied the application of a clustering algorithm (SOM) to overcome the computation complexity of the score function. This is due to the fact that the complexity of the CLS score is highly dependent to the unlabeled part of data. Such a clustering was proved to considerably reduce this complexity. In this work, based on the random subspace approach, we keep the use of SOM algorithm in each subspace. Doing this, not only the computational complexity is reduced, but also the diversity is gained by the diversity of clusterings obtained in the different subspaces. Algorithm 2 The EnsCLS algorithm Require: Set of labeled training examples (XL ); set of unlabeled training examples (XU ); input space (F = {f1 , . . . , fp }); committee size (N ) 1: Initialize the scores I(fr ) to zero for each feature r 2: for i = 1 : N do 3: RSM i = randomly draw m features from F 4: XL,b i = bootstrap sample from XL projected onto RSM i 5: XU i = the unlabeled sample XU projected onto RSM i 6: impi = CLS(XL,b i , XU i ) compute the constraint laplacian score of each feature in RSM i using Algorithm 1 i 7: for each feature r ∈ RSM do i 8: I(fr ) = I(fr ) + impN(fr ) 9: end for 10: end for 11: rank the features in F according to their scores I in ascending order. 12: return F 5 Experimental results In this section, we provide empirical results on several benchmark and real highdimensional datasets and compare EnsCLS against over state-of-the-art semi- 47 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 1. The datasets used in the experiments Dataset # patterns # features # classes Reference BasesHock 1993 4862 2 [36] Leukemia 73 7129 2 [15] Lymphoma 96 4026 9 [1] Madelon 2598 500 2 [4] PcMac 1943 3289 2 [36] PIE10P 210 2420 10 [36] PIX10P 100 10000 10 [36] Prostata 102 12533 2 [28] Relathe 1427 4322 2 [36] supervised feature ranking algorithms. EnsCLS is compared with four other feature selection methods: (1) the original CLS score [3], (2) the Constrained Selection based Feature Selection framework (CSFS) [19], two ensemble-based feature evaluation algorithms, including (3) the Bagging constraint Score (BS) [30], and (3) the wrapper-type Semi-Supervised Feature Importance approach (SSFI) [2]. Nine benchmark and real labeled datasets were used to assess the performance of feature selection algorithms. They are described in Table 1. We selected these datasets as they contain thousands features and are thus good candidates for feature selection. Most of these datasets have already been used in various empirical studies [35, 2] and cover different application domains: Biology, image and text analysis. 5.1 Evaluation framework To make fair comparisons, the same experimental settings in [3] was adopted here for CLS and CSFS approaches, i.e., the neighborhood graph with a neighborhood degree of 10, and the λ value is set to 0.1. For BS, we set the ensemble size to 100, as around this value the quality of this method is less insensitive to the increase of the ensemble size (c.f. [30]). EnsCLS and SSFI are tuned similarly. √ The number of features per bag is m = p, where p is the size of the input space. The committee size N is computed using the following formula: log(0.01) . (4) N = 10 × ceil √ log(1 − 1/ p) This formula ensures that each feature is drawn ten times at a confidence level of 0.01. Furthermore, as suggested by the authors in [2], the number of iterations maxiter and the sample size n in SSFI are set to 10, and 1, respectively. For each dataset, experimental results are averaged over 10 runs. At each run, the whole dataset is splitted (in a stratified way) into a training partition with 2/3 of the observations and a test partition with the remaining 1/3 observations. Training set is further splitted into labeled and unlabeled datasets. As in [35], 48 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Algorithm 3 Feature Evaluation Framework 1: for each dataset X do 2: build a randomly stratified partition (T r, T e), from X where |T r| = 1 3 .|X|; 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 2 3 .|X| and |T e| = Generate labeled data XL by randomly sampling from T r 3 instances per class; XU = T r\XL ; SFCLS = Apply CLS with XL ∪ XU ; SFCSF S = Apply CSFS with XL ∪ XU ; SFBS = Apply BS with XL ∪ XU ; SFSSF I = Apply SSFI with XL ∪ XU ; SFEnsCLS = Apply EnsCLS with XL ∪ XU ; for i = 1 to 20 do Select top i features from SFCLS , SFCSF S , SFBS , SFSSF I and SFEnsCLS ; T rCLS = ΠSFCLS (T r); T rCSF S = ΠSFCSF S (T r); T rBS = ΠSFBS (T r); T rSSF I = ΠSFSSF I (T r); T rEnsCLS = ΠSFEnsCLS (T r); Train the Baselearner using T rCLS , T rCSF S , T rBS , T rSSF I and T rEnsCLS and record accuracy obtained on T e; end for end for the labeled sample set XL consists of randomly selected 3 patterns per class, and the remaining patterns are used as unlabeled sample set XU . In order to assess the quality of a feature subset obtained with the aforementioned semisupervised procedures, we train a SVM classifier (using LIBSVM package [7]) on the whole labeled training data and evaluate its accuracy on the test data. The latter is taken as the score for the feature subset. The details of the evaluation framework are shown in Algorithm 3. As mentioned above, the process specified in Algorithm 3 is repeated 10 times. The obtained accuracy is averaged and used for evaluating the quality of the feature subset selected according to each algorithm. 5.2 Results In Figure 1, we plotted the accuracies of the above feature selection approaches against the 20 most important features. As may be observed, EnsCLS outperforms the other four methods by a noticeable margin. The major observations from the analysis of these plots are three-fold: – EnsCLS usually has better performances than CLS and CSFS. This firstly validates the motivation behind our method EnsCLS that ensemble strategy 49 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. BaseHock data set Lymphoma data set Leukemia data set 0.79 0.83 EnsCLS CLS CSFS BS SSFI 0.76 0.73 0.8 EnsCLS CLS CSFS BS SSFI 0.8 EnsCLS CLS CSFS BS SSFI 0.77 0.74 0.77 0.71 0.7 0.68 0.74 0.64 0.61 Accuracy 0.65 Accuracy Accuracy 0.67 0.71 0.68 0.62 0.59 0.56 0.65 0.58 0.55 0.62 0.52 0.59 0.53 0.5 0.47 0.49 0.44 1 2 3 4 5 6 7 8 0.56 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 The number of selected features 7 8 0.41 9 10 11 12 13 14 15 16 17 18 19 20 Madelon data set 0.73 0.7 Accuracy Accuracy Accuracy 0.55 0.64 0.61 0.52 0.58 0.55 0.49 0.52 2 3 4 5 6 7 8 4 5 6 7 8 0.49 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 The number of selected features 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0.91 0.88 0.85 0.82 0.79 0.76 0.73 0.7 0.67 0.64 0.61 0.58 0.55 0.52 0.49 0.46 0.43 0.4 0.37 0.34 0.31 0.28 0.25 0.22 0.19 0.16 EnsCLS CLS CSFS BS SSFI 1 2 3 4 5 The number of selected features PIX10P data set 9 10 11 12 13 14 15 16 17 18 19 20 PIE10P data set 0.67 1 3 PcMac data set EnsCLS CLS CSFS BS SSFI 0.76 0.58 0.46 2 The number of selected features 0.79 EnsCLS CLS CSFS BS SSFI 0.61 1 The number of selected features 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The number of selected features Prostata data set Relathe data set 0.71 0.94 0.8 0.91 0.77 0.88 0.74 0.68 EnsCLS CLS CSFS BS SSFI 0.71 0.85 EnsCLS CLS CSFS BS SSFI 0.76 0.73 Accuracy Accuracy 0.79 0.65 0.62 0.59 0.56 0.53 0.56 0.5 0.64 0.62 0.59 0.7 0.67 Accuracy 0.65 EnsCLS CLS CSFS BS SSFI 0.68 0.82 0.47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The number of selected features 0.44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The number of selected features 0.53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The number of selected features Fig. 1. Accuracy vs. different numbers of selected features. The number of labeled instances per class is set to 3. has the potential to improve the quality and the stability of the CLS score and also confirms the effectiveness of this ensemble strategy to rank the features properly, compared to the powerful constraint selection method used in CSFS. – EnsCLS seems to combine more efficiently the labeled and unlabeled data for feature evaluation and it shows promise for scaling to larger domains in a semi-supervised way in view of the good performance on BaseHock, PcMac, Madelon and Relathe datasets. This suggests the ability of the proposed 50 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 2. Mean and standard deviations of accuracy over the 20 most important features. Bottom row of the table present average rank of accuracy mean used in the computation of the Friedman test. Data BaseHock Leukemia Lymphoma Madelon PcMac PIE10P PIX10P Prostata Relathe Av Rank EnsCLS CLS CSFS BS SSFI 0.695±0.01 0.507±0.00 0.513±0.01 0.600±0.05 0.675±0.03 0.781±0.06 0.740±0.10 0.751±0.11 0.618±0.04 0.760±0.08 0.702±0.03 0.480±0.03 0.490±0.03 0.647±0.04 0.680±0.06 0.594±0.01 0.542±0.04 0.549±0.03 0.499±0.01 0.548±0.04 0.703±0.01 0.515±0.01 0.517±0.01 0.543±0.03 0.638±0.02 0.734±0.07 0.535±0.04 0.535±0.04 0.696±0.11 0.701±0.07 0.907±0.03 0.837±0.05 0.837±0.05 0.882±0.03 0.902±0.03 0.749±0.04 0.507±0.02 0.511±0.02 0.538±0.08 0.735±0.10 0.660±0.01 0.550±0.00 0.560±0.00 0.562±0.02 0.553±0.00 1.0000 4.6667 3.6667 3.3333 2.3333 ensemble method of CLS to rank the relevant features accurately, compared to especially the other ensemble semi-supervised feature selection approaches (BS and SSFI), by exploiting efficiently the topological information from the unlabeled data. – A closer inspection of the plots reveals that the accuracy on the features selected by EnsCLS generally increases swiftly at the beginning (the number of selected feature is small) and slows down afterwards. This suggests that EnsCLS ranks the most relevant features first and that a classifier can achieve a very good classification accuracy with the top 5 features while the other methods require more features to achieve comparable results. Fore sake of completeness, we also averaged the accuracy for different numbers of selected features. The averaged accuracies of EnsCLS and the other methods over the top 20 features are depicted in Table 2. In order to better assess the results obtained for each algorithm, we adopt in this study the methodology proposed by [9] for the comparison of several algorithms over multiple datasets. In this methodology, the non-parametric Friedman test is firstly used to evaluate the rejection of the hypothesis that all the classifiers perform equally well for a given risk level. It ranks the algorithms for each dataset separately, the best performing algorithm getting the rank of 1, the second best rank 2 etc. In case of ties it assigns average ranks. Then, the Friedman test compares the average ranks of the algorithms and calculates the Friedman statistic. If a statistically significant difference in the performance is detected, we proceed with a post hoc test. The Nemenyi test is used to compare all the classifiers to each other. In this procedure, the performance of two classifiers is significantly different if their average ranks differ more than some critical distance (CD). The critical distance depends on the number of algorithms, the number of data sets and the critical value (for a given significance level p) that is based on the Studentized range statistic (see [9] for further details). In this study, based on the values in Table 2, 51 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. the Friedman test reveals statistically significant differences (p < 0.05) between all compared approaches. Furthermore, we present the result from the Nemenyi posthoc test with average rank diagrams as suggested by Demsar [9]. These are given on Figure 2. The ranks are depicted on the axis, in such a manner that the best ranking algorithms are at the rightmost side of the diagram. The algorithms that do not differ significantly (at p = 0.1) are connected with a line. The critical difference CD is shown above the graph (here CD=1.8336). CD =1.8336 5 4 3 CLS 2 1 EnsCLS CSFS SSFI BS Fig. 2. Average ranks diagram comparing the feature selection algorithms in terms of accuracy over different number of selected features. Overall, EnsCLS performs best. However, its performances are not statistically distinguishable from the performances of SSFI. Another interesting observation upon looking at the average rank diagrams and Table 2 is that, almost in all cases the ensemble methods, i.e. EnsCLS, SSFI and BS, achieve better performances than those of single methods including CLS and CSFS, respectively. The statistical tests we use are conservative and the differences in performance for methods within the first group (EnsCLS and SSFI) are not significant. To further support these rank comparisons, we compared, on each dataset and for each pair of methods, the accuracy values in Table 2 using the paired t-test (with p = 0.1). The results of these pairwise comparisons are depicted in Table 3 in terms of ”Win-Tie-Loss” statuses of all pairs of methods; the three values in each cell (i, j) respectively indicate how times many the approach i is significantly better/not significantly different/significantly worse than the approach j. Following [9], if the two algorithms are, as assumed under the null-hypothesis, equivalent, each should win on approximately n/2 out of n data sets. The number of wins is distributed according to the binomial distribution and the critical number of wins at p = 0.1 is equal to 7 in our case. Since tied matches support the null-hypothesis we should not discount them but split them evenly between the two classifiers when counting the number of wins; if there is an odd number of them, we again ignore one. In Table 3, each pairwise comparison entry (i, j) for which the approach i is significantly better than j is boldfaced. From this table, the analogous trend between EnsCLS and other feature selection methods can be observed as in Table 52 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 3. Pairwise t-test comparisons of FS methods in terms of accuracy. Bold cells (i, j) highlights that the approach i is significantly better than j according to the sign test at p = 0.1. EnsCLS CLS CSFS BS SSFI EnsCLS − 0/1/8 0/1/8 0/1/8 0/4/5 CLS 8/1/0 − 4/3/2 5/2/2 6/3/0 CSFS 8/1/0 2/3/4 − 5/2/2 6/2/1 BS 8/1/0 2/2/5 2/2/5 − 7/2/0 SSFI 5/4/0 0/3/6 1/2/6 0/2/7 − 2 and Figure 2, i.e., EnsCLS and SSFI usually have better performances than all other methods. On the other hand, It can be seen from Table 3 that EnsCLS significantly outperforms SSFI. 6 Conclusion Constraint Laplacian Score (CLS) which uses pairwise constraints for feature selection has shown good performance in our previous work [3]. However, one important problem of such approach is how to best use the available constraints for dealing with low-quality ones that may deteriorate the learning performance. Instead of making efforts on choosing constraints for single feature selection, as recently done in the CSFS approach [19], we address, in this paper, this important issue from another view. We propose a novel semi-supervised feature selection method called Ensemble Laplacian Constraint Score (EnsCLS for short), which firstly combines both data resampling (bagging) and random subspace strategies for generating different views of the data. Once each ensemble component is obtained, the CLS score is used to measure features relevance. A ranking of all features is finally obtained with respect to their average relevances over all ensemble members. Extensive experiments on a series of benchmark and real datasets have verified the effectiveness of our approach compared to other state-of-the-art semisupervised feature selection algorithms and confirm the ability of the used ensemble strategy to rank the relevant features accurately. They also show that the proposed EnsCLS method can utilize labeled and unlabeled data in a more effective way than Constraint Laplacian Score. Furthermore, they indicate that our method which inject some randomness for manipulating the available unlabeled and labeled data (constraints) is superior to the recently proposed CSFS method which actively selects constraints to improve the quality of the CLS score. Future substantiation through more experiments on biological databases containing several thousands of variables and through evaluating the stability of the feature selection method [25, 27] when small changes are made to the data are currently being undertaken. Moreover, comparisons using different numbers of pairwise constraints will be reported in due course. 53 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. References 1. A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu, J. I. Powell, L. Yang, G. E. Marti, T. Moore, J. Hudson, L. Lu, D. B. Lewis, R. Tibshirani, G. Sherlock, W. C. Chan, T. C. Greiner, D. D. Weisenburger, J. O. Armitage, R. Warnke, R. Levy, W. Wilson, M. R. Grever, J. C. Byrd, D. Botstein, P. O. Brown, and L. M. Staudt. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403(6769):503–511, 2000. 2. Hasna Barkia, Haytham Elghazel, and Alex Aussem. Semi-supervised feature importance evaluation with ensemble learning. In ICDM, pages 31–40, 2011. 3. K. Benabdeslem and M. Hindawi. Constrained laplacian score for semi-supervised feature selection. In Proceedings of ECML-PKDD conference, pages 204–218, 2011. 4. C.L Blake and C.J Merz. Uci repository of machine learning databases, 1998. 5. L. Breiman. Bagging predictors. Machine Learning, 26(2):123–140, 1996. 6. Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. 7. Chih. Chung Chang and Chih. Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm. 8. I. Davidson, K. Wagstaff, and S. Basu. Measuring constraint-set utility for partitional clustering algorithms. In Proceedings of ECML/PKDD, 2006. 9. Janez Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006. 10. T.G. Dietterich. Ensemble methods in machine learning. In First International Workshop on Multiple Classier Systems, pages 1–15, 2000. 11. J.G. Dy and C. E. Brodley. Feature selection for unsupervised learning. Journal of Machine Learning Research, (5):845–889, 2004. 12. Haytham Elghazel and Alex Aussem. Unsupervised feature selection with ensemble learning. Machine Learning, pages 1–24, 2013. 13. R. Fisher. The use of multiple measurements in taxonomic problems. Annals Eugen, 7:179–188, 1936. 14. Y Freund and R.E. Shapire. Experiments with a new boosting algorithm. In 13th International Conference on Machine Learning, pages 276–280, 1996. 15. T.R. Golub, Slonim, D.K., P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, and H. Coller. Molecular classication of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999. 16. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, (3):1157–1182, 2003. 17. M. F. Abdel Hady and F. Schwenker. Combining committee-based semi-supervised learning and active learning. Journal of Computer Science and Technology, 25(4):681–698, 2010. 18. X. He, D. Cai, and P. Niyogi. Laplacian score for feature selection. In Advances in Neural Information Processing Systems, 17, 2005. 19. M. Hindawi, K. Allab, and K. Benabdeslem. Constraint selection based semisupervised feature selection. In Proceedings of international conference on Data Mining, pages 1080–1085, 2011. 20. Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell., 20(8):832–844, 1998. 21. Yi Hong, Sam Kwong, Yuchou Chang, and Qingsheng Ren. Consensus unsupervised feature ranking from multiple views. Pattern Recognition Letters, 29(5):595– 602, 2008. 54 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 22. Yi Hong, Sam Kwong, Yuchou Chang, and Qingsheng Ren. Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm. Pattern Recognition, 41(9):2742–2756, 2008. 23. M. Kalakech, P. Biela, L. Macaire, and D. Hamad. Constraint scores for semisupervised feature selection: A comparative study. Pattern Recognition Letters, 32(5):656–665, 2011. 24. T. Kohonen. Self organizing Map. Springer Verlag, Berlin, 2001. 25. Ludmila I. Kuncheva. A stability index for feature selection. AIAP’07, pages 390–395, 2007. 26. M. Li and Z. H. Zhou. Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man, and Cybernetics, 37(6):1088–1098, 2007. 27. Yvan Saeys, Thomas Abeel, and Yves Van de Peer. Robust feature selection using ensemble feature selection techniques. In ECML/PKDD (2), pages 313–325, 2008. 28. Dinesh Singh, Phillip G. Febbo, Kenneth Ross, Donald G. Jackson, Judith Manola, Christine Ladd, Pablo Tamayo, Andrew A. Renshaw, Anthony V. D’Amico, and Jerome P. Richie. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2):203–209, 2002. 29. A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583–617, 2002. 30. Dan Sun and Daoqiang Zhang. Bagging constraint score for feature selection with pairwise constraints. Pattern Recognition, 43:2106–2118, 2010. 31. A. Topchy, A.K. Jain, and W. Punch. Clustering ensembles: Models of consensus and weak partitions. IEEE Transaction on Pattern Analysis and Machine Intelligence, 27(12):1866–1881, 2005. 32. Y. Yaslan and Z. Cataltepe. Co-training with relevant random subspaces. Neurocomputing, 73(10-12):1652–1661, 2010. 33. D. Zhang, S. Chen, and Z. Zhou. Constraint score: A new filter method for feature selection with pairwise constraints. Pattern Recognition, 41(5):1440–1451, 2008. 34. Z. Zhao and H. Liu. Semi-supervised feature selection via spectral analysis. In Proceedings of SIAM International Conference on Data Mining (SDM), pages 641– 646, 2007. 35. Z. Zhao and H. Liu. Semi-supervised feature selection via spectral analysis. In SDM, pages 641–646, 2007. 36. Zheng Zhao, Fred Morstatter, Shashvata Sharma, Salem Alelyani, and Aneeth Anand. Feature selection, 2011. 55 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Feature ranking for multi-label classification using predictive clustering trees Dragi Kocev, Ivica Slavkov, and Sašo Džeroski Department of Knowledge Technologies, Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia {Dragi.Kocev,Ivica.Slavkov,Saso.Dzeroski}@ijs.si Abstract. In this work, we present a feature ranking method for multilabel data. The method is motivated by the the practically relevant multilabel applications, such as semantic annotation of images and videos, functional genomics, music and text categorization etc. We propose a feature ranking method based on random forests. Considering the success of the feature ranking using random forest in the tasks of classification and regression, we extend this method for multi-label classification. We use predictive clustering trees for multi-label classification as base predictive models for the random forest ensemble. We evaluate the proposed method on benchmark datasets for multi-label classification. The evaluation of the proposed method shows that it produces valid feature rankings and that can be successfully used for performing dimensionality reduction. Key words: multi-label classification, feature ranking, random forest, predictive clustering trees 1 Introduction The problem of single-label classification is concerned with learning from examples, where each example is associated with a single label λi from a finite set of disjoint labels L = {λ1 , λ2 , ..., λQ }, i = 1..Q, Q > 1. For Q > 2, the learning problem is referred to as multi-class classification. On the other hand, the task of learning a mapping from an example x ∈ X (X denotes the domain of examples) to a set of labels Y ⊆ L is referred to as a multi-label classification (MLC). In contrast to multi-class classification, alternatives in multi-label classification are not assumed to be mutually exclusive: multiple labels may be associated with a single example, i.e., each example can be a member of more than one class. Labels in the set Y are called relevant, while the labels in the set L \ Y are irrelevant for a given example. Many different methods have been developed for solving MLC problems. Tsoumakas et al. [16] summarize them into two main categories: a) algorithm adaptation methods, and b) problem transformation methods. Algorithm adaptation methods extend specific learning algorithms to handle multi-label data directly. Problem transformation methods, on the other hand, transform the MLC 56 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. problem into one or more single-label classification problems. The single-label classification problems are solved with a commonly used single-label classification method and the output is transformed back into a multi-label representation. The issue of learning from multi-label data has recently attracted significant attention from many researchers, motivated by an increasing number of new applications. The latter include semantic annotation of images and videos (news clips, movies clips), functional genomics (gene and protein function), music categorization into emotions, text classification (news articles, web pages, patents, emails, bookmarks, ...), directed marketing and others. Albeit the popularity of the task of MLC, the tasks of feature ranking and feature selection have not received much attention. The few available methods are based on the problem transformation paradigm [14], thus they do not fully exploit the possible label dependencies. More specifically, these methods use the label powerset (LP) approach for MLC [16, 6] from the group of problem transformation methods. The basis of the LP methods is to combine entire label sets into atomic (single) labels to form a single-label problem (i.e., single-class classification problem). For the single-label problem, the set of possible single labels represents all distinct label subsets from the original multi-label representation. In this way, LP based methods directly take into account the label correlations. However, the space of possible label subsets can be very large. To resolve this issue, Read [11] has developed a pruned problem transformation (PPT) method, that selects only the transformed labels that occur more than a predefined number of times. Tsoumakas et al. [16] use the LP transformed dataset to calculate simple χ2 statistic thus producing a ranking of the features. Doquire and Verleysen [6] use the PPT transformed dataset to calculate mutual information (MI) for performing feature selection and they show that this method outperforms the χ2 -based feature ranking. Feature ranking for MLC with problem transformation has two major shortcomings. First, the label dependencies and interconnections are not fully exploited. Second, these methods are not scalable to domains with large number of labels because of the exponential growth of the possible label powersets. Furthermore, the label powerset methods can yield a multi-class problem with extremely skewed class distribution. To address these issues, we propose an algorithm adaptation method for performing feature ranking. We extend the random forest feature ranking method [3] to the task of MLC. More specifically, we construct random forest that employs predictive clustering trees (PCTs) for MLC as base predictive models [1], [8]. PCTs can be considered a generalization of decision trees that are able to make predictions for structured outputs. This work is motivated by several factors. First, the number of possible application domains for MLC and the size of the problems is increasing. For example, in image annotation the number of available images and possible labels is growing rapidly; in functional genomics the measurement techniques have improved significantly and there are high-dimensional genomic data available for analysis. Second, in Madjarov et al. [10] we have shown that the random forests of 57 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. PCTs for MLC are among the best predictive models for the task of MLC. Next, random forests as feature ranking algorithms are very successful on simple classification tasks [17, 19]. Finally, in Kocev et al. [8] we have shown that the random forests of PCTs are among the most efficient methods for predicting structured outputs. This is especially important, since many of the methods for MLC are computationally expensive and thus are not able to produce a predictive model for a given domain in a reasonable time (i.e., few weeks) [10]. We evaluate the proposed method on 4 benchmark datasets for MLC using 7 different evaluation measures. We compare the feature ranking produced by the proposed method to a random feature ranking. The random feature ranking is the worst feature ranking thus if the proposed method is able to capture the feature relevances then it should outperform the random ranking. We assess the performance of the obtained rankings and the random rankings by using error testing curves [12]. The goal of this study is to investigate whether random forests of PCTs for MLC can produce good feature rankings for the task of multi-label classification. Moreover, we want to check whether the produced rankings can be used to reduce the dimensionality of the considered multi-label domains. The remainder of this paper is organized as follows. Section 2 presents the predictive clustering trees. The method for feature ranking using random forests is described in Section 3. Section 4 outlines the experimental design, while Section 5 presents the results from the experimental evaluation. Finally, the conclusions and a summary are given in Section 6. 2 Predictive clustering trees for multi-label classification Predictive clustering trees (PCTs) [1] generalize decision trees [4] and can be used for a variety of learning tasks, including different types of prediction and clustering. The PCT framework views a decision tree as a hierarchy of clusters: the top-node of a PCT corresponds to one cluster containing all data, which is recursively partitioned into smaller clusters while moving down the tree. The leaves represent the clusters at the lowest level of the hierarchy and each leaf is labeled with its cluster’s prototype (prediction). PCTs can be induced with a standard top-down induction of decision trees (TDIDT) algorithm [4]. The algorithm is presented in Table 1. It takes as input a set of examples (E) and outputs a tree. The heuristic (h) that is used for selecting the tests (t) is the reduction in variance caused by partitioning (P) the instances (see line 4 of BestTest procedure in Table 1). By maximizing the variance reduction the cluster homogeneity is maximized and it improves the predictive performance. If no acceptable test can be found (see line 6), that is, if the test does not significantly reduces the variance, then the algorithm creates a leaf and computes the prototype of the instances belonging to that leaf. The main difference between the algorithm for learning PCTs and other algorithms for learning decision trees is that the former considers the variance function and the prototype function (that computes a label for each leaf) as parameters that can be instantiated for a given learning task. So far, PCTs have 58 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 1. The top-down induction algorithm for PCTs. procedure PCT(E) returns tree 1: (t∗ , h∗ , P ∗ ) = BestTest(E) 2: if t∗ 6= none then 3: for each Ei ∈ P ∗ do 4: tree i = PCT(Ei ) S 5: return node(t∗ , i {tree i }) 6: else 7: return leaf(Prototype(E)) procedure BestTest(E) 1: (t∗ , h∗ , P ∗ ) = (none, 0, ∅) 2: for each possible test t do 3: P = partition induced by t on E P i| 4: h = Var (E) − Ei ∈P |E Var (Ei ) |E| ∗ 5: if (h > h ) ∧ Acceptable(t, P) then 6: (t∗ , h∗ , P ∗ ) = (t, h, P) 7: return (t∗ , h∗ , P ∗ ) been instantiated for the following tasks: prediction of multiple targets [8], [15], prediction of time-series [13] and hierarchical multi-label classification [18]. One of the most important steps in the induction algorithm is the test selection procedure. For each node, a test is selected by using a heuristic function computed on the training examples. The goal of the heuristic is to guide the algorithm towards small trees with good predictive performance. The heuristic used in this algorithm for selecting the attribute tests in the internal nodes is the reduction in variance caused by partitioning the instances. Maximizing the variance reduction maximizes cluster homogeneity and improves predictive performance. In this work, we focus on the task of multi-label classification, which can be considered as a special case of multi-target prediction. Namely, in the task of multi-target prediction, the goal is to make predictions for multiple target variables. The multiple labels in MLC can be viewed as multiple binary variables, each one specifying whether a given example is labelled with the corresponding label. Therefore, we compute the variance function same as for the task of predicting multiple discrete variables [8], i.e., the variance function is computed as the sum of the Gini indices [4] of the variables from the target tuple, i.e., PT PCi Var (E) = i=1 Gini (E , Yi ), Gini(E, Yi ) = 1 − j=1 pcij , where T is the number of target attributes, cij is the j-th class of target attribute Yi and Ci is the number of classes of target attribute Yi . The prototype function returns a vector of probabilities for the set of labels that indicate whether an example is labelled with a given label. For a detailed description of PCTs for multi-target prediction the reader is referred to [1, 8]. The PCT framework is implemented in the CLUS system1 . 3 Feature ranking via random forests We construct the random forest using predictive clustering trees as base classifiers. We exploit the random forests mechanism [3] to calculate the variable importance, i.e., the feature ranking. In the following subsections, first we present the random forest algorithm and then we describe how it can be used to estimate the importance of the descriptive variables. 1 CLUS is available for download at http://clus.sourceforge.net 59 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 3.1 Random Forests An ensemble is a set of classifiers constructed with a given algorithm. Each new example is classified by combining the predictions of every classifier from the ensemble. These predictions can be combined by taking the average (for regression tasks) and the majority or probability distribution vote (for classification tasks)[2], or by taking more complex combinations [9]. A necessary condition for an ensemble to be more accurate than any of its individual members, is that the classifiers are accurate and diverse [7]. An accurate classifier does better than random guessing on new examples. Two classifiers are diverse if they make different errors on new examples. There are several ways to introduce diversity: by manipulating the training set (by changing the weight of the examples [2] or by changing the attribute values of the examples [3]), or by manipulating the learning algorithm itself [5]. A random forest [3] is an ensemble of trees, where diversity among the predictive models is obtained by using bootstrap replicates, and additionally by changing the feature set during learning. More precisely, at each node in the decision trees, a random subset of the input attributes is taken, and the best feature is selected from this subset. The number of attributes that are retained is given by √ a function f of the total number of input attributes x (e.g., f (x) = 1, f (x) = b x + 1c, f (x) = blog2 (x) + 1c . . . ). By setting f (x) = x, we obtain the bagging procedure. 3.2 Feature ranking using random forests Feature ranking of the descriptive variables can be obtained by exploiting the mechanism of random forests. This method uses the internal out-of-bag estimates of the error and noising the descriptive variables. To create each tree from the forest, the algorithm first creates a bootstrap replicate (line 4, from the Induce RF procedure, Table 2). The samples that are not selected for the bootstrap are called out-of-bag (OOB) samples (line 7, procedure Induce RF ). These samples are used to evaluate the performance of each tree from the forest. The complete algorithm is presented in Table 2. Suppose that there are T target variables and D descriptive variables. The trees in the forest are constructed using a randomized variant of the PCT construction algorithm (P CTrand ), i.e., in each node the split is selected from a subset of the descriptive variables. After each tree from the forest is built, the values of the descriptive attributes for the OOB samples are randomly permuted one-by-one thus obtaining D OOB samples (line 3, procedure U pdate Imp). The predictive performance of each tree is evaluated on the original OOB data (Err(OOBi )) and the permuted versions of the OOB data (Erri (fd )). The performance is averaged across the T target variables. Then the importance of a given variable (Ij ) is calculated as the relative increase of the mis-classification error that is obtained when its values are randomly permuted. The importance is at the end averaged over all trees in the forest. The variable importance is calculated using the following formula: 60 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 2. The algorithm for feature ranking via random forests. E is the set of the training examples, k is the number of trees in the forest, and f (x) is the size of the feature subset that is considered at each node during tree construction. procedure Induce RF(E, k, f (x)) returns Forest, Importances 1: F = ∅ 2: I = ∅ 3: for i = 1 to k do 4: Ei = Bootstrap sample(E) 5: T reei =SP CTrand (Ei , f (x)) 6: F = F T reei 7: EOOB = E \ Ei 8: Update Imp(EOOB , T ree, I) 9: I = Average(I, k) 10: return F, I Importance(fd ) = procedure Update Imp(EOOB , T ree, I) 1: ErrOOB = Evaluate(T ree, EOOB ) 2: for j = 1 to D do 3: Ej = Randomize(EOOB , j) 4: Errj = Evaluate(T ree, Ej ) 5: Ij = Ij + (Errj − ErrOOB )/ErrOOB 6: return procedure Average(I, k) 1: I T = ∅ 2: for l = 1 to size(I) do 3: IlT = Il /k 4: return I T k 1 X Erri (fd ) − Err(OOBk ) · k i=1 Err(OOBk ) (1) where k is the number of bootstrap replicates and 0 < d ≤ D. 4 Experimental design In this section, we give the specific experimental design used to evaluate the performance of the proposed method. We begin by briefly summarizing the multilabel datasets used in this study. Next, we present the evaluation measures and discuss the construction of the error curves. Finally, we give the specific parameter instantiation of the methods. 4.1 Data description We use four multi-label classification benchmark problems. Parts of the selected problems were used in various studies and evaluations of methods for multi-label learning. Table 3 presents the basic statistics of the datasets. We can note that the datasets vary in size: from 391 up to 4880 training examples, from 202 up to 2515 testing examples, from 72 up to 1836 features, from 6 to 159 labels, and from 1.25 to 3.38 average number of labels per example (i.e., label cardinality [16]). From the literature, these datasets come pre-divided into training and testing parts: Thus, in the experiments, we use them in their original format. The training part usually comprises around 2/3 of the complete dataset, while the testing part the remaining 1/3 of the dataset. The datasets come from the domains of multimedia and text categorization. Emotions is a dataset from the multimedia domain where each instance is a piece 61 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 3. Description of the benchmark problems in terms of number of training (#tr.e.) and test (#t.e.) examples, the number of features (D), the total number of labels (Q) and label cardinality (lc ). The problems are ordered by their overall complexity roughly calculated as #tr.e. × D × Q. emotions medical enron bibtex #tr.e. #t.e. D Q lc 391 202 72 6 1.87 645 333 1449 45 1.25 1123 579 1001 53 3.38 4880 2515 1836 159 2.40 of music. Each piece of music can be labelled with six emotions: sad-lonely, angryaggressive, amazed-surprised, relaxing-calm, quiet-still, and happy-pleased. The domain of text categorization is represented with 3 datasets: medical, enron and bibtex. Medical is a dataset used in the Medical Natural Language Processing Challenge2 in 2007. Each instance is a document that contains brief free-text summary of a patient symptom history. The goal is to annotate each document with the probable diseases from the International Classification of Diseases. Enron is a dataset that contains the e-mails from 150 senior Enron officials categorized into several categories. The labels can be further grouped into four categories: coarse genre, included/forwarded information, primary topics, and messages with emotional tone. Bibtex contains metadata for bibtex items, such as the title of the paper, the authors, book title, journal volume, publisher, etc. These datasets are available for download at the web page of the Mulan project3 . 4.2 Experimental setup We evaluate the proposed method using seven evaluation measures: accuracy, micro precision, micro recall , micro F1 , macro precision, macro recall , and macro F1 . These measures are typically used for evaluation of the performance of multilabel classification methods. The micro averaging implicitly includes information about the label frequency, while macro averaging treats all the labels equally. Due to the space limitations, we only show the results for micro F1 because the F1 score unites the values for precision and recall. Moreover, the results and the discussion are similar if the other measures were used. These measures are discussed in detail in [10] and [16]. We asses the performance of the proposed method using error curves [12]. The error curves are based on the idea that the ‘correctness’ of the feature rank is related to predictive accuracy. A good ranking algorithm would put on top of a list a feature that is most important, and at the bottom a feature that is least important w.r.t. some target concept. All the other features would be in-between, ordered by decreasing importance. By following this intuition, we 2 3 http://www.computationalmedicine.org/challenge/ http://mulan.sourceforge.net/datasets.html 62 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. evaluate the ranking by performing a stepwise feature subset evaluation, which is used for obtaining an error curve. We generate two types of curves: forward feature addition (FFA) and reverse feature addition (RFA) curve. Examples of these curves are shown in Figures 1, 2, 3, and 4. The FFA curve is constructed from the top-k ranked features, i.e., from the beginning of the ranking. In contrast, the RFA curve is constructed from the bottom-k ranked features. For the FFA curves, we can expect that as the number of features used to construct the predictive model increases, the accuracy of the predictive models also increases. This can be interpreted as follows: By adding more and more of the top-ranked features, the feature subsets constructed contain more relevant features, reflected in the improvement of the error measure. On the other hand, for the RFA curves, we can expect a slight difference at the beginning of the curve, which considering the previous discussion, is located at the end of the x-axis. Namely, the accuracy of the models constructed with the bottom ranked features is minimal, which means the ranking is correct in the sense that it puts irrelevant features at the bottom of the ranking. As the number of bottom-k ranked features used to construct the predictive model increases, some relevant features get included and the accuracy of the models increases. In summary, at each point k, the FFA curve gives us the error of the predictive models constructed with the top-k ranked features, while the RFA curve gives us the error of the bottom-k ranked features. The algorithm for constructing the curves is given in Table 4. Table 4. The algorithm for generating forward feature addition (FFA) and reverse feature addition (RFA) curves.R = {Fr1 , . . . , Frn } is the feature ranking and Ft is the target feature. procedure ConstructErrorCurve(R, Ft ) returns Error Curve Err RS ⇐ ∅ for i = 1 to n do RS ⇐ RS ∪ f eature(R, i) Err[i] = Err(M(RS , Ft )) return Err for FFA curve: f eature(R, i) = {Fri } for RFA curve: f eature(R, i) = {Fr(n−i+1) } We compare the performance of the proposed method to the performance of a random ranking. We base this comparison on the idea that the random ranking is the worst ranking possible [12]. This is similar to the notion of random predictive model in predictive modelling. If our algorithm indeed is able to capture the variable importance correctly, then its error curves should be better than the curves of a random ranking. In this work, we generate 100 random feature rankings for each dataset and we show the averaged error curves. We opted for the comparison with the random rankings instead with the methods presented in [6] 63 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. and [16] because of the un-stable results produced by these rankings (especially for the Emotions dataset). Moreover, the reported accuracies for the emotions dataset is in the range of 0.3 to 0.5, and in our experiments the accuracy for the emotions dataset is in the range of 0.6 to 0.8. 4.3 Parameter instantiation The feature ranking algorithm with random forests of PCTs for MLC take as input two parameters: the number of base predictive models in the forest and the feature subset size. In these experiments, we constructed the feature rankings using a random forest with 500 PCTs for MLC. Each node in a PCT was constructed by randomly selecting 10% of the features (as suggested in [8]). For construction of the error curves, we selected random forests of PCTs for MLC as predictive models. The random forests model in this case consists of 100 PCTs for MLC and each node was constructed using 10% of the features. Both the predictive models and the feature rankings were constructed on the training set, while the performance for the error curves is the one obtained on the testing set. 5 Results and discussion In this section, we present the results from the experimental evaluation of the proposed method. We explain the results with respect to the variable importance scores for the features, the FFA curves and the RFA curves. The FFA and RFA curves are constructed using the micro F1 , however, the conclusions are still valid if we consider the other evaluation measures. In the remainder, we discuss the results for each of the datasets considered in this study. The results for the Emotions dataset are given in Figure 1. They show that the obtained ranking performs slightly better than the random ranking. Both FFA curves increase with a similar rate and have a similar shape. However, on a larger part the FFA curve of the ranking is above the curve of the random ranking. The RFA curve shows that the obtained ranking places more nonrelevant features at the bottom of the ranking. This finding can be confirmed and explained with the variable scores (Figure 1(a)). Namely, the curve with the variable scores is somewhat parallel to the x-axis. This means that the majority of features in this dataset are approximately equally relevant for the target concept (i.e., the multiple labels). Moreover, this could indicate that there are redundant features that are present in the dataset. All in all, selecting randomly a feature subset with a reasonable size (e.g., 25-30 features) is good enough to produce a predictive model with satisfactory predictive performance (i.e., the dimensionality can be easily reduced without a significant loss of information). The results for the Bibtex (Figure 2) and Enron (Figure 3) datasets are somewhat similar to each other, thus we discuss them together. We can see from the figures that the obtained ranking is clearly better than the random 64 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 0.12 0.7 0.7 RForest Random 0.65 0.6 0.1 0.6 0.5 0.55 0.06 0.04 0.5 micro F1 micro F1 Importance 0.08 0.45 0.4 0.35 0.4 0.3 0.2 0.3 0.02 RForest Random 0.25 0.1 0.2 0.0 0 10 20 30 40 50 60 70 0.0 0 80 10 20 30 40 50 60 70 80 0 Number of added features Features (a) 10 20 30 40 50 60 70 80 Number of removed features (b) (c) Fig. 1. The performance of random forests of PCTs for MLC feature ranking algorithm on the Emotions dataset. (a) feature importances reported by the ranking algorithm, (b) FFA curve and (c) RFA curve. ranking. The FFA curve of the obtained ranking is always above the FFA curve of the random ranking, and, conversely, the RFA curve of the obtained ranking is always bellow the RFA curve of the random ranking. Hence, more relevant features are placed at the top and more non-relevant features are placed at the bottom. 0.0045 0.3 0.3 0.25 0.25 0.2 0.2 RForest Random 0.004 micro F1 Importance 0.003 0.0025 0.002 micro F1 0.0035 0.15 0.1 0.0015 0.15 0.1 0.001 0.05 RForest Random 0.0005 0.0 0.0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Features (a) 0.05 0.0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of added features (b) 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of removed features (c) Fig. 2. The performance of random forests of PCTs for MLC feature ranking algorithm on the Bibtex dataset. (a) feature importances reported by the ranking algorithm, (b) FFA curve and (c) RFA curve. This is also confirmed with the variable importance scores. We can note that the curve of the variable importances drops linearly, which means that there are multiple features in the dataset that are more relevant for the target concept than the remaining features. The dimensionality in these cases can be significantly reduced. We will still obtain very good predictive performance if we select the 500 top-ranked features (out of 1836) for the Bibtex dataset and the 50 top-ranked features (out of 1001) for the Enron dataset. 65 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 0.55 0.018 0.016 0.55 0.5 RForest Random 0.5 0.014 0.45 0.45 0.01 0.008 micro F1 micro F1 Importance 0.012 0.4 0.35 0.4 0.35 0.006 0.3 0.004 0.3 RForest Random 0.25 0.002 0.25 0.2 0.0 0 200 400 600 800 0 1000 200 400 600 800 0 1000 100 200 300 400 500 600 700 800 900 1000 Number of added features Features (a) Number of removed features (b) (c) Fig. 3. The performance of random forests of PCTs for MLC feature ranking algorithm on the Enron dataset. (a) feature importances reported by the ranking algorithm, (b) FFA curve and (c) RFA curve. Finally, we discuss the results for the Medical dataset (given in Figure 4). We can note that the obtained ranking is significantly better than the random ranking. The FFA and RFA curves of the obtained ranking exhibit very steep increase and decrease, respectively. On the other hand, the FFA and RFA curves of the random ranking have linear increase and decrease. This means that there are a few features that are very relevant for the target concept and that these features carry the majority of the information for the target concept. This is further confirmed with the curve of the variable importances: this curve descends exponentially. Considering all of this, we can drastically reduce the dimensionality of this dataset. The good predictive performance will be preserved even if we select the 35 top-ranked features (out of 1449). 0.025 0.8 0.8 0.7 0.7 0.6 0.6 RForest Random 0.01 0.5 micro F1 0.5 0.015 micro F1 Importance 0.02 0.4 0.4 0.3 0.3 0.2 0.2 0.005 RForest Random 0.1 0.0 0.0 0 200 400 600 800 1000 1200 1400 1600 Features (a) 0.1 0.0 0 200 400 600 800 1000 1200 1400 1600 Number of added features (b) 0 200 400 600 800 1000 1200 1400 1600 Number of removed features (c) Fig. 4. The performance of random forests of PCTs for MLC feature ranking algorithm on the Medical dataset. (a) feature importances reported by the ranking algorithm, (b) FFA curve and (c) RFA curve. 66 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 6 Conclusions In this work, we presented and evaluated a feature ranking method for the task of multi-label classification (MLC). The proposed method is based on the random forest feature ranking mechanism. The random forests are already proven as a good method for feature ranking on the simpler tasks of classification and regression. Here, we propose and extension of the method to the task of MLC. To this end, we use predictive clustering trees (PCTs) for MLC as base predictive models. The random forests of PCTs have state-of-the-art predictive performance for the task of MLC. Here, we investigate whether this method can be also successful for the task of feature ranking for MLC. We evaluated the method on 4 benchmark multi-label datasets using 7 evaluation measures. The quality of the feature ranking was assessed by using forward feature addition and reverse feature addition curves. To investigate whether the obtained feature ranking is valid, i.e., that it places the more relevant features closer to the top of the ranking and the non-relevant features closer to the bottom of the ranking, we compare it to the performance of a random feature ranking. We summarize the results as follows. First, we show that in a datasets where many of the features are relevant for the target concept the produced ranking can slightly outperform the random ranking. This is due to the fact that if several features are (randomly) selected then the predictive model will have satisfactory predictive performance. Next, in the datasets where there are several relevant features for the target concept the produced ranking clearly outperforms the random ranking. This means that the ranking algorithm is able to detect these features and place them at the top of the ranking. Furthermore, in the datasets where there are only few features of high relevance for the target concept, the obtained ranking drastically outperforms the random ranking and satisfactory predictive performance can be obtained by using only 2 − 3% of the features. All in all, the experimental evaluation demonstrates that the random forests feature ranking method can be successfully applied to the task of MLC. We plan to extend this work in future along three major dimensions. First, we plan to include other measures for predictive performance in the ranking algorithm. In the current version, we use mis-classification rate. However, we will consider MLC specific evaluation measures. Next, we will extend the proposed method for other structured output prediction tasks, such as multi-target regression and hierarchical multi-label classification. Finally, we could estimate the relevance of a feature by considering the reduction of the variance the feature causes when selected for a test in a given node. References 1. Blockeel, H.: Top-down induction of first order logical decision trees. Ph.D. thesis, Katholieke Universiteit Leuven, Leuven, Belgium (1998) 2. Breiman, L.: Bagging Predictors. Machine Learning 24(2), 123–140 (1996) 3. Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001) 67 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 4. Breiman, L., Friedman, J., Olshen, R., Stone, C.J.: Classification and Regression Trees. Chapman & Hall/CRC (1984) 5. Dietterich, T.G.: Ensemble Methods in Machine Learning. In: Proc. of the 1st International Workshop on Multiple Classifier Systems - LNCS 1857. pp. 1–15. Springer (2000) 6. Doquire, G., Verleysen, M.: Feature Selection for Multi-label Classification Problems. In: Advances in Computational Intelligence. pp. 9–16 (2011) 7. Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(10), 993–1001 (1990) 8. Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Tree ensembles for predicting structured outputs. Pattern Recognition 46(3), 817–833 (2013) 9. Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. WileyInterscience (2004) 10. Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recognition 45(9), 3084– 3104 (2012) 11. Read, J., Pfahringer, B., Holmes, G.: Multi-label Classification Using Ensembles of Pruned Sets. In: Proc. of the 8th IEEE International Conference on Data Mining. pp. 995–1000 (2008) 12. Slavkov, I.: An Evaluation Method for Feature Rankings. Ph.D. thesis, IPS Jožef Stefan, Ljubljana, Slovenia (2012) 13. Slavkov, I., Gjorgjioski, V., Struyf, J., Džeroski, S.: Finding explained groups of time-course gene expression profiles with predictive clustering trees. Molecular BioSystems 6(4), 729–740 (2010) 14. Spolaôr, N., Cherman, E.A., Monard, M.C., Lee, H.D.: A comparison of multi-label feature selection methods using the problem transformation approach. Electronic Notes in Theoretical Computer Science 292, 135 – 151 (2013) 15. Struyf, J., Džeroski, S.: Constraint Based Induction of Multi-Objective Regression Trees. In: Proc. of the 4th International Workshop on Knowledge Discovery in Inductive Databases KDID - LNCS 3933. pp. 222–233. Springer (2006) 16. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining Multi-label Data. In: Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer Berlin / Heidelberg (2010) 17. Čehovin, L., Bosnić, Z.: Empirical evaluation of feature selection methods in classification. Intelligent Data Analysis 14(3), 265–281 (2010) 18. Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H.: Decision trees for hierarchical multi-label classification. Machine Learning 73(2), 185–214 (2008) 19. Verikas, A., Gelzinis, A., Bacauskiene, M.: Mining data with random forests: A survey and results of new tests. Pattern Recognition 44(2), 330–349 (2011) 68 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Identification of Statistically Significant Features from Random Forests Jérôme Paul, Michel Verleysen, and Pierre Dupont Université catholique de Louvain – ICTEAM/Machine Learning Group Place Sainte Barbe 2, 1348 Louvain-la-Neuve – Belgium {jerome.paul,michel.verleysen,pierre.dupont}@uclouvain.be http://www.ucl.ac.be/mlg/ Abstract. Embedded feature selection can be performed by analyzing the variables used in a Random Forest. Such a multivariate selection takes into account the interactions between variables but is not easy to interpret in a statistical sense. We propose a statistical procedure to measure variable importance that tests if variables are significantly useful in combination with others in a forest. We show experimentally that this new importance index correctly identifies relevant variables. The top of the variable ranking is, as expected, largely correlated with Breiman’s importance index based on a permutation test. Our measure has the additional benefit to produce p-values from the forest voting process. Such p-values offer a very natural way to decide which features are significantly relevant while controlling the false discovery rate. 1 Introduction Feature selection aims at finding a subset of most relevant variables for a prediction task. To this end, univariate filters, such as a t-test, are commonly used because they are fast to compute and their associated p-values are easy to interpret. However such a univariate feature ranking does not take into account the possible interactions between variables. In contrast, a feature selection procedure embedded into the estimation of a multivariate predictive model typically captures those interactions. A representative example of such an embedded variable importance measure has been proposed by Breiman with its Random Forest algorithm (RF) [1]. While this importance index is effective to rank variables it is difficult to decide how many such variables should eventually be kept. This question could be addressed through an additional validation protocol at the expense of an increased computational cost. In this work, we propose an alternative that avoids such additional cost and offers a statistical interpretation of the selected variables. The proposed multivariate RF feature importance index uses out-of-bag (OOB) samples to measure changes in the distribution of class votes when permuting a particular variable. It results in p-values, corrected for multiple testing, measuring how variables are useful in combination with other variables of the 69 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 2 Identification of Statistically Significant Features from RF model. Such p-values offer a very natural threshold for deciding which variables are statistically relevant. The remainder of this document is organised as follows. Section 2 presents the notations and reminds Breiman’s RF feature importance measure. Section 3 introduces the new feature importance index. Experiments are discussed in Section 4. Finally, Section 5 concludes this document and proposes hints for possible future work. 2 Context and Notations Let X n×p be the data matrix consisting of n data in a p-dimensional space and y a vector of size n containing the corresponding class labels. A RF model [1] is made of an ensemble of trees, each of which is grown from a bootstrap sample of the n data points. For each tree, the selected samples form the bag (B), the remaining ones constitute the OOB (B). Let B stand for the set of bags over the ensemble and B be the set of corresponding OOBs. We have |B| = |B| = T , the number of trees in the forest. In order to compute feature importances, Breiman[1] proposes a permutation test procedure based on accuracy. For each variable xj , there is one permutation test per tree in the forest. For an OOB sample B k corresponding to the k-th tree of the ensemble, one considers the original values of the variable xj and a random permutation x̃j of its values on B k . The difference in prediction error using the permuted and original variable is recorded and averaged over all the OOBs in the forest. The higher this index, the more important the variable is because it corresponds to a stronger increase of the classification error when permuting it. The importance measure Ja of the variable xj is then defined as: 1 X 1 X x̃j Ja (xj ) = I(hk (i) 6= yi ) − I(hk (i) 6= yi ) (1) T |B k | B k ∈B i∈B k where yi is the true class label of the OOB example i, I is an indicator function, hk (i) is the class label of the example i as predicted by the tree estimated on the x̃ bag Bk , hkj (i) is the predicted class label from the same tree while the values of the variable xj have been permuted on B k . Such a permutation does not change the tree but potentially changes the prediction on the out-of-bag example since its j-th dimension is modified after the permutation. Since the predictors with x̃ the original variable hk and the permuted variable hkj are individual decision trees, the sum over the various trees where this variable is present represents the ensemble behaviour, respectively from the original variable values and its various permutations. 3 A Statistical Feature Importance Index from RF While Ja is able to capture individual variable importances conditioned to the other variables used in the forest, it is not easily interpretable. In particular, 70 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Identification of Statistically Significant Features from RF 3 it does not define a clear threshold to highlight statistically relevant variables. In the following sections, we propose a statistical feature importance measure closely related to Ja , and then compare it with an existing approach that aims at providing statistical interpretation to feature importance scores. 3.1 Definition In [2], the authors analyse the convergence properties of ensembles of predictors. Their statistical analysis allows us to determine the number of classifiers needed in an ensemble in order to make the same predictions as an ensemble of infinite size. To do so, they analyse the voting process and have a close look to the class vote distribution of such ensembles. In the present work, we combine the idea of Breiman’s Ja to use a permutation test with the analysis of the tree class vote distribution of the forest. We propose to perform a statistical test that assesses whether permuting a variable significantly influences that distribution. The hypothesis is that removing an important variable signal by permuting it should change individual tree predictions, hence the class vote distribution. One can estimate this distribution using the OOB data. In a binary classification setting, for each data point in an OOB, the prediction of the corresponding tree can fall into one of the four following cases : correct prediction of class 1 (TP), correct prediction of class 0 (TN), incorrect prediction of class 1 (FP) and incorrect prediction of class 0 (FN). Summing the occurrences of those cases over all the OOBs gives an estimation of the class vote distribution of the whole forest. The same can be performed when permuting a particular feature xj . This gives an estimation of the class vote distribution of the forest after perturbing this variable. The various counts obtained can be arranged into a 4 × 2 contingency table. The first variable that can take four different values is the class vote. The second one is an indicator variable to represent whether xj has been permuted or not. Formally a contingency table is defined as follows for each variable xj : xj x̃j TN s(0, 0) sx̃j (0, 0) FP s(0, 1) sx̃j (0, 1) FN s(1, 0) sx̃j (1, 0) TP s(1, 1) sx̃j (1, 1) (2) where s(l1 , l2 ) = X X I(yi = l1 and hk (i) = l2 ) (3) B k ∈B i∈B k x̃ and sx̃j (l1 , l2 ) is defined the same way with hkj (i) instead of hk (i). In order to quantify whether the class vote distribution changes when permuting xj , one can use Pearson’s χ2 test of independence on the contingency table defined above. This test allows to measure if joined occurrences of two variables are independent of each other. Rejecting the null hypothesis that they 71 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 4 Identification of Statistically Significant Features from RF are independent with a low p-value pχ2 (xj ) would mean that xj influences the distribution and is therefore important. We note that, even on small datasets, there is no need to consider a Fisher’s exact test instead of Pearson’s χ2 since cell counts are generally sufficiently large: the sum of all counts is twice the sum of all OOB sizes, which is influenced by the number of trees T . If the importance of several variables has to be assessed e.g. to find out which features are important, one should be careful and correct the obtained p-values for multiple testing. Indeed, if 1000 dimensions are evaluated using the commonly accepted 0.05 significance threshold, 50 variables are expected to be falsely deemed important. To control that false discovery rate (FDR), p-values can be rescaled e.g. using the Benjamini-Hochberg correction [3]. Let pfχdr 2 (xj ) be the value of pχ2 (xj ) after FDR correction, the new importance measure is defined as Jχ2 (xj ) = pfχdr 2 (xj ) (4) This statistical importance index is closely related to Breiman’s Ja . The two terms inside the innermost sum of Equation (1) correspond to counts of FP et FN for permuted and non permuted variable xj . This is encoded by the second and third lines of contingency table in Equation (2). However, there are some differences between the two approaches. First, the central term of Ja (eq. (1)) is normalized by each OOB size while the contingency table of Jχ2 (eq. (2)) considers global counts. This follows from the fact that Ja estimates an average decrease in accuracy on the OOB samples while Jχ2 estimates a distribution on those samples. More importantly, the very nature of those importance indices differ. Ja is an aggregate measure of prediction performances whereas Jχ2 (eq. (4)) is a p-value from a statistical test. The interpretation of this new index is therefore much more easy from a statistical significance viewpoint. In particular, it allows one to decide if a variable is significantly important in the voting process of a RF. As a consequence, the lower Jχ2 the more important the corresponding feature, while it is the opposite for Ja . 3.2 Additional Related Work In [4], the authors compare several ways to obtain a statistically interpretable index from a feature relevance score. Their goal is to convert feature rankings to statistical measures such as the false discovery rate, the family wise error rate or p-values. To do so, most of their proposed methods make use of an external permutation procedure to compute some null distribution from which those metrics are estimated. The external permutation tests repeatedly compute feature rankings on dataset variants where some features are randomly permuted. A few differences with our proposed index can be highlighted. First, even if it can be applied to convert Breiman’s Ja to a statistically interpretable measure, the approach in [4] is conceptually more complex than ours: there is an additional resampling layer on top of the RF algorithm. This external resampling encompasses the growing of many forests and should not be confused with the 72 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Identification of Statistically Significant Features from RF 5 internal bootstrap mechanism at the tree level, inside the forest. This external resampling can introduce some meta-parameters such as the number of external resamplings and the number of instances to be sampled. On the other side, our approach runs on a single RF. There is no need for additional meta-parameters but it is less general: it is restricted to algorithms based on classifier ensembles. The external resampling procedures in [4] implies that those methods are also computationally more complex than Jχ2 . Indeed, they would multiply the cost of computing a ranking with Ja by the number of external resamplings whereas the time complexity of computing Jχ2 for p variables is exactly the same as with Breiman’s Ja . If we assume that each tree node splits its instances into two sets of equal sizes until having one point per leaf, then the depth of a tree is log n and the time complexity of classifying an instance with one tree is O(log n). Hence, the global time complexity of computing a ranking of p variables is O(T · p · n · log n). Algorithm 1 details the time complexity analysis. res ← initRes() // Θ(p) for xj ∈ Variables do // Θ(p) contT able ← init() // Θ(1 ) for B k ∈ B do // Θ(T ) x̃j ← perm(xj , B k ) // Θ(n) for i ∈ B k do // O(n) a ← hk (i) // Θ(depth) x̃ b ← hkj (i) // Θ(depth) contT able ← update(contT able, a, b, yi ) // Θ(1 ) end end res[xj ] ← χ2 (contT able) // Θ(1 ) end return res Algorithm 1: Pseudo-code for computing the importance of all variables with a forest of T = |B| trees 4 Experiments The following sections present experiments that highlight properties of the Jχ2 importance measure. They show that Jχ2 actually provides an interpretable importance index (Section 4.1), and that it is closely related to Ja both in terms of variable rankings (Section 4.2) and predictive performances when used as feature selection pre-filter (Section 4.3). The last experiments in Section 4.4 present some predictive performances when restricting models to only statistically significant variables. 73 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 6 4.1 Identification of Statistically Significant Features from RF Interpretability of Jχ2 The main goal of the new feature importance measure is to provide an interpretable index allowing to retrieve variables that are significantly important in the prediction of the forest. In order to check that Jχ2 is able to identify those variables, first experiments are conducted on an artificial dataset with a linear decision boundary. This dataset is generated the same way as in [4]. Labels y ∈ {−1, 1}n are given by y = sign(Xw) where w ∈ Rp and X ∈ Rn×p . Data values come from a N (0, 1) distribution. The number p of variables is set to 120. The first 20 weights wi are randomly sampled from U(0, 1). The other 100 weights are set to 0 such that relevant variables only belong to the first 20 ones (but all these variables need not be relevant e.g. whenever a weight is very small). The number of instances is n = 500 such that X ∈ R500×120 . In order to add some noise, 10% of the labels are randomly flipped. To check that a feature selection technique is able to identify significant variables, we report the observed False Discovery Rate (FDR) as in [4]: F DR = FD FD + TD (5) where F D is the number of false discoveries (i.e. variables which are flagged as significantly important by the feature importance index but that are actually not important) and T D the number of true discoveries. A good variable importance index should yield a very low observed FDR. A RF, built on the full dataset, is used to rank the variables according to their importance index. In order to decide if a variable is significantly important, we fix the p-value threshold to the commonly accepted 0.05 value after correcting for multiple testing. Figure 1 shows importance indices obtained by forests of various sizes and different numbers m of variables randomly sampled as candidate in each tree node. As we can see, the traditional (decreasing) Ja index does not offer a clear threshold to decide which variables are relevant or not. Similarly to the methods presented in [4], the (increasing) Jχ2 index appears to distinguish more clearly between relevant and irrelevant variables. It however requires a relatively large number of trees to gain confidence that a feature is relevant. When computed on small forests (plots on the left), Jχ2 may fail to identify variables as significantly important but they are still well ranked as shown by the FDR values. Moreover, increasing the parameter m also tends to positively impact the identification of those variables when the number of trees is low. 4.2 Concordance with Ja As explained in Section 3.1, Jχ2 and Ja share a lot in their computations. Figure 2 compares the rankings of the two importance measures on one sampling of the microarray DLBCL[5] dataset (p = 7129, class priors = 58/19). It shows that variable ranks in the top 500 are highly correlated. Spearman’s rank correlation coefficient is 0.97 for those variables. One of the main differences between the rankings produced by Ja and Jχ2 is that the first one penalizes features whose 74 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Identification of Statistically Significant Features from RF 0.8 0.6 0.0 0.2 0.4 0.6 0.4 0.0 0.2 m = 10 0.8 1.0 T = 10000 1.0 T = 500 10 20 0 10 20 30 40 50 30 40 50 0 10 20 0 10 20 30 40 50 30 40 50 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1.0 1.0 0 0.0 0.0 0.2 m = 120 7 rank rank Fig. 1. Importance indices computed on an artificial dataset with a linear decision boundary. For the sake of visibility, Ja has been rescaled between 0 and 1. The horizontal line is set at 0.05. Jχ2 (xj ) below this line are deemed statistically relevant. 75 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 8 Identification of Statistically Significant Features from RF 0 500 1000 1500 permuted versions would increase the prediction accuracy while the second one would favour such a variable since it changes the class vote distribution. That explains why features at the end of Ja ’s ranking have a better rank with Jχ2 . In particular, after rank 1250 on the horizontal axis, features have a negative Ja value for they somehow lower the prediction performance of the forest. But, since they influence the class vote distribution, they are considered more important by Jχ2 . Although those differences are quite interesting, the large ranks of those variables indicates that they encode most probably nothing but noise. Furthermore, only top ranked features are generally interesting and selected based on their low corrected p-values. 0 500 1000 1500 Fig. 2. Rankings produced by Ja and Jχ2 on one external sampling of the DLBCL dataset 4.3 Feature Selection Properties As shown in section 4.2, Ja and Jχ2 provide quite correlated variable rankings. The experiments described in this section go a little bit deeper and show that, when used for feature selection, the properties of those two importance indices are also very similar in terms of prediction performances and stability of the feature selection. In order to measure the predictive performances of a model, the Balanced Classification Rate (BCR) is used. It can be seen as the mean of per-class accuracies and is preferred to accuracy when dealing with non-balanced classes. It also generalizes to multi-class problems more easily than AUC. For two class problems, it is defined as TN 1 TP + (6) BCR = 2 P N 76 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Identification of Statistically Significant Features from RF 9 Stability of feature selection indices aim at quantifying how much selected sets of features vary when little changes are introduced in a dataset. The Kuncheva index (KI) [6] measures to which extent K sets of s selected variables share common elements. K−1 K X X |Si ∩ Sj | − 2 KI({S1 , ..., SK }) = 2 K(K − 1) i=1 j=i+1 s − sp s2 p (7) 2 where p is the total number of features and sp is a term correcting the chance to share common features at random. This index ranges from −1 to 1. The greater the index, the greater the number of commonly selected features. A value of 0 is the expected stability for a selection performed uniformly at random. In order to evaluate those performances and to mimic little changes in datasets, an external resampling protocol is used. The following steps are repeated 200 times: Randomly select a training set T r made of 90% data. The remaining 10% form the test set T e. • train a forest of T trees to rank the variables on T r • for each number of selected features s ∗ train a forest of 500 trees using only the first s features on T r ∗ save the BCR computed on T e and the set of s features The statistics recorded at each iteration are then aggregated to provide mean BCR and KI. Figure 3 presents the measurements made over 200-resamplings from the DLBCL dataset according to the number of features kept to train the classifier. It shows that the two indices behave very similarly with respect to the number of features and the number of trees used to rank the features. Increasing the number of trees allows to get more stable feature selection in both cases. This kind of behaviour has also been shown in [7]. 4.4 Prediction from Significantly Important Variables Experiments show that Jχ2 ranks features roughly the same way as Ja while providing a statistically interpretable index. One can wonder if it is able to highlight important variables on real-world datasets and furthermore if those variables are good enough to make a good prediction by themselves. Table 1 briefly describes the main characteristics of 4 microarray datasets used in our study. Using the same protocol as in Section 4.3, experiments show that the number of selected variables increases with the number of trees, which is consistent with the results in Section 4.1. As we can see on Table 2, it is also very dataset dependent with almost no features selected on the DLBCL dataset. Similar results are observed in [4]. When comparing the predictive performances of a model built on only significant variables of Jχ2 and a model built using the 50 best ranked 77 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Identification of Statistically Significant Features from RF 0.0 0.0 0.2 0.2 0.4 0.4 KI BCR 0.6 0.6 0.8 0.8 1.0 1.0 10 1 5 10 50 1 500 5 10 50 500 s s Fig. 3. Average BCR and KI of Ja and Jχ2 over 200 resamplings of the DLBCL dataset according to the number of selected features, for various numbers T of trees Table 1. Summary of the microarray datasets: class priors report the n values in each class, p represents the total number of variables. Name Class priors DLBCL [5] 58/19 Lymphoma [8] 22/23 Golub [9] 25/47 Prostate [10] 52/50 78 p 7129 4026 7129 6033 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Identification of Statistically Significant Features from RF 11 variables of Ja , a paired T-test shows significant differences in most of the cases. However, except for the DLBCL dataset, when using 10000 trees, the average predictive performances are quite similar to each other. This confirms that, provided the number of trees is large enough and depending on the dataset, Jχ2 is able to select important variables that can be used to build good predictive models. Table 2. Various statistics obtained over 200-resamplings when keeping only significantly relevant variables. T is the number of trees used to build the forest. avg(srel ) (resp. max, min) is the average (resp. maximum, minimum) number of Jχ2 significantly important features used to make the prediction. BCR is the average BCR obtained on models for which there is at least one significant feature with Jχ2 . BCR50 is the average BCR obtained when using the 50 Ja best ranked features in each iteration where Jχ2 outputted at least one significant feature. T avg(srel ) min(srel ) max(srel ) BCR BCR50 5000 0.04 0 1 0.52 0.67 10000 0.99 0 5 0.69 0.83 golub 5000 5.96 3 10 0.93 0.97 10.82 8 14 0.96 0.97 10000 lymphoma 5000 0.66 0 6 0.62 0.82 10000 4.85 2 9 0.93 0.94 prostate 5000 4.95 2 8 0.93 0.94 10000 7.92 6 11 0.93 0.94 DLBCL 5 Conclusion and Perspectives This paper introduces a statistical feature importance index for the Random Forest algorithm which combines easy interpretability with the multivariate aspect of embedded feature selection techniques. The experiments presented in Section 4 show that it is able to correctly identify important features and that it is closely related to Breiman’s importance measure (mean decrease in accuracy after permutation). The two approaches yield similar feature rankings. In comparison to Breiman’s importance measure, the proposed index Jχ2 brings the interpretability of a statistical test and allows us to decide which variables are significantly important using a very natural threshold at the same computational cost. We show that growing forests with many trees increases the confidence that some variables are statistically significant in the RF voting process. This observation may be related to [7] where it is shown that feature selection stability of tree ensemble methods increases and stabilises with the number of trees. The proposed importance measure may open ways to formally analyse this effect, similarly to [2]. We have evaluated Jχ2 on binary classification tasks. Although there is a straightforward way to adapt it to the multi-class setting, future work 79 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 12 Identification of Statistically Significant Features from RF should assess whether it is practically usable, in particular how many trees would be needed when increasing the number of classes. Finally one should also evaluate the possibility to apply this approach on other ensemble methods, possibly with different kinds of randomization. References 1. Breiman, L.: Random Forests. Machine Learning 45(1) (October 2001) 5–32 2. Hernández-Lobato, D., Martı́nez-Muñoz, G., Suárez, A.: How large should ensembles of classifiers be? Pattern Recognition 46(5) (2013) 1323 – 1336 3. Benjamini, Y., Hochberg, Y.: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57(1) (1995) 289–300 4. Huynh-Thu, V.A.A., Saeys, Y., Wehenkel, L., Geurts, P.: Statistical interpretation of machine learning-based feature importance scores for biomarker discovery. Bioinformatics (Oxford, England) 28(13) (July 2012) 1766–1774 5. Shipp, M.A., Ross, K.N., Tamayo, P., Weng, A.P., Kutok, J.L., Aguiar, R.C., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G.S., Ray, T.S., Koval, M.A., Last, K.W., Norton, A., Lister, T.A., Mesirov, J., Neuberg, D.S., Lander, E.S., Aster, J.C., Golub, T.R.: Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1) (January 2002) 68–74 6. Kuncheva, L.I.: A stability index for feature selection. In: AIAP’07: Proceedings of the 25th conference on Proceedings of the 25th IASTED International MultiConference, Anaheim, CA, USA, ACTA Press (2007) 390–395 7. Paul, J., Verleysen, M., Dupont, P.: The stability of feature selection and class prediction from ensemble tree classifiers. In: ESANN 2012, 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. (April 2012) 263–268 8. Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Powell, J.I., Yang, L., Marti, G.E., Moore, T., Hudson, J., Lu, L., Lewis, D.B., Tibshirani, R., Sherlock, G., Chan, W.C., Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke, R., Levy, R., Wilson, W., Grever, M.R., Byrd, J.C., Botstein, D., Brown, P.O., Staudt, L.M.: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769) (February 2000) 503–511 9. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science 286(5439) (1999) 531–537 10. Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D’Amico, A.V., Richie, J.P., et al.: Gene expression correlates of clinical prostate cancer behavior. Cancer cell 1(2) (2002) 203–209 Acknowledgements Computational resources have been provided by the supercomputing facilities of the Université catholique de Louvain (CISM/UCL) and the Consortium des Equipements de Calcul Intensif en Fédération Wallonie Bruxelles (CECI) funded by the Fond de la Recherche Scientifique de Belgique (FRS-FNRS). 80 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Prototype Support Vector Machines: Supervised Classification in Complex Datasets April Tuesday Shen and Andrea Pohoreckyj Danyluk Williams College Abstract. Classifier learning generally requires model selection, which in practice is often an ad hoc and time-consuming process that depends on assumptions about the structure of data. To avoid this difficulty, especially in real-world data sets where the underlying model is both unknown and potentially complex, we introduce the ensemble of prototype support vector machines (PSVMs). This algorithm trains an ensemble of linear SVMs that are tuned to different regions of the feature space and thus are able to separate the space arbitrarily, reducing the need to decide what model to use for each dataset. We present experimental results demonstrating the efficacy of PSVMs in both noiseless and noisy datasets. Keywords: Ensemble methods, Classification, Support Vector Machines 1 Introduction The goal of classification is to accurately predict class labels for a set of data. In machine learning, this is accomplished via algorithms that learn classification models for particular types of class distributions from sets of labeled training data. However, in real-world datasets, class distributions may be arbitrarily complex, and they are not generally known before learning takes place. Hence a data mining practitioner must choose an algorithm and its associated model without prior knowledge about the class distributions of the dataset in question. This often requires testing multiple models to find one that works well [1]. The process of model selection, which is already arbitrary and time-consuming, becomes even more problematic for datasets with the most difficult class distributions. In this paper, we introduce the ensemble of prototype support vector machines (PSVMs)1 as a classification learning algorithm addressing the problem of model selection in complex datasets. The PSVM algorithm learns a collection of linear classifiers tuned to different regions of the space in order to separate classes with arbitrarily complicated distributions. This algorithm is based on the exemplar SVM (ESVM) approach [3], which trains a separate linear separator specific to each instance in the training set. The PSVM algorithm trains an initial 1 Cheng et al. [2] refer to their profile support vector machines as PSVMs. In this paper, PSVM should be taken to unambiguously refer to our prototype support vector machines. 81 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. ensemble of ESVMs, but then iteratively improves boundaries to allow classifiers to capture groups of similar instances. Hence these new classifiers are tuned to more generalized prototypes rather than to specific exemplars. We present empirical evidence that PSVMs are capable of high classification accuracy in a variety of noiseless and noisy datasets with different class distributions. The remainder of this paper is organized as follows. In Section 2 we motivate the problem and introduce related work. In Section 3 we describe exemplar SVMs in more detail than we have thus far. In Section 4 we describe our PSVM algorithm. Section 5 details experiments comparing PSVMs with other classifier learning algorithms on a selection of datasets. Finally, we summarize our conclusions and suggest future work. 2 Motivation and Related Work In supervised machine learning, we can think of a learned classifier as a model or function that partitions the feature space into different regions corresponding to the data points of different classes. Many learning algorithms and their associated models are highly accurate fits to certain types of class distributions. For datasets that are linearly separable, linear SVMs [4] are a good choice. Such datasets include, for example, a number of text classification problems [5]. For more complicated class distributions, such as ones where multiple noncontiguous regions are mapped to the same class, C4.5 [6], a common decision tree learning algorithm, is a better choice. However, regardless of how aptly a given model captures a particular class distribution, choosing an inappropriate model can still result in poor classification performance. Of course, every learning algorithm has some inductive bias that limits the set of possible models it explores. It is unreasonable to expect an algorithm to be agnostic towards the structure of the data in question. The point is that choices about what learning algorithm to use, and hence what model to learn, must happen prior to training and generally without knowledge of what the data distribution looks like. In many cases, this forces the practitioner to simply train and test multiple algorithms in order to discover which one performs the best, a time-consuming and ad hoc process. Model selection is even more difficult in datasets with complex class distributions – for instance, ones that are highly nonlinear or contain many small disjuncts – since standard algorithms and models may not be sufficient in these situations. There are many ways to attack the problem of complex class distributions directly. One approach is to reduce or reformulate the feature space, since class boundaries may only seem complex when data is viewed in a particular space. Substantial research has been done in both feature selection (see [7] for feature selection in supervised learning and [8] for feature selection in unsupervised learning) and feature extraction (e.g., principle component analysis, multidimensional scaling, constructive induction [9], and more recently, manifold learning [10]). Unfortunately, all of these approaches still make strong assumptions about the fundamental underlying classifier models. 82 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Other research has been concerned with the class distributions themselves [11]. Some of this work is focused on specific types of difficult distributions, such as highly unbalanced class distributions [12] or distributions that include many small disjuncts [13]. While extremely valuable, this focused work does not tackle the wider array of possible class distributions that present challenges to different models and classification algorithms. Another approach to handling complex class distributions is to learn classifiers from different subsets of training examples. Such ensemble methods include, for example, AdaBoost [14], which builds a series of models that increasingly focus on examples in the difficult-to-capture regions of the feature space. Of particular relevance to us are approaches that attempt to approximate complex decision surfaces with an ensemble of hyperplanes. These include exemplar SVMs [3], which we discuss in more detail in Section 3, and localized and profile SVMs [2]. Localized and profile SVMs combine the benefits of SVMs with instancebased methods in order to learn local models of the example space. Localized SVMs train a new SVM model for each test instance using that instance’s nearest neighbors from the training set. As expected, this is very slow at test time. Profile SVMs also defer SVM learning to test time, but take advantage of the fact that multiple nearby test examples may require only a single SVM in that region. Profile SVMs use a variation of k-means to cluster training examples based on their relationship to the test examples, before learning a local SVM for each cluster. While profile SVMs are demonstrably more efficient than localized SVMs, they still defer training to test time, which may be unreasonable for many applications. Our approach differs from the localized SVM framework in that it is able to approximate a range of complex decision surfaces with ensembles of linear SVMs without the reliance on test examples of transductive or quasitransductive [2] approaches. 3 Exemplar Support Vector Machines In this section we discuss exemplar SVMs (ESVMs) in greater detail, as they are a foundation on which our algorithm is based. Exemplar SVMs were developed for object recognition [3]. The ESVM algorithm trains a separate SVM for each exemplar image from the training set, with that exemplar as the sole positive instance and many instances of the other classes as negative instances. If one of these exemplar SVMs positively classifies a novel instance, then this suggests that the novel instance shares the class label of (i.e., depicts the same object as) that model’s exemplar. Thus an ensemble of ESVMs can be used to classify new data instances, either through voting or a more complicated procedure. Object recognition tasks provide a good example of the types of complex data distributions we seek to handle. Consider, for instance, four images containing front and side views of bicycles and motorcycles, respectively. The two containing side views are conceivably closer in pixel-value feature spaces than are the front 83 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. and side views of a motorcycle (or bicycle), yet it is the objects that we wish to identify – i.e., motorcycle or bicycle – not the orientation. The ESVM framework is well-motivated for object recognition and other domains with complex class distributions for a number of reasons. Learning a number of separate classifiers permits each classifier to focus on a difference feature subset. This can be useful since different regions of the example space may be well-characterized by different combinations of features. An approach based on ensembles and exemplars also provides comprehensive coverage of the feature space, which is critical for classes whose instances could potentially lie in a number of diverse regions of the space. The existence of at least one exemplar in each of these regions can allow the ensemble as a whole to recognize their presence, without the need to create a single overly general classifier to accommodate them. The ESVM algorithm is an appealing starting point for our work as it leverages the power of both SVMs and ensembles to accommodate complexity. At the same time, the fact that it learns a separate SVM for each training instance makes it unnecessarily time- and space-intensive for many datasets. Our approach begins with ESVMs but then learns a set of SVMs that are tuned to general prototypes, rather than specific exemplars. 4 The Prototype SVM Algorithm In this section we describe how we train an ensemble of prototype support vector machines (PSVMs). The algorithm begins by training an ensemble of exemplar support vector machines (ESVMs). It then iteratively improves and generalizes the boundaries of the SVMs to achieve the final ensemble of PSVMs. There are three major components of the main PSVM algorithm: initialization, shifting of boundaries, and prediction. Algorithm 1 describes the high-level training of PSVMs, showing how these three components are used. 4.1 Initialization The algorithm first trains an ensemble of ESVMs, with one model for each instance in the training set. This requires that the algorithm create a training set of positive and negative examples specific to each ESVM. Initializing positive sets is straightforward, as each set is a single example that serves as the exemplar for the ESVM. In theory, negative sets could contain every instance of a different class from the exemplar. However, we do not want the single positive instance to be overwhelmed by negative instances. Furthermore, in spaces with highly variable class distributions, distant regions of the space may have no influence in how local regions should be partitioned, so negatives far from the exemplar might be useless or even detrimental for learning good classifiers. Finally, in general, classes will not be linearly separable, so in order to learn relatively high-quality linear discriminants, the algorithm chooses a sample of the potential negatives that is linearly separable from the exemplar. 84 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Algorithm 1 Train(T ): Train an ensemble of PSVMs. Input: set of labeled training data, T Parameters: number of iterations, s, and fraction of data to hold out for validation, v 1: Split data into training and validation sets, D and V . 2: P ← [[di ] for di ∈ D] # List of positive sets, each initially just the exemplar. 3: N ← [ChooseNegatives(D, di ) for di ∈ D] # List of negative sets. 4: for j = 0, . . . , s do 5: Ej ← [ ] # The ensemble being trained on this iteration. 6: for Pi ∈ P and Ni ∈ N do 7: Train a linear SVM, using Pi and Ni . 8: Add this SVM to Ej . 9: aj ← Test(V, Ej ) # Accuracy of Ej on V . 10: P, N ← Shift(D, Ej , P, N ) 11: return ensemble Ej with highest accuracy on V To accommodate these issues, we choose negatives in the manner described in Algorithm 2. We first find the negative closest in Euclidean distance to the exemplar. This defines a hyperplane passing through the exemplar and normal to the vector between the exemplar and that negative. The candidate negatives are then those that lie on the positive side of this hyperplane (i.e., the same side of the hyperplane as the closest negative). Note that no margin is enforced, and the hyperplane being considered is only a very rough approximation of the hyperplane that will be learned by the SVM algorithm. From those candidate negatives, the algorithm chooses only a small number, k, of them. The number can be user-selected, although empirically about seven negatives seems to work reasonably well. The k negatives chosen are those closest to the exemplar, so that the training set for each model is kept mostly localized and the training process is not “distracted” by instances in distant regions of the feature space. Once a negative set for each exemplar has been initialized, the algorithm trains an SVM for each exemplar, as in the ESVM algorithm. 4.2 Shifting After training the initial ensemble of ESVMs, we shift the boundaries according to Algorithm 3. (Note that the SVMs in Malisiewicz et al.’s original ESVM approach are shifted and generalized as well, though by a different process and not for the purpose of creating prototypes.) Shifting in our PSVM algorithm accomplishes three main goals: 1. It generalizes classifiers from a single exemplar to a cluster of nearby instances. 2. It adjusts boundaries that misclassified negative instances. 3. It removes useless classifiers from the ensemble altogether. 85 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Algorithm 2 ChooseNegatives(D, di ): Initialize set of negatives for a given data instance. Input: set of training data, D, and training instance, di Parameters: number of negatives to return, k 1: Ni ← [ ] 2: Di ← all instances in D of a different class label from di 3: Compute Euclidean distance from di to each element of Di . 4: Sort Di in ascending order by distance. 5: x ← closest negative in Di 6: n ← (x−di )/||x−di || # Normal vector to the hyperplane passing through di . 7: for dj ∈ Di do 8: if n · (dj − di ) > 0 then 9: Add dj to Ni 10: if |Ni | = k then 11: return Ni 12: return Ni Algorithm 3 Shift(D, E, P, N ): Adjust and drop models from the ensemble. Input: set of training data, D; ensemble of models, E; positive and negative sets for each model, P and N Parameters: probability to add to negative set, p 1: C ← [[ ] for dj ∈ D] # List of candidate models for each dj . 2: for mi ∈ E and dj ∈ D do 3: if mi classifies dj positively then 4: if class(dj ) = class(mi ) then 5: Compute the distance of dj to mi ’s exemplar. 6: Add mi and its distance to the list of candidates Cj . # dj is misclassified, i.e. a hard negative. 7: else 8: Add dj to mi ’s negative set Ni with some probability p. 9: for dj ∈ D do 10: Add dj to the positive set Pk , where mk is the closest model in Cj . 11: for mi ∈ E do 12: if mi did not classify anything positively then 13: Remove mi from the ensemble, by removing Pi from P and Ni from N . 14: return P, N 86 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Generalization is at the core of what turns exemplar SVMs into prototype SVMs. The algorithm generalizes by adding new instances to the positive sets of models. For a given instance, we define a candidate model to be one that classifies the instance positively and whose exemplar is of the same class as the new instance. These candidate models are ones that could be improved by adding this new instance to their positive sets. But it could be problematic to add each instance to all of its candidate models. Because linear classifiers simply divide the space into half-spaces, they run a serious risk of overgeneralizing by adding too many positives that may have nothing to do with the original exemplar and its cluster. To avoid this, if a particular instance is positively classified by multiple models, the algorithm only adds it to the positive set of the model with the closest exemplar. As in the initialization of negatives, this helps keeps each model tuned to a local region rather than attempting to capture wide swaths of the feature space. The algorithm also improves the models by adding misclassified negative instances to their negative sets. This performs a kind of hard negative mining. If we discover that a model classifies an instance as a positive example when it should be a negative, we need to shift the linear separator to exclude the negative example. To do this, we consider adding the negative instance explicitly to the negative set for that model. However, again since we are dealing with complicated class distributions and we want to keep models localized, some negatives actually should be classified incorrectly by individual models, and there is no principled way of identifying these in every possible dataset. Hence we only add to the negative set with some probability, as set by the user. This has the additional benefit of adding randomness and hence robustness to the algorithm. The final step in the shifting algorithm is to remove models that do not classify any instances positively. Depending, in part, on the choice of regularization parameter, some SVMs may default to classifying all examples as negative. These are not useful for the overall ensemble and are therefore removed. Fortunately, the use of an ensemble provides redundancy; since the algorithm begins with a classifier for each instance of the training data, some models can be dropped from the ensemble without detriment, unless the class distribution is extremely difficult. Dropping models also provides some robustness to noise, as noisy exemplars may be more difficult to separate from their closest negatives. We perform the shifting procedure some number of times as specified by the user. After each shift, the algorithm trains a new ensemble with the new positive and negative sets, then tests this ensemble on a held-out validation set. The ensemble with the highest classification accuracy on the validation set is retained. Empirically, we find that accuracy generally stabilizes fairly quickly – between 10 and 20 iterations at most. 4.3 Prediction For validation after each shift and at test time, the ensemble of PSVMs predicts class labels for novel instances as follows. For each new instance, if a model classifies it positively, this corresponds to a vote for that model’s positive class 87 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. (i.e., the class of its original exemplar). We also generate the probability that the instance should be assigned the model’s positive class, as in [15], and sum these probability values rather than the raw votes. The predicted class label assigned by the entire ensemble is simply the class with the maximum sum of weighted votes. 5 Experiments and Results In this section we discuss our experiments. Overall, our PSVM algorithm performs better than the other algorithms we tested on datasets with the most complicated class distributions, and its performance is not significantly worse than the other algorithms when applied to simpler data distributions. We also demonstrate that our PSVM algorithm degrades gracefully in the presence of noise. 5.1 Experimental Setup We tested our PSVM algorithm against C4.5 [6], AdaBoost [14] with C4.5 as the base classifier, linear SVMs [4], AdaBoost with linear SVMs as the base classifier, SVMs with a quadratic kernel, and multilayer perceptrons trained with backpropagation [16]. We chose these for their generally good performance and their varied strengths. For each algorithm, we used the implementations provided by Weka [17]. In particular, we used Weka’s wrapper of LIBSVM [15] for the SVM experiments, as we also used LIBSVM for our PSVM algorithm. We used Weka’s default parameters for our experiments, except we reduced the training time of multilayer perceptrons from 500 to 200 iterations for reasons of time. Weka’s default number of iterations for AdaBoost is quite low, namely 10, but we note that AdaBoosted linear SVMs finished in fewer than 10 iterations in the synthetic datasets and glass, and performance on the other datasets with as many as 100 iterations was statistically identical to performance with 10 iterations. For PSVMs we used the following default parameters: v = 25% (percent of data used for validation), s = 10 (iterations of shifting), k = 7 (number of initial negatives), and p = 0.5% (hard negative mining probability). There are three general categories of datasets we used in our experiments. These are listed in Table 1. First are synthetic datasets (see Figure 1), specifically designed to have unusual class distributions to provide a proof of concept of the power of PSVMs. The spirals dataset is a standard benchmark for neural networks originally from CMU [18]; we generated the other two. In isolated, each cluster is normally distributed and the background data is uniform outside three standard deviations of each cluster’s mean. In striated, each stripe is normally distributed with greater standard deviation in one direction. Next are benchmark datasets from UCI’s Machine Learning Repository [19]. The last dataset is a natural language classification task from the Semantic Evaluation (SemEval) workshop [20]. The data consists of raw Twitter messages, or tweets, and the task 88 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. is to classify them according to their sentiment as objective (i.e., no sentiment), positive, neutral, or negative. Table 1: The datasets used in the experiments. The top three are synthetic, with the spirals dataset taken from CMU’s Neural Networks Benchmarks [18]. The next four are benchmark datasets from UCI’s Machine Learning Repository [19]. The Twitter dataset is from SemEval [20]. Synthetic UCI Real-world Dataset isolated striated spirals iris glass vehicle segment twitter Instances 800 800 194 150 214 846 2310 600 Features (type) 2 (real) 2 (real) 2 (real) 4 (real) 9 (real) 18 (integer) 19 (real) 2715 (integer) Classes (instances per class) 2 (500 / 300) 2 (400 / 400) 2 (97 / 97) 3 (50 / 50 / 50) 6 (70 / 76 / 17 / 13 / 9 / 29) 4 (212 / 217 / 218 / 199) 7 (330 each) 4 (99 / 157 / 30 / 314) Fig. 1: Synthetic two-dimensional datasets: isolated, striated, and spirals. The Twitter dataset contained only raw tweets and sentiment labels, and hence we preprocessed and featurized that dataset. Much research has gone into good feature representations for natural language texts and tweets in particular (see, for example, [21] or [22]), but as the focus of our work is not sentiment analysis, we used a basic but reasonable set of features for this data, including single words, links, usernames, hashtags, standard emoticons, and words from the MPQA Subjectivity Lexicon [23]. 5.2 Results on Datasets With No Noise Added For each algorithm, we report the results of ten-fold cross validation on each dataset. Table 2 compares our PSVMs with each of the other algorithms. Also see Table 3 for counts of wins, losses, and ties of PSVMs over the other algorithms. Overall, the PSVM algorithm performs about as well as the other algorithms in all datasets, and has significantly higher accuracy in the datasets with the 89 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 2: Experiment results. Shown are classification accuracy means with one standard deviation. ◦ indicates statistically significant improvement of PSVMs over the other algorithm, • indicates statistically significant degradation based on a corrected paired T-test at the 0.05 level. Dataset PSVM isolated striated spirals iris glass vehicle segment twitter 90.9(4.07) 97.3(1.66) 21.8(18.9) 96.0(4.42) 52.8(8.72) 79.4(4.49) 95.2(1.18) 55.8(5.12) Boosted C4.5 C4.5 98.1(1.79)• 69.4(13.4)◦ 0.0(0.0)◦ 94.0(6.29) 69.1(6.40)• 73.8(4.48)◦ 97.1(0.93)• 54.5(2.89) Linear SVM 98.5(1.46)• 90.1(7.38)◦ 0.0(0.0)◦ 94.0(5.54) 72.9(7.85)• 75.7(3.56) 98.1(0.85)• 60.5(7.99) 62.5(0.0)◦ 48.5(3.94)◦ 0.0(0.0)◦ 98.7(2.67) 63.5(8.08)• 80.4(4.50) 96.3(0.93)• 62.2(3.66)• Boosted Linear SVM 62.5(0.0)◦ 53.4(16.3)◦ 5.63(8.97)◦ 97.3(3.27) 63.1(7.73)• 80.4(4.23) 96.1(0.89) 62.2(4.15)• Multilayer Perceptron 81.9(4.38)◦ 80.6(2.81)◦ 72.8(4.18)◦ 75.0(25.0)◦ 0.0(0.0)◦ 20.6(28.9) 96.7(4.47) 97.3(4.42) 69.7(6.55)• 70.6(8.82)• 80.4(4.53) 79.7(4.61) 95.8(1.38) 96.2(1.30) 52.3(5.54) 52.3(5.54) Polynomial SVM Table 3: The total number of wins, losses, and ties for PSVMs over other algorithms, in noiseless and noisy datasets. Regular Noiseless Noisy 16 - 13 - 19 14 - 7 - 21 most difficult class distributions. Its performance is especially impressive in the synthetic datasets. This is no surprise; while those datasets were not designed specifically for this algorithm, they were designed to exemplify particularly difficult class distributions. Although PSVMs do not perform as impressively in the other domains, in general they do not perform significantly worse than the other algorithms. The exceptions are glass and segment; we hypothesize that this is because these are the datasets with the largest number of classes, and the extension of SVMs to multiclass domains is not as natural as it is for the other algorithms. The glass domain has an especially small number of instances in certain classes, and since we hold out 25% of the training data for validation after each shift, this may explain PSVM’s poor performance in this dataset. The spirals dataset is especially difficult for all algorithms, though we would expect good results from SVMs with a radial basis kernel. Of the algorithms tested, multilayer perceptrons are the only other algorithm besides PSVMs that perform relatively well on spirals. However, it is worth noting that multilayer perceptrons took on the order of days to run the full suite of experiments, whereas the other algorithms, including PSVMs, each took on the order of minutes or hours. 5.3 Results on Datasets With Noise Added In order to test the robustness of PSVMs to noise, we repeated the experiments described above, on the same datasets but with noise injected into the class labels. These results are reported in Table 4. 90 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 4: Results for PSVMs in noisy datasets. Shown are classification accuracy means with one standard deviation. ◦ indicates statistically significant improvement, • statistically significant degradation. Dataset PSVM isolated striated spirals iris glass vehicle segment 71.9(7.71) 76.8(7.42) 29.0(15.4) 88.7(5.21) 50.0(5.50) 68.1(4.43) 84.3(2.18) Boosted Linear SVM 84.8(4.36)• 83.6(4.65)• 59.3(3.67)◦ 57.9(3.71)◦ 60.3(7.24)◦ 64.3(6.55)◦ 52.0(2.86)◦ 56.8(7.36)◦ 9.79(6.28)◦ 9.79(6.28)◦ 11.9(10.2)◦ 16.9(17.1) 84.7(7.33) 83.3(7.45) 90.0(6.83) 90.0(6.15) 62.2(7.46)• 66.3(8.87)• 53.7(8.91) 57.5(7.90)• 63.1(1.89)◦ 67.6(2.54) 71.5(3.57) 69.1(4.24) 85.5(2.30) 85.4(2.23) 84.1(1.82) 84.0(2.12) C4.5 Boosted C4.5 Linear SVM Polynomial SVM 75.5(4.91) 66.8(3.12)◦ 10.3(6.95)◦ 90.0(5.37) 61.6(8.95)• 57.4(5.12)◦ 67.8(3.95)◦ Multilayer Perceptron 73.9(3.51) 67.4(18.5) 17.1(13.4) 91.3(6.70) 61.7(4.14)• 71.0(2.61) 85.4(2.51) For injecting noise we followed the same methodology as in [24]: we selected 10% of the instances uniformly and without replacement, then changed their class labels to an incorrect one chosen uniformly. Note that we assume the Twitter dataset is already noisy; because we could not quantify the baseline noise level, we did not inject noise into this dataset. Every algorithm’s performance degrades in the presence of noise, as we would expect. (The spirals dataset is an anomaly; the dataset is so small and its class distribution so unusual that adding noise seems to make it easier to partition for most algorithms.) In general, the PSVM ensemble is fairly robust to the presence of noise. PSVMs retain their advantage in the datasets with the trickiest class distributions, and they degrade gracefully on the benchmark datasets as well. It is also worth noting that, for the most part, the sizes of the final PSVM ensembles are not affected by the presence of noise (see Table 5). This suggests that the algorithm is not retaining extra models to account for noisy instances. In addition, note that the number of models in the PSVM ensemble for every dataset is less than the number in the baseline ESVM ensemble. Because the final ensemble output by our PSVM algorithm is the one from the best iteration so far as determined by a validation set, it is clear that the unshifted ESVMs (i.e., the first iteration in the PSVM learning process) are never found to be best. 6 Summary and Future Work In this paper we have described the ensemble of prototype support vector machines (PSVMs), an algorithm for performing supervised classification in datasets with complex class distributions. This work is motivated primarily by the problem of model selection. When a data mining practitioner needs to choose a classifier-learning algorithm for a dataset, he most likely has no a priori knowledge of the class distributions that the algorithm will need to model. The issues involved in finding good models are exacerbated when distributions are especially complicated. In response to these challenges, the PSVM algorithm works by learning an ensemble of linear classifiers tuned to different sets of instances in the training 91 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 5: Average number of models in the final ensembles of PSVMs for noiseless and noisy datasets, rounded to the nearest whole number. Also listed is the number of models for the initial ESVM ensemble; these numbers are identical for noiseless and noisy datasets. isolated striated spirals iris glass vehicle segment twitter ESVM 720 720 175 135 193 761 2079 540 Noiseless 323 149 47 54 35 110 453 42 Noisy 122 170 56 24 37 113 323 X data. Such an ensemble is flexible enough to have high performance in datasets with arbitrarily complex class boundaries, with minimal parameter tuning. It accomplishes this by leveraging both the power of SVMs as effective linear classifiers and the power of ensembles to provide flexibility and improve accuracy without the need to specify a particular kernel function. This algorithm is based on the ensemble of exemplar SVMs for object recognition from [3]. The core of the PSVM approach is an initial ensemble of exemplar SVMs, followed by a shifting algorithm that refines linear models, drops unnecessary models, and generalizes models from single exemplars to clusters of similar instances, or prototypes. Our results demonstrate that PSVMs generally have the highest accuracy among the algorithms we tested in the datasets with the more complex distributions, and good performance in standard benchmark datasets. In addition, the results for noisy datasets provide evidence that PSVMs are more robust to noise than other algorithms that seek to maximize flexibility. The main goal of the PSVM algorithm is to reduce the need to make datadependent algorithmic decisions before knowing about data distributions. While model selection is no longer necessary with this algorithm, there are still parameters that must be set: the size of the validation set, the number of shifting iterations, the size of the initial negative sets, and the probability of mining hard negatives. Optimal settings for these may be dataset-specific and affect the quality of the final ensemble. This remains a limitation of our approach. There are several interesting directions for future work related to the PSVM algorithm. Further experiments on synthetic, benchmark, and real-world datasets would provide additional information on the capabilities of the algorithm. It would also be worthwhile to explore a variety of modifications to the basic algorithm. For example, we might leverage the ability of ensembles to select feature sets independently for each model. This can be useful since different regions of the example space may be well-characterized by different combinations of features. One elegant way of doing this would be to use 1-norm SVMs to effectively 92 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. perform feature selection in tandem with learning the SVM [25]. Finally, it would be interesting to investigate the connection of the PSVM algorithm with similar algorithms that might utilize an explicit clustering step during training. We hypothesize that the shifting process of PSVMs enables the organic discovery of clusters without needing to specify the number of centroids as in k-means. It would be worth exploring to what extent this hypothesis is supported empirically and theoretically. References 1. Chatfield, C.: Model uncertainty, data mining and statistical inference. Journal of the Royal Statistical Society. Series A (Statistics in Society) 158(3) (1995) 2. Cheng, H., Tan, P.N., Jin, R.: Efficient algorithm for localized support vector machine. IEEE Trans. Knowl. Data Eng. 22(4) (2010) 3. Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-SVMs for object detection and beyond. In: Proceedings of the 2011 International Conference on Computer Vision. ICCV ’11 (2011) 4. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3) (1995) 5. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning. ECML ’98 (1998) 6. Quinlan, J.R.: C4. 5: programs for machine learning. Morgan Kaufmann (1993) 7. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. The Journal of Machine Learning Research 3 (2003) 8. Ferreira, A., Figueiredo, M.: Unsupervised feature selection for sparse data. In: Proceedings of the 2011 European Symposium on Artificial Neural Networks. ESANN ’11 (2011) 9. Callan, J.P., Utgoff, P.E.: A transformational approach to constructive induction. In: Proceedings of the 8th International Workshop on Machine Learning. ML ’91 (1991) 10. Xiao, R., Zhao, Q., Zhang, D., Shi, P.: Facial expression recognition on multiple manifolds. Pattern Recognition 44(1) (2011) 11. Japkowicz, N.: Concept-learning in the presence of between-class and within-class imbalances. In: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence. AI ’01 (2001) 12. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: onesided selection. In: Proceedings of the 1997 International Conference on Machine Learning. ICML ’97 (1997) 13. Holte, R., Acker, L., Porter, B.: Concept learning and the problem of small disjuncts. In: Proceedings of the 1989 International Joint Conference on Artificial Intelligence. IJCAI ’89 (1989) 14. Schapire, R.E.: A brief introduction to boosting. In: Proceedings of the 1999 International Joint Conference on Artificial Intelligence. IJCAI ’99 (1999) 15. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3) (May 2011) 16. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In Rumelhart, D.E., McClelland, J.L., PDP Research Group, C., eds.: Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. MIT Press, Cambridge, MA, USA (1986) 318–362 93 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 17. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1) (November 2009) 10–18 18. White, M., Sejnowski, T., Rosenberg, C., Qian, N., Gorman, R.P., Wieland, A., Deterding, D., Niranjan, M., Robinson, T.: Bench: CMU neural networks benchmark collection. http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/ neural/bench/cmu/0.html (1995) 19. Bache, K., Lichman, M.: UCI machine learning repository. http://archive.ics. uci.edu/ml (2013) 20. Wilson, T., Kozareva, Z., Nakov, P., Ritter, A., Rosenthal, S., Stoyanov, V.: Semeval-2013 task 2: Sentiment analysis in twitter. http://www.cs.york.ac.uk/ semeval-2013/task2/ (2013) 21. Barbosa, L., Feng, J.: Robust sentiment detection on twitter from biased and noisy data. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters. COLING ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 22. Davidov, D., Tsur, O., Rappoport, A.: Enhanced sentiment learning using twitter hashtags and smileys. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters. COLING ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 23. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phraselevel sentiment analysis. In: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. HLT ’05 (2005) 347–354 24. Dietterich, T.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40(2) (2000) 25. Tan, M., Wang, L., Tsang, I.W.: Learning sparse SVM for feature selection on very high dimensional datasets. In: Proceedings of the 2010 International Conference on Machine Learning. ICML ’10 (2010) 94 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Software reliability prediction via two different implementations of Bayesian model averaging Alex Sarishvili1 and Gerrit Hanselmann2 1 Fraunhofer Institute for Industrial Mathematics ITWM, 67663 Kaiserslautern, Germany, [email protected] 2 Siemens AG, Corporate Technology 81739 München, Germany, [email protected] Abstract. The problem of predicting software reliability is strengthened by the uncertainty of selecting the right model. While Bayesian Model Averaging (BMA) provides a means to incorporate model uncertainty in the prediction, research on the influence of parameter estimation and the performance of BMA in different situations will further expedite the benefits of using BMA in reliability prediction. Accordingly, two different methods for calculating the posterior model weights, required for BMA, are implemented and benchmarked considering different data situations. The first is the Laplace method for integrals. The second is the Markov Chain Monte Carlo (MCMC) method using Gibb’s and Metropolis within Gibb’s sampling. For the last the explicit conditional probability density functions for grouped failure data are provided for each of the model parameters. With a number of different simulations of mixed grouped failure data we can show the robustness and superior performance of MCMC measured by mean squared error on long and short range predictions. 1 Introduction There exists a large number of different reliability prediction models with a wide variety of underlying assumptions. The problem of selecting the single right model or combination of models is receiving considerable attention with the growing need of improved software reliability predictions [1], [2], [3], [4], [5], [6], [7]. A conceivably simple approach to integrate uncertainty about the right model in the prediction is the equally weighted linear combination (ELC) of models studied by [8]. ELC is defined by k 1Xˆ fˆtELC = fi (t), k i=1 where fˆi (t) is the prediction of f (t) the number of accumulated faults found until time t, using model Mi and k is the number of models. It basically defines every prediction model as good as the other. 95 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 2 Instead of giving equal trust to each and every model a more sophisticated approach considers the model performance on the data to distribute trust on the different models. This can be done using Bayesian Model Averaging (BMA) that calculates posterior weights for every model and places trust amongst the models accordingly. Several realistic simulation studies comparing the performance of BMA [9, 10] showed that in general BMA has better performance. These studies were performed for a variety of situations, e.g., linear regression [11], Log-Linear models [12], logistic regression [13], and wavelets [14]. Other studies, especially regarding BMA out-of-sample performance have shown quite consistent results, namely BMA having better predictive performance than competing methods. Theoretical results on the performance of BMA are given by [15]. This paper investigates and describes two different approaches for estimating the posterior model probabilities for BMA, in the case were grouped failure data is given. One is the Laplace method for integrals. The other is the Markov Chain Monte Carlo (MCMC) method in its most general form, which allows implementation using Gibb’s and Metropolis within Gibb’s samplers. Through the comparison of two different implementations of BMA on grouped and small simulated data sets we show the significance of the presented methods on the prediction performance of BMA. Furthermore we demonstrate the superiority of BMA over single prediction models and over the ELC combination technique. We decided to simulate the data and not to use standard data sets available online for the following reason: model performance estimated on only few data sets (each of them is one realization of a stochastic process) have the drawback of missing generality and in most cases have limited informative value. For the combination four different non-homogeneous Poisson process (NHPP) models for grouped failure data with different mean value functions [16] have been used (Table 1). These models were selected because of their degree of popularity and convenience by the illustration of evaluation results. The doubt on validity of the assumption that the software reliability behaviour follows the NHPP is presented in [17]. The rest of this paper is organized as follows. Section 2 gives a short survey of software reliability growth modeling. Section 3 gives an introduction to Bayesian model averaging and describes the Laplace method and the MCMC method for calculating the posterior model weights. The simulation setup and the results are illustrated in Section 4. The conclusion and future work is given in Section 5. The models are introduced in [18], [19], [20], [21] respectively. Table 1. Models under consideration Model Delayed S-Shaped (DSS) Goel-Okumoto (GO) Goel Generalized (GG) Mean value function µ(t) = a(1 − [1 + βt]e−βt ) µ(t) = a(1 − e−bt ) γ µ(t) = a(1 − e−bt ) Inflection S-Shaped (ISS) µ(t) = 96 a(1−e−bt ) 1+βe−bt COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 3 2 Software Reliability Growth Models Software reliability is defined as the probability of failure-free operation for a specified period of time under specified operating conditions [22]. Thereby, a software failure is an inconsistent behavior with respect to the specified behavior originating from a fault in the software code [23]. Software reliability modeling started in the early 70s. In contrast to hardware reliability, software reliability is concerned with design faults only. Software suffers not from deterioration nor does it fail accidentally if not in use. Design faults occur deterministically until they have been removed. Notwithstanding, software’s failure behavior can be described using random models [24]. The reason is that even though the failures occur deterministically under the same conditions, their occurrence during usage may be random. A failure will no longer occur if its underlying fault has been removed. The process of finding and removing faults can be described mathematically by using software reliability growth models (SRGM) and most existing reliability models draw on this assumption of an improving reliability over time due to continuous testing and repair. An overview of these models can be found in [25], [26], [27], [28], [29]. This paper uses models of the NHPP class [16]. NHPP models have been used successfully in practical software reliability engineering. These models assume that N (t), the number of observed failures up to time t, can be modeled as a NHPP, as Poisson process with a time varying intensity function. A counting process {N (t), t ≥ 0} is an NHPP if N (t) has a Poisson distribution with mean value function µ(t) = E[N (t)], i. e., P (N (t) = n) = µ(t)n −µ(t) e , n = 0, 1, 2, ... . n! The mean value function µ(t) is the expected cumulative number of failures in [0, t). Different NHPP software reliability growth models have different forms of µ(t). (see also the table 1) 3 Bayesian Model Averaging A software reliability growth model uses failures found during testing and fault removal to describe the failure behavior over time. Different models have different assumptions concerning the failure behavior of software. Let M = {M1 , ..., Mk } be the k NHPP models that predict the cumulated number of failures, fi (t), i = 1, ..., k , for each time t . The BMA model predicts the expected cumulated number of failures at time t, µ̂(t)bma , by averaging over predictions of the models Mi , i = 1 ... k . Thereby the models are weighted using their posterior probabilities. This leads to the following BMA pdf p(f (t)| d(t)) = k X p(fi (t)| Mi , d(t))p(Mi | d(t)), i=1 97 (1) COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 4 where p(fi (t)| Mi , d(t)) is the prediction pdf of fi (t) under model Mi and p(Mi | d(t)) is the posterior probability of model Mi given data d(t)3 . The BMA point prediction of the cumulated number of experienced failures is fˆBM A (t) = k X fˆi (t) p(Mi | d(t)), (2) i=1 where the posterior probability for model Mi at time t is given by p(d(t)| Mi )p(Mi ) , p(Mi | d(t)) = Pk i=1 p(d(t)| Mi )p(Mi ) with p(d(t)| Mi ) = Z (3) p(d(t)| θi , Mi )p(θi | Mi )dθi , (4) being the integrated likelihood of model Mi . Thereby, θi is the vector of parameters of model Mi , p(θi |Mi ) is the prior density of θi under model Mi , p(d(t)| θi , Mi ) is the likelihood, and p(Mi ) is the prior probability that Mi is the true model [10]. 3.1 Laplace method for integrals In this paper two methods for implementing BMA have been examined. The first is the approximation of the integral in (4) by the method of Laplace. In regular statistical models (roughly speaking, those in which the maximum likelihood estimate (MLE) is consistent and asymptotically normal) the best way to approximate the integral in (4) is usually using the Laplace method. The integrated likelihood from (4) can be estimated in the following way. For simplicity the conditional information on models has been omitted from the equations. Let g(θi ) = log(p(d(t)| θi )p(θi )) and let θ̃i = arg maxθ∈Θi g(θ). After Taylor series expansion truncated at the second term the following is obtained: g(θi ) ≈ g(θ̃i ) + 1/2(θi − θ̃i )′ g ′′ (θ̃i )(θi − θ̃i ). It follows p(d(t)|Mi ) = Z e(g(θi )) dθi = eg(θ̃i ) Z e(1/2(θi −θ̃i ) T g ′′ (θ̃i )(θi −θ̃i )) dθi . (5) By recognizing the integrand as proportional to the multivariate normal density and using the Laplace method for integrals p(d(t)|Mi ) = eg(θ̃i ) (2π)Di /2 |A|−1/2 , (6) where Di is the number of parameters in the model Mi , and Ai = −g ′′ (θ̃i ). It can be shown that for large N which is the number of data available the 3 Since the observed data changes with time it is denoted by the time dependent function d(t). 98 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 5 θ̃i ≈ θ̂i , where θ̂i is the MLE, and Ai ≈ N Ii . Thereby Ii is the expected Fisher information matrix. It is a Di × Di matrix whose (k, l)-th elements are given by: 2 ∂ log(p(d(t)|Mi )) Ikl = −E θ=θ̂i . ∂θk ∂θl Taking the logarithm of (6) leads to log(p(d(t)|Mi )) = log p(d(t)|θ̂i ) + log p(θ̂i ) +Di /2 log(2π) − Di /2 log(N ) −1/2 log |Ii | + O(N −1/2 ). (7) This is the basic Laplace approximation. In (7) the information matrix Ii is estimated as the negative of the Hessian of the log-likelihood at the maximum likelihood estimate. Furthermore the results can be compared with the Bayesian information criterion (BIC) approximation. If only the terms which are O(1) or less for N → ∞ are retained in (7) the following can be derived: log(p(d(t)|Mi )) = log p(d(t)|θ̂i ) − Di /2 log(N ) + O(1). (8) This approximation is the well known BIC approximation. Its error O(1) does not vanish for an infinite amount of data, but because the other terms on the right hand side of (8) tend to infinity with the number of data, they will eventually dominate the error term. This is the case when software testing has progressed sufficiently and much failure data is available. One choice of the prior probability distribution function of the parameters in (7) is the multivariate normal distribution with mean θ̂i and the variance equal to the inverse of the Fisher information matrix. This seems to be a reasonable representation of the common situation where there is only little prior information. Under this prior using (7) the posterior approximation is: log(p(d(t)|Mi )) = log p(d(t)|θ̂i ) − Di /2 log(N ) + O(N −1/2 ). (9) Thus under this prior the error is O(N −1/2 ) which is much smaller for moderate to large sample sizes and which tends to zero as N tends to infinity. (9) was pointed out by [30]. This is a variant of the so called non-informative priors. Non-informative priors are useful in case when the analyst has no relevant experience to specify a prior, and that the subjective elicitation in multi-parameter problems is impossible. The Laplace method with normal prior (9) is used for the approximation of the posterior probabilities. 3.2 Markov Chain Monte Carlo method Another way of calculating the marginal likelihoods which are needed for estimating the posterior weights of the models is MCMC. MCMC generates samples from the joint pdf of the model parameters. In this paper the Gibb’s sampler is 99 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 6 used for the MCMC implementation. The transition distribution of this Markov chain is the product of several conditional probability densities. The stationary distribution of the chain is the desired posterior distribution [31]. After the samples from the parameter joint pdf for any model i are generated, the integral (4) can be approximated by the sum p(d(t)| Mi ) = T X (j) (j) p(d(t)| θi , Mi )p(θi | Mi ), (10) j=1 where T is the number of Gibb’s sampler iterations. Below the parameter conditional probability density functions are given, which are needed for the Gibb’s sampler implementation. A similar MCMC implementation was ddescribed by [32] but for non-grouped data. Because the likelihood function for grouped data is different then for the non-grouped data the conditional densities needed for Gibb’s implementation are different, as well. The likelihood function of different NHPP models for interval failure count data is t Y (µ(i) − µ(i − 1))di −µ(t) p(d(t)|θi , Mi ) = e . (11) di ! i=1 It is necessary to make an assumption about the prior distribution of the model parameters. It is convenient to choose the Gamma distribution as prior, for it supports the positivity of the parameters and is quite versatile to reflect densities with increasing or decreasing failures rates. Under this assumption the posterior density, for instance for the delayed s-shaped model is p(a, β|d(t)) ∝ p(d(t)|a, β, DSS)aα1 −1 e−aα2 β β1 −1 e−ββ2 . Inserting the corresponding mean value function into (11) yields p(d(t)| α, β, DSS). Subsections 3.2.1 to 3.2.4 describe the gamma prior distributions of the parameters of the considered models. Since not all conditional densities have convenient forms the Metropoliswithin-Gibbs algorithm is used to approximate the joint pdf of the model parameters. One way to avoid this computationally intensive Metropolis-within-Gibbs sampling is data augmentation [31]. This is however not considered in this paper. The following subsections describe the conditional densities for the different mean value functions of the NHPP models. These conditional densities can be used in the Gibbs sampler to generate the desired joint parameter probability distributions and therefore the corresponding posterior pdf ’s for each model. 3.2.1 Conditional densities for DSS model parameters For the DSS-model with Gamma a-priori distributed parameters a ∼ Γ (α1 , α2 ) and β ∼ Γ (β1 , β2 ), the following conditional densities are sampled p(β|a, d(t)) ∝ e(−β Pt i=1 id(i)−µ̄(t)−β2 β ) β1 −1 β t Y i=1 100 Ad(i) , COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 7 where A = −(1+βi)+eβ (1+βi−β) and µ̄(t) = −a(1+βt)e−βt is the second term of the expectation function of the DSS model. The conditional density function for a is ! t X −βt a|β, d(t) ∼ Γ d(i) + α1 , 1 − (1 + βt)e + α2 . i=1 3.2.2 Conditional densities for GO model parameters Considering the GO-model the conditional densities for the parameters a and b are ! t X −bt a|b, d(t) ∼ Γ d(i) + a1 , 1 − e + a2 i=1 and p(b|a, d(t)) ∝ e(−b Pt where A = (eb − 1) i=1 Pt id(i)−µ̄(t)−b2 b) i=1 d(i) b1 −1 b A, , and µ̄(t) = −ae−bt is the second term of the expectation function of the GO model. 3.2.3 Conditional densities for GG model parameters For the GG-model with a Gamma prior distribution the conditional densities are: For the parameter a ∼ Γ (a1 , a2 ): a|b, γ, d(t) ∼ Γ t X d(i) + a1 , 1 − e −btγ + a2 i=1 ! . For the parameter b ∼ Γ (b1 , b2 ): p(b|a, γ, d(t)) ∝ Ae−µ̄(t)−b2 b bb1 −1 , where A= t Y i=1 − 1 ebiγ + 1 eb(i−1)γ d(i) . For the parameter γ ∼ Γ (γ1 , γ2 ): p(γ|a, b, d(t)) ∝ Ae−µ̄(t)−γ2 γ γ γ1 −1 , γ where µ̄(t) = −aet is the second term of the expectation function of the GG model. 101 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 8 3.2.4 Conditional densities for ISS model parameters For the parameter a ∼ Γ (a1 , a2 ) of the ISS model the conditional density is ! t X 1 − e−bt + a2 . a|b, β, d(t) ∼ Γ d(i) + a1 , 1 + βe−bt i=1 For parameters b ∼ Γ (b1 , b2 ) and β ∼ Γ (β1 , β2 ) the conditional density is p(β|a, b, d(t)) ∝ (β + 1) and p(b|a, β, d(t)) ∝ eb( Pt i=1 Pt i=1 id(i)−b2 ) d(i) Aβ β1 −1 e−β2 β (eb − 1) Pt i=1 d(i) Abb1 −1 respectively. Thereby, A= t Y (ebi + β)(ebi + βeb ) i=1 −d(i) e−µ(t) , where µ(t) is the expectation function of the ISS model. 4 Evaluation This paper examines two methods to calculate the posterior weights for the model combination using BMA. The evaluation compares the prediction performance of different combinations: – BMA using MCMC – BMA using Laplace – ELC. Besides comparing the prediction performance of the combinations, BMA using MCMC is also compared to the performance of individuals models. The comparison of model performances is done by measuring the mean squared error on long and short range software reliability predictions. The experimental procedure begins with simulation of mixed NHPP realizations via thinning algorithm, which is described in the next section. The obtained data are then used for Gibb’s sampler (Section 3.2) and Laplace procedure(Section 3.1) to estimate the posterior pdf s of selected NHPP models. Finally the performance comparison is made in section 4.4. The following sections present the details of the evaluation. 4.1 Simulation To achieve this evaluation different data sets have to be simulated. In detail 100 mixed NHPP processes with uniformly distributed random mixing coefficients, 102 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 9 and with random model parameters have been simulated. In [33] was shown that failure data can be much better described using several models instead of only one. Therefore simulating mixed NHPP processes is more realistic and used in this paper. The following distributions and parameters were used in the simulation – – – – for for for for DSS Model a ∼ U[50,125] and β ∼ U[0.05,0.15] GO Model a ∼ U[50,125] and b ∼ U[0.05,0.15] ISS Model a ∼ U[50,125] , b ∼ U[0.05,0.15] and β ∼ U[2.5,3.5] GG Model a ∼ U[50,125] , b ∼ U[0.05,0.15] and γ ∼ U[2.5,3.5] . For the simulation of a single NHPP model an approach called thinning or rejection method [34] was used. It is based on the following observation. If there exists a constant λ̄ such that λ(t) ≤ λ̄ for all t. Let T1∗ , T2∗ , ... be the successive arrival times of a homogeneous Poisson process with intensity λ, and we accept the ith arrival time with probability λ(Ti∗ )/λ̄, then the sequence T1 , T2 , ... of the accepted arrival times are the arrival times of a NHPP with rate function λ(t) [35]. The generated data sets were divided in training and validation parts. The validation parts were used to calculate the predictive performance. All algorithms were implemented in Matlab environment. 4.2 Performance measure The performance measure used for the evaluation is the standard measure mean squared error (MSE). The MSE for a specific model i can be expressed as M SEi = N 2 1 Xˆ fi (t) − fi (t) , N t=1 with fˆi (t) the predicted number of failures of model i at time t and fi (t) is the actual number of observed (simulated) failures at time t. The Bayes MSE is estimated by means of the Monte Carlo integration. Let (k,m) θi be the variates of the parameters of model i drawn in the k-th replication (k,m) and m-th iteration of the Gibbs sampler and let fi (θi , t) be the model i (k,m) output if we use the parameter vector θi estimated with the data till time t, then the M SEi is calculated as follows: f˜i (t) = K 2 X KM M X (k,m) fi (θi , t), (12) k=1 m= M +1 M SEi = 1 N N X t=1 2 f˜i (t) − fi (t) 2 where f˜i (t) is the Bayesian MCMC estimation of the accumulated number of failures by model i with the data from the time interval [1, t], K is the number of replications and M is the number of iterations of the Gibbs sampler and N is the number of data. 103 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 10 4.3 Evaluation parameters The a-priori densities of the parameters were chosen from the Gamma distribution function with following parameters: – – – – for for for for DSS Model a ∼ Γ (1, 0.001) and β ∼ Γ (1, 0.001) GO Model a ∼ Γ (1, 0.001) and b ∼ Γ (1, 0.001) ISS Model a ∼ Γ (1, 0.001), b ∼ Γ (1, 0.001) and β ∼ Γ (2, 1) GG Model a ∼ Γ (1, 0.001), b ∼ Γ (1, 0.001) and γ ∼ Γ (1, 0.001) The number of MCMC iterations was M = 1000 and number of replications K = 25, see formula 12. The BMA point estimate is given as follows: – For BMA with MCMC: Inserting the outcome of (10) together with the equal model a-priori probabilities P (Mi ) = 1/4 into (3), results in the BMA weighting factors. Inserting the product of this factors and the outcome of (12), into 2) results in the BMA MCMC point estimate of the system output. – For BMA with Laplace: Inserting the exponential of the outcome of (9) together with the equal model a-priori probabilities P (Mi ) = 1/4 into (3), results in the BMA weighting factors. Inserting the product of this factors and the ML estimates of the model outputs into (2) results in the BMA Laplace point estimate of the system output. 4.4 4.4.1 Evaluation results Comparing model performance of MCMC BMA to single models Figure 1 shows the performance comparison of the MCMC BMA Model vs. single models using 75% of the simulated data for the model parameter estimation and the remaining 25% of data for the model validation. In detail the Figure shows on the x-axis the MSE of BMA MCMC and on the y-axis the MSE of the respective single model. Therefore every point above the equal performance border line indicates that the performance of MCMC BMA was better than that of the respective single model. Figure 2 shows the same comparison but with only 50% of the data for the model parameter estimation and the remaining data for the validation. In Figure 1 the DSS model MSE is near the equal performance border line for an M SE < 16, that means for simulated data which had a high weighting coefficient on DSS Model in the thinning algorithm. However the more interesting case of a longer term prediction, Figure 2 shows that the BMA MCMC model outperforms all single models by far. 104 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 11 Performance comparison on validation data(25% of the whole data set) MSE´s over 100 simulation runs, single models 1400 MCMC BMA vs GO MCMC BMA vs. DSS 1200 MCMC BMA vs. ISS MCMC BMA vs. GG 1000 Equal Performance border 800 600 400 200 0 0 5 10 15 20 25 MSE´s over 100 simulation runs, MCMC BMA 30 Fig. 1. Predictive performance comparison BMA MCMC vs single models, 25% of the data were used for the validation Performance comparison on validation data(50% of the whole data set) MSE´s over 100 simulation runs, single models 2500 MCMC BMA vs GO MCMC BMA vs. DSS 2000 MCMC BMA vs. ISS MCMC BMA vs. GG Equal Performance border 1500 1000 500 0 0 5 10 15 20 25 30 MSE´s over 100 simulation runs, MCMC BMA 35 Fig. 2. Predictive performance comparison BMA MCMC vs single models, 50% of the data were used for the validation 4.4.2 Comparing model performance of MCMC BMA to Laplace BMA and ELC The Figures 3 and 4 show the comparison results considering different combinations. Considering long term prediction BMA MCMC has better performance than ELC or BMA Laplace in almost every of the 100 simulation runs (Figure 3). However if 75% of the data were used for model fitting ELC had in 20% and BMA Laplace in 23% of the 100 simulation runs similar or better predictive performance than BMA MCMC (Figure 4). The BMA Laplace model was in 53 of 100 simulation cases better than the ELC on the small data set(50% of the data used for the parameter estimation) see the figure 3). For the bigger data set (75% of the data used for the parameter estimation) BMA Laplace had a smaller MSE than ELC in 64 out of the 100 simulation runs. This trend shows the better approximative results of the Laplace method for large data sets. Comparing the prediction performance of BMA MCMC and BMA Laplace revealed that 105 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 12 MSE´s over 100 simulation runs, MCMC Laplace and ELC Performance comparison on validation data(50% of the whole data set) 40 35 30 25 20 15 10 MCMC BMA vs ELC MCMC BMA vs. Laplace BMA Equal Performance border 5 0 0 5 10 15 20 25 30 MSE´s over 100 simulation runs, MCMC BMA 35 Fig. 3. Predictive performance comparison BMA MCMC vs combining models, 50% of the data were used for the validation MSE´s over 100 simulation runs, MCMC Laplace and ELC Performance comparison on validation data(25% of the whole data set) 35 30 25 20 15 10 MCMC BMA vs ELC MCMC BMA vs. Laplace BMA Equal Performance border 5 0 0 5 10 15 20 25 MSE´s over 100 simulation runs, MCMC BMA 30 Fig. 4. Predictive performance comparison BMA MCMC vs combining models, 25% of the data were used for the validation 106 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 13 the BMA MCMC is clearly better suited for small data sets. If only 50% of the data was used as training data set BMA MCMC had a much better prediction performance than BMA Laplace. With more and more data the BMA Laplace is improving. This trend can be observed when for instance 75% of the data was used as training data set. In this case already 23% of the predictions had similar or better performance than BMA MCMC. 5 Conclusion A review of the relevant literature reveals great agreement that there is no single model that can be used for all cases. This paper addressed this issue and studied two ways of implementing Bayesian Model Averaging for Non-Homogeneous Poisson Process models for grouped failure data. It could be shown that BMA had better prediction performance than the single models. Also it could be shown that BMA has better prediction performance than other simpler combination of approaches like ELC. Considering the two ways of implementing BMA it was shown that the MCMC approach is by far better than the Laplace method if the data set used for parameter estimation is small. Laplace should not be used when the number of detected failures is small and therefore the ratio between the error term and the other terms on the right hand side of (9) is high. On the other hand, the Laplace approximation does not require complicated computational procedures like Metropolis within Gibb’s sampler. However since the multiplicative normal distribution does not account for skewness, the accuracy of the approximation is low in many cases. The MCMC exploration of the support of a-posteriori probability density function was pretty fast. The problem was the Metropolis within Gibbs variant of the algorithm, which was time consuming. One possibility to avoid these difficulties is the introduction of latent random variables for augmentation of Gibbs conditional densities. The number of iterations in the Gibbs sampler was determined by monitoring of convergence of averages [31]. We showed the high predictive performance of the MCMC BMA in comparison with Laplace BMA and ELC combination methods in the case if one wants to make long range predictions in the software testing processes for moderate sized software projects. In this paper we concentrated on four NHPP based SRG Models. An extension of the presented implementation techniques of BMA to more models is possible. For larger model spaces techniques for optimal model space reduction like ”Occam’s window” or optimal model space exploration like M C 3 [36] or reversible jump MCMC could be of interest. References 1. Abdel-Ghaly, A.A., Chan, P., Littlewood, B.: Evaluation of competing software reliability predictions. IEEE Transactions on Software Engineering 12(9) (1986) 950–967 107 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 14 2. Nikora, A.P., Lyu, M.R.: Software Reliability Measurement Experience. In: Handbook of Software Reliability Engineering. McGraw–Hill (1995) 255–302 3. Brocklehurst, S., Littlewood, B.: Techniques for Prediction Analysis and Recalibration. In: Handbook of Software Reliability Engineering. McGraw–Hill (1995) 119–166 4. Almering, V., van Genuchten, M., Cloudt, G., Sonnemans, P.J.M.: Using software reliability growth models in practice. Volume 11-12. (2007) 82–88 5. Singpurwalla, N.D., Wilson, S.P.: Statistical Methods in Software Engineering: Reliability and Risk. Springer (1999) 6. Ravishanker, N., Liu, Z., Ray, B.K.: Nhpp models with markov switsching for software reliability. Computational statistics and data analysis 52 (2008) 3988– 3999 7. Dharmasena, L.S., Zeephongsekul, P.: Fitting software reliability growth curves using nonparametric regression methods. Statistical Methodology 7 (2010) 109– 120 8. Lyu, M.R., Nikora, A.: Applying reliability models more effectively. IEEE Software 9(4) (1992) 43–52 9. Raftery, A.E., Madigan, D., Volinsky, C.T.: Accounting for model uncertainty in survival analysis improves predictive performance. Technical report, Department of Statistics, GN-22, University of Washington (1994) 10. Hoeting, J.A., Madigan, D., Raftery, A.E., Volinsky, C.T.: Bayesian model averaging: A tutorial. Statistical Science 14 (1999) 382–417 11. George, E.I., McCulloch, R.E.: Variable selection via gibbs sampling. Journal of the American Statistical Association 14(Series B) (1993) 107–114 12. Clyde, M.A.: Bayesian model averaging and model search strategies. Bayesian Statistics 6 (1999) 157–185 13. Viallefont, V., Raftery, A.E., Richardson, S.: Variable selection and Bayesian model averaging in case-control studies. Statistics in Medicine 20 (2001) 3215–3230 14. Clyde, M.A., George, E.I.: Flexible empirical Bayes estimation for wavelets. Journal of Royal Statistical Society 62(Series B) (2000) 681–698 15. Raftery, A.E., Zheng, Y.: Long run performance of Bayesian model averaging. Technical report, Department of Statistics University of Washington (2003) 16. Huang, C.Y., Lyu, M.R., Kuo, S.Y.: A unified scheme of some nonhomogeneous Poisson process models for software reliability estimation. IEEE Transactions on Software Engineering 29(3) (2003) 261–269 17. Cai, K.Y., Hu, D.B., Bai, C.G., Hu, H., Jing, T.: Does software reliability growth behavior follow a non-homogeneous Poisson process. Information and Software Technology 50 (2008) 1232–1247 18. Yamada, S., Ohba, M., Osaki, S.: S-shaped reliability growth modeling for software error detection. IEEE Transactions on Reliabiity R-32 (1983) 475–478 19. Goel, A.L., Okumoto, K.: Time-dependent error-detection rate model for software reliability and other performance measures. IEEE Transactions on Reliability 28(3) (1979) 206–211 20. Goel, A.L.: Software reliability models: Assumptions, limitations, and applicability. IEEE Transactions on Software Engineering 11(12) (1985) 1411–1423 21. Ohba, M.: Inflection S-shaped softwrae reliability growth models. In: Stochastic Models in Reliability Theory. Springer (1984) 144–162 22. ANSI/IEEE: Standard Glossary of Software Engineering Terminology. Std-7291991 edn. (1991) 23. IEEE: IEEE Standard Glossary of Software Engineering Terminology. Institute of Electrical & Electronics Enginee (2005) 108 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 15 24. Lyu, M.R.: Software Reliabiliy Theory. In: Encyclopedia of Software Engineering. John Wiley & Sons (2002) 25. Xie, M.: Software Reliability Modeling. World Scientific Publishing (1991) 26. Lyu, M.R.: Handbook of Software Reliability Engineering. McGraw–Hill (1995) 27. Musa, J.D., Iannino, A., Okumoto, K.: Software Reliability Measurement Prediction Application. McGraw–Hill (1987) 28. Lakey, P.B., Neufelder, A.M.: System and Software Reliability Assurance Notebook. Produced for Rome Laboratory by SoftRel (1997) 29. Pham, H.: Software Reliability. Springer (2000) 30. Kass, R.E., Wasserman, L.: Improving the laplace approximation using posterior simulation. Technical report, Carnegie Mellon University, Dept. of Statistics (1992) 31. Casella, G. In: Monte Carlo Statistical Methods, Springer texts in statistics. Springer (1999) 32. Kuo, L., Yang, T.Y.: Bayesian computation for nonhomogeneous Poisson processes in software reliability. Journal of the American Statistical Association 91/434 (1996) 763–773 33. Gokhale, S., Lyu, M., Trivedi, K.: Software reliability analysis incorporating fault detection and debugging activities. (1998) 202 34. Grandell, J.: Aspects of risk theory. Springer (1991) 35. Burnecki, K., Härdle, W., Weron, R.: An introduction to simulation of risk processes. Technical report, Hugo Steinhaus Center for Stochastic Methods (2003) 36. Madigan, D., York, J.: Bayesian graphical models for diskrete data. International Statistical Review 63 (1995) 215–232 109 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Multi-Space Learning for Image Classification Using AdaBoost and Markov Random Fields W. Zenga , X. Chenb , H. Chengc , J. Huab a c Department of Eletrical Engineering and Computer Science, The University of Kansas, 1520 West 15th Street, Lawrence, Kansas, USA b Computer Science Department, Wayne State University, 5057 Woodward Ave., 3010, Detroit, MI, USA University of Electronic Science and Technology of China, Chengdu, China, 611731 [email protected], [email protected] Abstract. In various applications (e.g., automatic image tagging), image classification is typically treated as a multi-label learning problem, where each image can belong to multiple classes (labels). In this paper, we describe a novel strategy for multi-label image classification: instead of representing each image in one single feature space, we utilize a set of labeled image blocks (each with a single label) to represent an image in multi-feature spaces. This strategy is different from multi-instance multi-label learning, in which we model the relationship between image blocks and labels explicitly. Furthermore, instead of assigning labels to image blocks, we apply multi-class AdaBoost to learn a probability of a block belonging to a certain label. We then develop a Markov random field-based model to integrate the block information for final multi-label classification. To evaluate the performance, we compare the proposed method to six state-of-art multi-label algorithms on a real world data set collected on the internet. The result shows that our method outperforms other methods in several evaluation indicators, including Hamming loss, ranking-loss, macro-averaging F1, micro-averaging F1 and so on. Keywords: Image classification, Multi-label learning, Markov random field 1 Introduction With the rapid development of multimedia applications, the number of images in personal collection, public data sets, and web is growing. It is estimated that every minute around 3000 images are uploaded to Flickr. In 2010, the number of images Flickr hosted had exceeded five billion. The increasingly growing number of images presents significant challenges in organizing and indexing images. In addition to scene analysis [3, 31, 34, 35, 37], image retrieval [4, 19, 26], contentsensitive image filtering [6], and image representation [18], extensive attention has been drawn to automatic management of images. Automatic image tagging, for example, is a process to assign multiple keywords to a digital image. It is typically transformed as a multi-class or multi-label learning problem. In multiclass learning (MCL) [1, 2, 5, 8], an image is assigned with one and only one label 110 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 2 Lecture Notes in Computer Science: Authors’ Instructions from a set of predefined categories, while in multi-label learning (MLL), an image is assigned with one or more labels from a predefined label set. In this paper, we focus on MLL in real-world application. One of the commonly-used MLL methods is called problem transformation, which transforms a multi-label learning problem into multiple binary learning problems using a strategy called binary relevance (BR) [15, 27]. A BR-based learning model typically constructs a binary classifier for each label using regrouped data sets. While it is simple to implement, a BR-based method neglects label dependency, which is crucial in image classification. For example, an image labeled with beach may also be labeled with sea, while an image labeled with mountain is unlikely labeled with indoor. More sophisticated algorithms are advanced to address the label dependency [11, 12, 29, 31]. However, in most of the existing methods, an image with multiple labels is represented by one feature vector, while these labels are from different sub-regions in image responsible for different labels. To solve the ambiguity between image regions and labels, multi-instance learning (MIL) methods are developed in MLL methods. In multi-instance learning including multi-instance multi-label learning (MIMLL), an image is transformed into a bag of instances. The bag is positive, if at least one instance in the bag is positive and negative otherwise. MIL [30, 32] attempted to model the relationship between each sub-region in image with an associated label. To extract the sub-region, techniques from image segmentation is applied. However, image segmentation is an open problem in image processing, which will make MIL computationally expensive. Also the accuracy of segmentation interferes the performance of MIL. In this paper, we propose a multi-space learning (MSL) method using Adaboost and Markov random field to transform MLL tasks into MCL problems. We utilize normalized real-value outputs from one-against-all multi-class AdaBoost to represent the association between a block (instead of a bag or the entire image) and a potential label. The normalized real-valued output will also represent a contribution of a block in an entire image with a label. This step will solve the ambiguity between instances and labels, since the labels in multiclass classification share no intersection in labeling examples. This will solve the ambiguity of labeling an image. Then, we use Markov Random Fields (MRF) models to integrate label sets for an image. Compared to MIMLL, MRF-based integration is a more advanced way to integrate the results from blocks rather than the hard logic provided from MIL. In image classification, that different labels describes different regions is the major reason for labeling ambiguity. In our framework, we follow the characteristics in image classification to transform MLL task into MCL problem. The key contributions of this paper are highlighted as follows: (1) We propose an algorithm for multi-space learning which explicitly models the relationship between each image block and labels. (2) We derive a MRF-based model to integrate the results from every block in an image. Instead of predefining values for parameters, the values of parameters in MRF and integration thresholds are obtained via training images. 111 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Lecture Notes in Computer Science: Authors’ Instructions 3 The rest of the paper is organized as follows. In Section 2, we will describe the proposed method with multi-space learning and MRF-based integration. In Section 3, we will discuss our data set including all the image descriptors we have used and the final statistics. This is followed by experiment results of performance comparison. Finally, we will conclude and discuss about our future work in Section 5. 2 Methodology The system overview is showed in Figure 1. In this framework, we convert training images and testing images into multi-space representations with overlapped blocks of fixed size. We utilize a set of single-labeled training blocks with the same fixed size to train a one-against-all multi-class AdaBoost classifier. The classifier is used to calculate the real-valued outputs related to the association between block and labels for training images and testing images. A Markov random field model is used to integrate these real-valued outputs. Via thresholds estimation from the integration results of training images, testing images are predicted label by label. We will discuss feature extraction in Section 3. In the Fig. 1. The Framework of Our Algorithm framework, we use a multi-space representation to extract blocks from every image. The blocks are fixed-sized and overlapped. From training images, we can create a set of training blocks. The set of training blocks are denoted as B trn , which contains image blocks labeled with one and only one label from a finite label set of L with q semantic labels and a newly introduced label called background to filter out non-object blocks. The set of training images Itrn is labeled with the same label set L and Itst denotes the testing images. Therefore, for labeled training blocks, we have q + 1 categories, i. e., Btrn = {(b1 , yb1 ), . . . , (bN , ybN )}, ∀bi , ybi ∈ L∗ , 112 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 4 Lecture Notes in Computer Science: Authors’ Instructions where L∗ = {L, background }. For training and testing images, we have the following representations as Itrn = {(X1 , Y1 ), . . . , (Xn , Yn )}, ∀i, Yi ⊆ L, Itst = {T1 , . . . , Tm }. In multi-space representation, we have the following definitions. (i) (i) (i) (i) Let Xi = {b(1,1) , . . . , b(j,k) , . . . , b(ri ,ci ) }, where b(j,k) denotes the image block in j-th row and k-th column in the i-th training image, and ri and ci denote the row and column numbers of blocks contained in Xi . For testing images, we have the (i) (i) (i) (i) same representation as Ti = {t(1,1) , . . . , t(j,k) , . . . , t(ri ,ci ) }, where t(j,k) denotes the image block in j-th row and k-th column in the i-th testing image, and r i and ci denote the row and column numbers of blocks contained in Ti . It should be noted that the extracted blocks Btrn is not necessarily contained in the S union set of multi-space representations of training images, which is Btrn * i Xi . Training image is blocked via multi-space representation. The blocks in training images are fixed, once the image is given. However, training blocks in B trn are extracted in training images at random positions where an object locates. Figure 1 shows an multi-space representation, the rectangle size is 75*100 pixels. The overlapping along the x axis is 25 pixels, and 40 pixels along the y axis. The blocks are extracted with sequence. Thus, it efficiently records the content and spatial information of an individual image. Features are extracted upon every block in the image. It should be noted that given the size of the image, the number and the location of blocks can be calculated. In our framework, we train a multi-class classifier mapping every block in Btrn to a label in L∗ . In the experiment, we use multi-class AdaBoost [38] with one-against-all strategy and one dimensional decision stumps as weak learners denoted as hil (b), i = 1, . . . , M ; l ∈ L∗ , where i is the index of different weak learners, l is the index of different labels, and M is the iteration number of AdaBoost. Accordingly, The weight for its corresponding weak learner is denoted as αli , i = 1, . . . , M ; l ∈ L∗ The AdaBoost we used follows Algorithm 1 in [38]. The only difference is we record normalized real-valued outputs instead of direct labels to block b. In the step of multi-class AdaBoost, for an assigned label l, all the image blocks in Btrn labeled with l are considered as positive examples; while remaining image blocks in Btrn are considered as negative examples. To describe the normalized real-valued outputs, we firstly introduce a boolean expression operator as JπK for a boolean statement π. If π is true, JπK = 1; otherwise, JπK = 0. Then the normalized real-valued output fl (b) for an image block b in Itrn and Itst given an assigned label l is as follows: fl (b) = P PM i i i=1 αl · Jhl (b) = lK PM 0 i i0 k∈L∗ i0 =1 αl · Jhl (b) = lK . (1) All these normalized outputs viewed as block-label association for image blocks in Itrn are kept in parameter estimation for MRF-based integration, which will be discussed in the next part. 113 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Lecture Notes in Computer Science: Authors’ Instructions 5 Algorithm 1. Estimation of 2-node Potentials 1:For l = 1 to |L|: 2: Initialize nl = 0; 3: Initialize JH,l and JV,l with 10*10 zero metrices. 4: 5: 6: 7: 8: For Xi ∈ Itrn and l ∈ Yi nl ← nl + 1; Initialize CH,l,Xi and CV,l,Xi with two 10*10 zeros matrices; (j,k) (j,k+1) For (bi , bi ) ∈ Xi and k + 1 6 ci (i) x = Q(fl (b(j,k) )); (i) 9: 10: 11: 12: y = Q(fl (b(j,k+1) )); CH,l,Xi (x, y) ← CH,l,Xi (x, y) + 1. End CH,l,Xi ; JH,l,Xi (x, y) = ri ·(c i −1) 13: 14: For (bi , bi ) ∈ Xi and j + 1 6 ri (i) x = Q(fl (b(j,k) )); 15: 16: 17: 18: y = Q(fl (b(j+1,k) )); CV,l,Xi (x, y) ← CV,l,Xi (x, y) + 1. End C i JV,l,Xi (x, y) = (riV,l,X −1)·ci ; (j,k) (j+1,k) (i) 19: JH,l ← JH,l + JH,l,Xi ; 18: JV,l ← JV,l + JV,l,Xi ; 20: End J 21: JH, l ← nH,l ; l JV,l 22: JV, l ← nl . 23:End Outputs: JH,l and JV,l . Our method fully utilizes the normalized outputs from the multi-class AdaBoost classifier to build up Markov random field models for MLL. We derive a MRF model for information integration as follows. For an assigned label l ∈ L, our goal is to maximize the likelihood defined as P (Xi |l), which is proportional to a Gibbs distribution as follows [14], P (Xi |l) ∝ e−U (Xi |l) , where U (Xi |l) is called energy function. The energy function takes the following form [14], X X (i) (i) (i) U (Xi |l) = (Vl1 (b(j,k) ) + (2) Vl2 (b(j,k) , b(j 0 ,k0 ) )). (i) (i) (i) (i) b(j 0 ,k0 ) ∈N (b(j,k) ) b(j,k) (i) (i) Note that Vl1 (b(j,k) ) and Vl2 (b(j,k) , b(j 0 ,k0 ) ) are potentials for one block and two adjacent blocks in MRFs horizontally or vertically, given a label l. We introduce 114 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 6 Lecture Notes in Computer Science: Authors’ Instructions (i) (i) (i) the following definition, Vl1 (b(j,k) ) = −fl (b(j,k) ) where fl (b(j,k) ) represents the contribution of each block belonging to a certain label. When the contribution is increasing, the energy function is decreasing. Thus, we introduce the format of one-block potential as in Formula (2). (i) (i) To formulate the potentials of two blocks Vl2 (b(j,k) , b(j 0 ,k0 ) ), we firstly quantize the normalized real-valued outputs from multi-class AdaBoost with function Q, where b denotes a image block. As 0 6 fl (b) 6 1, p p−1 6 fl b < , 10 10 Q(fl (b)) = 10, if fl b = 10. Q(fl (b)) = p, if (3) After quantization, given a label l, we count the different combinations of the ten levels horizontally and vertically upon a training image. Thus, we get two count matrices denoted as CH,l,Xi and CV,l,Xi . The two matrices are normalized by the numbers of the two-adjacent blocks horizontally and vertically. JH,l,Xi and JV,l,Xi denote the normalized count matrices. Finally, the averages of JH,l,Xi and JV,l,Xi over all positive images labeled with l is created as JH,l and JV,l called joint contribution matrices of two adjacent blocks horizontally and vertically. After parameter estimation for potentials of two blocks, JH,l and JV,l are (i) (i) outputs as a codebook to extend Vl2 (b(j,k) , b(j 0 ,k0 ) ) defined as follows, x = (i) (i) Q(fl (b(j,k) )), y = Q(fl (b(j 0 ,k0 ) )), and if the two blocks are located in horizontal direction, λ (i) (i) Vl2 (b(j,k) , b(j 0 ,k0 ) ) = − JH,l (x, y); (i) |N (b(j,k) )| otherwise, (i) λ (i) Vl2 (b(j,k) , b(j 0 ,k0 ) ) = − (i) |N (b(j,k) )| (i) JV,l (x, y), (4) (i) where N (b(j,k) ) denotes the neighbors of block b(j,k) and λ denotes a parameter to rate the different contribution on one-block potential and two-block potentials. The integration results are calculated to predict label sets via thresholds to testing images. For the convenience of numerical calculation, we derive the following formula for normalized integration by block number ni . As the number of blocks −U (Xi |l) ) i |l) = −U (X . This will lead bias to the integration results Intgl Xi = ln(e ni ni P (i) formula is expanded with the following forms:Intgl1 (Xi ) = b(i) ∈Xi fl (b(j,k) , Intgl2 (Xi ) = P (i) (j,k) P (i) (i) (i) b(j 0 ,k0 ) ∈N (b(j,k) ) 1 2 Intgl (Xi )+Intgl (Xi ) b(j,k) ∈Xi (i) λ·Jl (b(j,k) ,b(j 0 ,k0 ) ) (i) |N (b(j,k) )| (i) (i) , and Intgl (Xi ) = , where Jl (b(j,k) , b(j 0 ,k0 ) ) denotes joint contribuni tion obtained by Algorithm 1. Whether to use horizontal or vertical direction 115 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Lecture Notes in Computer Science: Authors’ Instructions (i) 7 (i) depends on the locations of b(j,k) and b(j 0 ,k0 ) . With the integration results got from Itrn , the threshold for predicting assigned label l is estimated in maximizing F1 measurement in Itrn . An image will be predicted as a positive image for l, when the integration result is above the threshold. Otherwise, the image is predicted as a negative image for l. Integration results are used to predict label sets for testing images via thresholding. Since we want to evaluate the performance from ranking-based criteria in multi-label classification, we use the integration result subtracted by threshold given a label l. The thresholds are considered as zero-baselines for prediction. 3 Database We collect 4100 images from the internet and label them with building, car, dog, human, mountain, and water, according to their contents. The resolution of all the images is controlled under 800*600. In feature extraction, we use 13 different descriptors to represent the images. They focus on different characteristics on the images, such as color, texture, edges, contour and frequency information. They also showed different advantages to describe local details or global features in images. Table 2 describes the feature sets we have used. In the proposed algorithm, we extract features upon every fixed size block. The dimensionality of features on a block is 2684. In experiment comparison, six MLL algorithms are used. Features for entire images are extracted with the same 13 feature sets. The dimensionality for an entire image is 2629. Table 1. Sample Numbers per Label Set Label Train/Test Label Train/Test b 250/125 b+h 167/83 c 250/125 c+h 167/83 d 250/125 d+h 167/83 h 250/125 m+w 167/83 m 250/125 b+c+h 133/67 w 250/125 b+m+w 133/67 b+c 167/83 d+h+w 133/67 Among 4100 images, 2734 images are selected randomly for training and the remaining 1366 images are used for testing (2/3 for training and 1/3 for testing). In our algorithm, the training blocks is normalized according to sample mean and variance of every dimension. These sample means and variances are recorded to normalize the image blocks in both training and test images. The similar normalization strategy is used to normalize the data set used in multilabel classification for comparison experiment. Label-Cardinality and Label-Density [27] are used commonly Pn Pn in multi-label |Yi | i=1 |Yi | data,Label − Cardinality = , Label − Density = i=1 , where n n n·q denotes the sample size of training set, and q denotes the predefined label set size. For MLL problem in our database, Label-Cardinality in our database is 1.5973, and Label-Density is 0.2662. In Table 1, we use b, c, d, h, m, and w 116 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 8 Lecture Notes in Computer Science: Authors’ Instructions Table 2. Feature Set Description Features Block-wise color moment [24] RGB histogram [23] HSV histogram [23] Description Parameter(s) Value and —– Mean, standard deviation skewness of HSV 64-bin normalized histogram of RGB 64-bin normalized histogram of HSV Co-occurrence of pixels with a given distance and color level bin-num = 64 bin-num = 64 dist=1 or 3 color-level =64 global-bin=1 local-bin=25 edge distribution histogram [13] Global, local and semi-label edge horizon-bin=5 distribution with five filters vertical-bin=5 center-bin=1 Uh = 0.4 Gabor wavelet Ul = 0.05 Gabor wavelet transformation [17] transformation for texture K=6 S=4 LBP [[20], [36]] Local descriptor of binary patterns Default parameter values LPQ [21] Local descriptor of phase quantiza- Default parameter tion values moment invariants [7] Shape descriptor —– Tamura texture feature [25] Global descriptor of coarseness, —– contrast, and directionality Haralick Texture Feature [9] Please refer to [9] —– Scale-invariant numspatialbins=4 SIFT[16] feature transform numorientbins=12 Generic Fourier max-rad =4 GFD [28] descriptor max-ang=15 color correlogram [10] 117 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Lecture Notes in Computer Science: Authors’ Instructions 9 for short to represents building, car, dog, human, mountain, and water. From the training images, we generate 3000 sample blocks for training for every label including background. These image blocks are used to train the AdaBoost model. 4 Experiment In this section, we evaluate the proposed multi-space learning on the database. We compare our algorithm with six state-of-art MLL algorithms, namely AdaBoost.MH [22] which combines MCL with label ranking, back-propagation (BP) for MLL (BP-MLL) [33] which modifies the error term of traditional BP, instance differentiation (INSDIF) [35] which converts MLL into MIML upon differences between an image from different label centroids, binary relevance SVM with linear kernel (LBRSVM), multi-label kNN (ML-KNN) [34] which combines MAP principle with kNN, SVM with low-dimensional shared space named as MLLS [11]. In the experiment, we assign 100 as maximum number of iterations for BPMLL, AdaBoost.MH, and also multi-class AdaBoost. Other parameters are obtained via 3-fold cross-validation. The criteria of optimization in cross-validation is F1-measure. We use three different aspects of criteria for evaluation, namely, example-based, ranking-based and label-based criteria, including hamming loss, one-error, coverage, ranking loss, average precision, together with micro-averaging and macro-averaging recall, precision and F1. Let H be a learned classifier, f denote the real-valued function associated with H, and T = {t1 , t2 , . . . , tm } be the testing data set. Yi is the true label for ti . The definitions for example-based and ranking-based criteria are listed as follows: Pm |H(ti )∆Yi | Jargmaxy∈L fy (ti ) ∈ / Yi K , 1 − err(f ) = , (5) hloss(H) = i=1 mq m Pm maxy∈Yi rankf (ti , Yi ) cov(f ) = i=1 − 1, (6) m Pm |Si | rloss(f ) = i=1 , Si = {(y1 , y2 )|fyi (ti ) 6 fy2 (ti ), (y1 , y2 ) ∈ Yi × Ȳi }, (7) m m 1 X 1 X |Si0 | avgprec(f ) = , (8) m i=1 |Yi | rankf (ti , y) y∈Yi Si0 0 0 = {y ∈ Yi |rankf (ti , y ) 6 rankf (ti , y)}. (9) T Pl , F Nl , and F Pl denote true positive rate, false negative rate, and false positive rate for label l. Thus, micro-averaging and macro-averaging recall, precision, and F1 are defined in the following way. Pq Pq T Pl T Pl micro-rec = Pq , micro-prec = Pq l=1 , (T P + F N ) (T P l l l + F Pl ) l=1 l=1 l=1 118 (10) COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 10 Lecture Notes in Computer Science: Authors’ Instructions 1X 1X T Pl T Pl , macro-prec = , q i=1 T Pl + F Nl q i=1 T Pl + F Pl q macro-rec = q (11) micro-rec × micro-prec macro-rec × macro-prec , macro-F 1 = 2· . micro-rec + micro-prec macro-rec + macro-prec (12) Table 3 and Table 4 show the comparison results. (-) means the smaller the value is, the better the performance is; while (+) means the opposite. As can be seen, the proposed MSL algorithm outperforms the other six algorithms in several important criteria, including hamming loss, one-error, coverage, ranking loss, average precision, micro-averaging F1 and macro-averaging F1. For multi-label classification, not only the prediction with higher accuracy is important, but also the ranking on the association between examples and labels is vital. As other than thresholding, label ranking is another popular integration strategy for prediction on multi-label data. Label-based criteria are borrowed from the field of information retrieval, which reflect classifier performance excluding the imbalance factor from learning domain. The label-based criteria are recall and precision. F1 measure shows the balance between recall and precision. F1 is highly related to two factors. The first one is the absolute value of either recall or precision. The second one is the difference between recall and precision. High-value of F1 means the values of precision and recall are both high. Multi-label classification utilizes F1 as a crucial evaluation criterion via different averaging strategies. Among them, micro-averaging and macro-averaging are two common ones. The former describes the performance based on equal power of every example, while the latter focuses on the equal power of every label to generate F1 measure. micro-F 1 = 2· Table 3. Performance on Hamming loss, One-error, Coverage, Ranking loss, and Average precision Hamming Loss One-Error Coverage Ranking Loss Average Precision (-) (-) (-) (-) (+) AdaBoost.MH 21.18 33.82 1.49 16.46 75.98 BP-MLL 27.45 23.72 1.68 23.26 73.48 INSDIF 17.18 24.31 1.23 11.61 83.48 LBRSVM 18.95 26.13 1.31 12.95 81.78 ML-KNN 16.34 24.45 1.25 12.04 83.77 MLLS 24.91 22.41 1.23 11.38 84.48 MSL 13.68 10.83 1.04 7.16 91.03 In addition to the evaluation showed above in Table 3 and Table 4, we change the threshold values for prediction to draw precision-recall curves in Figure 2 and Figure 3. The micro-averaging and macro-averaging precision-recall curves basically show how sensitive a classifier would be interfered by threshold change. In summary, the larger the area under precision-recall curve (AUPRC) is, the 119 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Lecture Notes in Computer Science: Authors’ Instructions 11 Table 4. Performance on micro-averaging and macro-averaging recall, precision and F1 macro-rec macro-prec macro-F1 micro-rec micro-prec macro-F1 (+) (+) (+) (+) (+) (+) AdaBoost.MH 57.01 75.18 64.84 60.61 60.16 60.38 BP-MLL 77.63 48.51 59.71 78.52 49.04 60.37 INSDIF 60.65 75.93 67.43 62.21 69.96 65.86 LBRSVM 62.84 73.56 67.78 64.32 64.46 64.39 ML-KNN 61.67 72.95 66.84 62.91 72.19 67.22 MLLS 90.41 53.01 66.82 89.61 51.87 65.71 MSL 77.01 72.26 74.56 77.65 72.81 75.15 better the classifier is. The better here means more robust with the threshold change. Overall, MSL method yields superior performance compared to the other six MLL algorithms. Fig. 2. micro-averaging precision-recall Fig. 3. macro-averaging precision-recall curve curve 5 Conclusion In this paper, we present an algorithm using training blocks extracted in training images and image multi-space representations to generate a multi-space learning method, which utilizes a multi-class AdaBoost to train a multi-class classifier. In this sense, we try to transform image classification from a multi-label learning problem to a multi-class learning problem. In addition to that, rather than using a predefined logic to integrate results from regions in multi-instance learning, we derive a Markov random field model to integrate the normalized real-valued outputs from AdaBoost. MRF-based multi-space learning maintains the content and spatial information in images. Hence, MRF-based integration is a more advanced method to integrate results from different regions. Our algorithm is ex- 120 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 12 Lecture Notes in Computer Science: Authors’ Instructions perimentally evaluated through a multi-label image database and proven highly effective for image classification. 6 Acknowledgement This material is based upon work supported by the National Science Foundation under Grant No. 0066126. References 1. O. Boiman, E. Shechtman, and M. Irani. In defense of nearest-neighbor based image classification. In CVPR’08, pages 1–8, June 2008. 2. A. Bosch, A. Zisserman, and X. M. noz. Image classification using random forests and ferns. In ICCV’07, pages 1–8, October 2007. 3. M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene classfication. Pattern Recognition, 37:1757–1771, September 2004. 4. G. Carneiro, A. B. Chan, P. J. Moreno, and N. Vasconcelos. Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans. on Pattern Recognition and Machine Intelligence, 29(3):394–410, March 2007. 5. X. Chen, X. Zeng, and D. van Alphen. Multi-class feature selection for texture classification. IEEE Trans. Pattern Analysis and Machine Intelligence, 27:1685– 1691, October 2006. 6. T. Deselaers, L. Pimenidis, and H. Ney. Bag-of-visual-words models for adult image classification and filtering. In ICPR’08, pages 1–4, December 2008. 7. S. A. Dudani, K. J. Breeding, and R. B. McGhee. Aircraft identification by moment invariants. IEEE trans. on Computers, 26:39–46, January 1977. 8. Y. Fu and T. S. Huang. Image classification using correlation tensor analysis. IEEE trans. on Image Processing, 17:226–234, February 2008. 9. R. M. Haralick, K. Shanmugam, and I. Dinstein. Textural features for image classification. IEEE trans. on Systems, Man and Cybernetics, 3:610–621, November 1973. 10. J. Huang, S. R. Kumar, M. Mitra, W. Zhu, and R. Zabih. Image indexing using color correlograms. In CVPR’97, pages 762–768, June 1997. 11. S. Ji, L. Tang, S. Yu, and J. Ye. Extracting shared subspace for multi-label classification. In KDD’08, pages 381–389, August 2008. 12. F. Kang, R. Jin, and R. Sukthankar. Correlated label propagation with application to multi-label learning. In CVPR’06, pages 1719–1726, October 2006. 13. D. kwon Park, Y. S. Jeon, and C. S. Won. Efficient use of local edge histogram descriptor. In Proceedings of the 2000 ACM workshops on Multimedia, pages 51–54, December 2000. 14. S. Z. Li. Markov Random Field Modeling in Image Analysis. Springer-Verlag New York, Inc., Secaucus, New Jersey, 2001. 15. X. Lin and X. Chen. Mr. knn: Soft relevance for multi-label classification. In CIKM’10, pages 349–358, October 2010. 16. D. G. Lowe. Object recognition from local scale-invariant features. In ICCV’99, pages 1150–1157, September 1999. 17. B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Analysis and Machine Intelligence, 18:837–842, August 1996. 18. E. Nowak, F. Jurie, and B. Triggs. Sampling strategies for bag-of-features image classification. In ECCV’06, pages 490–503, May 2006. 121 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Lecture Notes in Computer Science: Authors’ Instructions 13 19. J. F. Nunes, P. M. Moreira, J. M. R. S. T. E. Nowak, F. Jurie, and B. Triggs. Shape based image retrieval and classification. In Iberian Conference on Information Systems and Technologies, pages 1–6, August 2010. 20. T. Ojala, M. Pietikäinen, and T. Mäenpää. A generalized local binary pattern operator for multiresolution gray scale and rotation invariant texture classification. In ICAPR’01, pages 397–406, March 2001. 21. V. Ojansivu, E. Rahtu, and J. Heikkilä. Rotation invariant local phase quantization for blur insensitive texture analysis. In ICPR’08, pages 1–4, December 2008. 22. R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, May 2000. 23. L. G. Shapiro and G. Stockman. Computer Vision. Prentice Hall, Inc., Upper Saddle River, New Jersey, 2001. 24. M. Stricker and M. Orengo. Similarity of color images. In Proceedings of Storage and Retrieval for Image and Video Databases, pages 381–392, February 1995. 25. H. Tamura, S. Mori, and T. Yamawaki. Textural features corresponding to visual perception. IEEE Trans. on Systems, Man and Cybernetics, 8:460–473, June 1978. 26. B. Tomasik, P. Thiha, and D. Turnbull. Tagging products using image classification. In SIGIR’09, pages 792–793, July 2009. 27. G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In Data Mining and Knowledge Discovery Handbook, 2010. 28. A. Vijay and mahua Bhattacharya. Content-based medical image retrieval using the generic fourier descriptor with brightness. In ICMV’09, pages 330–332, December 2009. 29. H. Wang, H. Huang, and C. Ding. Image annotation using multi-label correlated green’s function. In ICCV’09, pages 2029–2034, October 2009. 30. C. Yang, M. Dong, and J. Hua. Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning. In CVPR’06, pages 2057–2063, June 2006. 31. Z. Younes, F. Abdallah, and T. Denceux. Multi-label classification algorithm derived from k-nearest neighbor rule with label dependencies. In European Signal Processing Conference, August 2008. 32. Z. Zha, X. Hua, T. Mei, J. Wang, G. Qi, and Z. Wang. Joint multi-label multiinstance learning for image classification. In CVPR’08, pages 1–8, June 2008. 33. M. Zhang and Z. Zhou. Multi-label neural networks with applications to functional genomics and text categorization. IEEE Trans. on Knowledge and Data Engineering, 18:1338–1351, 2006. 34. M. Zhang and Z. Zhou. Ml-knn: A lazy learning approach to multi-label learning. Pattern Recognition, 40:2038–2048, July 2007. 35. M. Zhang and Z. Zhou. Multi-label learning by instance differentiation. In AAAI’07, pages 669–674, July 2007. 36. G. Zhao and M. Pietikäinen. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. on Pattern Analysis and Machine Intelligence, 29(6):915–928, June 2007. 37. Z. Zhou and M. Zhang. Multi-instance multi-label learning with application to scene classification. In NIPS’06, pages 1609–1616, December 2006. 38. J. Zhu, H. Zou, S. Rosset, and T. Hastie. Multi-class adaboost. Statistics and Its Interface, 2:349–460, 2009. 122 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. An Empirical Comparison of Supervised Ensemble Learning Approaches Mohamed Bibimoune1,2 , Haytham Elghazel1 , Alex Aussem1 1 Université de Lyon, CNRS Université Lyon 1, LIRIS UMR 5205, F-69622, France [email protected], [email protected], [email protected] 2 ProbaYes, 82 allée Galilée, F-38330 Montbonnot, France Abstract. We present an extensive empirical comparison between twenty prototypical supervised ensemble learning algorithms, including Boosting, Bagging, Random Forests, Rotation Forests, Arc-X4, Class-Switching and their variants, as well as more recent techniques like Random Patches. These algorithms were compared against each other in terms of threshold, ranking/ordering and probability metrics over nineteen UCI benchmark datasets with binary labels. We also examine the influence of two base learners, CART and Extremely Randomized Trees, and the effect of calibrating the models via Isotonic Regression on each performance metric. The selected datasets were already used in various empirical studies and cover different application domains. The experimental analysis was restricted to the hundred most relevant features according to the SNR filter method with a view to dramatically reducing the computational burden involved by the simulation. The source code and the detailed results of our study are publicly available. Key words: Ensemble learning, classifier ensembles, empirical performance comparison. 1 Introduction The ubiquity of ensemble models in Machine Learning and Pattern Recognition applications stems primarily from their potential to significantly increase prediction accuracy over individual classifier models [25]. In the last decade, there has been a great deal of research focused on the problem of boosting their performance, either by placing more or less emphasis on the hard examples, by constructing new features for each base classifier, or by encouraging individual accuracy and/or diversity within the ensemble. While the actual performance of any ensemble model on a particular problem is clearly dependent on the data and the learner, there is still much room for improvement as the comparison between all the proposals provide valuable insight into understanding their respective benefit and their differences. 123 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. There are few comprehensive empirical studies comparing ensemble learning algorithms [1, 9]. The study performed by Caruana and Niculescu-Mizil [9] is perhaps the best known study however it is restricted to small subset of well established ensemble methods like random forests, boosted and bagged trees, and more classical models (e.g., neural networks, SVMs, Naive Bayes). On the other had, many authors have compared their ensemble classifier proposal with others. For instance, Zhang et al. compared in [29] RotBoost against Bagging, AdaBoost, MultiBoost and Rotation Forest using decision tree-based estimators, over 36 data sets from the UCI repository. In [23], Rodriguez et al. examined the Rotation Forest ensemble on a selection of 33 data sets from the UCI repository and compared it with Bagging, AdaBoost, and Random Forest with decision trees as the base classifier. More recently, Louppe et al. investigated a very simple, yet effective, ensemble framework called Random Patches that builds each individual model of the ensemble from a random patch of data obtained by drawing random subsets of both instances and features from the whole dataset. With respect to AdaBoost and Random Forest, these experiments on 16 data sets showed that the proposed method provides on par performance in terms of accuracy while simultaneously lowering the memory needs, and attains significantly better performance when memory is severely constrained. Despite these attempts that have emerged to enhance the capability and efficiency, we believe an extensive empirical evaluation of most of the ensemble proposal algorithms can shed some light into the strength and weaknesses. We briefly review these algorithms and describe a large empirical study comparing several ensemble method variants in conjunction with two types of unpruned decision trees : the standard CART decision tree and another randomized variant called Extremely Randomized Tree (ET) proposed by Geurts et al in [13] as base classifier, both using the Gini splitting criterion. As noted by Caruana et al. [9], different performance metrics are appropriate for each domain. For example Precision/Recall measures are used in information retrieval; medicine prefers ROC area; Lift is appropriate for some marketing tasks, etc. The different performance metrics measure different tradeoffs in the predictions made by a classifier. One method may perform well on one metric, and worse on another, hence the importance to gauge their performance on several performance metrics to get a broader picture. We evaluate the performance of Boosting, Bagging, Random Forests, Rotation Forests, and their variants including LogitBoost, VadaBoost, RotBoost, and AdaBoost with stumps. For the sake of completeness, we added more recent techniques like Random Patches and less conventional techniques like Class-Switching and Arc-X4. All these voting algorithms can be divided into two types: those that adaptively change the distribution of the training set based on the performance of previous classifiers (as in boosting methods) and those that do not (as in Bagging). Our purpose was not to cover all existing methods, and we have restricted ourselves to well performing methods that have been presented in the literature, without claiming exhaustivity, but trying to cover a wide range of implementation ideas. 124 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. The data sets used in the experiments were all taken from the UCI Machine Learning Repository. They represent a variety of problems but do not include high-dimensional data sets owing to the computational expense of running Rotation Forests. The comparison is performed based on three performance metrics: accuracy, ROC Area and squared error. For each algorithm we examine common parameters values. Following [9] and [22], we also examine the effect that calibrating the models via Isotonic Regression has on their performance. The paper is organized as follows. In Section 2, we begin with basic notation and follow with a description of the base inducers that build classifiers. We use two variants of decision tree inducers: unlimited depth, and extremely randomized tree. We then describe three performance metrics and the Isotonic calibration method that we use throughout the paper. In Section 3, we describe our set of experiments with and without calibration and report the results. We raise several issues and for future work in Section 4 and conclude with a summary of our contributions. 2 Ensemble Learning Algorithms & Parameters Before discussing the ensemble algorithms chosen in this comprehensive study, we would like to mention that, contrary to [9] which attempted to explore the space of parameters for each learning algorithm, we decided to fix the parameters to their common value except for a few data dependent extra parameters that have to be finely pretuned. The number of trees was fixed to 200 in accordance with a recent empirical study [15] which tends to show that ensembles of size less or equal to 100 are too small for approximating the infinite ensemble prediction. Although it is shown that for some datasets the ensemble size should ideally be larger than a few thousands, our choice for the ensemble size tries to balance performance and computation cost. This shall now summarize the parameters used for each learning algorithm below. Bagging (Bag) [4]: Practically, Bag has many advantages. It is fast, simple and easy to program. It has no parameters to tune. Bag is sometimes proposed with an optimization of the bootstraps samples size to perform better. However we fixed the default size equal to the size of the initial dataset. Random Forests (RF) [7]: the number of feature selected at each node for building the trees was fixed to the root square of the total number of features. Random Patches (RadP) [19]: this method was proposed very recently to tackle the problem of insufficient memory w.r.t. the size of the data set. The idea is to build each individual model of the ensemble from a random patch of data obtained by drawing random subsets of both instances and features from the whole dataset; ps and pf are hyper-parameters that control the number of samples and features in a patch. These parameters are tuned using an independent validation dataset. It is worth mentioning that RadP was initially designed to overcome some shortcomings of the existing ensemble techniques in the context of huge data sets. As such, they were not meant to outperform the other methods 125 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. on small data sets or without an memory limitation. We chosed, however, this algorithm as an interesting alternative to Bag and RF. AdaBoost (Ad) [11]: we used the standard algorithm proposed by Freund and Schapire. AdaBoost Stump (AdSt): in this particular version of Ad, the base learner is replaced by a stump. A stump is a decision tree with only one node. While the base learner is highly biased, when combined with AdaBoost, it is believed to compete with the best methods while providing a serious computational advantage. VadaBoost (Vad) [26]: this is another ensemble method called Variance Penalizing AdaBoost that appeared recently in the literature. VadaBoost is similar to AdaBoost except that the weighting function tries to minimize both empirical risk and empirical variance. This modification is motivated by the recent empirical bound which relates the empirical variance to the true risk. Vad depends on a hyper-parameter, λ, that will be tuned on a validation set. Arc-X4 (ArcX4) [5]: the method belongs to the family of Arcing (Adaptive Resampling and Combining) algorithms. It started out as a simple mechanism for evaluating the effect of Ad. LogitBoost (Logb) [12]: LogitBoost is a boosting algorithm formulated by Friedman et al. Their original paper [12] casts the Ad algorithm into a statistical framework. When regarded as a generalized additive model, the Logb algorithm is derived by applying the cost functional of logistic regression. Note that there is no final vote as each base classifier is not an independent classifier but rather a correction for the whole model. Rotation Forests (Rot) [23]: this method builds multiple classifiers on randomized projections of the original dataset The feature set is randomly split into K subsets (K is a parameter of the algorithm) and PCA is applied to each subset in order to create the training data for the base classifier. The idea of the rotation approach is to encourage simultaneously individual accuracy and diversity within the ensemble. The size of each subsets of feature was fixed to 3 as proposed by Rodriguez. The number of sub classes randomly selected for the PCA was fixed to 1 as we focused on binary classification. The size of the bootstrap sample over the selected class was fixed to 75% of its size. RotBoost (Rotb) [29]: this method combines Rot and Ad. As the main idea of Rot is to improve the global accuracy of the classifiers while keeping the diversity through the projections, the idea here is to replace the decision tree by Ad. This can be seen as an attempt to improve Rot by increasing the base learner accuracy without affecting the diversity of the ensemble. The final decision is the vote over every decision made by the internal Ad. The parameter setup for Rotb is the same as for Rot. In order to be fair in term of ensemble size, we construct an ensemble consisting of 40 Rotation Forests which are learned by Ad during 5 iterations. Hence the total number of trees is 200. This ratio has been shown to be approximatively the empirically best in [29]. Class-Switching (Swt) [6]: Swt is a variant of the output flipping ensembles proposed by Martinez-Munoz and Suarez in [21]. The idea is to randomly 126 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. switch the class labels at a certain user defined rate p. The decision of the final classifier is again the plurality vote over these base classifiers. p will be tuned on a validation set. Considering the four data dependent parameters mentioned above (i.e., ps , pf ,p and λ), we randomly split each dataset into two parts, 80% for training and 20% for validation, The later is used to search the best hyper-parameters and is not used afterwards for training or comparison purposes (it will be discarded from the whole data set). We then construct the ensemble on the training set by increasing each parameters from 0.1 to 1.0. The parameters yielding the best accuracy on the validation set are retained. It is worth noting that the other two performance metrics (i.e., mean square error and AUC) could also be applied for parametrization. All the above methods were implemented in Matlab - except the CART algorithm in the Matlab statistics toolbox and the ET algorithm in the regression tree package [13] -, in order to make fair comparisons and also because some algorithms are not publicly available (e.g., random patches, output switching). To make sure our Matlab implementations were correct, we did a sanity check against previous papers on ensemble algorithms. 2.1 The decision tree inducers As mentioned above, we use two distinct decision tree inducers: a decision tree (CART) and a so-called Extremely Randomized Tree (ET) proposed in [13]. In [19], Louppe and Geurts found out that every sub-sampling (samples and/or feature) ensemble method they experimented with was improved when ET was used as base learner instead of a standard decision tree. ET is a variant of decision tree which aims to reduce even more the variance of ensemble methods by reducing the variance of the tree as base learner. At each node, instead of cutting at the best threshold among every possible ones, the method selects an attribute and a threshold at random. To avoid very bad cuts, the score-measure of the selected cut must be higher than a user-defined threshold otherwise it has to be re-selected. This process is repeated until a convenient threshold is found or until it does not remain any attribute to pick up (The algorithm uses one threshold per attribute). According to the authors, the reducing variance strength of his algorithm arises from the fact that threshold are selected totally at random, contrary to preceding methods proposed by Kong and Dietterich in [18] which select at random a threshold among the best ones or by Ho in [16] which select the best one among a fixed number of thresholds. Therefore, we used both unpruned DT and ET as base learners. For ET, we used he regression tree package proposed in [13]. To distinguish ensemble with DT and ET, we added ’ET’ at the end of the algorithm names to indicate that extremely randomized trees are used. 2.2 Performance Metrics & Calibration The performance metrics can be splitted into three groups: threshold metrics, ordering/rank metrics and probability metrics [8]. For thresholded metrics, like 127 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. accuracy (ACC), it makes no difference how close a prediction is to a threshold, usually 0.5, what matters whether it is above or below the threshold. In contrast, the ordering/rank metrics, like the area under the ROC curve (AUC), depend only on the ordering of the instances, not the actual predicted values, while the probability metrics, like the squared error (RMS), interpret the predicted value of each instance as the conditional probability of the output label being in the positive class given the input. In many applications it is important to predict well calibrated probabilities; good accuracy or area under the ROC curve are not sufficient. Therefore, all the algorithms were run twice, with and without post calibration, in order to compare the effects of calibrating ensemble methods on the overall performance. The idea is not new, Niculescu-Mizil and Caruana have investigated in [9] the benefit of two well known calibration methods, namely Platt Scaling and Isotonic Regression [28], on the performance of several classifiers. They concluded that AdaBoost and good ranking algorithms in general are those which draw the most benefits from calibration. As expected, these benefits are the most noticeable on the root mean squared error metric. In this paper, we only focus on Isotonic Regression because it was originally designed for decision trees model although Platt Scaling could also applied to decision trees. To this purpose, we use the pair-adjacent violators (PAV) algorithm described in [28, 9] that finds a piecewise constant solution in linear time. 2.3 Data sets We compare the algorithms on nineteen binary classification problems of various sizes and dimensions. Table 1 summarizes the main characteristics of these data sets utilized in our empirical study. This selection includes data sets with different characteristics and from a variety of fields. Among them, we find some data sets with thousands of features. As explained by Liu in [17], if Rot or Rotb are applied to classify such datasets, a rotation matrix with thousands of dimensions is required for each tree, which entails a dramatic increase in computational complexity. To keep the running time reasonable, we had no choice but to resort to a dimension reduction technique for these problems; the same strategy was adopted in several works [29, 23, 17]. Based on Liu’s comparison, we took the best of the three proposed filter methods for rotation forest, the signal to noise ratio [27]. SNR was used to rank all the features; we kept the 100 top relevant features and discarded the others. Of course this choice necessarily entails some compromises as there will generally be some loss of information. So the reader shall bear in mind that the actual size of the data sets is limited to 100 features in the experiments. 3 Performances analysis In this section, we report the results of the experimental evaluation. For each test problem, we use 5-fold cross validation (CV) on 80% of the data (recall 128 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 1. Characteristics of the nineteen problems used in this study Data sets Basehock Breast-cancer Cleve Colon Ionosphere Leukemia Madelon Musk Ovarian Parkinson PcMac Pima Promoters Relathe Smk-Can Spam Spect Wdbc Wpbc #inst 1993 699 303 62 351 73 2600 476 54 195 1943 768 106 1427 187 4601 267 569 194 #feat #labels Reference 4862 2 [30] 9 2 [3] 13 2 [3] [2] 2000 2 34 2 [3] 7129 2 [14] 500 2 [3] 166 2 [3] 1536 2 [24] 22 2 [3] 3289 2 [30] 8 2 [3] 57 2 [3] 4322 2 [30] 19993 2 [30] 57 2 [3] [3] 22 2 [3] 30 2 33 2 [3] that 20% of each data set is used to calibrate the models and to select the best parameters). In order to get reliable statistics over the metrics, the experiments were repeated 10 times. So the results obtained are averaged over 50 iterations which allows us to apply statistical tests in order to discern significant differences between the 20 methods. Detailed average performances of the 20 methods for all 19 data sets using the protocol described above are reported in Tables 1-6 of the supplementary material1 . For each evaluation metric, we present and discuss the critical diagrams from the tests for statistical significance using all data sets. Table 2 shows the normalized score for each algorithm on each of the three metrics. Each entry in the table averages these scores across the fifty trials and nineteen test problems. The table is divided into two blocks to separately illustrate the performances for both calibrated and uncalibrated models. The last column per block, Mean, is the mean (only for illustration purposes, not for statistical analysis) over the three metrics (ACC, AU C, 1 − RM S) and nineteen problems, and fifty trials. In the table, higher scores always indicate better performance. Considering all three metrics together, it appears that the strongest models among the uncalibrated ones are Rotation Forest (Rot), Rotation Forest using extremely randomized tree (RotET), Rotboost (Rotb) and its ET-based variant 1 http://perso.univ-lyon1.fr/haytham.elghazel/copem2013-supplementary.pdf 129 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 2. Average normalized scores by metric for each learning algorithm obtained over nineteen test problems. We give complete results over all evaluation metrics in supplementary material. Approach Uncalibrated Models ACC AUC 1-RMS Mean 0,865 0,903 0,700 0,823 Rot 0,823? 0,875? 0,660? 0,786 Bag 0,857 0,893 0,668? 0,806 Ad 0,864 0,896 0,689 0,816 RF 0,865 0,897 0,702 0,821 Rotb 0,852? 0,892? 0,686 0,810 ArcX4 0,833? 0,874? 0,598? 0,769 AdSt 0,811? 0,809? 0,617? 0,746 CART 0,845 0,884 0,635? 0,788 Logb ? 0,859 0,888 0,638? 0,795 Swt 0,850 0,889 0,669? 0,803 RadP 0,858 0,894 0,684 0,812 Vad 0,871 0,901 0,698 0,823 RotET 0,836? 0,893 0,673? 0,800 BagET 0,862 0,898 0,667? 0,809 AdET 0,704 0,824 RotbET 0,866 0,900 0,901 0,693 0,821 ArcX4ET 0,868 0,866 0,890 0,649? 0,802 SwtET 0,908 0,680 0,816 RadPET 0,861 0,864 0,899 0,681 0,815 VadET ACC 0,837 0,820? 0,836 0,835 0,841 0,829 0,817? 0,808? 0,823 0,829? 0,836 0,839 0,843 0,833 0,838 0,844 0,842 0,841 0,844 0,841 Calibrated Models AUC 1-RMS Mean 0,864 0,673 0,791 0,844 0,649? 0,771 0,863 0,669 0,789 0,857 0,669 0,787 0,861 0,676 0,793 0,853 0,659? 0,780 0,845 0,653 0,771 0,806? 0,622? 0,745 0,854 0,660 0,779 0,848? 0,660? 0,779 0,851 0,662 0,783 0,864 0,671 0,791 0,858 0,675 0,792 0,852 0,663 0,783 0,861 0,674 0,791 0,859 0,678 0,794 0,859 0,673 0,791 0,850 0,673 0,788 0,867 0,678 0,797 0,864 0,678 0,794 (RotbET), and ArcX4ET. Among calibrated models, the best models overall are Rotation Forest (Rot) and its ET-based variant (RotET), Rotboost (Rotb) and its ET-based variant (RotbET), boosted extremely randomized trees (AdET), ArcX4ET, Vadaboost (Vad) and its ET-based variant (VadET), and Random Patch using extremely randomized tree (RadPET). With or without calibration, the poorest performing models are decision trees (CART), bagged trees (Bag), and AdaBoost Stump (AdSt). Looking at individual metrics, calibration generally slightly degrades the results on accuracy and AUC and is remarkably effective at obtaining excellent performance on the RMS score (probability metric) for especially boosting-based algorithms. Indeed, calibration improves the performance (in terms of RMS) of boosted stumps (AdSt), LogitBoost (Logb), Class-Switching with or without extremely randomized trees (Swt and SwtEt), and provides a small, but noticeable improvement for boosted trees with or without extremely randomized trees (Ad and AdET), and a single tree (CART). If we consider only large data sets in Tables 1-6 of the supplementary materials (i.e. Ovarian, Smk-Can, Leukemia), reported results show that RMS values decrease with calibration when boosting-based approaches are used, while their AUC and ACC are not affected. 130 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Regarding now the performances of ET-based variants of the algorithms, across all three metrics, with or without calibration, it is observed that each ensemble method with ET always outperforms ensembles of standard DT. This observation confirms the results obtained in [19] and clearly suggests that using random split thresholds, instead of optimized ones like in DT, pays off in terms of generalization error, especially for small data sets. In order to better assess the results obtained for each algorithm on each metric, we adopt in this study the methodology proposed by [10] for the comparison of several algorithms over multiple datasets. In this methodology, the non-parametric Friedman test is firstly used to evaluate the rejection of the hypothesis that all the classifiers perform equally well for a given risk level. It ranks the algorithms for each data set separately, the best performing algorithm getting the rank of 1, the second best rank 2 etc. In case of ties it assigns average ranks. Then, the Friedman test compares the average ranks of the algorithms and calculates the Friedman statistic. If a statistically significant difference in the performance is detected, we proceed with a post hoc test. The Nemenyi test is used to compare all the classifiers to each other. In this procedure, the performance of two classifiers is significantly different if their average ranks differ more than some critical distance (CD). The critical distance depends on the number of algorithms, the number of data sets and the critical value (for a given significance level p) that is based on the Studentized range statistic (see [10] for further details). In this study, the Friedman test reveals statistically significant differences (p < 0.05) for each metric with and without calibration. As seen in table 2, the algorithm performing best on each metric is boldfaced. Algorithms performing significantly worse than the best algorithm at p = 0.1 (CD=6.3706) using the Nemenyi posthoc test are marked with ’?’ next to them. Furthermore, we present the result from the Nemenyi posthoc test with average rank diagrams as suggested by Demsar [10]. These are given on Figure 1. The ranks are depicted on the axis, in such a manner that the best ranking algorithms are at the rightmost side of the diagram. The algorithms that do not differ significantly (at p = 0.1) are connected with a line. The critical difference CD is shown above the graph. As may be observed in Figure 1, ET-based variant of Rotboost (RotbET) performs best in terms of accuracy. In the average ranks diagrams corresponding to accuracy, two groups of algorithms could be separated. The first consists of all algorithms which have seemingly similar performances with the best method (i.e. RotbET). The second contains the methods that performs significantly worse than RotbET, including Bagging (Bag) and its ET-based variant (BagET); ArcX4, Boosted stumps (AdS) and single tree (CART). The statistical tests we use are conservative and the differences in performance for methods within the first group are not significant. To further support these rank comparisons, we compared the 50 accuracy values obtained over each dataset split for each pair of methods in the first group by using the paired t-test (with p = 0.05) as done [19]. The results of these pairwise comparisons are depicted (see the supplementary material) in terms of ”Win-Tie-Loss” sta- 131 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Fig. 1. Average ranks diagram comparing the 20 algorithms in terms of three metrics (Accuracy, AUC and RMS) Average ranks diagram of uncalibrated models in terms of Accuracy Average ranks diagram of calibrated models in terms of Accuracy CD =6.3706 CD =6.3706 2019181716151413121110 9 8 7 6 5 4 3 2 1 CART Bag AdSt BagET ArcX4 Logb RadP Swt Vad Ad 2019181716151413121110 9 8 7 6 5 4 3 2 1 CART Bag AdSt Swt ArcX4 Logb Vad VadET AdET Ad RotbET Rotb RotET ArcX4ET RF SwtET Rot AdET VadET RadPET Average ranks diagram of uncalibrated models in terms of AUC Average ranks diagram of calibrated models in terms of AUC CD =6.3706 CD =6.3706 2019181716151413121110 9 8 7 6 5 4 3 2 1 CART Bag Swt AdSt ArcX4 Logb SwtET Vad RadP Rotb 2019181716151413121110 9 8 7 6 5 4 3 2 1 CART Swt Bag ArcX4 AdSt SwtET RadP BagET Vad Logb RadPET VadET RotbET RotET AdET Rot ArcX4ET RF BagET Ad Average ranks diagram of uncalibrated models in terms of RMS RadPET Rot RotbET Ad Rotb AdET VadET RF RotET ArcX4ET Average ranks diagram of calibrated models in terms of RMS CD =6.3706 CD =6.3706 2019181716151413121110 9 8 7 6 5 4 3 2 1 AdSt CART Swt Logb Bag Ad AdET SwtET BagET RadP RotbET Rotb ArcX4ET RadPET RotET SwtET RF Rot BagET RadP 2019181716151413121110 9 8 7 6 5 4 3 2 1 CART Bag Swt ArcX4 AdSt Logb BagET Vad Ad RadP RotbET Rotb Rot RotET ArcX4ET RF ArcX4 RadPET VadET Vad 132 RadPET RotbET Rotb RotET ArcX4ET VadET Rot RF AdET SwtET COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. tuses of all pairs of methods; the three values in each cell (i, j) respectively indicate how times many the approach i is significantly better/not significantly different/significantly worse than the approach j. Following [10], if the two algorithms are, as assumed under the null-hypothesis, equivalent, each should win on approximately N/2 out of N data sets. The number of wins is distributed according to the binomial distribution and the critical number of wins at p = 0.1 is equal to 13 in our case. Since tied matches support the null-hypothesis we should not discount them but split them evenly between the two classifiers when counting the number of wins; if there is an odd number of them, we again ignore one. In the Table 7 (see the supplementary material), each pairwise comparison entry (i, j) for which the approach i is significantly better than j is boldfaced. The analysis of this table reveals that the approaches that are never beaten by any other approach are: all the Rotation Forest-based methods (Rot, Rotb, RotET and RotbET), AdET and ArcX4ET. We may also notice from Figure 1 and Table 8 (see the supplementary material) for accuracy on calibrated models the following. First, the calibration is beneficial to Random Patch algorithms (RadP and RadPET) and Bagged trees (BagET) in terms of ranking. It hurts the ranking of boosted trees but does not affect the performances of Rotation Forestbased methods and ArcX4ET. Overall, RotbET is ranked first, then come Rotb, ArcX4ET and RadPET. Looking at Table 8 (see the supplementary material), the dominating approaches include again all Rotation Forest-based methods and ArcX4ET, as well as RadPET and VadET (c.f. Table 3). Another interesting observation upon looking at the average rank diagrams is that ensembles of ET lie mostly on the right side of the plot compared to their DT counterparts, hence their superior performance. As far as the AUC is concerned (c.f. Figure 1), RadPET ranks first. However, its performance is not statistically distinguishable from the performance of five other algorithms: RotET, RotbET, Ad, AdET and VadET (c.f. Table 9 in supplementary material). In our experiments, ET improved the ranking of all ensemble approaches by at least 10% on average when compared to DT. This corroborate our previous finding, namely that ET should be preferred to DT in the ensembles. Figure 1 and Table 10 (see the supplementary material) indicate that calibration reduces the ranking of some approaches, especially VadET and RotET (among the best uncalibrated approaches in terms of AUC) but slightly improves the ranks of the approaches that adaptively change the distribution (Logb, AdSt, Ad, Vad, Rotb, ArcX4) and Rot. This explain why equally performing methods like RadPET are, after calibration, judged insignificant (c.f. Table 3). Regarding the RMS results reported in Figure 1 and Table 11 (see the supplementary material). Rot, Rotb and RotbET significantly outperforms the other approaches. Here again, ET-based method outperforms the DT ones by a noticeable margin. We found calibration to be remarkably effective at improving the ranking of boosting-based algporithms in terms of RMS values, especially Ad, AdET, AdSt, Logb and VadET. This is the reason why that algorithms 133 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 3. List of dominating approaches per metric, with and without calibration Metric Without calibration ACC AdET, ArcX4ET, Rot, Rotb, RotET, RotbET AUC Ad, AdET, RotET, RotbET, RadPET, VadET RMS Rot, Rotb, RotbET With calibration ArcX4ET, Rot, Rotb, RotbET, RotET, RadPET, VadET Ad, AdET, ArcX4ET, Logb, Rot, Rotb, RotbET, RadPET, Vad, VadET Ad, AdET, Logb, Rot, Rotb, RotET, RotbET, RadPET, Vad, VadET that adaptively change the distribution have integrated the list of dominating approaches (c.f. Table 3). 3.1 Diversity-error diagrams To achieve higher prediction accuracy than individual classifiers, it is crucial that the ensemble consists of highly accurate classifiers which at the same time disagree as much as possible. To illustrate the diversity-accuracy patterns of the ensemble, we use the kappa-error diagrams proposed in [20]. The latter are scatterplots with L × (L − 1)/2 points, where L is the committee size. Each point corresponds to a pair of classifiers. On the x-axis is a measure of diversity between the pair, κ. On the y-axis is the averaged individual error of the classifiers in the pair, ei,j = (ei + ej )/2. As small values of κ indicate better diversity and small values of ei,j indicate better performance; the diagram of an ideal ensemble should be filled with points in the bottom left corner. Since we have a large number of algorithms to compare and due to space limitation, we only plot the distance between their corresponding centroids in Figure 2 for the 18 ensemble methods (Logb and CART are excluded), for the ”Musk” and ”Relathe” data sets only. The following is observed: (1) Rot-based algorithms outperform the others in terms of accuracy; (2) ArcX4, Bag and RF exhibit equivalent patterns, they are slightly more diverse but slightly less accurate than Rot-based algorithms; (3) while boosting-based methods (AdSt, Ad, AdET) and switching are more diverse, their accuracies are lower than the others, except SwtET as ET is generally able to increase the individual accuracy, and (4) no clear picture emerged when one examines Random Patch-based algorithms. Not surprinsingly, as the classifiers become more diverse, they become less accurate and vice versa. Furthermore, according to the results in the previous subsection, it seems that the more accurate the base classifiers are, the better the performance. This corroborates the conclusion drawn in [23], namely that individual accuracy is probably the more crucial component of the tandem diversity-accuracy, contrary to the diversifying strategies. 134 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Fig. 2. Centroids of κ-Error Diagrams of different ensemble approaches for two data sets. x-axis= κ, y-axis= ei,j (average error of pair of classifiers). (01) Rot; (02) Bag; (03) Ad; (04) RF; (05) Rotb; (06) ArcX4; (07) AdSt; (08) Swt; (09) RadP; (10) Vad; (11) RotET; (12) BagET; (13) AdET; (14) RotbET; (15) ArcX4ET; (16) SwtET; (17) RadPET; (18) VadET. Musk Relathe 0.45 7 0.4 0.45 8 0.4 0.35 8 13 0.35 3 13 0.3 7 17 10 18 3 9 0.3 16 10 4 0.25 12 6 2 0.25 15 18 9 1 11 0.2 17 16 212 0.2 5 15 6 4 14 514 11 1 0.15 0.15 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 The kappa-error relative movement diagrams in Figure 3 display the difference between the κ and accuracy of the DT-based method and the ET-based one. There are as many points as data sets. Points in the upper-right corner represent datasets for which the ET-based method outperformed the standard DT-based algorithm in terms of both diversity and accuracy, points up-left indicate that ET-based method improved the accuracy but degrades diversity. We may notice that ET as a base learner improves one criteria at the expense of the other. Furthermore, according to the resulting win/tie/loss counts for each ETbased approach against the DT-based one summarized in Table 4, we find that the approaches for which the ET-variant is significantly superior to the standard one are those for which the accuracy (i.e. Swt) or the diversity (i.e. Bag, ArcX4 and RadP) is significantly better. Before we conclude, we would like to mention that some of the above findings need to be regarded with caution. We list a few caveats and our comments on these. – The experimental analysis was restricted to the 100 most relevant features with a view to dramatically reducing the computational burden required to run Rotation Forest-based methods. Thus, the results reported here are valid for data sets of small to moderate sizes. The data sets used in the experiments did not include very large-scale data sets. Moreover, the complexity 135 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Fig. 3. Centroids of κ-Error relative movement diagrams of different ensemble approaches 1: 3: 5: 7: 6 0.04 1 vs. 11 3 vs. 13 6 vs. 15 9 vs. 17 2: 2 vs. 12 4: 5 vs. 14 6: 8 vs. 16 8: 10 vs. 18 0.02 ∆ error 3 4 8 0 25 7 1 −0.02 −0.04 −0.12 −0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 ∆κ Table 4. The win/tie/loss results for ET-based ensembles vs. DT-based ensembles. Bold cells indicate significant differences at p = 0.1 Approaches Uncalibrated Models ACC AUC RMS 8/8/3 11/2/6 7/6/6 RotET/Rot 11/6/2 13/4/2 13/3/3 BagET/Bag 7/10/2 7/10/2 11/4/4 AdET/Ad 3/12/4 6/10/3 5/11/3 RotbET/Rotb ArcX4ET/ArcX4 14/5/0 13/2/4 13/1/5 10/8/1 9/5/5 13/2/4 SwtET/Swt RadPET/RadP 9/10/0 10/7/2 14/1/4 10/7/2 9/9/1 9/5/5 VadET/Vad Calibrated Models ACC AUC RMS 6/11/2 7/8/4 8/7/4 13/5/1 12/5/2 12/6/1 6/11/2 4/8/7 6/12/1 3/13/3 3/11/5 4/10/5 10/9/0 9/7/3 14/4/1 14/3/2 10/6/3 13/4/2 10/7/2 12/4/3 13/4/2 6/9/4 3/11/5 7/9/3 In Total 47/42/25 74/29/11 41/55/18 24/67/23 73/28/13 69/28/17 68/33/13 44/50/20 issue should be addressed to balance the computation cost with the obtained performance in a real scenario. – We used the same ensemble size L = 200 for all methods. It is known that bagging fares better for large L. On the other hand, AdaBoost would benefit from tuning L. It is not clear what the outcome would be if L was treated as hyperparameter and tuned for all ensemble methods compared here. We acknowledge that a thorough experimental comparison of a set of methods needs tuning each of the methods to its best for every data set and every performance metric. Interestingly, while VadaBoost, Class-Switiching and Random Patches were slightly favored as we tuned some of their parameters on an independent validation set, these methods were not found to compare favorably with Rotation Forest and its variants. 136 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. – The comparison was performed on binary classification problems solely. Mutliclass and multi-label classification problems were not investigated. These can, however, be turned into binomial classifiers by a variety of strategies. 4 Discussion & Conclusion We described an extensive empirical comparison between twenty prototypical supervised ensemble learning algorithms over nineteen UCI benchmark datasets with binary labels and examined the influence of two variants of decision tree inducers (unlimited depth, and extremely randomized tree) with and without calibration. The experiments presented here support the conclusion that the Rotation Forest family of algorithms (Rotb, RotbET, Rot and RotET) outperforms all other ensemble methods with or without calibration by a noticeable margin, which is much in line with the results obtained in [29]. It appears that the success of this approach is closely tied to its ability to simultaneously encourage diversity and individual accuracy via rotating the feature space and keeping all principal components. Not surprinsingly, the worse performing models are single decision trees, bagged trees, and AdaBoost Stump. Another conclusion we can draw from these observations is that building ensembles of extremely randomized trees is very competitive in terms of accuracy even for small sized data sets. This confirms the effectiveness of using random split thresholds, instead of optimized ones like in decision trees. We found calibration to be remarkably effective at lowering the RMS values of boosting-based methods. References 1. Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. In Machine Learning, pages 105–139, 1998. 2. Amir Ben-Dor, Laurakay Bruhn, Agilent Laboratories, Nir Friedman, Miche‘l Schummer, Iftach Nachman, U. Washington, U. Washington, and Zohar Yakhini. Tissue classification with gene expression profiles. Journal of Computational Biology, 7:559–584, 2000. 3. C.L Blake and C.J Merz. Uci repository of machine learning databases, 1998. 4. Leo Breiman. Bagging predictors. In Machine Learning, pages 123–140, 1996. 5. Leo Breiman. Bias, variance, and arcing classifiers. Technical report, 1996. 6. Leo Breiman. Randomizing outputs to increase prediction accuracy. Machine Learning, 40(3):229–242, 2000. 7. Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. 8. Rich Caruana and Alexandru Niculescu-Mizil. Data mining in metric space: an empirical analysis of supervised learning performance criteria. In KDD, pages 69–78, 2004. 9. Rich Caruana and Alexandru Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In ICML, pages 161–168, 2006. 10. Janez Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006. 137 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 11. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting, 1997. 12. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28:2000, 1998. 13. Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3–42, 2006. 14. T.R. Golub, Slonim, D.K., P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, and H. Coller. Molecular classication of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999. 15. Daniel Hernández-Lobato, Gonzalo Martı́nez-Muñoz, and Alberto Suárez. How large should ensembles of classifiers be? Pattern Recognition, 46(5):1323–1336, 2013. 16. Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell., 20(8):832–844, 1998. 17. Kun hong Liu and De-Shuang Huang. Cancer classification using rotation forest. Comp. in Bio. and Med., 38(5):601–610, 2008. 18. Eun Bae Kong and Thomas G. Dietterich. Error-correcting output coding corrects bias and variance. In ICML, pages 313–321, 1995. 19. Gilles Louppe and Pierre Geurts. Ensembles on random patches. In ECML/PKDD (1), pages 346–361, 2012. 20. Dragos D. Margineantu and Thomas G. Dietterich. Pruning adaptive boosting. In International Conference on Machine Learning (ICML), pages 211–218, 1997. 21. Gonzalo Martı́nez-Muñoz and Alberto Suárez. Switching class labels to generate classification ensembles. Pattern Recognition, 38(10):1483–1494, 2005. 22. Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In ICML, pages 625–632, 2005. 23. Juan José Rodrı́guez, Ludmila I. Kuncheva, and Carlos J. Alonso. Rotation forest: A new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell., 28(10):1619–1630, 2006. 24. M. Schummer, W. V. Ng, and R. E. Bumgarnerd. Comparative hybridization of an array of 21,500 ovarian cdnas for the discovery of genes overexpressed in ovarian carcinomas. Gene, 238(2):375–385, 1999. 25. Friedhelm Schwenker. Ensemble methods: Foundations and algorithms [book review]. IEEE Comp. Int. Mag., 8(1):77–79, 2013. 26. Pannagadatta K. Shivaswamy and Tony Jebara. Variance penalizing adaboost. In NIPS, pages 1908–1916, 2011. 27. Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander, and Eric S. L. Class prediction and discovery using gene expression data. pages 263–272. Press, 2000. 28. Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In ICML, pages 609–616, 2001. 29. Chun-Xia Zhang and Jiang-She Zhang. Rotboost: A technique for combining rotation forest and adaboost. Pattern Recognition Letters, 29(10):1524–1536, 2008. 30. Zheng Zhao, Fred Morstatter, Shashvata Sharma, Salem Alelyani, and Aneeth Anand. Feature selection, 2011. 138 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Clustering Ensemble on Reduced Search Spaces Sandro Vega-Pons and Paolo Avesani NeuroInformatics Laboratory (NILab), Fondazione Bruno Kessler, Trento, Italy Centro Interdipartimentale Mente e Cervello (CIMeC), Università di Trento, Italy {vega,avesani}@fbk.eu Abstract. Clustering ensemble has become a very popular technique in the past few years due to its potentialities for improving the clustering results. Roughly speaking it consists in the combination of different partitions of the same set of objects in order to obtain a consensus one. A common way of defining the consensus partition is as the solution of the median partition problem. This way, the consensus partition is defined as the solution of a complex optimization problem. In this paper, we study possible prunes of the search space for this optimization problem. Particularly, we introduce a new prune that allows a dramatic reduction of the search space. We also give a characterization of the dissimilarity measures that can be used to take advantage of this prune and we proof that the lattice metric fits in this family. We carry out an experimental study comparing, under different circumstances, the size of the original search space and the size after the proposed prune. Outstanding reductions are obtained, which can be very beneficial for the development of clustering ensemble algorithms. Keywords: Clustering ensemble, partition lattice, median partition, search space reduction, dissimilarity measure. 1 Introduction Clustering ensemble has become a popular technique to deal with data clustering problems. When different clustering algorithms are applied to the same dataset, different clustering results can be obtained. Instead of trying to find the best one, the idea of combining these individual results in order to obtain a consensus has gained an increasing interest in the last years. In practice, such a procedure could produce high quality final clusterings. In the past ten years, motivated by the success of the combination of supervised classifiers, several clustering ensemble algorithms have been proposed in the literature [1]. Different mathematical and computational tools have been used for the development of clustering ensemble algorithms. For example, there are methods based on Co-Association Matrix [2], Voting procedures [3], Genetic Algorithms [4], Graph Theory [5], Kernel Methods [6], Information Theory [7], Fuzzy Techniques [8], among others. However, the consensus clustering, which is the final result of all clustering ensemble algorithms, is not always defined in the same way. For many methods, 139 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 2 S. Vega-Pons and P. Avesani the consensus partition lacks of a formal definition, it is just implicitly defined as the objective function of the particular algorithm. This makes the theoretical study of the consensus partition properties to be difficult. This is the case, for example, of Relabeling and Voting [3], Graph based methods [5] and Co-association Matrix based methods [2]. On the other hand, there are some methods that use an explicit definition of the consensus partition concept. In this approach, the consensus partition is defined as the solution of an optimization problem, the problem of finding the median partition with respect to the clustering ensemble. Before defining this problem, we introduce the notation that will be used throughout this paper. Let X = {x1 , x2 , . . . , xn } be a set of n objects and P = {P1 , P2 , . . . , Pm } be a set of m partitions of X. A partition P = {C1 , C2 , . . . , Ck } of X is a set of k subsets of X (clusters) that satisfies the following properties: (i) Ci 6= ∅, for all i = 1, . . . , k; (ii) Ci ∩ Cj = ∅, for all i 6= j; Sk (iii) i=1 Ci = X. Furthermore, PX is defined as the set of all possible partitions of X, P ⊆ PX and the consensus partition is denoted by P ∗ , P ∗ ∈ PX . Formally, given an ensemble P of m partitions, the median partition is defined as: m X P ∗ = arg min d(P, Pi ) (1) P ∈PX i=1 1 where d is a dissimilarity measure between partitions. Despite the median partition has been accepted in the clustering ensemble community, almost no studies about its theoretical properties have been done by scientists of this area. However, theoretical studies about the median partition problem have been carried out by the discrete mathematicians much before this problem gained interest in the machine learning community. Nevertheless, it has been mainly studied when it is defined by using the symmetric difference distance (or Mirkin distance) [9]. One of the most important results is the proof that the problem of finding the median partition with this distance is N P-hard [10]. A proper analysis with other (dis)similarity measures has not been done. Despite the complexity of the problem depends on the applied (dis)similarity measure, it seems to be a hard problem for any meaningful measure [1]. The application of an exhaustive search for the optimum solution, would only be computationally feasible for very small size problems. Therefore, several heuristic procedures have been applied to face this problem, for example: simulated annealing [6, 11] and genetic algorithms [12]. Despite the good results with these heuristics, they are still designed for finding the optimum solution in the whole search space. An interesting approach is to study the properties of the problem in order to find a possible prune of the search space, reducing the complexity. In some clustering ensemble algorithms, 1 The problem can also be equally defined by maximizing the similarity with all partitions, in the case that d is a similarity measure. 140 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Clustering Ensemble on Reduced Search Spaces 3 an intuitive simplification of the problem, called fragment clusters [13, 14], has been used. The idea is that, if a subset of objects has been placed in the same cluster in all partitions to be combined, it is expected to find it in the consensus partition. Therefore, a representative object (fragment object) for each of those subsets can be computed. This way, the problem is reduced to work with the fragment objects. Once the consensus result is obtained, each fragment object is replaced back by the set of objects that it represents, in order to obtain the final consensus partition. This idea has also been used in the context of ensemble of image segmentations under the name of super-pixels [15, 16]. The above explained reduction needs objects to be placed in the same cluster for all partitions. As the number of partitions increases or partitions are more independent or some noisy partitions are included in the ensemble, the probability of having such subsets of objects with the same cluster label in all partitions decreases. Therefore, this prune of the search space could be useless in practice. Stronger prunes of the search space are needed for real applications. In this paper, we introduce a new prune that leads to a dramatic reduction of the size of the search space. The paper is structured as follows, in Section 2 we present the basic concepts on lattice theory that are needed to introduce our results. In Section 3 a relation between the dissimilarity measure used to define the median partition problem and possible prunes of the search space is establish. First, a formalization of the fragment objects based prune is given by introducing the properties to be fulfilled by the dissimilarity measure. Afterwards, we introduce a new prune of the search space and provide a family of dissimilarity measures for which this prune is possible. Moreover, we present a measure that fits in this family, which can be used in practice to take advantage of the reduction of the search space. In Section 4, both prunes of the search space are experimentally evaluated on synthetic data. The size of the reduced search spaces is compared with the size of the whole search space under different conditions. Finally, Section 5 concludes this study. 2 Partition Lattice The cardinality of the set of all partitions PX is given by the |X|-th Bell number which can be computed by the following recursion formula Bn+1 = Pn [17], n k=0 k Bk . The Bell number has an exponential growing as the number of objects increases, e.g. B3 = 5, B10 = 115975 and B100 = 4.75 × 10115 (Much bigger than the estimation of the number of all atoms in the observable universe, around 1080 ) 2 . Therefore, even for a relatively small number of objects, the set of all partitions of them PX is huge. Over PX , a partial order relation3 (called refinement) can be defined. For all P, P 0 ∈ PX , we say P P 0 if and only if, for all cluster C 0 ∈ P 0 there are 2 3 http://www.universetoday.com/36302/atoms-in-the-universe/ A binary relation which is reflexive (P P ), anti-symmetric (P P 0 and P 0 P implies P = P 0 ) and transitive (P P 0 and P 0 P 00 implies P P 00 ) for all P, P 0 , P 00 ∈ PX . 141 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 4 S. Vega-Pons and P. Avesani Sv clusters Ci1 , Ci2 , . . . , Civ ∈ P such that C 0 = j=1 Cij . In this case, it is said 0 0 that P is finer than P or equivalently, P is coarser than P . The set of all partitions PX of a finite set X, endowed with the refinement order () is a lattice (see example in Fig. 1). Therefore, for each pair of partitions P , P 0 two binary operations are defined: meet (P ∧ P 0 ) which is the coarsest of all partitions finer than both P and P 0 , and join (P ∨ P 0 ) which is the finest of all partitions coarser than both P and P 0 . For example, in Fig. 1, if P1 = {{a, b, c, d}}, P2 = {{a, b}, {c, d}}, P3 = {{a, b, c}, {d}}, P4 = {{a, b}, {c}, {d}}, then: P4 P2 P1 ; P2 ∧ P3 = P4 and P2 ∨ P3 = P1 . {a, b, c, d} {a} {b, c, d} {a} {b} {c, d} {b} {a, c, d} {c} {a, b, d} {a} {c} {b, d} {d} {a, b, c} {a} {d} {b, c} {a, b} {c, d} {a, b} {c} {d} {a, c} {b, d} {a, c} {b} {d} {a, d} {b, c} {a, d} {b} {c} {a} {b} {c} {d} Fig. 1. Hasse diagram or graphical representation of the lattice associated to the set of partitions PX of the set X = {a, b, c, d}. Among several properties, the partition lattice (PX , ) satisfies the property of being an atomic lattice. The partitions Pxy , composed of a cluster containing only the objects x and y, and the remaining clusters containing only one object, are called atoms. For example, in Figure 1, Pab = {{a, b}, {c}, {d}} and Pbc = {{a}, {b, c}, {d}} are two atoms. The partition lattice is atomic because, every partition P is the join of the elementary partitions Pxy for all pair of objects x, y which are in the same cluster in P . An important concept that is needed to understand the results proposed in this paper is the q-quota rules [18]. Given a real number q ∈ [0, 1], the q-quota rule cq is defined in the following way: cq (P) = _ {Pxy : γ(xy, P) > q} (2) where γ(xy, P) = N (xy,P) and N (xy, P) is the number of times that the m objects x, y ∈ X are in the same cluster in the partitions in P. Two interesting cases of q-quota rules are the following: W – unanimity rule: u(P) =W {Pxy : γ(xy, P) = 1} – majority rule: m(P) = {Pxy : γ(xy, P) > 0.5} 142 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Clustering Ensemble on Reduced Search Spaces 5 Notice that any q-quota rule is a partition of PX . For example: u(P) is the partition obtained by the join of all atoms Pxy , such that the objects x and y are placed in the same cluster in all partitions in P. In the same way, m(P) is the join of all atoms Pxy such that x and y are in the same cluster in more than half of the partitions in P. Next we present a toy example. Example 1. Let X = {1, 2, 3, 4, 5, 6} be a set of objects and P = {P1 , P2 , P3 , P4 , P5 } be a set of partitions of X, such that: P1 = {{1, 2}, {3, 4}, {5, 6}}, P2 = {{1, 2, 4}, {3, 5, 6}}, P3 = {{1, 2, 3}, {4, 5, 6}}, P4 = {{1, 3, 4}, {2, 5, 6}}, P5 = {{1, 3}, {2, 4}, {5, 6}}. In this case u(P ) = {{1}, {2}, {3}, {4}, {5, 6}} since 5 and 6 are the only elements that are grouped in the same cluster in all partitions. On the other hand, m(P) = {{1, 2, 3}, {4}, {5, 6}} = P12 ∨ P13 ∨ P56 . Notice that objects 2 and 3 are in the same cluster in m(P) even though P23 is not a majority atom, i.e. γ(23, P) = 1/5 < 0.5. This is a chaining effect of the fact that P12 and P13 are majority atoms. 3 Methods The two rules defined in the previous Section (unanimity and majority rules) allow the definition of two different subsets of the partition space PX . Let us consider UX ⊆ PX the set of all partitions coarser than u(P), i.e. UX = {P ∈ PX : u(P) P }. Analogously, MX ⊆ PX is defined as the set of all partitions coarser than m(P), i.e. MX = {P ∈ PX : m(P) P }. It is not difficult to verify that MX ⊆ UX ⊆ PX , because any atom Pxy satisfying Pxy u(P) also holds Pxy m(P). In this section we will describe the conditions such that the median partition can be searched just in the reduced spaces UX and MX . The median partition problem could have more than one solution, therefore equation (1) should be written in a more precise way as follows: MP = arg min P ∈PX m X d(P, Pi ) (3) i=1 where MP is the set of all median partitions. If we only consider the reduced search space UX , the median partition problem is defined as: MU = arg min P ∈UX m X d(P, Pi ) (4) i=1 In the same way, when only MX is considered as search space, the median partition problem is given by: MM = arg min P ∈MX m X d(P, Pi ) i=1 Another concept that we will use is the sum-of-dissimilarities (SoD): 143 (5) COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 6 S. Vega-Pons and P. Avesani Definition 1. Given a set of partitions P ⊂ PX and a dissimilarity d : PX × PX → R, the sum-of-dissimilarities of a partition P to P (SoD(P )) is defined as: m X SoD(P ) = d(P, Pi ) i=1 ∗ Notice that a median partition P is an element of PX with a minimum SoD value, i.e. P ∗ = arg minP ∈PX SoD(P ). In Section 3.1 we present a family of dissimilarity functions for which MP = MU and therefore the reduced search space UX can be used instead of PX . In Section 3.2 a family of dissimilarity functions for which MP = MM is also presented. We also prove that the lattice metric belongs to this family of functions. 3.1 Prune of the search space based on Unanimity Rule Definition 2. A dissimilarity measure between partitions du : PX × PX → R is said to be u-atomic, if for every pair of partitions P, P 0 ∈ PX and every atom Pxy such that Pxy P and Pxy P 0 , then d(P ∨ Pxy , P 0 ) < d(P, P 0 ). Proposition 1. Let P ⊂ PX be a set of partitions and du be an u-atomic dissimilarity function. For all partition P ∈ PX and every atom Pxy such that Pxy P and Pxy : γ(xy, P) = 1, we have: SoD(P ∨ Pxy ) < SoD(P ) Proof. SoD(P ) = du (P, P1 ) + . . . + du (P, Pm ) for all Pi ∈ P, with i = 1, . . . , m. In the same way, SoD(P ∨ Pxy ) = du (P ∨ Pxy , P1 ) + . . . + du (P ∨ Pxy , Pm ). As γ(xy, P) = 1, all partitions Pi ∈ P hold Pxy Pi . For each term in both equations we have du (P ∨Pxy , Pi ) < du (P, Pi ) based on the definition of u-atomic dissimilarity and therefore SoD(P ∨ Pxy ) < SoD(P ). t u Proposition 2. Let P ⊂ PX be a set of partitions, u(P) be the unanimity rule and du be an u-atomic function. Every median partition P ∗ ∈ MP holds u(P) P ∗. Proof. Let us assume that P ∗ ∈ MP is a median partition and u(P) P ∗ . Then, there is at least one atom Pxy u(P) such that Pxy P ∗ . According to Proposition 3, P ∗ would not be a median element because the partition P ∗ ∨ Pxy would have a smaller SoD value. Therefore, the assumption is false and we conclude that u(P) P ∗ t u Corollary 1. Let P ⊂ PX be a set of partitions and du be an atomic function, then problems (3) and (4) have the same set of solutions (MP = MU ). Proof. This is a direct consequence of Proposition 2 and equations (3) and (4). If all solutions of equation (3) are coarser than m(P), they are in UX . Therefore, problems (3) and (4) are equivalent. t u 144 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Clustering Ensemble on Reduced Search Spaces 7 In practice, there is a simple way to reduce the search space from PX to UX . First, u(P) should be computed and for each cluster a representative element yi is defined. This way, a set of objects Y with |Y | ≤ |X| is obtained, and the corresponding set of partitions PY will be equivalent to UX . This is exactly the idea of fragment clusters. As we have previously mentioned, this idea has been intuitively used before and have also been proven to be valid for some common dissimilarity measures [13] such as: Mutual Information and Mirkin distance. In this section, we have presented the notion of u-atomic function and we have proven that for any u-atomic dissimilarity measure this prune of the search space can be used. 3.2 Prune of the search space based on the Majority Rule Definition 3. A dissimilarity measure between partitions dm : PX × PX → R is said to be m-atomic, if for every pair of partitions P, P 0 ∈ PX and every atom Pxy such that Pxy P , there is a constant real value c > 0 such that the following properties hold: – (i) if Pxy P 0 , then dm (P ∨ Pxy , P 0 ) ≤ d(P, P 0 ) − c – (ii) if Pxy P 0 , then dm (P ∨ Pxy , P 0 ) ≤ d(P, P 0 ) + c Notice that following this definition any m-atomic function is also u-atomic. Proposition 3. Let P ⊂ PX be a set of partitions and dm be an m-atomic function. For all partition P ∈ PX and every atom Pxy such that Pxy P and Pxy : γ(xy, P) > 0.5, we have: SoD(P ∨ Pxy ) < SoD(P ) Proof. SoD(P ) = dm (P, P1 ) + . . . + dm (P, Pm ) for all Pi ∈ P, with i = 1, . . . , m. In the same way, SoD(P ∨ Pxy ) = dm (P ∨ Pxy , P1 ) + . . . + dm (P ∨ Pxy , Pm ). As γ(xy, P) > 0.5 there are t > m/2 partitions Pi ∈ P such that Pxy Pi , and therefore according to definition 3, dm (P ∨ Pxy , Pi ) ≤ dm (P, Pi ) − c. On the other hand, there are l < m/2 partitions Pj ∈ P such that Pxy Pj and dm (P ∨ Pxy , P 0 ) ≤ dm (P, P 0 ) + c. Therefore, SoD(P ∨ Pxy ) ≤ SoD(P ) − t · c + l · c, and taking into account that t > l and c > 0 we have that: SoD(P ∨ Pxy ) < SoD(P ) and the proposition is proven. t u Proposition 4. Let P ⊂ PX be a set of partitions, m(P) be the majority rule and dm be a m-atomic function. Every median partition P ∗ ∈ MP holds m(P) P ∗ . Proof. The proof is analogous to the one for Proposition 2. Let us assume that P ∗ ∈ MP is a median partition and m(P) P ∗ . Then, there is at least one atom Pxy m(P) such that Pxy P ∗ . According to Proposition 3, P ∗ would not be a median element because the partition P ∗ ∨ Pxy would have a smaller SoD value. Therefore, the assumption is false and we conclude that m(P) P ∗ t u 145 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 8 S. Vega-Pons and P. Avesani Corollary 2. Let P ⊂ PX be a set of partitions and dm be a m-atomic function, then problems (3) and (5) have the same set of solutions (MP = MM ). Proof. This is a direct consequence of Proposition 4 and equations (3) and (5). t u We have proven that if the median partition problem is defined with a matomic function, any solution of the problem will be found in the reduced search space MX . As in the case of the fragment clusters prune, there is a simple way to reduce the search space from PX to MX . In this case, m(P) should be first computed and for each cluster a representative element yi is defined. This way, a set of objects Y with |Y | ≤ |X| is obtained, and the corresponding set of partitions PY will be equivalent to MX . So far, we have presented the notion of m-atomic function and we have proven that for any m-atomic dissimilarity measure this prune of the search space can be applied. Now, we present an existing distance between partitions and we prove that it is m-atomic. Definition 4. (Lattice Metric [18]) The function δ : PX × PX → R defined as δ(P, P 0 ) = |P | + |P 0 | − 2|P ∨ P 0 |, where |P | denotes the number of clusters in partition P , is called lattice metric. Proposition 5. The lattice metric δ : PX × PX → R is m-atomic. Proof. Let P, Pxy , Pzt , P 0 ∈ PX be 4 partitions such that Pxy and Pzt are two atoms holding Pxy P , Pxy P 0 ; and Pzt P , Pzt P 0 . We have to proof that there is a constant c value such that: (i) δ(P ∨ Pxy , P 0 ) ≤ δ(P, P 0 ) − c, and (ii) δ(P ∨ Pzt , P 0 ) ≤ δ(P, P 0 ) + c Working on (i), we have: |P ∨ Pxy | + |P 0 | − 2|P ∨ Pxy ∨ P 0 | ≤ |P | + |P 0 | − 2|P ∨ P 0 | − c as Pxy P 0 , then P ∨ Pxy ∨ P 0 = P ∨ P 0 . Therefore: |P ∨ Pxy | ≤ |P | − c as Pxy P we have |P ∨ Pxy | = |P | − 1 because |P ∨ Pxy | means the join of two clusters in |P |. Thus, we obtain c≤1 Now, working on (ii): |P ∨ Pzt | + |P 0 | − 2|P ∨ Ptz ∨ P 0 | ≤ |P | + |P 0 | − 2|P ∨ P 0 | + c −1 − 2|P ∨ Ptz ∨ P 0 | ≤ −2|P ∨ P 0 | + c c ≥ 2|P ∨ P 0 | − 1 − 2|P ∨ Ptz ∨ P 0 | the right-hand side of the inequality takes the higher value when Pzt P ∨ P 0 , and in this case |P ∨ Ptz ∨ P 0 | = |P ∨ P 0 | − 1. Therefore, c ≥ 2|P ∨ P 0 | − 1 − 2(|P ∨ P 0 | − 1) 146 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Clustering Ensemble on Reduced Search Spaces 9 c≥1 Taking into account both results, we conclude that δ is a m-atomic function with c = 1. t u This means that if the median partition problem is defined with the lattice metric δ the search space of the problem is reduced to MX . Notice that this metric corresponds to the minimum path length metric in the neighboring graph of the lattice (see Figure 1), when all edges have a weight equal to 1 [18]. Therefore, we could say that this is an informative measure to compare partitions that takes into account the lattice structure of the partition space. Furthermore, it allows a pruning of the search space for the median partition problem. 4 Experimental Results and Discussion Let X = {x1 , . . . , xn } be a set of n objects where xi ∈ Rd is a vector in multidimensional space. We assume that each dimension of the vector xi = (xi,1 , . . . , xi,d ) is drawn from the uniform distribution xi,j ∼ U (0, 1) and are mutually independent. We generated synthetic datasets for d = 3 and n = 1000(= 103 ), 3375(= 153 ), 8000(= 203 ), 15625(= 253 ), 27000(= 303 ). Objects in the datasets lie inside a 3-dimensional cube starting at the origin of the cartesian coordinates system and with edge length of 1. In order to generate different partitions of this dataset, we use a simple clustering algorithm based on cuts of the cube by random hyperplanes. Furthermore, we model different dependency between partitions in the ensemble by clustering the dataset taking into account different subsets of the dimensions of the objects representation. We carry out three different kinds of experiments to illustrate the behavior of the proposed method for pruning the search space. In Section 4.1 we work with different dataset sizes (n) and three different levels of dependency between partitions. In Section 4.2, we vary the number of clusters (k) in the partitions to be combined and finally, in Section 4.3, we carry out experiments with different amount of partitions (m) in the cluster ensemble. For all experiments we show: |X|: Number of objects in the dataset. |X| = n. |PX |: Size of the original search space for the median partition problem. |u(P)|: Number of clusters in the unanimity rule partition. |UX |: Size of the search space after applying the unanimity rule based prune (fragment clusters based prune). |m(P)|: Number of clusters in the majority rule partition. |MX |: Size of the search space after applying the majority rule based prune (the proposed prune). In all tables, the sizes of the search spaces are given in powers of 10. This way, the order-of-magnitude differences among the sizes of the different search spaces can be easily appreciated. In order to provide an uniform notation, even the small values are given in powers of ten, e.g. if |PX | = 203 we will write 102 . Results reported in all tables correspond to the median values of individual results after 5 repetitions. 147 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 10 4.1 S. Vega-Pons and P. Avesani Analysis by increasing the number of objects and varying the independence degree In this section we compare the sizes of |PX |, |UX | and |MX | for different dataset sizes. We generated 10 partitions, where each partition has a number of clusters equal to a random number in the interval [2, n/2]. In Table 1, only the first feature dimension of the objects was used for the computation of partitions. The idea is to generate partitions with high degree of dependency, i.e. partitions with similar distribution of objects in the clusters. In the case of Table 2, a medium degree of dependency is explore by using features d = 0, 1. Finally, in Table 3 all features d = 0, 1, 2 are used to analyze the behavior of the prunes in the case of ensembles with highly independent partitions. Table 1. Comparison of |PX |, |UX | and |MX | for different dataset sizes |X|. Partitions were generated with a high degree of dependency (d = 0). We use m = 10, and for each partition k = random(2, n/2). Results are the average of 5 trials. |X| |PX | |u(P)| |UX | |m(P)| |MX | 1000 101928 10 106 6 102 3375 107981 15 1010 9 104 8000 1021465 20 1014 10 105 15625 1045847 25 1019 10 105 27000 1084822 30 1024 14 108 Table 2. Comparison of |PX |, |UX | and |MX | for different dataset sizes |X|. Partitions were generated with a medium degree of dependency (d = 0, 1). We use m = 10, and for each partition k = random(2, n/2). Results are the average of 5 trials. |X| |PX | |u(P)| |UX | |m(P)| |MX | 1000 101928 98 10113 6 102 3375 107981 195 10268 7 103 8000 1021465 345 10539 14 109 15625 1045847 558 10963 15 1010 27000 1084822 689 101239 25 1019 Table 3. Comparison of |PX |, |UX | and |MX | for different dataset sizes |X|. Partitions were generated with a low degree of dependency (d = 0, 1, 2). We use m = 10, and for each partition k = random(2, n/2). Results are the average of 5 trials. |X| |PX | |u(P)| |UX | |m(P)| |MX | 1000 101928 777 101429 12 107 3375 107981 2094 104589 35 1030 148 8000 1021465 5877 1015096 43 1039 15625 1045847 7122 1018801 50 1048 27000 1084822 9014 1024587 71 1075 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Clustering Ensemble on Reduced Search Spaces 11 From this experiment we can appreciate the following: – The cardinality of MX is always considerable lower than the cardinality of UX and the cardinality of UX is also always much lower than the cardinality of PX . – As the number of elements in the dataset increases the size of all search spaces also increases. However, |PX | grows faster than |UX | and at the same time |UX | grows faster than |MX |. – Increasing the independence in the partitions in the ensemble, the cardinality of the resulting search spaces after both prunes also increases. The higher the dependency between partitions, the higher the probability of finding groups of objects that were placed in the same cluster in all partitions or in more than half of the partitions. – Despite the original search space PX is huge in all cases, the reduced search space after the majority rule based prune MX is sometimes very small. In this case, the exact solution of the median partition problem could be even found by following an exhaustive search. On the other hand, the reduced search space after the unanimity rule prune UX is, many times, too big to be useful in practice. 4.2 Analysis increasing number of clusters in the partitions In this section we used a dataset of size 10×10×10 = 1000. The three dimensions of each object are taken into account for the generation of the partitions in the ensemble. We generated different ensembles of m = 10 partitions with different number of clusters k = 5, 20, 50, 100, 200. The results of this experiment are reported in Table 4. Table 4. Comparison of |PX |, |UX | and |MX | when the number of clusters k in the partitions is increased. Partitions were generated by using the full representation of objects d = 1, 2, 3. The dataset size is 1000 and we use m = 10 with k clusters. k |X| |PX | |u(P)| |UX | |m(P)| |MX | 5 1000 1021465 41 1037 1 100 20 1000 1021465 161 10211 3 101 50 1000 1021465 523 10891 10 106 100 1000 1021465 647 101149 31 1026 200 1000 1021465 769 101412 115 10139 From Table 4 we can see that the size of the reduced search spaces increases together with the number of clusters in the partitions to be combined. This is expected, because the higher the number of clusters, the less the probability of finding groups of objects that are placed in the same cluster in all partitions or more than half of the partitions. Furthermore, when a small number of clusters 149 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 12 S. Vega-Pons and P. Avesani (w.r.t the number of objects) is used, the reduction of the search space after the majority rule based prune is too big. The median partition could have very few clusters or even one cluster. This is a consequence of the chaining effect illustrated in Example 1. This kind of medians could be useless in practical applications. 4.3 Analysis increasing the number of partitions In this section we used a dataset of size 10×10×10 = 1000. The three dimensions of each object are taken into account for the generation of the partitions in the ensemble. We generated ensembles with different number of partitions m = 5, 10, 20, 50, 100, where each partition has a number of clusters equal to a random number in the interval [2, n/2]. The results of this experiment are reported in Table 5. Table 5. Comparison of |PX |, |UX | and |MX | when the number of partitions k in the ensemble is increased. Partitions were generated by using the full representation of objects d = 1, 2, 3. The dataset size is 1000 and we generate m partitions, each one with k = random(2, n/2) clusters. m |X| |PX | |u(P)| |UX | |m(P)| |MX | 5 1000 1021465 601 101052 27 1021 10 1000 1021465 694 101249 14 109 20 1000 1021465 789 101456 13 108 50 1000 1021465 772 101418 6 102 100 1000 1021465 746 101362 1 100 While the size of the search space after unanimity rule based prune |UX | remains stable, the cardinality of MX decreases as the number of partitions in the ensemble increases. The higher the number of partitions in the ensemble, the higher the probability of finding groups of objects that are placed in the same cluster in more than half of the partitions. However, this reduction could be sometimes too big such that the resulting median partition has few clusters or just one. This could be inappropriate in practical applications. 5 Conclusions We studied two possible reductions of the search space for the median partition problem. In the first case, we introduced a family of functions that allow the application of the fragment clusters based prune. This prune have been used in an intuitive manner or with a few measures for which the suitability of this prune has been proven. A characterization of the measures that allow this prune is presented. Furthermore, we introduced a stronger prune of the search space 150 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Clustering Ensemble on Reduced Search Spaces 13 for the median partition problem. In this case, we also presented a family of dissimilarity measures that allow the application of this prune and we proved that the lattice metric fits in this family. The proposed prune is able to do a dramatic reduction of the search space. Even for relatively big number of objects, for which the original search space is really huge, the reduced search space is many times small enough such that the median partition can be found by an exhaustive search. Even in the cases when the reduced search space is still big, any heuristic procedure could take advantage of the strong reduction with respect to the original size of the space. Despite this prune can be beneficial in several problems, sometimes the median partition defined with a function that allows this prune, has a small number of clusters. In some extreme cases, it could even be just one cluster, making this kind of consensus useless. In practice, this limitation could be smoothed by generating an ensemble of partitions with a high number of clusters, which will be reduced in the consensus partition computation. This idea has been previously used in the clustering ensemble context [19]. The advantages of the proposed prune from the computational point of view are clear in our experiments with synthetic data. A further step would be to analyze the quality of the median partition obtained by this method on real datasets. The two studied prunes correspond to two particular cases of the q-quota rules presented in Section 2: unanimity (q = 1) and majority (q = 0.5). The first one leads to a commonly weak reduction of the search space, while the second prune could be too strong sometimes. A possible good trade-off could be found for prunes associated to other quota rules, e.g. q = 2/3 or 3/4. A characterization of the dissimilarity measures between partitions that would allow this kind of prunes is worth to be study. Acknowledgments. This research has been supported by the RESTATE Programme, co-funded by the European Union under the FP7 COFUND Marie Curie Action - Grant agreement no. 267224. References 1. Vega-Pons, S., Ruiz-Shulcloper, J.: A survey of clustering ensemble algorithms. International Journal of Pattern Recognition and Artifcial Intelligence. 25 (3) (2011) 337 – 372 2. Fred, A.L., Jain, A.K.: Combining multiple clustering using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005) 835– 850 3. Ayad, H.G., Kamel, M.S.: On voting-based consensus of cluster ensembles. Pattern Recognition 43 (2010) 1943 – 1953 4. Yoon, H.S., Ahn, S.Y., Lee, S.H., Cho, S.B., Kim, J.H.: Heterogeneous clustering ensemble method for combining different cluster results. In: BioDM 2006. Volume 3916 of LNBI. (2006) 8292 151 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. 14 S. Vega-Pons and P. Avesani 5. Strehl, A., Ghosh, J.: Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3 (2002) 583–617 6. Vega-Pons, S., Correa-Morris, J., Ruiz-Shulcloper, J.: Weighted partition consensus via kernels. Pattern Recognition 43(8) (2010) 2712–2724 7. Topchy, A.P., Jain, A.K., Punch, W.F.: Clustering ensembles: Models of consensus and weak partitions. IEEE Trans. Pattern Anal. Mach. Intell. 27(12) (2005) 1866– 1881 8. Punera, K., Ghosh, J.: Consensus-based ensembles of soft clusterings. Applied Artificial Intelligence 22(7&8) (2008) 780–810 9. Mirkin, B.G.: Mathematical Classification and Clustering. Kluwer Academic Press, Dordrecht (1996) 10. Wakabayashi, Y.: Aggregation of Binary Relations: Algorithmic and Polyhedral Investigations. PhD thesis, Universitat Augsburg (1986) 11. Filkov, V., Skiena, S.: Integrating microarray data by consensus clustering. International Journal on Artificial Intelligence Tools 13(4) (2004) 863 – 880 12. Luo, H., Jing, F., Xie, X.: Combining multiple clusterings using information theory based genetic algorithm,. In: IEEE International Conference on Computational Intelligence and Security. Volume 1. (2006) 84 – 89 13. Wu, O., Hu, W., Maybank, S.J., Zhu, M., Li, B.: Efficient Clustering Aggregation Based on Data Fragments. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42(3) (2012) 913–926 14. Chung, C.H., Dai, B.R.: A fragment-based iterative consensus clustering algorithm with a robust similarity. Knowledge and Information Systems (2013) 1–19 15. Singh, V., Mukherjee, L., Peng, J., Xu, J.: Ensemble clustering using semidefinite programming with applications. Machine Learning 79 (2010) 177 200 16. Vega-Pons, S., Jiang, X., Ruiz-Shulcloper, J.: Segmentation ensemble via kernels. In: First Asian Conference on Pattern Recognition (ACPR), 2011. 686–690 17. Spivey, M.Z.: A generalized recurrence for bell numbers. Journal of Integer Sequences 11 (2008) 1 – 3 18. Leclerc, B.: The median procedure in the semilattice of orders. Discrete Applied Mathematics 127 (2003) 285 – 302 19. Fred, A.: Finding consistent clusters in data partitions. In: 3rd. Int. Workshop on Multiple Classifier Systems. (2001) 309–318 152 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. An Ensemble Approach to Combining Expert Opinions Hua Zhang1 , Evgueni Smirnov1 , Nikolay Nikolaev2 , Georgi Nalbantov3 , and Ralf Peeters1 1 Department of Knowledge Engineering, Maastricht University, P.O.BOX 616, 6200 MD Maastricht, The Netherlands {hua.zhang,smirnov,ralf.peeters}@maastrichtuniversity.nl 2 Department of Computing, Goldsmiths College, University of London, London SE14 6NW, United Kingdom [email protected] 3 Faculty of Health, Medicine and Life Sciences, Maastricht University, P.O.BOX 616, 6200 MD Maastricht, The Netherlands [email protected] Abstract. This paper introduces a new classification problem in the context of human computation. Given training data annotated by m human experts s.t. for each training instance the true class is provided, the task is to estimate the true class of a new test instance. To solve the problem we propose to apply a wellknown ensemble approach, namely the stacked-generalization approach. The key idea is to view each human expert as a base classifier and to learn a meta classifier that combines the votes of the experts into a final vote. We experimented with the stacked-generalization approach on a classification problem that involved 12 human experts. The experiments showed that the approach can outperform significantly the best expert and the majority vote of the experts in terms of classification accuracy. 1 Introduction Human computation is an interdisciplinary field involving systems of humans and computers capable of solving problems that neither party can solve better separately [4]. This paper introduces a new classification problem in the context of human computation and proposes an ensemble-related approach to that task. The classification problem we define is essentially a single-label classification problem. Assume that we have m human experts that estimate the true class of instances coming from some unknown probability distribution. We collect these instances together with the experts’ class estimates and label them with their true classes. The resulting instances’ collection form our training data. In this context our classification problem is to estimate the true class of a new test instance, given the training data and the class estimates given by m human experts for that instance. To solve the problem we defined above we propose to apply a well-known ensemble approach, namely the stacked-generalization approach [6]. The key idea is to view each human expert as a base classifier and to learn a meta classifier that predicts the class 153 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. for a new instance given the class estimates provided by m human experts for that instance. This implies that the meta classifier combines the votes of the experts into a final vote. It is proposed to be learned using the training data given with expert class estimates and true classes. We experimented with the stacked-generalization approach on a classification problem that involved 12 human experts. The experiments showed that the approach can outperform significantly the best expert and the majority vote of the experts in terms of classification accuracy. Our work can be compared with other work on classification considered in the context of human computation and crowdsourcing [5, 7]. In these two fields the main emphasis is on classification problems where the training data is labeled by experts only; i.e., the true instance classes are not provided. We note that our classification problem is conceptually simpler but somehow it has not been considered so far. There are many applications in medicine, finance, meteorology etc. where our classification problem is central. Consider for example a set of meteorologists that predict whether it will rain next day. The true class arrives in 24 hours. We can record the meteorologist predictions and the true class over a time period to form our data. Then stacked generalization is applied and thus we hopefully will be able to predict better than the best meteorologist or the majority vote of the meteorologists. The remainder of the paper is organized as follows. Section 2 formalizes our classification task and describes the stacked generalization as an approach to that task. The experiments are given in Section 3. Finally, Section 4 concludes the paper. 2 Classification Problem and Stacked Generalization Let X be an instance space, Y a class set, and p(x, y) be an unknown probability distribution p(x, y) over the labeled space X × Y . We assume existence of m number of human experts capable of estimating the true class of any instance (x, y) ∈ X × Y according to p(x, y). We draw n labeled instances (x, y) ∈ X × Y from p(x, y). Any expert i ∈ 1..m provides an estimate y (i) ∈ Y of the true class y of each instance x without observing y. This implies that the description x of any instance is effectively extended by the class estimates y (1) , ..., y (m) ∈ Y given by the m experts. Thus, we consider any instance as a m + 2-tuple (x, y (1) , ..., y (m) , y). The set of the n instances formed in this way results in training data D. In this context we define our classification problem. Given the training data D, a test instance x ∈ X, the class estimates y (1) , ..., y (m) ∈ Y provided by the m experts for x, the classification problem is to estimate the true class for the instance x according to the unknown probability distribution p(x, y). Our solution to the classification problem defined above is to employ stacked generalization [6]. The key idea is to consider each human expert i ∈ 1..m as a base classifier (providing class estimates) and then to learn a meta classifier that combines the class estimates of the experts into a final class estimate. The meta classifier is a function that can have two possible forms either h : X, Y m → Y or h : Y m → Y . The difference is whether the instance descriptions in X are considered. Once the decision of the metaclassifier format is finalized we build the classifier using the training data D. We note 154 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. that our use of the stacked generalization does not impose any restrictions on the type of the meta classifier (as opposed to [1]). 3 Experiments For our experiments we chose a difficult language-style classification problem4 . We had 317 sentences in English that were composed according to either a Chinese style or an American style. An example of two such sentences with the same meaning are given below: – Chinese Style: “I recommend you to take a vacation.” – American Style: “I recommend that you take a vacation.” The sentences were labeled by 12 experts that did not know the true classes of those sentences. The language-style classification problem was to estimate the true class for any new sentence given the class estimates provided by the 12 experts for that sentence. The language-style classification problem was indeed a difficult problem. The expert accuracy rates were in the interval [0.27, 0.54]. The mean accuracy rate was 0.39 and standard deviation was 0.08. The accuracy rate of the majority vote of the experts was 0.71. The sentences with the labels of the 12 experts and their true classes formed our training data. We trained meta classifiers predicting the true class of the sentences. We considered two types of meta classifiers h : X, Y 12 → Y and h : Y 12 → Y . The input of the first type of meta classifiers consisted of bag-of-word representation of the sentence to be classified and the classes provided by all the 12 experts. The input of the second type consisted of the classes provided by the 12 experts only. The output of both types of meta classifiers was the class estimate for the instance to be classified. In addition we experimented with the meta classifiers with and without use of feature selection. The feature-selection procedure employed was the wrapper method based on greedy stepwise search [2]. The accuracy rates of the meta classifiers were estimated using 10-fold cross-validation. However, to decide whether these classifiers were good, we needed to determine whether their accuracy rates were statistically greater than those of the best expert and majority vote of the experts. We note that this is not a trivial problem, since the k-fold cross-validation is not applicable for the human experts employed. Nevertheless we performed paired t-test that we designed as follows. We split the training data randomly into k folds. For any fold we received: the class estimates provided by the meta classifiers, the class estimates of the best expert, and the class estimates of the majority vote of the experts for all the instances in the fold. Using this information we computed for be any fold j ∈ 1..k the accuracy rate am j of the meta classifiers, the accuracy rate aj mv of the best expert, and the accuracy rate aj of majority vote of the experts. Then we be m mv computed the paired difference dj = am j −aj (dj = aj −aj ), and the point estimate P ¯ k d¯ = ( dj )/k. Using this data the t-statistics that we used was d−µ√d , where µd is j=1 Sd / n the true mean and Sd is the sample standard deviation. 4 The data can be freely downloaded from: https://dke.maastrichtuniversity.nl/smirnov/hua.zip. 155 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 1. Accuracy rates of meta classifiers for the language-style classification task. s (s̄ ) indicates (non-) presence of sentence representation in the input. w (w̄) indicates (non-) use of wrapper. The rates in bold are significantly greater than the accuracy rate 0.71 of the majority vote of the experts on 0.05 significance level. Classifier AdaBoostM1 k-Nearest Neighbor Logistic Regression Naive Bayes RandomForest s̄-w̄ 0.78 0.75 0.76 0.76 0.73 s-w̄ 0.76 0.76 0.72 0.72 0.75 s̄-w 0.76 0.77 0.76 0.78 0.76 s-w 0.71 0.78 0.71 0.76 0.74 The accuracy rates of the meta classifiers are provided in Table 1. Since the majority vote of the human experts outperformed the best expert, the table shows the results of the statistical paired t-test of comparison of the accuracy rates of the meta classifiers and the majority vote of the human experts on 0.05 significance level5 . Two main observation can be derived from Table 1: (O1) 18 out of 20 meta classifiers have accuracy rate significantly greater than the accuracy rate 0.71 of majority vote of the experts. (O2) the stacked generalization achieves the best classification accuracy rates when: (O2a) the instances to be classified are represented by the expert estimates only, and (O2b) feature selection is employed. In this case we achieved an average rate of 0.766. During the experiments we recorded the running time of training the meta classifiers. The results are provided in Table 2. They show that: (O3) wrapper-based meta classifiers require more time. Among them the most efficient are the meta classifiers that do not employ the sentence representation. (O4) meta classifiers that do not use wrappers require less time. Among them the most efficient are the meta classifiers that do not employ the sentence representation. 4 Conclusion This section analyzes observations (O1) - (O4) from section 3. Based on the analysis it provides final conclusions. We start with observation (O1). This observation allows us to conclude that the stacked generalization can outperform significantly the best expert and the majority vote of the experts in terms of generalization performance . This implies that the classification problem we defined and the approach we proposed are indeed useful. Observation (O2a) is a well-known fact in stacked generalization [1]. However in the context of this paper it has additional meaning. More precisely we can state that for 5 For the sake of completeness we trained classifiers h : X → Y as well. Their accuracy rates were in interval [0.47, 053]; i.e., they were statistically worse than the experts’ majority vote. 156 COPEM ECML-PKKD 2013 workshop proceedings, Solving Complex Machine Learning Problems with Ensemble Methods, Prague, Czech Republic, 27 September 2013. Table 2. Time (ms) for building meta classifiers. s (s̄) indicates (non-) presence of sentence representation in the input. w (w̄) indicates (non-) use of wrapper. Classifier AdaBoostM1 k-Nearest Neighbor Logistic Regression Naive Bayes RandomForest s̄-w̄ 0.03 0 0.02 0 0.03 s-w̄ s̄-w s-w 2.21 14.85 257.63 0 26.49 296.84 0.05 4.95 133.47 0.1 0.31 53.11 1.33 23.51 219.64 our classification problem we do have to know the class estimates of the experts only in order to receive the best accuracy rates. The input from the application domain (in our case English text) is less important. In addition we note that according to observations (O3) and (O4) the use of the expert class estimates only implies less computational cost. Observation (O2b) is an expected result in context of feature selection. However it also has a practical implication for our classification problem, namely it allows to choose combination of the most adequate experts. In our experiments for example only half of the experts was chosen to maximize the accuracy. This means that we can reduce the number of human experts and thus the overall financial cost. Of course this has a price: increase of computational complexity according to observation (O3). Future research will focus on the problem of human-experts’ evolution. Indeed in real life the experts change due to many factors (e.g.; training, ageing etc.). Solving this problem will have a high practical impact. For that purpose we plan to apply techniques from concept drift [8] and transfer learning [3]. References 1. S. Dzeroski and B. Zenko. Is combining classifiers better than selecting the best one. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 123– 130, 2002. 2. I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh. Feature Extraction, Foundations and Applications. Physica-Verlag, Springer, 2006. 3. Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Trans. Knowl. Data Eng., 22(10):1345–1359, 2010. 4. A. Quinn and B. Bederson. Human computation: a survey and taxonomy of a growing field. In Proceedings of the International Conference on Human Factors in Computing Systems, CHI 2011, pages 1403–1412. ACM, 2011. 5. V. Raykar, S. Yu, L. Zhao, G. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11:1297–1322, 2010. 6. D. Wolpert. Stacked generalization. Neural Networks, 5(2):241–259, 1992. 7. Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo Valadez, L. Bogoni, L. Moy, and J. Dy. Modeling annotator expertise: Learning when everybody knows a bit of something. Journal of Machine Learning Research, 9:932–939, 2010. 8. I. Zliobaite. Learning under Concept Drift: an Overview. Technical Report. 2009, Faculty of Mathematics and Informatics, Vilnius University: Vilnius, Lithuania, 2009. 157