Download A Multi-clustering Fusion Algorithm

A Multi-clustering Fusion Algorithm Dimitrios Frossyniotis1 , Minas Pertselakis1 , and Andreas Stafylopatis2 National Technical University of Athens Department of Electrical and Computer Engineering Zographou 157 73, Athens, Greece 1 {dfros, mper}@cslab.ntua.gr 2 [email protected] Abstract. A multi-clustering fusion method is presented based on combining several runs of a clustering algorithm resulting in a common partition. More specifically, the results of several independent runs of the same clustering algorithm are appropriately combined to obtain a partition of the data which is not affected by initialization and overcomes the instabilities of clustering methods. Finally, the fusion procedure starts with the clusters produced by the combining part and finds the optimal number of clusters in the data set according to some predefined criteria. The unsupervised multi-clustering method implemented in this work is quite general. There is ample room for the implementation and testing with any existing clustering algorithm that has unstable results. Experiments using both simulated and real data sets indicate that the multi-clustering fusion algorithm is able to partition a set of data points to the optimal number of clusters not constrained to be hyper-spherically shaped. 1 Introduction Unsupervised classification, also known as data clustering, is a generic label for a variety of procedures designed to find natural groupings or clusters in multidimensional data, based on measured similarities among the patterns [1]. Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes. Additionally, the number of clusters in the data often depends on the resolution with which the data are viewed. As a consequence, different clustering algorithms have been proposed in the literature and new clustering algorithms continue to appear. Moreover, the majority of these algorithms are based on the following four most popular clustering methods: iterative square-error partitional clustering, hierarchical clustering, grid-based clustering and density-based clustering [2,3]. Partitional methods can be further classified into two groups. In the first group, each sample is assigned to one and only one cluster, contrary to the second group of methods where each sample can be associated (in some sense) with several clusters. The most commonly used partitional clustering algorithm I.P. Vlahavas and C.D. Spyropoulos (Eds.): SETN 2002, LNAI 2308, pp. 225–236, 2002. c Springer-Verlag Berlin Heidelberg 2002 226 D. Frossyniotis, M. Pertselakis, and A. Stafylopatis is K-means, which is based on the square-error criterion. This algorithm is computationally efficient and yields good results if the clusters are compact, hyperspherical in shape and well separated in the feature space. Numerous attempts have been made to improve the performance of the simple K-means by using the Mahalanobis distance to detect hyper-ellipsoidal shaped clusters [4] or by incorporating a fuzzy criterion function resulting in a fuzzy C-means algorithm [5]. A different partitional clustering approach is based on probability density function (pdf) estimation using Gaussian mixtures. The specification of the parameters of the mixture is based on the expectation-minimization algorithm (EM) [6]. A recently proposed greedy-EM algorithm [7] is an incremental scheme that has been found to provide better results than the conventional EM algorithm. Hierarchical clustering methods organize data in a nested sequence of groups which can be displayed in the form of a dendrogram or a tree [8]. These methods can be either agglomerative or divisive. An agglomerative hierarchical method places each sample in its own cluster and gradually merges these clusters into larger clusters until all samples are ultimately in a single cluster (the root node). A divisive hierarchical method starts with a single cluster containing all the data and recursively splits parent clusters into daughters. Grid-based clustering algorithms are mainly proposed for spatial data mining. Their main characteristic is that they quantise the space into a finite number of cells and then they do all operations on the quantised space. On the other hand, density-based clustering algorithms adopt the key idea to group neighbouring objects of a data set into clusters based on density conditions. However, many of the above clustering methods require additional userspecified parameters, such as the optimal number and shapes of clusters, similarity thresholds and stopping criteria. Moreover, different clustering algorithms and even multiple replications of the same algorithm result in different solutions due to random initializations, so there is no clear indication for the best partition result. Consequently, two main of the challenges in cluster analysis are first to select an appropriate measure of similarity to define clusters, which in general is cluster shape dependent, and second to specify the optimal number of clusters in the data set. In this direction, clustering strategies have been developed which prove to perform very satisfactorily in clustering and finding the number of clusters [9,10,11,12,13]. The present work, following an analogous approach, proposes a clustering algorithm which tackles these two important problems and is able to partition a data set in a shape independent manner and to find the optimal number of clusters existing in the data set. The paper is organized as follows: Section 2 describes the multi-clustering fusion method, while experimental results for the evaluation of the proposed method are presented in Section 3 and, finally, conclusions are presented in Section 4. A Multi-clustering Fusion Algorithm 2 227 Description of the Algorithm The multi-clustering fusion algorithm consists of two procedures that take place sequentially. The Partitioning procedure, which is used to partition data points of a set in clusters and the Fusion procedure, which determines the true structure of the data. In the primary stage, the initial number of clusters and the number of iterations are defined for the Partioning procedure, wherein a clustering algorithm and a voting scheme are implemented, in order to produce a distinct partition of the data set. During the Fusion procedure, this partition is processed and neighbour clusters are merged, resulting in an optimal number of clusters for the given data set, according to some specified criteria. 2.1 Partitioning Procedure The partioning procedure applies the same basic clustering algorithm for a number of iterations, Iter, so as to accomplish a distinct partitioning of N data points to a predefined number C of clusters. The experimental study of our work is based on two implementations of the proposed multi-clustering fusion method using different basic clustering algorithms: the K-means and the greedyEM algorithm. More specifically, the K-means clustering aims to optimise an objective function that is described by the equation J= C d(x, vi ) (1) i=1 x∈µi where vi is the center of cluster µi and d(x, vi ) is the Euclidean distance between a point x and vi . Thus, the criterion function J attempts to minimize the distance of every point from the center of the cluster to which the point belongs. Starting from arbitrary initial positions for cluster centers and by iteratively updating cluster centers, the algorithm moves the cluster centers to sensible locations within the data set. As far as the greedy-EM algorithm [7] is concerned, the data are assumed to be generated by several parameterized Gaussian distributions, so the data points are assigned to different clusters based on their posterior probabilities of having been generated by a specific Gaussian distribution. A multivariate Gaussian mixture is defined as the weighted sum: p(x) = C j ) πj f (x; φ (2) j=1 where πj are the mixing weights satisfying the l-dimensional Gaussian density j j ) is πj = 1, πj 0, and f (x; φ f (x; φj ) = (2π)−l/2 | Sj |−1/2 exp[−0.5(x − m j ) Sj−1 (x − m j )] (3) 228 D. Frossyniotis, M. Pertselakis, and A. Stafylopatis parameterized on the mean m j and the covariance matrix Sj , collectively de j . Usually, for a given number C of kernels, noted by the parameter vector φ the specification of the parameters of the mixture is based on the expectationminimization algorithm (EM) [6] for maximization of the data log-likelihood: L= N 1 log p(xi ) N i=1 (4) The algorithm starts with one kernel and adds kernels dynamically one at a time so as to estimate the true number of components of the mixture (therefore the true number of clusters, if we consider that each kernel corresponds to a group of patterns ) as follows. The algorithm is run for a large value of C, and, for the solution obtained for each intermediate value of C, a model selection criterion is applied, e.g., cross-validation using a set of test points, a coding scheme based on minimum description length etc. Finally, the optimal value of C is selected that corresponds to the optimal value of the model selection criterion. In this work, we have used as a criterion for the specification of C, the log-likelihood value on a validation set of points that have not been used for training. The above procedure is carried out when applying the greedy-EM algorithm as a stand-alone clustering method. When using the greedy-EM as a basic clustering algorithm within the multi-clustering fusion approach we consider only the predefined value of C and no intermediate values, so as to obtain a partitioning to C clusters at each iteration step. In what concerns the Partioning procedure, the basic clustering algorithm partitions the data set in a different way for each iteration, creating a problem of deciding which cluster of one run corresponds to which in another run. This algorithm tackles this problem using the similarity between the clusters produced during successive runs. By determining the percentage of points of a cluster in the t-th run belonging to clusters of the t − 1-th run, each cluster of the new run is assigned to one of the previous run, resulting in a cluster renumbering process. After renumbering, if pattern i is assigned to cluster q, then a positive vote is given to cluster q and a negative one to all other clusters. This process defines a voting scheme, during which a voting table VT (of dimension N ×C) is updated, so that V T (i, j) denotes the membership degree of pattern i to cluster j, where i = 1, . . . , N , and j = 1, . . . , C. At the end of the runs, each pattern i is considered to belong to the cluster i Cmax , where i Cmax = argmax(V T (i, j)), j = 1, . . . , C (5) The procedure thus results in a distinct partitioning of the data set, assigning each data point to one cluster. Using the VT table and the relation between the data points of one cluster with all the remaining clusters, a table NRT (of dimension C × C) can be produced, so that N RT (i, j) represents the neighbourhood relation between clusters i and j: A Multi-clustering Fusion Algorithm N RT (i, j) = N 229 p (V T (p, j)I(Cmax = i)), i = 1, . . . , C, j = 1, . . . , C, j = i (6) p=1 where I(z) is an indicator function, i.e. I(z) = 1 if z =true, otherwise I(z) = 0. 2.2 Fusion Procedure Given the neighbourhood relation among clusters, a Fusion procedure is developed. This procedure starts with the predefined number C of clusters and (after removing the clusters with zero data points) merges the ones which are closest to each other. More specifically, the procedure searches the neighbourhood relation table (C × C table) for the two clusters (with indexes C1 and C2) that fulfill the following conditions: first, both clusters are the closest to each other and, second, these two clusters are the closest of all clusters. The next step is to merge these clusters into one and to reconfigure the voting table accordingly, by adding the votes of the second cluster to the first one as follows: V T (i, C1 ) = V T (i, C1 ) + V T (I, C2 ), i = 1, . . . , N (7) where C1 = min(C1, C2) and C2 = max(C1, C2). The new neighbourhood relation table is created with one cluster less, by removing cluster C2 , and the procedure starts again until some stopping criterion is met. The criterion that derives directly from this procedure is that merging will stop when all clusters end up to have an average ‘sureness’ of 100%. (The average ‘sureness’ is defined as the sum of the membership degrees of points assigned to a cluster divided by their total number). That means that in the voting table all data points will be assigned to only one cluster by 100%. Since in practice this condition is not always possible to be realized, due to overlapping clusters for example, it was decided to use methods suitable for quantitative evaluation of the clustering results, which determine the number of clusters better fitting a data set. The cluster validity methods used in our study are the Root-mean-square standard deviation (RMSSTD) and the R-squared (RS) described in [3]. More specifically, RMSSTD and RS have to be taken into account simultaneously in order to find the correct number of clusters. The optimal values of the number of clusters are those for which significant local change in values of RS and RMSSTD occurs. It should be noted, however, that since these methods give an indication of the quality of the resulting partitioning they should only be considered as a tool at the disposal of the experts in order to evaluate the clustering results. 2.3 Pseudo-Algorithm - Define number of clusters, C - Define number of iterations, Iter 230 D. Frossyniotis, M. Pertselakis, and A. Stafylopatis Procedure 1: Partitioning – i = 1; Run the basic clustering algorithm to partition the data set into C clusters – If sample p (p = 1, . . . , N ) belongs to cluster q then V T (p, q) = 1 V T (p, j) = 0, j = 1, . . . , C, j = q – For i = 2 to Iter - Run the basic clustering algorithm to partition the data set into C clusters - Renumber clusters - Voting scheme: If sample p (p = 1, . . . , N ) belongs to cluster q then 1 V T (p, q) = (i−1) i V T (p, q) + i (i−1) V T (p, j) = i V T (p, j), j = 1, . . . , C, j = q – Create neighbourhood relation table NRT (C × C) Procedure 2: Fusion – Remove clusters with zero data points – Repeat until stopping criterion is met - From neighbourhood relation table find the two closest clusters - Merge pairs, sum the votes - Recompute NRT with C = C − 1 clusters The proposed algorithm consists of two procedures that take place sequentially, thus the total complexity is the sum of the respective complexities. The Partioning procedure has the complexity of the basic clustering algorithm, i.e., if the basic clustering algorithm is the K-means then the time complexity is O(n) where n is the number of points in the dataset. The time complexity of the Fusion procedure is O(C 3 ) where C is the number of clusters produced from the Partition procedure. 3 Experimental Results In this section we present a comparative experimental evaluation of the proposed methodology using different basic clustering algorithms, namely the K-means and the greedy-EM algorithm. The resulting multi-clustering fusion method with K-means as the basic clustering algorithm will be hereafter referred to as multifusion-k-means. Similarly, using the greedy-EM as the basic clustering algorithm will be referred to as multi-fusion-greedy-EM. The proposed multi-clustering fusion method has been tested on several data sets. The basic idea for √ choosing the initial number of clusters is by setting C to a large value, say N , N being the number of patterns in the data set. We used this formula, because partitioning a small data set into a large number of clusters (compared to the actual number of clusters) usually produces clusters of A Multi-clustering Fusion Algorithm 231 few points or empty clusters. The experiments presented here consist of Iter = 100 runs of the basic clustering algorithm in the Partitioning procedure with a number C of clusters. The voting table VT and the neighbourhood relation table NRT are computed between successive runs and the Fusion procedure follows according to the final results of the partition. The optimal values of the number of clusters are those for which a significant local change in values of RS and RMSSTD occurs. Finally, for comparison purposes, we also present clustering results from running the greedy-EM algorithm as a stand-alone clustering method. In this case, we have applied the procedure described in the previous section for selecting the optimal value of clusters using a validation set of points that have not been used for training. 10 10 5 5 0 0 −5 −5 −10 −10 −15 0 2 4 6 8 10 12 Fig. 1. Lith data set after the Partitioning procedure (multi-fusionk-means). 3.1 14 −15 0 2 4 6 8 10 12 14 Fig. 2. Lith data set after the Fusion procedure (multi-fusion-kmeans). The Lith Data This is a 2-dimensional data set consisting of 2000 data points. The data is uniformly distributed along two sausages and is superimposed by a normal distribution with standard deviation 1 in all directions. We have considered C = 45 clusters in the Partitioning procedure. The multi-fusion-k-means partitioned the data points correctly into two clusters (Fig. 1 and 2). The validity indices (RMSSTD and RS) select the clustering scheme of two clusters while we reached an average ‘sureness’ of the clusters greater than 99%. Similarly, the multi-fusion-greedy-EM method partitioned the data points into two well separated clusters reaching an average ‘sureness’ of the clusters greater than 99%. For the stand-alone greedy-EM algorithm, we used 1000 data points for training and 1000 for validation. We ran the algorithm for C = 45 clusters and the optimal solution obtained was 6 clusters (Fig. 7) with average ‘sureness’ of the clusters 91.8%. 232 D. Frossyniotis, M. Pertselakis, and A. Stafylopatis 6 6 4 4 2 2 0 0 −2 −2 −4 −4 −6 −6 −8 −8 −10 −12 −10 −12 −10 −8 −6 −4 −2 0 2 4 6 8 Fig. 3. Banana data set after the Partitioning procedure (multifusion-k-means). 3.2 −10 −8 −6 −4 −2 0 2 4 6 8 Fig. 4. Banana data set after the Fusion procedure (multi-fusion-kmeans). The Banana Data The Banana data set is also a 2-dimensional one consisting of 2000 data points that belong to two banana shaped clusters. We have considered C = 45 clusters in the Partitioning procedure. The multi-fusion-k-means partitioned the data points correctly into two clusters (Fig. 3 and 4). The validity indices (RMSSTD and RS) select the clustering scheme of two clusters while we reached an average ‘sureness’ of the clusters greater than 99%. Similarly, the multi-fusion-greedy-EM method partitioned the data points into two well separated clusters reaching an average ‘sureness’ of the clusters greater than 99%. For the stand-alone greedyEM, we used 1000 data points for training and 1000 for validation. We ran the algorithm for C = 45 clusters and the optimal solution obtained was 10 clusters (Fig. 8) with average ‘sureness’ of the clusters 86.8%. 5 5 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 −4 −4 −3 −2 −1 0 1 2 3 Fig. 5. Clouds data set after the Partitioning procedure (multifusion-k-means). 4 −4 −4 −3 −2 −1 0 1 2 3 Fig. 6. Clouds data set after the Fusion procedure (multi-fusion-kmeans). 4 A Multi-clustering Fusion Algorithm 3.3 233 The Clouds Data The Clouds artificial data from the ELENA project [14] are two-dimensional produced by three different Gaussian distributions. There are 5000 samples in the data set belonging to three clusters which are relatively highly overlapped. We have considered C = 70 clusters in the Partitioning procedure. The multifusion-k-means correctly identified the true number of clusters (three) (Fig. 5 and 6). The validity indices (RMSSTD and RS) select the clustering scheme of three clusters while we reached an average ‘sureness’ of the clusters greater than 98.5%. Similarly, the multi-fusion-greedy-EM method partitioned the data points into three clusters reaching an average ‘sureness’ of the clusters greater than 95.5%. For the stand-alone greedy-EM algorithm, we used 3000 data points for training and 2000 for validation. We ran the algorithm for C = 70 clusters and the optimal solution obtained was 4 clusters (Fig. 9) with average ‘sureness’ of the clusters 94.1%. The average ‘sureness’ of the clusters is less than that of the previous examples for the proposed method. Indeed, the Lith and Banana data sets have a simple and clear structure, but, unfortunately, in the case of overlapping clusters (especially in real-world data sets) it is very difficult to find a ‘very sure’ partitioning. 3.4 The Pima Indians Data The Diabetes set from the UCI data set repository [15] contains 8-dimensional data. It is based on personal data from 768 Pima Indians obtained by the National Institute of Diabetes and Digestive and Kidney Diseases. We have considered C = 28 clusters in the Partitioning procedure. The multi-fusion-k-means yielded four clusters. The validity indices (RMSSTD and RS) select the clustering scheme of four clusters, while we reached an average ‘sureness’ of the clusters greater than 99%. Similarly, the multi-fusion-greedy-EM method partitioned the data points into four clusters reaching an average ‘sureness’ of the clusters greater than 96.5%. For the stand-alone greedy-EM algorithm, we used 500 data points for training and 268 for validation. We ran the algorithm for C = 28 clusters and the optimal solution obtained was 5 clusters with average ‘sureness’ of the clusters 95%. 3.5 Discussion An important conclusion that can be drawn from the experimental evaluation is that the proposed multi-clustering fusion method results in a partitioning scheme that fits optimally the specific data set according to some criteria, such as ‘sureness’, RMSSTD and RS. We used two different basic clustering algorithms and came up with similar clustering results. It can be claimed that the multiclustering fusion methodology, independently of the basic clustering algorithm used, finds the ‘optimal’ number and shape of clusters that fit the data, thus 234 D. Frossyniotis, M. Pertselakis, and A. Stafylopatis 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 −2 −2 −2.5 −2.5 −3 −2 −1 0 1 2 3 −2 Fig. 7. Means and variances of the kernels using the stand-alone Greedy-EM for the Lith data set. −1.5 −1 −0.5 0 0.5 1 1.5 2 Fig. 8. Means and variances of the kernels using the stand-alone Greedy-EM for the Banana data set. 4 3 2 1 0 −1 −2 −3 −4 −3 −2 −1 0 1 2 3 Fig. 9. Means and variances of the kernels using the stand-alone Greedy-EM for the Clouds data set. dealing with the problem of initialization dependency and selection of the number and shape of clusters. Another interesting observation is that the proposed multi-clustering fusion method almost always exhibits better clustering performance than the greedyEM algorithm, according to the adopted cluster validity methods and the term of ‘sureness’. However, this comparison should be considered as rather indicative. 4 Conclusions This paper proposed a general unsupervised learning scheme for combining clustering results produced by several iterations of a basic clustering algorithm. A fusion procedure takes the resulting partition and finds the optimal number of clusters in the data set according to some cluster validity methods. Although the general scheme has been explored here within the framework of K-means and A Multi-clustering Fusion Algorithm 235 greedy-EM clustering, the data points are typically not uniquely assigned by the fusion procedure to one cluster, so we can also consider ‘fuzzy’ partitioning. We have shown that the clustering algorithm implemented in this work can handle the problem of initialization dependency and selection of the number of clusters. Moreover, as illustrated by the experimental results, the algorithm can partition a data set into clusters which are shape independent. Concluding, the proposed multi-clustering fusion algorithm does not require additional user-specified parameters, since the only parameter needed to be defined is the initial number of clusters. It must be noted, however, that a good value for this parameter was found experimentally depending on the size of the problem. Ongoing work includes the adoption of other basic clustering algorithms and experimentation with different fusion techniques, as well as comparison of the proposed method with other AI clustering methods for selecting the optimal number of clusters. Finally, this multi-clustering methodology can be used for improving the performance of a multi-net classification system, which is based on supervised and unsupervised learning [16]. References 1. A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Englewood Cliffs, N. J.: Prentice Hall, 1988. 2. A.K. Jain, R.P.W. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 2000. 3. M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Clustering algorithms and validity measures. In Proceedings of the SSDBM conference, Virginia,USA, July 2001. 4. J.C. Bezdek and S.K. Pal. Fuzzy Models for Pattern Recognition: Methods that Search for Structures in Data. IEEE CS Press, 1992. 5. J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, 1981. 6. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Roy. Statist. Soc. B, 39:1–38, 1977. 7. Vlassis N. and Likas A. A greedy-EM algorithm for Gaussian mixture learning. Technical report, Computer Science Institute, University of Amsterdam, The Netherlands, May 2000. 8. E. Boundaillier and G. Hebrail. Interactive interpretation of hierarchical clustering. Intell. Data Anal., 2(3), 1998. 9. A. Fred. Finding Consistent Clusters in Data Partitions. In Proceedings of the Second International Workshop on Multiple Classifier Systems (MCS 2001), LNCS 2096, pages 309–318, Cambridge, UK, July 2-4 2001. Springer. 10. E. Dimitriadou, A. Weingessel, and K. Hornik. A voting-merging clustering algorithm. Working Paper 31, SFB ‘Adaptive Information Systems and Modeling in Economics and Management Science’, April 1999. 11. P. Smyth. Clustering Using Monte Carlo Cross-Validation. In Proceedings Knowledge Discovery and Data Mining, pages 126–133, 1996. 12. P. Cheeseman and J. Stutz. Bayesian Classification (AutoClass): Theory and Results. In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press, 1996. 236 D. Frossyniotis, M. Pertselakis, and A. Stafylopatis 13. D.H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139–172, 1987. 14. ESPRIT Basic Research Project ELENA (no. 6891). [ftp://ftp.dice.ucl.ac.be/pub/neural-nets/ELENA/databases], 1995. 15. UCI Machine Learning Databases Repository, University of California-Irvine, Department of Information and Computer Science. [ftp://ftp.ics.edu/pub/machinelearning-databases]. 16. D.S. Frossyniotis and A. Stafylopatis. A Multi-SVM Classification System. In Proceedings of the Second International Workshop on Multiple Classifier Systems (MCS 2001), LNCS 2096, pages 198–207, Cambridge, UK, July 2-4 2001. Springer.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Multi-clustering Fusion Algorithm