* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download K-Means and K-Medoids Data Mining Algorithms
Survey
Document related concepts
Transcript
Analysis and Approach: K-Means and K-Medoids Data Mining Algorithms Dr. Aishwarya Batra Asst Professor, L. J. Institute of Computer Applications, Ahmedabad, India. E–mail: [email protected] Abstract Clustering is similar to classification in which data are grouped. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters .There exist a large number of clustering algorithms in the literature. The choice of clustering algorithm depends both on the type of data available and on the particular purpose and application. Clustering analysis is one of the main analytical methods in data mining. K-means is the most popular and partition based clustering algorithm. But it is computationally expensive and the quality of resulting clusters heavily depends on the selection of initial centroid and the dimension of the data. Several methods have been proposed in the literature for improving performance of the kmeans clustering algorithm. In this research, the most representative algorithms K-Means and K-Medoids were examined and analyzed based on their basic approach. The best algorithm in each category was found out based on their performance. The input data points are generated by two ways, one by using normal distribution and another by applying uniform distribution. Keywords: K Means, K Medoid, Clustering, Partitional Algorithm Introduction Clustering techniques have a wide use and importance nowadays. This importance tends to increase as the amount of data grows and the processing power of the computers increases. Clustering applications are used extensively in various fields such as artificial intelligence, pattern recognition, economics, ecology, psychiatry and marketing. Data clustering is under vigorous development. Contributing areas of research include data mining, statistics, machine learning, spatial database technology, biology, and marketing. Owing to the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research. As a branch of statistics, cluster analysis has been extensively studied for many years, focusing mainly on distance-based cluster analysis. The main purpose of clustering techniques is to partition a set of entities into different groups, called clusters. Cluster analysis tools based on k-means, k-medoids, and several other methods have also been built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and SA. Categorization of Major Clustering Methods In general, the major clustering methods can be classified into the following categories: • Partitioning methods • Hierarchical methods • Density-based methods • Grid-based methods • Model-based methods Classical Partitioning Methods: K-Means & K-Medoids The most well-known and commonly used partitioning methods are k-means, k-medoids, and their variations. Partitional clustering techniques create a one-level partitioning of the data points. There are a number of such techniques, but we shall only describe two approaches in this section: Kmeans and K-medoid. Both these techniques are based on the idea that a centre point can represent a cluster. For K-means we use the notion of a centroid, which is the mean or median point of a group of points. Note that a centroid almost never corresponds to an actual data point. For K-medoid we use the notion of a medoid, which is the most representative (central) point of a group of points. Partitional techniques create a onelevel (un-nested) partitioning of the data points. If K is the desired number of clusters, then partitional approaches typically find all K clusters at once. Clustering k-means (MacQueen’67): Each cluster is represented by the center of the cluster. k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster Centroid-Based Technique: The K-Means Method Basic Algorithm The K-means clustering technique is very simple and we immediately begin with a description of the basic algorithm. We elaborate in the following sections. Basic K-means Algorithm for finding K clusters 1. Select K points as the initial centroids. 2. Assign all points to the closest centroid. 3. Recompute the centroid of each cluster. [Page No. 274] Analysis and Approach: K-Means and K--Medoids Data Mining Algorithms 4. Repeat steps 2 and 3 until the centroids doon’t change. Point The k-means algorithm takes the inputt parameter, k, and partitions a set of n objects into k clusters so s that the resulting intracluster similarity is high but the interccluster similarity is low. A1 A2 A3 A4 A5 A6 A7 A8 where E is the sum of the square error for all objects in the data set; p is the point in space representiing a given object; and mi is the mean of cluster Ci (booth p and mi are multidimensional). In other words, for eaach object in each cluster, the distance from the object to itts cluster center is squared, and the distances are summed. Thhis criterion tries to make the resulting k clusters as compact and as separate as possible. (2,10) (2, 5) (8, 4) (5, 8) (7, 5) (6, 4) (1, 2) (4, 9) (2,10) Dist Mean 1 0 5 12 5 10 10 9 3 (5, 8)) Dist Mean M 2 5 6 7 0 5 5 10 2 (1,2) Dist Mean 3 9 4 9 10 9 7 0 10 Cluster 1 3 2 2 2 2 3 2 The initial cluster centres – means, are (2, 10), (5, 8) and (1, 2) - chosen randomly. Next,, we will calculate the distance from the first point (2, 10) to each of the three means, by using the distance function: Point mean1 x1, y1 x2, y2 (2, 10) (2, 10) p (a, b) = |x2 – x1| + |y2 – y1| y p (point, mean1) = |x2 – x1| + |y2 – y1| = |2 – 2| + |10 – 10| =0+0 =0 point x1, y1 (2, 10) mean2 x2, y2 (5, 8) p (a, b) = |x2 – x1| + |y2 – y1| y p (point, mean2) = |x2 – x1| + |y2 – y1| = |5 – 2| + |8 – 10| =3+2 =5 point x1, y1 (2, 10) mean3 x2, y2 (1, 2) p (a, b) = |x2 – x1| + |y2 – y1| y p (point, mean2) = |x2 – x1| + |y2 – y1| = |1 – 2| + |2 – 10| =1+8 =9 • • • Flow Chart of K-Means Algorrithm For example if we consider the folloowing data set, KMeans Algorithm will work like this – • • [Page No. 275] Next, we need to re-coompute the new cluster centers. We do so by taking the mean of all points in each cluster. m is sensitive to outliers, since The k-means algorithm an object with an extremely large value may substantially distort thee distribution of the data. In k-means algorithhm (MacQueen, 1967), the prototype, called the center, c is the mean value of all objects belonging to a cluster. c Furthermore it requirees several passes on the entire dataset, which can maake it very expensive for large datasets as the dataset in i our application. Often found in data relating to classified images. 5th IEEE International Conference on Advanced Computing & Communication Technologies [ICACCT-2011] ISBN 81-87885-03-3 • The k-medoids approach is moore robust in this aspect. K-Medoid Algorithm K-medoid (The PAM-algorithm)(Kaufmann and Rousseeuw, 1990),a partitioning around Medoids was one of the first kMedoids algorithms introduced. It attemppts to determine k partitions for n objects. After an initial ranndom selection of k medoids, the algorithm repeatedly tries to make m a better choice of medoids. Instead of taking the mean value of thee object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. is the centroid of cluster Ci (both p and Oj are multidimensional). 3. Select the configurationn with the lowest cost. If this is a new configuration, thhen repeat step 2. Numerical experiments Cluster the given data set of tenn objects into two clusters i.e. k = 2. Consider a data set of ten obbjects as follows: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 2 3 3 4 6 6 7 7 8 7 6 4 8 7 2 4 3 4 5 6 Data Set – Ten T Objects Distribution of Data Objects The PAM-algorithm is based on the search for f k representative objects or medoids among the objects of the dataset. These objects should represent the structure of the data. After finding a set of k medoids, k clusters are construucted by assigning each object to the nearest medoid. The goal is to find k representative objects which minimize the sum of the dissimilarities of the objects to their cloosest representative object. Step 1 Initialise k centre, Let us assum me c1 = (3,4) and c2 = (7,4). Here c1 and c2 are selected as medoid. m Calculating distance so as to t associate each data object to its nearest medoid. Cost is calculated using Minkowski distance metric. C1 3 3 3 3 3 3 3 3 Basic K-medoid Algorithm for finding K cllusters. 1. Select K initial points. These pointts are the candidate medoids and are intended to bee the most central points of their clusters. o of the selected 2. Consider the effect of replacing one objects (medioids) with one off the non-selected objects. C1 3 3 3 3 3 3 3 3 Conceptually, this is done in the foollowing way. The distance of each non-selected point from thhe closest candidate medoid is calculated, and this distance iss summed over all points. This distance represents the “cosst” of the current configuration. All possible swaps of a non-selected point for a selected one are considered, and the cost off each configuration is calculated. 4 4 4 4 4 4 4 4 Data Objects Xi 2 6 3 8 4 7 6 2 6 4 7 3 8 5 7 6 Cost (Distance) 3 4 4 5 3 5 6 6 4 4 4 4 4 4 4 4 Data Objects Xi 2 6 3 8 4 7 6 2 6 4 7 3 8 5 7 6 Cost (Distance) 3 4 4 5 3 5 6 6 Cost Calculation where E is the sum of the distances for all objects in the data set; p is the point in space representing a giiven object; and Oj Then the clusters become: Cluster1 = {(3,4) (2,6) (3,8) (4,77)} Cluster2 = {(7,4) (6,2) (6,4) (7,33) (8,5) (7,6)} [Page No. 276] Analysis and Approach: K-Means and K--Medoids Data Mining Algorithms Since the points (2,6) (3,8) and (4,77) are closer to c1 hence they form one cluster whilst remaaining points form another cluster. So the total cost involved is 20. Where cost between any two pointts is found using formula where x is any data object, c is thhe medoid, and d is the dimension of the object which in this case is 2. Total cost is the summation of the cost of data object from its medoid in its cluster so here: Total cost = { cost((3,4),(2,6)) + cost ((3,4),(3,8)) + cost ((3,4),(4,7)) } + { cost ((7,4) , (6,2)) + cost ((7,4) , (66,4)) + cost ((7,4) , (7,3)) + cost ((7,4),(8,5)) + cost ((7,4) , (7,6))) } = (3+4+4)+(3+1+1+2+2) = 20 Cluster1 = {(3,4)(2,6)(3,8)(4,7)} Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,66)} Step 2 Selection of nonmedoid O′ randomly. Leet us assume O′ = (7,3), so now the medoids are c1(3,4) and O′(7,3) O If c1 and O′ are new medoids, calcuulate the total cost involved by using the formula in the step 1 Total cost = 3+4+4+2+2+1+3+3 = 22 So cost of swapping medoid from c2 too O′ is S = current total cost – past total cost =22 - 20 =2 > 0 So moving to O′ would be bad idea, hence h the previous choice was good and algorithm terminates here h (i.e. there is no change in the medoids). It may happen that some data points may m shift from one cluster to another cluster depending upon their closeness to medoid. Artificial Data for Comparison & Handling Outliers In order to evaluate the perfformance of the clusters the artificial data will be generatedd and clustered using K-means clustering and PAM.We gennerate 120 objects having 2 variables for each of three classes shown in Fig. 1. We call the first group marked by square as class A, the second group marked by circle as class B and third group marked by triangle as class C for the sake of convennience. Artificial Data for Comparison C (Fig 1) Data is generated from muultivariate normal distribution, whose mean vector and variancce of each variable (variance of each variable is assumed to be equal and covariance is zero) are given in Table 1. In orderr to compare the performance when some outliers are present among objects, we add outliers to the class B. The outliers are generated from a multivariate normal distribution which has thhe equal mean with class B but larger variance as shown in Table1. Table1 - Mean and variancce when generating objects Class A Class B Class C Outliers (Class B) Mean Vector (0 , 0) (6 , 2) (6,-1) (6 , 2) Variance of each Variable 1.52 0.52 0.52 22 The adjusted Rand index will w be used as the performance measure, which is proposed by Hubert and Arabie (1985) and is popularly used for comparison of clustering results. The adjusted Rand index is calculateed as Typical K-Medoids Algorithm (PAM) Where a = number of pairs whichh are in the identical cluster of compared clustering solution for f pairs of objects in certain cluster of correct clustering soluution [Page No. 277] 5th IEEE International Conference on Advanced Computing & Communication Technologies [ICACCT-2011] ISBN 81-87885-03-3 b = number of pairs which are not in the identical cluster of compared clustering solution for pairs of objects in certain cluster of correct clustering solution c = number of pairs which are not in the identical cluster of correct clustering solution for pairs of objects in certain cluster of compared clustering solution d = number of pairs which are not in the identical cluster of both correct clustering solution and compared clustering solution. Features of K-Medoid Algorithm: It operates on the dissimilarity matrix of the given data set or when it is presented with an nxp data matrix, the algorithm first computes a dissimilarity matrix. It is more robust, because it minimizes a sum of dissimilarities instead of a sum of squared Euclidean distances It provides a novel graphical display, the silhouette plot, which allows the user to select the optimal number of clusters. However, PAM lacks in scalability for very large databases and it present high time and space complexity. Unsupervised classification, or clustering, as it is more often referred as, is a data mining activity that aims to differentiate groups (classes or clusters) inside a given set of objects , being considered the most important unsupervised learning problem. The resulting subsets or groups, distinct and non-empty, are to be built so that the objects within each cluster are more closely related to one another than objects assigned to different clusters. Central to the clustering process is the notion of degree of similarity (or dissimilarity) between the objects. Let O = fO1;O2;::: ;On be the set of objects to be clustered. The measure used for discriminating objects can be any metric or semi-metric function The distance expresses the dissimilarity between objects. The partitioning process is iterative and heuristic; it stops when a “good” partitioning is achieved. Finding a “good” partitioning coincides with optimizing a criterion function defined either locally (on a subset of the objects) or globally (defined over all of the objects, as in kmeans). These algorithms try to minimize certain criteria (a squared error function in K-Means); the squared error criterion tends to work well with isolated and compact clusters .In kmedoids or PAM (Partitioning around medoids) algorithm, each cluster is represented by one of the objects in the cluster. It finds representative objects, called medoids, in clusters. The algorithm starts with k initial representative objects for the clusters (medoids), then iteratively recalculates the clusters (each object is assigned to the closest cluster - medoid), and their medoids until convergence is achieved. At a given step, a medoid of a cluster is replaced with a non-medoid if it improves the total distance of the resulting clustering. It is understood that the average time for normal distribution is greater than the average time for the uniform distribution. This is true for both the algorithms K-Means and K-Medoids. If the number of data points is less, then the K-Means algorithm takes lesser execution time. But when the data points are increased to maximum the K-Means algorithm takes maximum time and the K-Medoids algorithm performs reasonably better than the K-Means algorithm. The characteristic feature of this algorithm is that it requires the distance between every pair of objects only once and uses this distance at every stage of iteration. Conclusion Since measuring similarity between data objects is simpler than mapping data objects to data points in feature space, these pairwise similarity based clustering algorithms can greatly reduce the difficulty in developing clustering based pattern recognition applications. The advantage of the K means algorithm is its favourable execution time. Its drawback is that the user has to know in advance how many clusters are searched for. It is observed that K means algorithm is efficient for smaller data sets and K medoids seems to perform better for large datasets. Bibliography [1] J. Han, M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann, 2006 (2nd Edition), http://www.cs.sfu.ca/~han/dmbook [2] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition, Morgan, Kaufmann, 2005,http://www.cs.waikato.ac.nz/~ml/weka/book.htm l [3] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining (First Edition), Addison Wesley, (May 2, 2005). [4] Margaret H. Dunham, Data Mining: Introductory and Advanced Topics, Prentice Hall, 2003,http://lyle.smu.edu/~mhd/book [5] A K-means-like Algorithm for K-Medoids Clustering and Its Performance, Hae-Sang Park, Jong-Seok Lee and Chi-Hyuck Jun, Department of Industrial and Management Engineering, POSTECH, South Korea, [email protected], [email protected], [email protected] [6] Computational Complexity between K-Means and KMedoids Clustering lgorithms for Normal and Uniform Distributions of Data Points T. Velmurugan and T. Santhanam ,Department of Computer Science, DG Vaishnav College, Chennai, India [7] Best of Both: A Hybridized Centroid-Medoid Clustering Heuristic, [email protected], Michael [email protected], National Institute of Informatics, Japan [8] K-Medoids: CUDA Implementation, Douglas Roberts, May 21, 2009 [9] International Journal of Database Management Systems (IJDMS), Vol.3, No.1, February 2011 [10] Top 10 algorithms in data mining, XindongWu · Vipin Kumar · J. Ross Quinlan [11] Comparison between K-Means & K-Medoids clustering Algorithms, T SoniMadhuLatha, International Journal of Advanced Computing, April 2011. [Page No. 278] Analysis and Approach: K-Means and K-Medoids Data Mining Algorithms [12] Clustering Parallel Data Streams, Yixin Chen,Department of Computer Science, Washington University. St. Louis, Missouri [13] A Partitional Clustering Algorithm for Crosscutting Concerns Identification,Gabreilla Czibula Grigoreta, Sofia Cojocar, Istvan Gergely Czibula [14] Data Mining and Soft Computing Research Group on Soft Computing and Information Intelligent Systems (SCI2S), Dept. of Computer Science and A.I., University of Granada, Spain [15] Variance enhanced K-medoid clustering qPor-Shen Lai , Hsin-Chia Fu Department of Computer Science, National Chiao-Tung University, Hsin-Chu 300, Taiwan, ROC [Page No. 279]