* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download An Evaluation of Two Clustering Algorithms in Data Mining
Survey
Document related concepts
Transcript
International Journal of Computer, Mathematical Sciences and Applications Vol. 5, No. 1-2, January-June 2011, pp. 39– 47 © Serials Publications ISSN: 0973-6786 An Evaluation of Two Clustering Algorithms in Data Mining - a Case Study P.K. SRIMANI1 AND YETHIRAJ N.G.2 1 2 Director, R and D, Bangalore. E-mail: [email protected] Department of Computer Science, Maharani’s Science College for Women, Bangalore. E-mail: [email protected] Abstract: Data Mining is fundamentally an applied discipline and this paper aims at integrating mathematical and computer science concepts by taking a case study. This paper discusses the testing of two clustering algorithms (K-means and Expectation Maximization) on two datasets. Data Mining is the process of extracting the potential information, patterns and trends from large quantities of data possibly stored in databases. Clustering is alternatively referred to as unsupervised learning segmentation and can be accomplished by using many algorithms. From the experiments it is concluded that EM is the best suited algorithm for the given datasets, since EM calculates the log likelihood for each of the attribute in the datasets. Also, Maximum log likelihood gives the best quality of clustering which predicts the number of burst instances that are present in the datasets. Experimental results are presented through charts and tables. The results of the comparison are presented in detail. Keywords: Clustering, Algorithms, K-means, EM (Expectation Maximization), Comparison, datasets. 1. INTRODUCTION It is Interesting to note that the progress in digital data acquisition and storage technology has resulted in the growth of huge databases. This has occurred in almost all areas of human endeavor from the mundane to the more exotic. The science of extracting useful/potential information from large data sets or databases is known as Data Mining. Data Mining is not simply a process, but an ongoing voyage of discovery, interpretation and re-investigation. It is a new discipline and an exciting discipline lying at the intersection of Statistics, Machine Learning, Data Management, Databases, Pattern Recognition, Artificial Intelligence and other areas [1] and [6]. In fact, Data Mining is an interdisciplinary exercise, Mathematics, Statistics, Database Technology, Machine Learning, Pattern Recognition, Artificial Intelligence, and Visualization. Further it is difficult to draw a sharp boundary between any two disciplines. Therefore, for understanding the complexities of Data Mining, an excellent understanding of both “The Mathematical Modeling” and “The Computational Algorithmic” views are essential. Therefore, the requirement to master two different areas of expertise presents a challenge for the students, instructors and researchers. The lack of details regarding statistical, mathematical and algorithmic concepts results in a very poor understanding of the subject Data Mining and probably the topic on regression is the most 40 International Journal of Computer, Mathematical Sciences and Applications mathematically challenging one. Data Mining is fundamentally an applied discipline and this paper aims at integrating mathematical and computer science concepts by taking a case study. Data Mining is the process of posing various queries and extracting useful information, patterns, hidden from large quantities of data possibly stored in databases. Essentially, the goals of data mining with regard to many organizations include improving marketing capabilities, detecting abnormal patterns, and predicting the future, based on the past experiences and current trends. Data Mining is an integration of multiple technologies which include, Data management, such as database management, data warehousing, statistics, machine learning, decision support and other such as visualization and parallel computing. Data mining research is being carried out in various disciplines. There are various steps in data mining [7]. Traditional database queries access database using a well defined query stated in a language such as SQL. The output of the query consists of the data from the database that satisfies the query. Data mining access of a database differs from this traditional access in several ways. Data mining involves many different algorithms to accomplish different tasks. All of these algorithms attempt to fit a model to the data. The algorithms examine the data and determine a model that is closest to the characteristics of the data being examined. Data mining algorithms can be characterized as consisting of three parts viz., Model, Preference and Search. Some the basic data mining functions are (i) Classification (ii) Regression (iii) Time series analysis (iv) Prediction (v) Clustering (vi) Summarization (vii) Association rules and (viii) Sequence discovery. Clustering : In this study, our attention is focused on two clustering algorithms. Clustering is alternatively referred to as unsupervised learning or segmentation [1]. It can be thought of as partitioning or segmenting the data into groups that might or might not be disjoint. The clustering is usually accomplished by determining the similarities among the data on predefined attributes or parameters. The most similar data are grouped into clusters. In high-dimensional data, clusters can exist in subspaces that hide themselves from traditional clustering methods. A number of algorithms have been proposed to identify such projected clusters, but most of them rely on some user parameters to guide the clustering process. There are many clustering algorithms. Clustering algorithms can be applied in many fields for instance: Marketing, Biology, Libraries, Insurance, City-Planning, Earthquake studies, WWW. Among them K-means and EM clustering algorithms are considered. K-means is an iterative clustering algorithm in which items are moved among sets of clusters until the desired set is reached [2]. As such, it may be viewed as a type of squared error algorithm [8], although the convergence criteria need not be defined based on the squared error. A high degree of similarity among elements in clusters is obtained, while a high degree of dissimilarity among elements in different clusters is achieved simultaneously. K-means clustering algorithm can also be implemented in disk-based method. This disk-based approach is designed to work inside a relational database management system. It can cluster large data sets having very high dimensionality. As discussed earlier, these algorithms are compared on different data sets. The comparison is done based on the two categories viz. Time and cluster quality. The best suited algorithm for the given dataset is found out. The main requirements that a clustering algorithm should satisfy are: (1) Scalability (2) Dealing with different types of attributes (3) Discovering clusters with arbitrary An Evaluation of Two Clustering Algorithms in Data Mining - a Case Study 41 shape (4) Minimal requirements for domain knowledge to determine input parameters (5) Ability to deal with noise and outliers (6) Insensitivity to order of input records (7) High dimensionality and (8) Interpretability and usability. 2. STATEMENT OF THE PROBLEM The assumption here is that the number of clusters to be created is an input value k. The actual content of each cluster, Kj 1 ≤ j ≤ k, is determined as a result of the function definition. Without loss of generality, the result of solving a clustering problem is that a set of clusters is created: K = {K1, K2, . . . .., Kk}. Given a database D = {t1, t2, . . . ., tn} of tuples and integer value k, the clustering problem is to define a mapping f: D → {1, . . . ., k} where each ti is assigned to one cluster Kj, 1 ≤ j ≤ k. A Cluster, Kj, contains precisely those tuples mapped to it; that is Kj = {ti / f(ti) = Kj, 1 ≤ i ≤ n, and ti ∈ D} 3. CLASSIFICATION OF CLUSTERING ALGORITHMS Clustering algorithms may be classified as listed below: • Exclusive Clustering • Overlapping Clustering • Hierarchical Clustering • Probabilistic Clustering 3.1 Maximum Likelihood Estimate (MLE) Likelihood can be defined as the value proportional to the actual probability with a specific distribution for the given sample. So the sample gives an estimate for a parameter of the distribution. The higher the likelihood value, the more likely the underlying distribution will produce the results observed. Given a sample set of values X = {x1, . . . . . ., xn} from a known distribution function f(xi/Θ), the MLE can estimate the parameters for the population from which the sample is drawn[5]. The approach obtains parameters estimates that maximize the probability that the sample data occur for the specific mode. It looks at the joint probability for observing the sample data by multiplying the individual probabilities. The Likelihood function, L, is thus defined as n L(Θ/x1, . . . . . . xn) = Π f(xi/Θ) i=1 where Θ is the parameterized estimated value. 3.2 Expectation Maximization Algorithm (EM) This is one of the clustering algorithms that solve the estimation problem with incomplete data. The EM algorithm finds an MLE for a parameter (such as mean) using a two-step process: estimation and maximization. The basic EM algorithm is shown below. An initial set of estimates for the parameters is obtained. Given these estimates and the training data as input, the algorithm then calculates a value for the missing data. For example, it might use the estimated mean to predict a missing value. These data are then used to determine an estimate for the mean that maximizes the likelihood. These steps are applied iteratively until successive parameter estimates converge. Any 42 International Journal of Computer, Mathematical Sciences and Applications approach can be used to find the initial parameter estimates. In the algorithm it is assumed that the input database has actual observed values Xobs = {x1, . . . . . ., xk} as well as values that are missing Xmiss = {xk + 1, . . . ., xn}. We assume that the entire database is actually X = Xobs ∪ Xmiss. The parameters to be estimated are Θ = {θ1, . . . . . θp}. The likelihood function is defined by n L(Θ/X) = Π f (xi/Θ) i =1 We are looking for the Θ that maximizes L. The MLE of Θ are the estimates that satisfy ∂ ln (Θ / X ) ∂θi The expectation part of the algorithm estimates the missing values using the current estimates of Θ. This can initially be done by finding a weighted average of the observed data. The maximization step then finds the new estimates for the Θ parameters that maximize the likelihood by using those estimates of the missing data. Before invoking the EM model, the following three options are presented: (i) Number of Clusters (ii) Maximum iterations (iii) Minimum allowable standard deviation The first option allows the selection of the number of clusters that are to be created for the data. The second option determines the maximum number of times to loop through the algorithm. In general, decreasing the maximum number of iterations results in a less precise clustering, with a reduction in time while, the increase of the maximum number of iterations yields a more precise clustering with more time. The third option determines the minimum standard deviation for each attribute in each cluster. Generally, increasing the number will yield clusters that encompass more volume in the data set, and decreasing the number will yield clusters the encompass less volume in the data set. 4. ALGORITHM FOR EXPECTATION MAXIMIZATION (EM)AND K-MEANS S Algorithm Input: Θ = {θ1, . . . . ., θp} // Parameters to be estimated Xobs = {x1, . . . . . ., xk} // Input database values observed Xmiss = {xk + 1, . . . ., xn} // Input database value missing Output: θ̂ // Estimates for [4] EM algorithm: i = 0; Obtain initial parameter MLE estimate Θi; An Evaluation of Two Clustering Algorithms in Data Mining - a Case Study 43 Repeat Estimate missing data, Xmissi; i++; Obtain next parameter estimate, θ^i to Maximize likelihood; Until estimate converges; EM assigns a probability distribution to each instance, which indicates its probability of belonging to each of the clusters. EM can decide how many clusters to be created by cross validation, or may specify apriori the number of clusters to be generated. 5. K- MEANS CLUSTERING K-means is an iterative clustering algorithm in which items are moved among sets of clusters until the desired sets is reached[3]. Input: D = {t1, t2, . . . . ., tm} //Sets of elements k //Number of desired clusters Output: K //Sets of clusters K-means algorithm: Assign initial values for means m1,, . . . . ., mk; Repeat Assign each item ti to the cluster which has the closest mean; Calculate new mean [4] for each cluster; Until convergence criteria is met; 6. DATA SETS In this section, two data sets are presented viz. Australian and German credit card data sets. Australian credit is a data consisting of the credit card applications. This dataset is a good mix of attributes. It consists of continuous, nominal with small number of values, and nominal with larger number of values. There are also a few missing values. The number of instances present is 690 and the number of attributes are 14 +. Out of these 6 attributes are numerical and 8 are categorical. All the attribute values are numerical. German credit dataset consists of German Credit card details. It consists of 1000 instances. It consists of 20 attributes out of which 7 are numerical and 13 are categorical. Some of the attribute values are numerical and some are characters. 7. OUTPUTS AND RESULTS In this section, the results of the present study are presented in Tables 1 to 5. The given data set is the Banking data set. Two algorithms viz. K-means and EM were applied on each of the data sets. The summarized output is shown below: 44 International Journal of Computer, Mathematical Sciences and Applications The table below shows the outputs of EM algorithm on the Australian Data Set. No. of Clusters: 2 Seed : 100 Table 1 The table below shows the outputs of EM algorithm on the Australian Data Set. Sl. No. of No. instances Prior Probabilities Clustered Instances Clust 0 Clust 1 Clust 0 Clust 1 Log-likelihood Time reqruired to build the model (in sec) 1 1000 0.5706 0.4292 57% 43% -25.33392 01 2 2000 0.5703 0.4297 57% 43% -25.32674 01 3 3000 0.5703 0.4297 57% 43% -25.32707 02 4 4000 0.5704 0.4296 57% 43% -25.32724 01 5 5000 0.5705 0.4295 57% 43% -25.3313 02 6 6000 0.5705 0.4295 57% 43% -25.33076 02 7 7000 0.5705 0.4295 57% 43% -25.33121 03 8 8000 0.5706 0.4294 57% 43% -25.33224 03 9 9000 0.5706 0.4294 57% 43% -25.33337 03 10 10000 0.5707 0.4293 57% 43% -25.33397 05 Table 2 The table below shows the outputs of EM algorithm on the German Data Set. No. of Clusters: 2 and Seed: 100 Sl. No. of No. instances Prior Probabilities Clustered Instances Clust 0 Clust 1 Clust 0 Clust 1 Log-likelihood Time reqruired to build the model (in sec) 1 1000 0.3003 0.6997 30% 70% –21.81065 01 2 2000 0.3003 0.6997 30% 70% –21.8097 01 3 3000 0.3005 0.6995 30% 70% –21.80852 01 4 4000 0.3005 0.6995 30% 70% –21.8087 01 5 5000 0.3004 0.6995 30% 70% –21.8087 01 6 6000 0.3004 0.6990 30% 70% –21.80889 02 7 7000 0.3004 0.6996 30% 70% –21.80895 03 8 8000 0.3004 0.6996 30% 70% –21.809 03 9 9000 0.3004 0.6996 30% 70% –21.80903 04 10 10000 0.3004 0.6996 30% 70% –21.80906 04 45 An Evaluation of Two Clustering Algorithms in Data Mining - a Case Study Table 3 The table below shows the outputs of K-means algorithm on the Australian Data Set No. of Clusters : 2 Seed : 100 Sl. No. of instances Clustered instances Time reqrd. to build No. the model (in secs) Cluster 0 Cluster 1 1 1000 44% 56% 00 2 2000 56% 44% 00 3 3000 44% 56% 01 4 4000 44% 56% 01 5 5000 44% 56% 01 6 6000 44% 56% 01 7 7000 32% 69% 02 8 8000 53% 47% 03 9 9000 56% 44% 04 10 10000 56% 44% 04 Table 4 Table table below shows the outputs of K-means algorithm on the German Data Set Sl. No. of instances Clustered instances Time reqrd. to build No. the model (in secs) Cluster 0 Cluster 1 1 1000 53% 47% 00 2 2000 53% 47% 00 3 3000 60% 40% 01 4 4000 52% 48% 01 5 5000 62% 38% 01 6 6000 49% 51% 01 7 7000 24% 76% 01 8 8000 50% 50% 02 9 9000 25% 75% 02 10 10000 61% 39% 03 46 International Journal of Computer, Mathematical Sciences and Applications Table 5 The Table below showing relative comparison of EM and K-means Algorithms Sl. No. No. of instances Log- likelihood Aus Ger Clustered instances EM K-means Aus Ger Aus C 0% C 1% C 0% C 1% Ger C 0% C 1% C 0% C 1% 1 1000 –25.33 21.81 57 43 30 70 44 56 53 47 2 2000 –25.34 –21.80 57 43 30 70 56 44 53 47 3 3000 –25.37 –21.808 57 43 30 70 44 56 60 40 4 4000 –25.32 –21.808 57 43 30 70 44 56 52 48 5 5000 –25.33 –21.808 57 43 30 70 44 56 62 38 6 6000 –25.33 –21.809 57 43 30 70 44 56 49 51 7 7000 –25.33 –21.80895 57 43 30 70 32 69 24 76 8 8000 –25.24 –21.809 57 43 30 70 53 47 50 50 9 9000 –25.37 –21.81 57 43 30 70 56 44 25 75 10 10000 –25.97 –21.81 57 43 30 70 56 44 61 39 7 In EM it calculates the log likelihood, which gives the information of the best higher cluster quality. If log likelihood is maximum, it represents the goodness of clustering. Log likelihood can be defined as the value proportional to the actual probability to that with a specific distribution with the given sample. So the sample gives an estimate for a parameter from the distribution. The higher the likelihood value, the more likely the underlying distribution will produce the results observed. So from the graph it is observed that as the number of instances increases the log likelihood of the Australian data set remains constant. From the table it is clear that EM algorithm works better on the German Data Set. 8. COMPARISON OF K-MEANS AND EM Comparing the clustered instances of both K-means and EM, it is found that EM is the best-suited algorithm for the given data sets, since it depends on the probability distributions where each distribution represents a cluster. In EM it calculates the log likelihood, which gives the information of the best higher cluster quality. If the log likelihood is maximum, it represents the goodness of clustering. The likelihood computation is simply the multiplication of the sum of the probabilities for each of the instances. In the case of K-means, the results are entirely dependent on the value of k i.e. the number of clusters. There is no way of knowing how many clusters exist. The same algorithm can be applied to the same dataset, which can produce two or three clusters. There is no general theoretical solution to find the optimal number of clusters for any given data set. A simple approach is to compare the results of multiple runs with different k values. But increasing the values of k, clustering results in smaller error function and also an increasing risk of over fitting. By comparing the results, it is found that EM algorithm works better on the data sets, when compared to K-means. Hence for the clustering of data, it is better to apply EM algorithm only. An Evaluation of Two Clustering Algorithms in Data Mining - a Case Study 47 REFERENCES [1] Han J., and Kamber M., (2001). “Data Mining: Concepts and Techniques”, Morgan Kaufmann, San Francisco. [2] J a i n A . K . , a n d D u b e s R . C . , ( 1 9 9 8 ) . “ A l g o r i t h m s f o r C l u s t e r i n g D a t a ” . Prentice Hall Advanced Reference Series, Upper Saddle Rivea. N.J. [3] Ordonez. C, Omiecinski.E, “Efficient disk-based K-means Clustering for Relational Databases”. IEEE Transactions on Knowledge and Data Engineering August 2004, Vol. 16, pp. 909-921. [4] Yip K.Y, Cheung.D.W - “HARP: A Practical Projected Clustering Algorithm”. IEEE Transactions on Knowledge and Data Engineering, November 2004, Vol. 16, pp. 1387-1397. [5] Dempster A.P., Laird N.M., and Rubin D.B., “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society, Vol. 39, pp. 1-38, 1977. [6] Cooley R., Mobasher B., and Srivastava J., “Data Preparation for Mining World Wide Web Browsing Patterns”, Knowledge and Information Systems, Vol. 1, pp. 5-32, 1999. [7] Wang J., and Net Library I., “Encyclopedia of Data Warehousing and Mining: Idea Group Reference”, 2006. [8] Arthur D.; Vassilvitskii S., (2006). “How Slow is the K-means Method?” Proceedings of the 2006 Symposium on Computational Geometry (SoCG).