* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download a two-staged clustering algorithm for multiple scales
Survey
Document related concepts
Transcript
202 International Journal of Electronic Business Management, Vol. 3, No. 3, pp. 202-208 (2005) A TWO-STAGED CLUSTERING ALGORITHM FOR MULTIPLE SCALES Chien-Lung Chan* and Rung-Ting Chien Department of Information Management Yuan Ze University Chung-Li (320), Taiwan ABSTRACT Cluster analysis is a data mining technique used to identify hidden patterns within data. Most clustering algorithms treat different fields of data with equal weights and calculate the “distance” using the same method. They ignore the fact that different fields of data have different scales; therefore, the “distance” should be calculated differently. This study incorporated a traditional clustering algorithm with expert subjective judgment, and used different methods to calculate the degree of similarity for four different scales -- nominal, ordinal, interval and ratio. This study proposes a two-staged clustering algorithm to improve the process. In the first stage, training data was used to determine the parameters that improved clustering quality. In the second stage, different methods were used to calculate the degree of similarity for four different scales of data and treated different fields with unequal weights. To evaluate the outcomes of this proposed clustering method, four standard data sets were used for testing. They were the Wisconsin Breast Cancer Data, Contraceptive Method Choice Data, Iris Education Data, and Balance Scale Weight & Distance Data. The results were positive; the algorithm using multi-scales resulted in a better quality clustering. Also, the algorithm incorporating expert subjective weighting had better accuracy in clustering. Keywords: Data Mining, Clustering Algorithm, Multi-scales Analysis, Expert Weight 1. INTRODUCTION * Clustering is a method of grouping objects into clusters according to their similarity [2]. A cluster is a set of like objects. Objects from different clusters are not alike. This method helps to discover important attributes within the same cluster in large datasets. Objects can be separated according to their attributes. Objects in the same cluster share common attributes. However, previous research mentioned the difficulty of interpreting the outcome of clustering [7,8]. Researchers have applied clustering algorithms into datasets for different purposes, even when using the same clustering algorithm. This study evaluated how an expert subjective judgment affects the quality of clustering, and investigated whether expert subjective weightings in clustering influences the quality of clustering. Experiments were designed to compare the quality of clustering between equally weighted and differently weighted attributes. Most clustering algorithms use the same method to calculate the “distance” between different objects regardless of their scales (nominal, ordinal, interval * Corresponding author: [email protected] and ratio). In this study, the values of different scales were treated differently, and experiments were designed to compare the quality of clustering between traditional clustering algorithms and the algorithms used in this study. 2. CLUSTERING ALGORITHM Knowledge discovery is a process that extracts potentially interesting and previously unknown information from large amounts of data [3,4]. The data might contain pattern in the large database. According to Fayed’s (1996) studies, there are six steps in the process of knowledge discovery. The six steps are: 1. Learning the application domain 2. Creating a target data set 3. Data cleaning and preprocessing 4. Data mining 5. Result interpretation 6. Apply discovered knowledge Data mining is one step in knowledge discovery, and clustering is one method of data mining. It is used to segment data into different clusters. Objects have similar attributes within the same cluster. Clustering is an un-supervised classification C. L. Chan and R. T. Chien: A Two-staged Clustering Algorithm for Multiple Scales method for grouping similar objects into the same cluster. The objective is to infer common characteristics for the objects in the same cluster. A good clustering method produces a quality cluster meaning a high intra-class similarity and a low inter-class similarity. The quality of a clustering method is also measured by its ability to discover hidden patterns [1]. There are two kinds of clustering methods -- hierarchical and partitioning. This study used a k-means method (one of the popular partitioning methods) to be the experimental method because this method of clustering produces a higher quality cluster than the hierarchical method. The k-means clustering method is an algorithm for clustering n data points into k subsets so as to minimize the cost function (usually expressed as sum-of-squares error, SSE). It is comprised of a simple re-estimation procedure as follows: [9] Initial the centroid Assign the data points at random to the k sets Compute the centroid for each set Iterate the three steps above until a stoppingcriterion is met, which is defined as: Minimize Cost function: k k n F ( d ij ) U ij * d ij i 1 d xi Gi ij xi Gi 203 3. TWO-STAGED CLUSTERING A traditional k-means clustering algorithm is designed to find clusters by assuming that all data attributes are numeric, and thus, numeric distances can be calculated. Researchers have tried to release this assumption to be more close to the real data. Instead of calculating the numeric distance, Huang (1998) calculated the total mismatching when clustering the categorical data [5,6]. Ralambondrainy (1995) transformed the categorical data into a binary code to calculate the distance [9]. This study used a two-staged clustering algorithm and treated multiple-scale data differently. In the first stage, the training data was driven randomly from the database to find the cluster parameters. Distances of different scales were calculated using different methods. Then, the training data was clustered with equal weight. After that, the domain expert reviewed the outcome of clustering and used discriminate analysis to determine the weights. In the second stage, the parameters driven from the first stage were used to cluster all data. The two-staged clustering algorithm is illustrated as Figure 1. (1) i 1 j 1 : Total distance between object j and the centroid of group i 1, if x j mi x j ml , l i U ij 0, otherwise (2) x j : The object j mi : The center of cluster i U ij : If i and j are the same cluster In the equation (1), the dij means the distance between object j and the centroid of group i. It is the most important factor for clustering. The similarity between two objects is a measure of how closely they resemble each other. Dissimilarity is the opposite concept measured by the distance between two objects. The most popular distance is the Euclidean distance used in this study to calculate the similarity. The weakness of the k-means method is that it treats different kinds of scales the same, and uses the same algorithm to calculate the distance between two objects. Consequently, the k-means algorithm was improved by treating different scales differently and calculating the distance with different methods. Figure 1: Two-staged clustering algorithm In this study, we tried four different methods to cluster standard data sets. The first clustering method is traditional K-means. Distance between two objects was calculated using numeric values with equal weight. The second clustering method is K-means 204 International Journal of Electronic Business Management, Vol. 3, No. 3 (2005) with different weights. Distance between two objects was calculated using numeric values with unequal weights. The third method is to calculate the distances between two objects by treating different types of scale differently. Finally, the fourth method is to use unequal weights and multi-scales calculation simultaneously for clustering. When multi-scales method was applied to cluster data, researchers need to clarify the scale of each attribute. For interval and ratio scale, the distance can be calculated directly using the numeric values. For nominal scale, the concept of “similarity” was applied to calculate distance. If two objects have the same value in nominal scale, their distance should be 0. Otherwise their distance is 1. For the ordinal scale, we first transform the original value to new value (value / max value –min value), which represents the object location. Then we calculate the distance between two objects using these two transformed values. In this study, the expert’s role is set up to be the confirmation of the accuracy of clustering results in the first stage. It’s a very similar concept of “training” in data mining. In this stage, the expert can make a subjective judgment if any object has been clustered into the cluster it should belong to. If not, the expert can exclude this “misclassified” object subjectively. After the expert’s confirmation, discriminate analysis was applied to find the weights of these attributes. This is why we call it the expert’s subjective weighting process. The expert’s involvement is very important to our proposed two-staged clustering method; however, it is difficult to find the domain experts for our data sets. Therefore, instead of recruiting the real domain experts, this study chose a satisfied alternative by reference the standard datasets and associated journal papers. By this way, the clustering result in the first stage can be checked with the standard data sets, which have the real values for every attributes, including the real clusters each object should belong to. After eliminating those “misclassified” cases by a comparison of the standard data sets, we calculated the weights for those attributes determining clustering. For example, in the second data set, there is the final result of contraceptive method that every subject chose; therefore, we can confirm the result of clustering without consulting real domain experts. 4. EXPERIMENT DESIGN To verify the effectiveness of the proposed algorithm, results of the two-staged algorithm were compared with traditional k-means using four standard data sets. They were the Wisconsin Breast Cancer Data, Contraceptive Method Choice Data, Iris Education Data, and Balance Scale Weight & Distance Data. The two-staged clustering algorithm was programmed on Matlab 6.0, and the test environment was WINXP in PC (Pentium III 866MHz with 256MB SDRAM). The details of these four data sets are presented in the appendix. To compare the quality of three algorithms with the four data sets, experiments were designed as illustrated in Table 1. Table 1: A comparison of three algorithms with 4 data sets Methods\ Data Data Set 1 Data Set 2 Data Set 3 Data Set 4 K-means X X X X Two-staged Algorithm with Multi-scales X X X X Algorithm with Different Weights X X X X The detailed descriptions of these four data sets are as follows: Wisconsin breast cancer data This data set is a breast cancer database obtained from the University of Wisconsin Hospital in Madison, Wisconsin. Dr. William H. Wolberg (1992) constructed the data base. Samples arrived periodically as Dr. Wolberg reported his clinical cases. The database therefore reflects this chronological grouping of the data. Every record has numeric attributes (Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, and Mitoses), and a class attribute (benign and malignant). Contraceptive method choice data This dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The samples are married women who were either not pregnant or did not know they were at the time of interview. The problem was to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics. There were 1473 instances with multi-type attributes (numeric, nominal and ordinal ) including the Wife's age, Wife's education, Husband's education, Number of children born, Wife's religion, Wife's working status, Husband's occupation, Standard-of-living index, and Media exposure. Iris education data This data set concerned educational transitions for a sample of 500 Irish schoolchildren aged 11 in 1967. The data were collected by Greaney and Kelleghan (1984), and reanalyzed by Raftery and Hout (1985, 1993). There were 441 instances with C. L. Chan and R. T. Chien: A Two-staged Clustering Algorithm for Multiple Scales multi-type attributes (numeric, nominal and ordinal) including Sex, DVRT (Drumcondra Verbal Reasoning Test Score), Educational level attained, Leaving Certificate, Prestige score for father's occupation, and a class attribute (Type of school ). Balance scale weight & distance data This data set was generated to model psychological experiments reported by Siegler (1976). Each example was classified as having the balance scale tip to the right, tip to the left, or balanced. There were 625 examples with four numeric attributes (the left weight, the left distance, the right weight, and the right distance). There were three kinds of scales in the four data sets. Only numeric attributes in the data set (data set 1). Multi-scale attributes in the data set (data set 2 and data set 3). Only non-numeric attributes in the data set (data set 4). The four data sets were clustered using four algorithms (k-means, k-means with weight, Multi-Scales Clustering, and Multi-Scales Clustering with weight) to compare the influence of multi-scales and weight. The following four criteria evaluated the quality of the clustering algorithms. (1) Accuracy of grouping: n F= U i 1 i 1 if f i ri 0 otherwise , Ui n 1 (3) The difference within groups: Total difference within groups n k Dwithin group = ( X i Ci )U ij (6) i 1 j 1 1 if i j where U ij 0 otherwise Average difference within groups: Dwithin group = Xi : Dwithin group n (7) Data point i , Ci : centroid of cluster i (4) The distance between group’s centers: Total distance between group’s centers k Dc = k (C i Cj) (8) (3) Average distance between group’s centers: Dc avg = 2 Dc n(n 1) (9) A better-quality clustering algorithm will increase the total distance between group’s centers (Dc). (2) The difference between groups: Total difference between groups n P(i, j)Q(i, j) better-quality clustering algorithm will cause P(i,j) and Q(i,j) to increase; consequently, their total difference between groups will increase. Therefore, Γcan be a criteria to evaluate the quality of the clustering algorithm. i 1 j i 1 f i : The object’s cluster determined by algorithm ri : The object’s real class T= 205 (4) 5. RESULTS i 1 j i 1 Average difference between groups = 2T n(n 1) (5) P(i,j): the distance between object i and j Q(i,j): the distance between cluster Ci and Cj Object i belongs to cluster Ci, and object j belongs to cluster Cj. When both object i and j are belong to the same cluster, Q(i,j) is equal to zero. The total difference between groups is the total summation of product between P(i,j) and Q(i,j). When object i is totally different from object j, a 5.1 Data Set with All Numeric Attributes For the data that had only numeric attributes, a traditional k-means created a closer cluster. This is because k-means with equal weight would not change the inherent structure of the data. However, k-means with unequal weights increases the importance of some attributes. For example, there were three objects A (5, 2), B (5, 3) and C (7, 2) in the dataset. The k-means algorithm grouped A and B into the same cluster because of their similarity, and placed C into another cluster. But after weighting the second attribute, A and C were placed into the same cluster. In this case, the average difference within groups and the average difference between groups for k-means with unequal weights were greater than those of k-means with equal weights. Therefore, k-means 206 International Journal of Electronic Business Management, Vol. 3, No. 3 (2005) increases the similarity of objects within the same cluster, and decreases the similarity of objects between groups. As far as the accuracy of classification is concerned, k-means with unequal weight is more accurate then k-means with equal weight (Table 2). The reason is weighting can strengthen the influence of significant attributes. Table 3. A comparison of clustering quality between traditional k-means and k-means with multi-scale clustering method for Contraceptive Method Choice Data and Iris Education Data Contraceptive Iris Education Method Choice Data (n=441) Method Data (n=1473) Run 5.2 Data Sets with Multi Scales Data set 2 was the Contraceptive Method Choice Data. To be consistent, before clustering these data, all numeric attributes were preprocessed and standardized. In data set 3, not only were numeric attributes standardized, but dummy variables for nominal attribute were also used. Table 2. A comparison of clustering quality between traditional k-means and k-means with expert’s weight for Wisconsin Breast Cancer Data (n=699) K-means Method with K-means Run Expert’s Weight H-BOV* 1.1201 1.1032 1.1337 1.1521 M-BOV 1 Accuracy of 94.13% 94.85% clustering 4.1046 Center distance 4.1510 H-BOV 1.1201 1.0894 M-BOV 1.1337 1.1635 2 Accuracy of 94.13% 94.71% clustering Center distance 4.1510 4.0768 H-BOV 1.1201 1.0706 M-BOV 1.1337 1.1815 3 Accuracy of 94.13% 94.28% Clustering Center distance 4.1510 4.0433 H-BOV 1.1201 1.0877 M-BOV 1.1337 1.1657 Average Accuracy of 94.13% 94.61% clustering Center distance 4.151 4.0749 * H-BOV is the average difference between groups (Herbert ), the larger the better & M-BOV is the average difference within groups ( Dwithin group avg ), the smaller the better Accuracy of clustering is the accuracy of objects clustering (F) Center distance is the average distance of group’s center Dc avg 1 2. 3. Average * & K* MSC& K MSC H-BOV 4.4679 5.3816 1.7291 2.3339 M-BOV 3.0658 2.4035 1.8816 1.6298 H-BOV 4.4576 4.9506 1.8304 2.4090 M-BOV 3.0113 2.6766 2.0620 1.8204 H-BOV 4.4846 5.9258 1.8087 2.3928 M-BOV 3.0008 2.5033 1.9415 1.6962 H-BOV 4.4700 5.4193 1.7894 2.3786 M-BOV 3.0259 2.5278 1.9617 1.7155 K is the k-means algorithm MSC is the Multi-Scale Cluster method H-BOV is the average difference between groups (Herbert ) M-BOV is the average difference within groups ( Dwithin group avg ) Table 4: A comparison of clustering quality between k-means with weights and multi-scale clustering method with weights for Contraceptive Method Choice and Iris Education Data Run Contraceptive Iris Education Method Choice Method Data (n=441) Data (n=1473) KW* MSCW& KW MSCW Accuracy of clustering 1 Center distance Accuracy of clustering 2 Center distance Accuracy of clustering 3 Center distance Accuracy of Average clustering Center distance 39.44% 40.19% 65.31% 70.75% 3.0019 3.013 2.8922 4.1068 38.09% 40.60% 65.53% 67.35% 3.6976 3.4158 3.4781 1.5864 43.65% 43.25% 71.66% 71.66% 3.9246 4.8624 3.708 4.6272 40.39% 41.34% 67.35% 70.07% 3.5414 3.7637 3.3594 3.4401 C. L. Chan and R. T. Chien: A Two-staged Clustering Algorithm for Multiple Scales * & KW is the k-means with weights MSCW is the multi-scale clustering method with weights Accuracy of clustering is the accuracy of objects clustering (F) Center distance is the average distance of group’s center Dc avg After analyzing the data, it was found that k-means with multiple scale calculation causes similar objects to group together and dissimilar objects to separate. This is because different ways were used to calculate distance for different scales. For the nominal scale, match and mismatch was used to calculate the similarity. When the two values of the nominal attributes from two records were the same, the distance was 0. Otherwise, the distance was 1. The mode was used to find the center of the cluster. For the ordinal scale, the ordinal attribute was transformed into a value between 0 and 1. The center of the cluster was represented by the median rather than the mean. The result shows multi-scale calculation had a shorter average difference within group and a larger average difference between groups. For the accuracy of clustering, multi-scale calculation with weight was more accurate than k-means with weight. Also, the center distance in multi-scale calculation was larger than that of k-means with weight. Table 5: A comparison of clustering quality between k-means with weights and multi-scale clustering method for Balance Scale Weight & Distance Data (n=625) Method K* MSC& Run 9.4262 10.0080 H-BOV 1 3.9278 3.9408 M-BOV 2 3 H-BOV 8.1604 9.4743 M-BOV 3.9205 3.7904 H-BOV 6.7353 8.2132 M-BOV 4.1727 4.0704 H-BOV 8.1073 9.2318 Average M-BOV 4.007 3.9338 * K is the k-means algorithm & MSC is the multi-scale clustering method H-BOV is the average difference between groups (Herbert ) M-BOV is the average difference within groups ( Dwithin group avg ) 5.3 Data set with All Ordinal Scale There were four ordinal attributes in the Balance Scale Weight & Distance data. For traditional cluster algorithms, a numeric calculation was used 207 with this kind of data. But in fact, ordinal data should use its own method to calculate the distance. So k-means and multi-scale k-means were compared just like the Contraceptive Method Choice Data. The reason is the same as the Contraceptive Method Choice Data. But in this dataset, k-means had a larger distance of cluster center then MSC. This was because the median was used to determine the cluster center in the ordinal attribute, and it is difficult to produce an extreme value with the median. So the distance in the multi-scale algorithm is shorter than with k-means. Table 6. A comparison of clustering quality between k-means with weights and multi-scale clustering method with weights for Balance Scale Weight & Distance Data (n=625) Method Run 1 Accuracy of clustering Center distance KW* MSCW& 55.04% 60.32% 4.4127 4 Accuracy of 43.04% 52.80% clustering Center distance 4.3425 4 Accuracy of 44.64% 57.76% clustering 3 Center distance 4.2987 2 Accuracy of 47.52% 56.96% clustering Average Center distance 4.3513 3.3333 * KW is the k-means with weights & MSCW is the multi-scale clustering method with weights Accuracy of clustering is the accuracy of objects clustering (F) Center distance is the average distance of group’s center Dc avg 2 6. CONCLUSION A k-means clustering algorithm uses the same method to calculate the distances for all kinds of scales. It is a simple and quick way to cluster objects. However, it ignores the inherent meaning of each kind of scale. Therefore, the results of clustering are difficult to interpret. In this study, a two-staged clustering algorithm taking multiple scales into account has been provided. Through the designed experiments, the results of clustering algorithm with multi-scales were more interpretable. Also, the quality of the clustering measured by the average difference between groups and the average difference within group is better. Furthermore, the clustering 208 International Journal of Electronic Business Management, Vol. 3, No. 3 (2005) algorithm using unequal weight improved the accuracy of clustering and the average distance of a group’s center. The limitations for this study are as follows: Only four standard data sets were used to compare the performance of algorithms. More data sets are needed to confirm the findings. Instead of getting insight from domain experts, this study applied the knowledge from published journal papers for each data set as expert weights. In practice, the domain experts should be involved during the clustering process. The future works for this study are as follows: Apply this proposed algorithm into more data sets to verify if the findings are consistent. Combine this algorithm with optimization methods to improve the efficiency. Involve the domain experts into the weighting process to see if the quality of clustering can be improved even more. REFERENCES 1. 2. 3. 4. Berry, M. J. A. and Linoff, G., 1997, Data Mining Technique for Marketing, Sale and Customer Support, Wiley. Biswas, G., Weinberg, J. and Fisher, D. H., 1992, “ITERATE: A conceptual clustering method for knowledge discovery in databases,” Artificial Intelligence in Petroleum Industry, B. Braunschweig and R. Day eds. Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P., 1996, “The KDD process for extracting useful knowledge for volumes of data,” Communication of the ACM, Vol. 39, No.11, pp. 27-34. Guape, F. H. and Owrang, M. M., 1995, “Database mining discovering new knowledge and cooperative advantage,” Information Systems 5. 6. 7. 8. 9. Management, Vol. 12, pp. 26-31. Huang, Z., 1997, “A fast clustering algorithm to cluster very large categorical data sets in data mining,” Research Issues on Data Mining and Knowledge Discovery. Huang, Z., 1998, “Extensions to the k-means algorithm for clustering large data sets with categorical values,” Data Mining and Knowledge Discovery, Vol. 2, 283-304. Jain, A. K. and Dubes, R. C., 1988, Algorithm for Clustering Data, Prentice Hall Advanced Reference Series. Leonard, K. and Rousseeuw, P. J., 1990, Finding Groups in Data: An Introduction to Cluster Analysis, A Wily-Interscience Publication. Ralambondrainy, H., 1995, “A conceptual version of the k-means algorithm,” Pattern Recognitions Letters, pp. 1147-1157. ABOUT THE AUTHORS Chien-Lung Chan is an Associate Professor and Chairman in the Department of Information Management at Yuan Ze University (YZU), Taiwan R.O.C. He received his Ph.D. degree in Industrial Engineering at University of Wisconsin-Madison in 1995. His current research and teaching interests are in the area of Decision Science, Decision Support System and Healthcare Informatics. Rung-Ting Chien received his master degree from Department of Information Management, Yuan Ze University (YZU). His research interests are Data Mining and Decision Support. (Received August 2004, revised October 2004, accepted November 2004)