Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Advances in Natural and Applied Sciences, 9(6) Special 2015, Pages: 518-524 AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/ANAS Modified Rock (MROCK) Algorithm for Clustering Categorical Data 1 1 2 Dr. K. Saruladha and 2P.V. Likhitha Assistant Professor, Pondicherry Engineering College, Puducherry, India. Post Graduate Student, Pondicherry Engineering College, Puducherry, India. ARTICLE INFO Article history: Received 12 October 2014 Received in revised form 26 December 2014 Accepted 1 January 2015 Available online 25 February 2015 ABSTRACT The ROCK algorithm is an agglomerative hierarchical clustering algorithm for clustering categorical data. A new algorithm MROCK is proposed which computes the clusters by determining the similarity between the entities of the data in an efficient manner. This leads to a very efficient method of obtaining the clusters which reduces the computing time of the ROCK algorithm. The MROCK algorithm uses a new similarity and new goodness measure for the formation of the clusters. Keywords: Categorical data, clustering, similarity © 2015 AENSI Publisher All rights reserved. measures. To Cite This Article: Dr. K. Saruladha and P.V. Likhitha., Modified Rock (MROCK) Algorithm for Clustering Categorical Data. Adv. in Nat. Appl. Sci., 9(6): 518-524, 2015 INTRODUCTION Data mining is a technique used for processing data from different perspectives which leads to summarization of useful information. The goal of the data mining process is to extract information from a data set and transform it into an understandable format for further use. In data mining, the data is mined using two learning methodologies. They are Supervised Learning Unsupervised Learning Supervised Learning: Supervised Learning is the task of deriving a function from labeled training data. The training data is a set of training examples. Supervised learning is a pair consisting of an input object and output value. A supervised learning algorithm takes the training data and then produces an inferred function that can be used for mapping new examples. An optimal scenario will allow the algorithm to correctly determine the class labels for unseen instances. Classification and regression are the approaches under Supervised Learning. Unsupervised Learning: In machine learning, the problem of unsupervised learning tries to find hidden structure in unlabeled data. The examples that are given to the learner are unlabeled. This is the difference between unsupervised learning and supervised learning. Unsupervised learning is related to the density estimation problem in statistics. However the unsupervised learning also contains many other techniques that aim to summarize and explain key features of the data. Many methods that are given in unsupervised learning are based on data mining methods which are used to preprocess data. Clustering is one of the approaches under unsupervised learning. 1. Related Work: Recently, clustering algorithms for mining large databases have been proposed in (Raymond et al., 1994, Martin Ester et al., 1995, Tian Zhang et al., 1996, Sudipto Guha et al., 1998, Martin Ester et al., 1996). Most of these, are variants of either partition (Raymond et al., 1994) or centroid-based hierarchical clustering (Tian Zhang et al., 1996, Sudipto Guha et al., 1998). These algorithms are not suitable to cluster the categorical data, instead used to cluster numeric data. For example, CLARANS (Raymond et al., 1994) incorporates a randomized search to find the k-best cluster medoids. BIRCH, proposed in (Tian Zhang et al., 1996), finds the clusters in two steps. As the first step, Corresponding Author: P.V. Likhitha, Post Graduate Student, Pondicherry Engineering College, Puducherry, India. 519 Dr. K. Saruladha and P.V. Likhitha, 2015 Advances in Natural and Applied Sciences, 9(6) Special 2015, Pages: 518-524 the data is pre clustered and finally, the centroid based hierarchical algorithm is used to cluster the partial clusters. The CURE algorithm proposed in (Sudipto Guha et al., 1998) is a combination of random sampling and partition clustering to handle large databases. The hierarchical clustering algorithm represents each and every cluster by a certain number of entities that are generated by selecting well scattered entities and then shrinks them toward the cluster centroid by a specified fraction. DBSCAN (Density Based SCAN), which is proposed in (Martin Ester et al., 1996), grows clusters by including the dense neighborhoods of points already in the cluster. This approach may be prone to errors if clusters are not well-separated. The ROCK clustering algorithm, proposed in (S. Guha et al., 1999) is based on the clustering of categorical data. The classification of all the clustering algorithms is given in Fig 1. Data mining Methodologies Supervised Learning Unsupervised Learning Classification Clustering Regression Hierarchical Partition Density based BRICH [x] K-Means DBSCAN [iv] CURE [ix] ROCK[viii] DENCLUE [i] Chemeleon [ii] MROCK [Proposed] Grid based STING [xi] CLIQUE [vi] OPTICS [v] Fig. 1: Classification of Clustering Algorithms. 2. Clustering: Clustering is a process of grouping with similar properties. Any cluster should exhibit two main properties, low inter-class similarity and high intra-class similarity. Clustering is an unsupervised learning i.e., it learns by observations rather than examples. There are no predefined class label exists for the data points. Cluster analysis is used in many number of applications such as data analysis, image processing, etc., Clustering algorithms can be divided into many number of categories which are illustrated in Fig 1. ROCK algorithm comes under the Partitioned Clustering algorithms where the complete data set is portioned into several subsets. Fig 2 represents the original entities of a data set and entities after the partition clustering is performed. Original Points Partitioned Clustering Fig. 2: Representation of original Points and Clusters after Partitioning. This section outlines the data mining learning approaches, different types of clustering algorithms and the algorithms that come under each type. Next section briefly explains the existing ROCK algorithm. 520 Dr. K. Saruladha and P.V. Likhitha, 2015 Advances in Natural and Applied Sciences, 9(6) Special 2015, Pages: 518-524 3. Rock Clustering Algorithm: ROCK clustering algorithm is used for clustering of the categorical data. The ROCK clustering algorithm is mainly based on the following Neighbors Links and Goodness measure Neighbors: A entity neighbors are those points that are considerably similar to it. The similarity is calculated using the Jaccard Coefficient. sim (C1, C2) = Eq (1) where, C1 and C2 are the two entities in the data. The Jaccard Coefficient depends on the common attributes for both the entities for which the neighbors are calculated. If the similarity value is greater than the fixed threshold, then the entity is considered as neighbor of another entity. Links: The link[i,j] is defined as the common attributes between the entities i and j. If the value of the link[i,j] is large, then it is more probable that entities i and j belong to same cluster. Goodness measure: This is used to merge the two clusters. It is calculated based on the number of links that exists between two entities. The goodness measure used in ROCK is given below. g(Ci ,Cj) = Eq(2) where, link[Ci,Cj] is the number of cross links which is nothing but the number of common attributes between Ci,Cj ni,nj are the size of the clusters Ci,Cj f( In this section the existing ROCK clustering algorithm is explained in detail. In the next section, the proposed MROCK clustering algorithm and the comparison with the existing ROCK algorithm is briefly discussed. 4. Proposed Mrock Clustering Algorithm: The proposed MROCK clustering algorithm is based on the ROCK algorithm (S. Guha et al., 1999) which is extensively used in the area of clustering for the categorical values of the data. The proposed clustering algorithm proceeds in three stages as follows. Step 1: Computation of the intra cluster similarity In this step, the similarity between each and every entity pair of the data set is calculated. The proposed similarity measures for calculating the similarity between the two entities are given as follows and the best similarity among the following similarity measures is considered for MROCK algorithm. Modified Sorensen Dice coefficient: sim (C1, C2) = Eq (3) Modified Tversky: sim (C1, C2) = Eq (4) 521 Dr. K. Saruladha and P.V. Likhitha, 2015 Advances in Natural and Applied Sciences, 9(6) Special 2015, Pages: 518-524 Modified Second Kulczynski: sim (C1, C2) = Eq (5) where, C1 and C2 are the two entities in the data. For a sample data set which consists of 20 entities and 5 attributes, the similarity between each and every entity pair is calculated and those values are compared with the Human Judgment values and are illustrated in Fig 3. Fig. 3: Graph representing the similarity measure values using different similarity measures and HJ values. The correlation coefficient is calculated for the different similarity measures with the HJ values and is depicted in Fig 4. In comparision the Modified Tversky model is found to correlate well with Human judgements and the correlation coefficient obtained is 0.77. Fig. 4: Graph representing the correlation coefficient for different similarity measures. Step 2: Computation of the neighbor list The neighbor list of the entities is nothing but the entities which are related to any particular entity. After calculating the intra similarity, if the similarity is greater than or equal to the fixed threshold, then that entity is considered as the neighbor. Based on the experimental results, the threshold (Ө) fixed for the experiment as 0.555. Step 3: Calculating the goodness measure Goodness measure is calculated between the two clusters to decide whether to merge the clusters or not. Initially each and every entity is taken as one cluster and subsequently based on the calculated goodness 522 Dr. K. Saruladha and P.V. Likhitha, 2015 Advances in Natural and Applied Sciences, 9(6) Special 2015, Pages: 518-524 measure, the clusters are merged. Here, The merging process is continued till the links between the entities disappear or till a specified number k of clusters remain. The proposed goodness measure for merging the two clusters is given below. Eq(6) g(Ci ,Cj) = where link[Ci,Cj] is the number of cross links between Ci,Cj ni,nj are the size of the clusters Ci,Cj f( The complete pseudo code for the proposed MROCK Clustering Algorithm is given in Fig 5. Algorithm MROCK (D, k, Ө) Input: Set of entities, D, number k of clusters to be found, the similarity threshold Ө Output: Set of clusters Begin Step 1: Compute the similarity for each i,j D using Eq(4) Step 2: Compute neighbor list for each i D Fig. 5: Pseudo code for the proposed clustering algorithm. if sim (i,j) ≥Ө then j is considered as neighbor of i. Step 3: Compute the links link [i,j] for each i, j D //Number of links between i,j is number of common of common attributes and j //between i Step 4: Compute the goodness measure g (Ci,Cj) using Eq(6) //Store the goodness measure values in the descending order of the magnitude and the clusters with high goodness measure value. Step 5: Formation of the clusters For each Ci and Cj //Merge the clusters till the links disappear //or no.of clusters k remain. End //merge Fig. 5: Pseudo code for the proposed clustering algorithm. In this section, the detailed description of the proposed MROCK clustering algorithm has been given. In the next section, the proposed MROCK clustering algorithm is compared with the existing ROCK algorithm. 4.1 Comparison of Mrock Algorithm with Rock Algorithm: The comparison between the proposed MROCK clustering algorithm with the existing ROCK clustering algorithm can be done by taking entities from a sample data set. The following two figures, Fig 6 and Fig 7 depict the formation of the clusters using ROCK and MROCK respectively. In this section, the proposed clustering MROCK and its comparison with the existing ROCK algorithm is discussed clearly. The next section concludes by giving the results obtained by using the synthetic data set and sample of Mushroom data set. 523 Dr. K. Saruladha and P.V. Likhitha, 2015 Advances in Natural and Applied Sciences, 9(6) Special 2015, Pages: 518-524 C Cluster 2 1 C 2 C 2 3 C C 4 5 C C 6 7 C 8 Cluster 1 C C 1 9 1 0 Cluster 3 Fig. 6: Clusters formation using ROCK Algorithm for synthetic dataset. Cluster 2 C 1 C C 2 3 C C C C C 4 5 6 7 8 Cluster 1 C C 9 1 0 Fig. 7: Clusters formation using MROCK Algorithm for synthetic dataset. 5. Conclusion: The existing clustering technique ROCK with which the proposed Modified ROCK is compared. For the synthetic dataset used for experimental purpose, number of clusters formed from ROCK are 4 and from MROCK, number of clusters formed are 2. For the sample mushroom dataset, number of clusters formed from ROCK are 5 and from MROCK, number of clusters formed are 3. Contributions of this work are, A new intra similarity measure is proposed for the computation of the intra similarity between the entities in the categorical data. A new goodness measure is proposed to merge the clusters in an effective manner. Using the above intra similarity measure and goodness measure, MROCK algorithm is given which performs better compared to ROCK. REFERENCES Alexander Hinneburg, Daniel A.Keim, 1998. An Efficient Approach to Clustering in Large Multimedia Databases with Noise, American Association for Artificial Intelligence. Geroge Karypis Eui-Hong (Sam) Han, Vipin Kumar, 1999. Chameleon: Hierarchical Clustering Using Dynamic Modeling, IEEE, August. Martin Ester, Hans-Peter Kriegel and Xiaowei Xu, 1995. A database Interface for Clustering in Large Spatial Databases, in proceedings of the Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August. Martin Ester, Hans-Peter Kriegel, Jorg Sander and Xiaowei Xu, 1996. A Density-based Algorithm for Discovering Clusters in Large Spatial Database with Noise, in proccedings of the Conference on Knowledge Discovery in Databases and Data Mining (KDD-96), Portland, Oregon, August. Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jorg Sander, 1999. OPTICS: Ordering Points to Identify the Clustering Structure, in the proceedings of ACM SIGMOD’99 Int.Conf.on Management Data, Philadelphia PA. Raghunath Kar and Susant Kumar Dash, 2011. A Study On High Dimensional Clustering By Using Clique, International Journal of Computer Science & Informatics, Volume-I, Issue-II. 524 Dr. K. Saruladha and P.V. Likhitha, 2015 Advances in Natural and Applied Sciences, 9(6) Special 2015, Pages: 518-524 Raymond T. Ng and Jiawei Han, 1994. Efficient and Effective Clustering Methods for Spatial Data Mining, in proceedings of the 20th VLDE Conference Santiago, Chille, pp: 144-155. Guha, S., R. Rastogi and K. Shim, 1999. ROCK: a robust clustering algorithm for categorical attributes, in Proceedings of the 15th International Conference on Data Engineering, pp: 512-521. Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, 1998. CURE: A Clustering Algorithm for Large Databases, in proceedings of the ACM SIGMOD Conference on Management of Data, May. Tian Zhang, Raghu Ramakrishnan and Miron Livny, 1996. Birch: An Efficient Data Clustering method for Very Large Databases, in proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, pp: 103-114. Wei Wang, Jiong Yang, Richard Muntz, 1997. STING: A Statistical Information Grid Approach to Spatial Data Mining, in the proceedings of the 23rd VLDB Conference, Athens, Greece.