Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Volume 4, Issue 3, May-June 2014 Available Online at www.gpublication.com/jcer ISSN No.: 2250-2637 ©Genxcellence Publication 2011, All Rights Reserved RESEARCH PAPER Comparative Study of Clustering Techniques Archana1, Jitendra Kumar2 *1,2 Department of Computer Science & Engineering, SET, IFTM University, Moradabad, India 1 [email protected], [email protected] Abstract Data Clustering is grouping of data of similar qualities into clusters. In other words, clustering is a division of similar data into different groups. These groups of data are called clusters. Data are grouped into clusters in such a way that data in the same group are similar and in other groups are dissimilar. Cluster analysis aims to categorize the set of patterns into clusters based on similarity. Various types of clustering techniques analyzed on a dataset. We have used WEKA to compare the various clustering algorithms by taking some parameters on a sample data. Keywords Clustering, Data mining, Weka etc. complications of very large datasets with many attributes of different types [1]. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. I. INTRODUCTION Data Miming is one of the important aspects for extracting a great deal of information from the large amount of data. DM is the search for valuable information in large volumes of data [3]. Clustering is an important data mining technique to divide the similar data into a cluster and dissimilar data into other cluster [9]. Clustering is an unsupervised learning technique. Data clusters are created to meet specific requirements that cannot create using any of the categorical levels. One can combine data subjects as a temporary group to get a data cluster. Clustering techniques can be divided into several categories: Partitioning algorithms, Hierarchical algorithms, Density based clustering algorithms, Farthest first clustering and filtered clustering etc. These clustering algorithms are defined and compared in this research. WEKA is used data mining tools for this purpose in this research. It provides a better interface compare to the other data mining tools. This study is of use of different clustering techniques available in WEKA data mining tool. Comparing different clustering algorithms dermatology dataset is used which contains 35 attributes and 366 instances. Study comprises of different data mining clustering algorithms, i.e. KMeans, Hierarchical clustering, Density based clustering algorithm, Farthest first clustering and Filtered Clustering for comparison. Dermatology dataset with 35 attributes and 366 instances is taken and by using WEKA taken results and analyzed. Various clustering algorithms are taken and predict a useful result with the help of WEKA. Raw Data Clustering Algorithm Clusters of data Figure 1: Clustering Process III. WEKA WEKA is the product of the University of Waikato (New Zealand) and was first implemented in its modern form in 1997 [5]. WEKA stands for Waikato Environment for Knowledge Analysis. It is the most widely tool for the data mining. WEKA is an open source interface developed in Java and aims at providing an interface where various algorithms are incorporated for the practitioners for research. Users can quickly adapt the interface due to its simplicity and ease of use [6]. WEKA provides the graphical user interface of the user and provides many facilities. The GUI Chooser consists of four buttons in WEKA: 1. Explorer: An environment for exploring data with WEKA. 2. Experimenter: An environment for performing experiments and conducting statistical tests between learning schemes. 3. Knowledge Flow: This environment supports essentially the same functions as the Explorer but with a drag and drop interface. One advantage is that it supports incremental learning. 4. Simple CLI: Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface [7]. II. CLUSTERING Data clustering is measured a remarkable approach for finding similarities in data and grouping similar data into different clusters. A cluster is a set of objects which are similar and dissimilar to the objects belonging to the other clusters. Cluster analysis is used in wide variety of field: business analysis, psychology, social science, biology, statistics, information retrieval, machine learning and data mining [4] etc. Data mining adds to clustering the 7 Please Cite this Article at: Archana et al, Journal of Current Engineering Research, 4 (3), May-June 2014, 7-10 IV. DATA CLUSTERING TECHNIQUES Clustering is the main task of Data Mining. The most commonly used algorithms in Clustering are Partitioning, Hierarchical, Farthest First clustering, Filtered Clustering and Density based algorithms. Partitioning algorithms are two types K-Means and K-Medoids. K-means is the most popular techniques for clustering because it exhibits less complexity comparison to other algorithms. Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Hierarchical clustering algorithms divide or merge a dataset into a sequence of nested partitions [13]. This algorithm build cluster gradually. Every cluster node contains child clusters; sibling clusters partition the points covered by their common parent. Such an approach allows exploring data on different levels of granularity. It is of two types: A. K-means Clustering It is a partition method technique. K-Means is one of the simplest unsupervised learning methods among all partitioning based clustering algorithms. K-Means algorithm organizes objects into k – partitions where each partition represents a cluster [8]. In K-Means partitioning algorithms each cluster‟s center is represented by mean value of objects in the cluster. A centroid is defined for each cluster. All the data objects are placed in a cluster having centroid nearest to that data object [10]. i. Agglomerative (bottom up)Agglomerative hierarchical clustering is a bottom-up clustering method where clusters have sub-clusters, which in turn have sub-clusters, etc. It starts by letting each object form its own cluster and iteratively merges cluster into larger and larger clusters, until all the objects are in a single cluster or certain termination condition is satisfied. The single cluster becomes the hierarchy‟s root. For the merging step, it finds the two clusters that are closest to each other, and combines the two to form one cluster [8]. Algorithm Input : „k‟ the number of clusters to be partitioned; „n‟, the number of objects. Output : a set of „k‟ clusters based on given similarity function. Steps i) Arbitrarily choose „k‟ objects as the initial cluster centers; ii) Repeat a. (Re)assign each object to the cluster to which the object is the most similar; based on the given similarity function; b. Update the centroid (cluster means), i.e., calculate the mean value of the objects for each cluster; iii) Until no change. ii. Divisive (top down)A top-down clustering method and is less commonly used. It works in a similar way to agglomerative clustering but in the opposite direction. This method starts with a single cluster containing all objects, and then successively splits resulting clusters until only clusters of individual objects remain [8]. Fig.4 Results of Hierarchical Clustering C. Density Based Clustering In Density Based Clustering, clusters are defined as areas of higher density than the outstanding of the data set. Objects in these areas that are required to separate clusters are considered to be noise and border points [11]. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold [13]. Representative algorithms include DBSCAN, GDBSCAN and OPTICS. Fig.3. Results of K-Means algorithm B. Hierarchical Clustering 8 Please Cite this Article at: Archana et al, Journal of Current Engineering Research, 4 (3), May-June 2014, 7-10 Density Based Spatial Clustering of Application with Noise (DBSCAN) This algorithm is proposed by Easter in 1996. In DBSCAN cluster is defined by the set of all point connected to their neighbors. The Eps and the Minpts are the two parameters of the DBSCAN. The basic idea of DBSCAN algorithm is that for each object of a cluster, the neighborhood of a given radius (Eps) has to contain at least a minimum number of objects ( MinPts )[11]. Procedure of DBSCAN algorithm Arbitrary select a point r. Retrieve all points density-reachable from r w.r.t Eps and Minpts. If r is a core point, cluster is formed. If r is a border point, no points are density-reachable from r and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed [12]. Fig.6 Results of Farthest First Clustering E. Filtered Clustering The filtered algorithm is used for filtering the information or pattern. In this the user supplies a sample set of appropriate information. On the coming of new information they are comparing against the filtering profile and the information matching the keywords is presented to the user. Fig.5 Results of Density Based Clustering D. Farthest first Clustering Farthest first is a variant of K Means that places each cluster centre in turn at the point furthermost from the existing cluster centre [14]. This point has to lie within the data region. This greatly speeds up the clustering in most of the cases since less reassignment and adjustment is needed. Fig.7 Results of Filtered Clustering V. COMPARISON Above section involves the study of each of the five techniques introduced previously using WEKA clustering tool on a set of dermatology dataset consists of 35 attributes and 366 records .Comparison of these clustering algorithm using WEKA is given in Table 1. VI. CONCLUSION In this work we study various clustering algorithms on the dermatology data. Clustering is the process of grouping the data into clusters. The objects are clustered or grouped based on the principle of maximizing the intra-class 9 Please Cite this Article at: Archana et al, Journal of Current Engineering Research, 4 (3), May-June 2014, 7-10 similarity and minimizing the inter-class similarity. Dissimilarity is based on the attribute values describing the objects. We have performed five experiments on the dermatology database. After getting the results of clustering algorithms we can find that the performance of K-Means algorithm is better than Hierarchical Clustering algorithm. From table it is observed that using WEKA tool K-means and Density-Based algorithm gives similar Name Of Algorithms No. Of clusters Cluster Instances results. Hierarchical algorithm takes more time in comparison to K-Means and Density-based clustering algorithms. Farthest First clustering gives best results in less time compare to other algorithms used in our study. For the purpose of future scope, one can take the other algorithms or even devise new algorithms and compare the results with the existing results. No. of iterations Log Likelihood Within cluster sum of squared errors Time taken to build a model K-Means Algorithm 2 0 252(69%) 1 114(31%) 4 - 3519.6616534260493 0.11 seconds Hierarchical Algorithm 2 0 365(100%) 1 1(0%) - - - 0.53 seconds Density Based Algorithm 2 0 254(69%) 1 112(31%) 4 -28.94463 3519.6616534260493 0.11 seconds Farthest First Clustering 2 0 277(76%) 1 89(24%) - - - 0.02 seconds Filtered Clustering 2 0 252(69%) 1 114(31%) 4 - 3519.6616534260493 0.05 seconds Table1: Comparison Table of clustering algorithms REFERENCES [1] [2] [3] [4] [5] [6] [7] Pavel Berkhin, “Survey of clustering data mining techniques”. Techreport, 2002 Ankit Bhardwaj, Arvind Sharma, V.K. Shrivastava, “Data Mining Techniques and Their Implementation in Blood Bank Sector –A Review”, IJERA 2012 Sang Jun Lee and Keng Siau, ”A review of data mining techniques”, Industrial Management & Data Systems 2001. Er. Arpit Gupta 1 Er. Ankit Gupta 2 Er. Amit Mishra,” Research Paper on cluster techniques of data variations “, IJATER 2011. Narendra Sharma 1, Aman Bajpai 2, Mr. Ratnesh Litoriya 3, “Comparison the various clustering algorithms of weka tools”, IJETAE Volume 2, Issue 5, May 2012. Namita Bhan, Deepti Mehrotra, ”comparative study of EM and K-means clustering techniques in weka interface”, IJATER 2013. Bharat chandhari, Manan Parikh, “A Comparative study of clustering algorithms using weka tools”, Vol 1, Issue 2, IJAIEM, 2012. [8] [9] [10] [11] [12] [13] [14] 10 Aastha Joshi, Rajneet Kaur, “A Review: Comparative study of Various Clustering Techniques In Data Mining”, IJARCSSE 2013. Manish Verma, Mauly Srivastava, Neha Chack, Atul, Nidhi Gupta, “A comparative study of various clustering algorithm in data mining”, IJERA, 2012. Shalini S Singh and N C Chauhan, “K-Means v/s KMedoids: A Comparative Study” , National Conference on Recent Trends in Engineering & Technology, 2011. Amandeep kaur Mann and Navneet Kaur, “Survey paper on clustering techniques”, IJSETR, 2013. Prabhdip Kaur1, Shruti Aggrwal2, “Comparative Study of Clustering Techniques”, IJARET Vol. 1, Issue III, April 2013. Shraddha K.Popat and Emmanuel M., ” Review and Comparative Study of Clustering Techniques”, (IJCSIT), 2014. Vishal Shrivastava And Prem Narayan Arya, “ Performance Analysis On Retail Sales Data From Various clustering Algorithm Using Weka Tool”, IJCITB 2013.