Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Conceptions on Computing and Information Technology Vol. 1, Issue. 1, November 2013; ISSN: 2345 – 9808 Recent advances in clustering algorithms: a review Manivara Kumar Parsha Sreenivasulu Pacha Dept. of Electronics & Communication Engineering, Narayana Engineering College, Nellore, AP, India. [email protected] Dept. of Electronics & Communication Engineering, PBR Viswodaya Institute of Technology & Science, Kavali, AP, India. [email protected] Abstract— The Unsupervised classification of patterns, observations, data items, or feature vectors is called Clustering. The classification is aimed to group the data into clusters. The clustering problem has been addressed in many contexts and by researchers in many disciplines. This reflects its broad applications in the data analysis. This paper presents an overview of Clustering algorithms, with a goal of providing useful advices and references from fundamental concepts to the present day advancements in the algorithms mentioned, so that the broad community of clustering practitioners will get benefited from this review. b. Overlapping clustering: The overlapping clustering uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. Keywords- Clustering Algorithms; Machine learning; Gene Expression; e. Partitional Clustering: Non Hierarchical or Partitioning clustering creates the clusters in one step as opposed to several steps. MST and Squared error, K-Means, Nearest Neighbor, PAM algorithms come under the Partitional algorithms. Information retrieval; I. INTRODUCTION Clustering is the process of grouping the data. The groups are called clusters. Grouping is accomplished by finding similar characteristics in the actual data. Clustering is an unsupervised learning. Hence clusters can be defined as group of like elements. Clustering has been used in many application domains such as biology, medicine, anthropology, marketing, psychiatry, psychology, archeology, geology, geography and economics etc., in any domain, clustering means grouping a given collection of unlabeled data or patterns or feature vectors into meaningful clusters. II. CLUSTERING ACTIVITY The typical pattern clustering activity involves the following steps[1] (i) Pattern Representations, this stage also involves the feature extraction and/or selection. (ii) Definition of a pattern proximity measure appropriate to the data domain. (iii) Clustering or grouping. (iv) Data Abstraction (v) Assessment of Output. The Last two steps are optional in several applications. The Clustering methods are used in Pattern Recognition, Image processing and information retrieval. More or less these are also used in unsupervised learning, vector quantization and Learning by observation. III. CLASSIFICATION The clustering algorithms may be broadly classified as [2], a. Exclusive clustering: In Exclusive clustering data are grouped in Exclusive way, so that a certain datum belongs to only one definite cluster. K-Means clustering is one example of the exclusive clustering algorithms. c. Hierarchical clustering: It is connectivity based clustering and is a method of cluster analysis which seeks to build a hierarchy of clusters. Agglomerative clustering and Divisive clustering comes under the Hierarchical clustering. d. Probabilistic clustering: Probabilistic clustering uses a completely probabilistic approach. Mixture of Gaussian algorithm is an example. f. Clustering Large Databases: Algorithms associated with Dynamic databases are BIRCH, DBSCAN, CURE algorithms. g. Clustering with Large Categorical Data: Problems that are existed when clustering the categorical data can be solved with these algorithms. K-mode algorithm is an example. h. Centroid Based Clustering: In this, clusters are represented by a central vector, which may not necessarily be a member of the data set. K-Means is an example. Density Based Clustering: In this type, clusters are defined as areas of higher density than the remainder of the data set. DBSCAN is the most popular example. IV. ADVANCES IN THE CLUSTERING ALGORITHMS A. Density Based Algorithm (DBSCAN) DBSCAN (for Density-Based Spatial Clustering of Applications with Noise) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes. DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature. An Interesting Experiment was conducted by DominicaArlia and Massimocoppala[3] in the parallel 1|71 International Journal of Conceptions on Computing and Information Technology Vol. 1, Issue. 1, November 2013; ISSN: 2345 – 9808 clustering with DBSCAN in which they used a master module and a slave module. Master module performs cluster assignment while the slave module answers the neighborhood queries. As a result unlike the sequential scan, the points already labeled are returned again and again from the slaves. The results were similar to the sequential DBSCAN, but they provided the new methodology of code which helps to reengineer the existing code. The Density based algorithms such as GDBSCAN, PDBSCAN, DENCLUE, are mentioned[4]. Recently In 2010, Tepwankul, A. et al., studied the problem of clustering uncertain objects. They proposed a new deviation function that approximates the underlying uncertain model of objects and a new density-based clustering algorithm, U-DBSCAN,[5] that utilizes the proposed deviation. B. BIRCH algorithm Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is designed for very large dataset. Incremental and dynamic clustering of incoming objects can be done with BIRCH. Only one scan of data is necessary, and no need to have whole data set in advance. This is a new data clustering method using clustering feature (CF)-tree data structure, and suggests that this method can be easily parallelized. Two algorithms are used in BIRCH clustering. They are CF-Tree Insertion and CF-tree rebuilding. Both these algorithms were explained by Tian Zhang and Raghu Ramakrishnan in their paper[6]. Recently novel method of efficient spam mail classification using BIRCH clustering techniques is presented by M. Basavaraju et al.,[7] . This technique was implemented using open source technology in C language. Birch Algorithm works for larger data set. C. Clustering using Representatives(CURE)Algorithm The Drawbacks of the traditional clustering algorithms are mentioned [8]. The algorithms generally favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. Sudipto Guha, Rajeev Rastoji, Kyuseok Shim Proposed a new algorithm CURE[8] This algorithm is robust to outliers and identifies clusters with non spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non spherical shapes and the shrinking helps to dampen the effects of outliers. An improved cure algorithm was developed to analyze the business information[9]. In this technique the weight value can be adjusted dynamically, which increases the flexibility of clustering. . Theory analysis and experiment result showed that it is a validate method to dynamic identifying analysis of competitor for enterprise. D. K-Means Algorithm A Simple efficient implementation of Lloyd’s K-means clustering algorithm was done by Kanungo[10]. He mentioned the Filtering Algorithm which is based on storing the multidimensional data points in kd-tree. He established the practical efficiency of Filtering algorithm in two ways. One is the running time and another is related to the studies on the synthetically generated data and on real data sets from applications in color quantization, data compression and image segmentation. Spam mail classification was also done[7] using K-means algorithm, which fits best for smaller data set. E. PAM: (Partitioning Around Medoids) Compared to the k-means approach, the PAM has the following features[11] (i) it also accepts a dissimilarity matrix (ii) it is more robust because it minimizes a sum of dissimilarities instead of a sum of Squared Euclidean distances (iii) it provides a novel graphical display, the silhouette plot which also allows to select the number of clusters. The PAM algorithm is based on the search for k representative objects or medoids among the objects of the dataset. These objects should represent the structure of the data. After finding a set of k medoids, k clusters are constructed by assigning each object to the nearest medoid. The goal is to find k representative objects which minimize the sum of the dissimilarities of the objects to their closest representative object. The algorithm first looks for a good initial set of medoids (this is called the BUILD phase). Then it finds a local minimum for the objective function, that is, a solution such that there is no single switch of an object with a medoid that will decrease the objective (this is called the SWAP phase). Advances in feature level fusion system of face and palm print traits is recently carried out[12]. Partitioning Around Medoids (PAM) algorithm is used to partition the setof ‘n’ invariant feature points of the face and palmprint imagesinto ‘k’ clusters. By partitioning face and palm print imageswith scale invariant features SIFT points, a number of clustersare formed on both the images. The results showed the betterperformance compared to the existing techniques. F. CLARANS Algorithm CLARANS is an efficient and effective clustering method especially in spatial data mining. It is applicable to locate objects with polygon shape. The content of data mining usually contains discrete data. To solve this problem, the traditional method is to convert discrete data into digital values. Still it is hard to obtain a meaningful result since the order of discrete data problem area would be damaged. A clustering Algorithm, CLARANS (Clustering Large Applications based on RANdomized Search) could avoid this efficiently. The improved and parallel CLARANS[13] can improve the clustering performance further, and it's able to get better clustering result and as well shorten searching time. This was applied in 110 Alarm System and it turned out to be satisfying. 2|71 International Journal of Conceptions on Computing and Information Technology Vol. 1, Issue. 1, November 2013; ISSN: 2345 – 9808 CLARA (Clustering LARge Applications) relies on sampling. This Algorithm handles large data sets. Instead of finding representative objects for the entire data set, CLARA draws a sample of the data set, applies PAM on the sample, and finds the medoids of the sample. The point is that, if the sample is drawn in a sufficiently random way, the medoids of the sample would approximate the medoids of the entire data set. The detailed algorithm was given in the paper [14]. G. CHAMELEON Algorithm Existing clustering algorithms, such as K-means, PAM, CLARANS, DBSCAN, CURE, and ROCK are designed to find clusters that fit some static models. These algorithms can breakdown if the choice of parameters in the static model is incorrect with respect to the data set being clustered, or if the model is not adequate to capture the characteristics of clusters. Furthermore, most of these algorithms breakdown when the data consists of clusters that are of diverse shapes, densities, and sizes. CHAMELEON algorithm[15] measures the similarity of two clusters based on a dynamic model. In the clustering process, two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal inter-connectivity of the clusters and closeness of items within the clusters. The merging process using the dynamic model presented in this paper facilitates discovery of natural and homogeneous clusters. The methodology of dynamic modeling of clusters used in CHAMELEON is applicable to all types of data as long as a similarity matrix can be constructed. Out of the five data sets they used CURE was able to find clusters only for two data sets. But CHAMELEON is able to find for the remaining also. Moreover CHAMELEON can discover clusters of different shapes and sizes. H. FCDC Algorithm FCDC algorithm (Frequent Concepts based Document Clustering)[16] treats the text documents as set of related words, instead of bag of words. Different words shares the same meanings are known as synonyms. Set of these differentwords that have same meaning is known as concept. So whether document share the same frequent concept or not is used as the measurement of their closeness. So, this algorithm is able to group documents in the same cluster even if they do not contain common words. They constructed feature vector based on concepts and apply an Apriori paradigm for discovering frequent concepts then frequent concepts are used to create clusters. It was found that FCDC algorithm is more efficient and accurate than other clustering algorithms in this application I. expression data. Later advancements proved that Cluster Ensemble Approach[18] yields better results than the single best clustering algorithm for analyzing gene expression data set. J. RObust Clustering using links algorithm (ROCK) is targeted to both Boolean data and categorical data. A pair of items are said to be neighbors if their similarity exceeds some threshold. The number of links between two items is defined as the number of common neighbors they have. The objective of the clustering algorithm is to group together points that have more links. A recent development in genetics used this algorithm called GE-ROCK[19]. GE-ROCK is an improved ROCK algorithm which combines the techniques of clustering and genetic optimization. In GE-ROCK, similarity function is used throughout the iterative clustering process, while in the “conventional” ROCK algorithm, similarity function is only to be used for the initial calculation. The analysis showed that GE-ROCK leads to the superior performance in not only better clustering quality but also shorter computing time when compared to the ROCK algorithm commonly used in the literature. V. CONCLUSION We have presented the overview of Recent Advances in the Clustering Algorithms along with the basic activity and wide classification to give broader idea about the clustering algorithms. The improvements in different algorithms were mentioned. Advancements in the algorithms such as CLARANS, FCDC, Cluster Ensemble approaches, ROCK etc., are mentioned. Recent advances in clustering algorithms in different areas such as analyzing Gene expression, Genetic optimization, Text document clustering, Image clustering, Business Information, Spam mail classification, clustering the uncertain objects are included, which gives the broad spectrum of recent advances in clustering. REFERENCES [1] [2] [3] [4] [5] Cluster Ensemble Approach Clustering is one of the most widely and frequently used technic in Gene Expression analysis. Comparatative study on analyzing ES Cell Gene Expression Data with different algorithms like K-Means, PAM and SOM are given by Gengxin Chen et al.,[17] . Their study provides a guideline on how to select a suitable clustering algorithm for the extraction of meaningful biological information from microarray ROCK Algorithm [6] [7] [8] 3|71 Mrs.J.R.Prasad, R.S.Prasad, U.V.Kulkarni “Impact of Feature Selection Methods in Hierarchical Clustering Technique: A Review.” Proceedings of IMECS 2008, Vol I, pp 19-21 (2008) Margaret, H. D, Data Mining: Introductory and Advanced topics, Pearson Education, Inc., Copyright 2003. Dominica, A, Massimo, C, Experiments in Parallel clustering with DBSCAN. Euro-Par 2001, LNCS 2150, pp 326-331, Springer, Heidelberg (2001) Kotsiantis S.B, Pintelas P. E, Recent Advances in Clustering: A Brief Survey. Technical Report (2004) Tepwankul. A Maneewongwattana S, U-DBSCAN: A density-based clustering algorithm for uncertain objects. 26th IEEE International Conference on Data Engineering Workshops (ICDEW), pp 136-143, (2010) Tian.Z, Raghu.R, Miron.L, BIRCH: A New Data Clustering Algorithm and Its Applications Kluwer Academic Publishers pp 1-40 (1997) Basavaraju.M, Prabhakar.R, A Novel Method of Spam mail detection using Text Based Clustering Approach. IJCA (0975 – 8887) Vol. 5, No.4, August 2010. Sudipto.G, Rajeev.R, Kyuseok.S, CURE: An Efficient Clustering Algorithm for Large Databases. SIGMOD, Seattle, WA, USA (1998) International Journal of Conceptions on Computing and Information Technology Vol. 1, Issue. 1, November 2013; ISSN: 2345 – 9808 [9] [10] [11] [12] [13] Zhao, Yan, Research of an improved cure algorithm used in enterprise competitive intelligence to dynamic identify analysis. IEEE Youth Conference on Information Computing and Telecommunications (YCICT), pp 299-302, Nov 2010 Tapas.K, David.M.M., Nathan.S.N, Christine.D.P, Ruth.S, Angela, Y.Wu, An Efficient K- Means Clustering Algorithm Analysis and Implementation. IEEE transactions on pattern analysis and machine intelligence, Vol. 24, No. 7, July 2002. United Nations Educational, Scientific and Cultural Organization, http://www.unesco.org/webworld/idams/advguide/Chapt7_1_1.htm Dakshina R. K, Phalguni G, Jammu.K. S, Multibiometrics Feature Level Fusion by Graph Clustering, International Journal of Security and Its Applications Vol. 5 No. 2, April, 2011 Xue Jing-Sheng, Parallel CLARANS- improvement and application of CLARANS algorithm. International Conference On Computer and Communication Technologies in Agriculture Engineering (CCTAE), PP248 – 251, June 2010 [14] Raymond.T, Ng, Jiawei.H, Efficient and Effective Clustering methods for Spatial Data Mining. Proceedings of the 20th VLDB Conference, Satiago, Chile, 1994. [15] George.K,Eui.H.H,Vipin.K, Chameleon: A Hierarchical clustering algorithm using Dyanamic Modeling. Technical Report [16] Rekha.B, Renu.D, A Frequent Concepts Based Docment Clustering Algorithm. IJCA (0975 – 8887) Volume 4 – No.5, July 2010. [17] Gengxin.C, Saied.A.J, Nila.B, Evaluation and Comparison of Clustering Algorithms In Analyzing ES Cell Gene Expression Data. [18] Xiaohua.H, Illhoi.Y, Cluster Ensemble and Its Applications in Gene Expression Analysis. 2nd APBC, Dunedin, New Zealand 2004 [19] Qiongbing. Z, Lixin.D, Shanshan.Z, A genetic Evolutionary Rock Algorithm. International Conference on Computer Application and System Modeling (ICCASM), 347-351 pages, Oct, 2010 4|71