Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN:2249-5789 Mamta Mor, International Journal of Computer Science & Communication Networks,Vol 6(3),138-142 A Review on Various Clustering Techniques in Data Mining Mamta Mor [email protected] Abstract This paper presents a review on various clustering techniques used in data mining. Data mining is the task of retrieving useful and hidden knowledge from data sets [1] [2]. Clustering is one of the important tasks of data mining. Clustering is an unsupervised learning problem which is used to determine the intrinsic grouping in a set of unlabeled data [3]. The grouping of objects is done on the principle of maximizing the intra-cluster similarity and minimizing the inter-cluster similarity in such a way that the objects in the same group/cluster share some similar properties/traits [4]. pair of objects. On the basis of similarity or dissimilarity clustering can be classified into two types: a) Distance based clustering b) Conceptual clustering In distance based clustering the objects/instances are put into clusters on the basis of distance criteria: two or more objects belong to the same cluster if they are “closer” to the centroid of that particular cluster. The basic idea behind distance based clustering is to minimize the intra-cluster distance and maximize the inter-cluster distance, which is shown in Figure 1. 1. Introduction Clustering techniques are useful in various applications of real world including data/text mining, voice mining, image processing, web mining etc. It is a main task of exploratory data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics [3] [5]. Clustering is the technique of partitioning the data being mined into several clusters of data objects, in such a way that: a) The objects in a cluster resemble to each other to a great extent; and b) The objects of a cluster are much different from the objects in another cluster. A cluster should exhibit the properties of external isolation and internal cohesion. External isolation requires that objects/instances in one cluster should be separated from objects/instances in another cluster by fairly empty areas of space. Internal cohesion requires that objects/instances within the same cluster should be similar to each other, at least within the local metric [6]. It can also be sated as that a good clustering algorithm always maximizes the intra-cluster similarity and minimizes the inter-cluster similarity [1] [2]. Clustering is often based on some: a) Similarity measure b) Distance measure The notion of similarity is always problem dependent. The dissimilarity (or similarity) between objects is typically computed based on the distance between each IJCSCN | June-July 2016 Available [email protected] Figure 1: Distance based Clustering In conceptual clustering two or more objects belong to the same cluster if they are conceptually same or similar. In other words, clusters are formed according to descriptive concepts, not according to distance measure, which is shown in Figure 2. Figure 2: Conceptual clustering 2. Clustering Techniques/Methods The various clustering methods that are used in the field of data mining can categorized as: 138 ISSN:2249-5789 Mamta Mor, International Journal of Computer Science & Communication Networks,Vol 6(3),138-142 a) b) c) d) e) 3. Hierarchical methods Partitioning methods Density-based methods Grid-based methods Model-based methods Repeat until n clusters are there. Here in figure 4 both the techniques agglomerative as well as divisive have been shown: 2.1 Hierarchical Clustering Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a technique of cluster analysis which seeks to build a hierarchy of clusters also known as dendrogram. Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. It is based on the core idea of objects being more related to nearby objects than to objects farther away. So it can be concluded, these algorithms connect "objects" to form "clusters" on the basis of distance based clustering [5]. It can be further divided into two subtypes as shown in Fig. 3: Hierarchical Clustering Agglomerative Method Divisive Method Figure 3 • Agglomerative Method It is a bottom-up approach which starts by assigning each data instance to one cluster and then iteratively merges the two most similar clusters. This technique builds the hierarchy from the individual objects by progressively merging clusters which is shown in figure 4. The general algorithm for agglomerative clustering is as follows [5][6]: 1. Find the 2 closest objects and merge them into a cluster. 2. Find and merge the next two closest points, where a point is either an individual object or a cluster of objects. 3. If more than one cluster remains, return to step 2. • Figure 4: Hierarchical Clustering Advantages of hierarchical clustering include: a) Embedded flexibility regarding the level of granularity. b) Ease of handling of any forms of similarity or distance. c) Consequently, applicability to any attributes types. Disadvantages of hierarchical clustering are related to: a) Vagueness of termination criteria. b) The fact that most hierarchical algorithms do not revisit once constructed (intermediate) clusters with the purpose of their improvement. 2.2 Partitioning Clustering The partitioning methods generally result in a set of M clusters, each object belonging to one cluster. Each cluster may be represented by a centroid [5]. It can be further divided into two subtypes as shown in Fig. 5: Iterative Partitioning Method Overlapping Method Non-Overlapping Method Figure 5 Divisive Method It is top-down approach which starts by assigning all the objects to one cluster and then iteratively dividing it into smaller and smaller clusters which is shown in figure 4. The general algorithm for agglomerative clustering is as follows [5]: (Let us assume there n objects) 1. Find the 2 farthest objects and split them into two clusters. 2. Find and split the next two closest points, where a point is either an individual object or a cluster of objects. IJCSCN | June-July 2016 Available [email protected] Perhaps the most popular class of clustering algorithms is the combinatorial optimization algorithms a.k.a. iterative relocation algorithms. These algorithms minimize a given clustering criterion by iteratively relocating data points between clusters until a (locally) optimal partition is attained. In overlapping method, a data instance can belong to one or more clusters at same time whereas in non-overlapping method a data instance can belong to only cluster at a time. Among the partition based clustering algorithms k-means clustering is one of the most popular and the simplest 139 ISSN:2249-5789 Mamta Mor, International Journal of Computer Science & Communication Networks,Vol 6(3),138-142 unsupervised learning technique which belongs to the non overlapping class. • K-means Clustering K-means partition the data points into K groups or cluster, where parameter k is a positive integer specified by the user. K-means starts with K centroids, and it iteratively performs the following steps [3][8]: a) Assign each data instance to the cluster whose centroid is nearest to it. b) Compute the new centroids of each cluster. Repeat the above steps until no data instance moves from one cluster to another. A formal description of kmeans is given below: Let us consider a dataset X=(x1, x2, …, xn) be a set of n objects, where each object has p attributes. K-means clustering aims to partition the n observations into k (≤n) sets S = {S1, S2, …, Sk} so as to minimize the intra-cluster distance and maximize inter-cluster distance. In other words, its objective is to find [4][9]: Where, x –is the point representing data object. Si – ith cluster Mi – is the mean of ith cluster K-means clustering algorithm is fast, robust, relatively efficient and easier to understand. Advantages of k-means clustering include: 1) If number of data objects is large, then K-Means most of the times computationally faster than hierarchical clustering, if the value of k is kept small. 2) K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are globular. Disadvantages of k-means clustering include: 1. It is difficult to predict the value of K. 2. Different partitions can result in different final clusters. 3. Many a times it converges to a sub-optimal solution due to large clustering space. 2.3 Density based Clustering Density-based clustering groups together data objects/points that are tightly packed together (points with many nearby neighbours), marking as outliers points that lie alone in low-density regions (whose nearest neighbours are too far away)[5]. Clusters are defined as areas of higher density than the remainder of the data set. Objects in these sparse areas - that are required to separate clusters - are usually considered to IJCSCN | June-July 2016 Available [email protected] be noise and border points. DBSCAN (Density-based spatial clustering of applications with noise) is one of the most common and popular density based clustering algorithms [6]. Density Reachability - A point "x" is said to be density reachable from a point "y" if point "x" is within ε distance from point "y" and "y" has sufficient number of points in its neighbors which are within distance ε[5]. Density Connectivity - A point "x" and "y" are said to be density connected if there exist a point "z" which has sufficient number of points in its neighbors and both the points "x" and "y" are within the ε distance. This is chaining process. So, if "y" is neighbor of "z", "z" is neighbor of "s", "s" is neighbor of "t" which in turn is neighbor of "x" implies that "y" is neighbor of "x"[5]. The algorithm for DBSCAN clustering as follows: Let X = {x1, x2, x3, ..., xn} be the set of n data points. DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a cluster (minPts) [10]. 1) Start with an arbitrary starting point that has not been visited. 2) Extract the neighborhood of this point using ε (All points which are within the ε distance are neighborhood). 3) If there are sufficient neighborhoods around this point then clustering process starts and point is marked as visited else this point is labeled as noise (Later this point can become the part of the cluster). 4) If a point is found to be a part of the cluster then its ε neighborhood is also the part of the cluster and the above procedure from step 2 is repeated for all ε neighborhood points. This is repeated until all points in the cluster is determined. 5) A new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise. 6) This process continues until all points are marked as visited. Advantages 1) It does not require a-priori specification of number of clusters which is opposite from the case of kmeans. 2) It is able to identify noise data while clustering, so more robust in nature. 3) Able to find arbitrarily size and arbitrarily shaped clusters. Disadvantages 1) DBSCAN algorithm fails in case of varying density clusters 2) Fails in case of neck type of dataset. 140 ISSN:2249-5789 Mamta Mor, International Journal of Computer Science & Communication Networks,Vol 6(3),138-142 2.4 Grid based Clustering Grid based clustering is popular for mining clusters in large multi-dimensional space. The huge benefit of it is its reduction of the computational complexity, especially when the datasets are very large. This approach is not concerned with the data points but with the value space associated with the data points. • Algorithm A typical grid based clustering consists of the following basic steps [11]: 1) Create a grid structure which consists of partitioning the data space into finite number of grid cells. 2) Assign data objects to the appropriate grid cell and calculate the density for each cell. 3) Sorting the cells according to their density and eliminating cells, whose density is below a certain threshold t. 4) Identifying clusters centre. 5) Traversal of neighbour cells. Grid approach includes STING (STatistical INformation Grid) approach and CLIQUE. Let us discuss STING in brief. STING 2.4.1 STING clustering method was proposed by Wang et al. (1997) used to cluster spatial databases. It can be used to answer different kind of spatial queries [12]. • Pseudo-Code 1) Divide the spatial areas into rectangular cells representing a hierarchical structure. 2) Cells at higher level are divided into a number of smaller cells in the next/lower level. 3) Statistical information of each cell is calculated and is used to answer spatial queries. 4) Parameters of higher level cells can be easily calculated from parameters of lower level cell. 5) Use a top-down approach to answer spatial data queries. 6) Start from a pre-selected layer—typically with a small number of cells from the pre-selected layer until you reach the bottom layer do the following: 7) For each cell in the current level compute the confidence interval indicating a cell‟s relevance to a given query; a. If it is relevant, include the cell in a cluster. b. If it irrelevant, remove cell from further consideration otherwise, look for relevant cells at the next lower layer. c. Combine relevant cells into relevant regions (based on grid-neighbourhood) and return the so obtained clusters as your answers. IJCSCN | June-July 2016 Available [email protected] Advantages: • Query-independent, easy to parallelize, incremental update. • O(K), where K is the number of grid cells at the lowest level. Disadvantages: • All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected. 3. Application Areas of Clustering Clustering algorithms can be applied in many fields, for instance: • Marketing: finding groups of customers with similar interests and behaviour given a large database of customer data containing their properties and past buying records. • Medicine: IMRT segmentation, Analysis of antimicrobial activity, Medical imaging. • Financial task: Forecasting stock market, currency exchange rate, bank bankruptcies, understanding and managing financial risk, trading futures, credit rating. • Computer Science: Software evolution, Image segmentation, Anomaly detection. • Biology: classification of plants and animals given their features, human genetic clustering, transcriptomics. • Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds. • City-planning: identifying groups of houses according to their house type, value and geographical location. • Earthquake studies: clustering observed earthquake epicentres to identify dangerous zones. • WWW: document classification; clustering web log data to discover groups of similar access patterns. 4. Conclusion Several clustering algorithm have been proposed for the task of data mining. The clustering algorithm to be used in a particular case depends on the type, nature and of dataset. It has large area of application. Clustering is a descriptive technique. The solution is not unique and it strongly depends upon the analysts choices. 5. References [1]J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques. Morgan kaufmann, 2006. [2] A.A. Frietas, “A survey of evolutionary algorithms for data mining and knowledge discovery,” in Advances in Evolutionary computing, Springer, 2003 pp.819-845. [3] M. Mor, P. Gupta, and P. Sharma, “A Genetic Algorithm Approach for Clustering.” ,2014 pp6442-47 [4] Priyanka Sharma “Comparative Analysis of Various Clustering Algorithms” , 2015 pp.107-112. 141 ISSN:2249-5789 Mamta Mor, International Journal of Computer Science & Communication Networks,Vol 6(3),138-142 [5]“Hierarchical clustering,” Wikipedia, the free encyclopedia. 09-Feb-2015. [6]Glenn. W. Milligan,”An examination of the effect of six type of error perturbation on fifteen clustering algorithm” [7] Data Clustering. A Review: A.K. Jain Michigan State University and M.N. Murty Indian Institute of Science and P.J. Flynn The Ohio State University. [8] J.A. Hartigan (1975). Clustering algorithms. John Wiley & Sons, Inc. [9] Hartigan, J. A.; Wong, M. A. (1979). "Algorithm AS 136: A K-Means Clustering Algorithm". Journal of the Royal Statistical Society, Series C 28 (1): 100–108 [10] Sander, Jörg; Ester, Martin; Kriegel, Hans-Peter; Xu, Xiaowei (1998). "Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications".Data Mining and Knowledge Discovery (Berlin: Springer-Verlag) 2 (2): 169–194. [11]”Grid based clustering” Wikipedia, the free encyclopedia. [12] T. Soni Madhulatha, “AN OVERVIEW ON CLUSTERING METHODS”, IOSR Journal of Engineering Apr. 2012, Vol. 2(4) pp: 719-72. IJCSCN | June-July 2016 Available [email protected] 142