* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Density-based methods
Survey
Document related concepts
Transcript
Subject Name: Data Warehousing and data Mining Subject Code: 10MCA542 & IS 74 31/10/2014 Prepared By: V.Srikanth& Harkiranpreet Department : MCA & IS Date 31/10/2014 : UNIT 7 CLUSTERING TECHNIQUES Clustering techniques Overview Features of Cluster Analysis Types of Data and computing distance Types of Cluster analysis methods Partitional Methods Density based methods Quality and validity of cluster analysis 31/10/2014 CLUSTERING TECHNIQUES OVERVIEW • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. • A clustering is a set of clusters • Important distinction between hierarchical and partitional sets of clusters • Partitional Clustering – A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering – A set of nested clusters organized as a hierarchical tree 31/10/2014 FEATURES OF CLUSTER ANALYSIS • Given that there are a large number of cluster analysis methods on offer, we make a list of desired features that an ideal cluster analysis method should have. The list is given below. 1. Scalability : Data mining problems can be large and therefore it is desirable that a cluster analysis method be able to deal with small as well as large problems gracefully. 2. Only one scan of the data set: For large problems, the data must be stored on the disk and the cost of I/O from the disk can then become significant in solving the problem. 3. Ability to stop and resume: When the dataset is very large, cluster analysis may require considerable processor time to complete the task. In such cases, it is desirable that the task be able to be stopped and then resumed when convenient. 4. Minimal input parameters: The cluster analysis method should not expect too much guidance from the user. 31/10/2014 5. 6. 7. 8. Robustness: Most data obtained from a variety of sources has errors. it is therefore desirable that a cluster analysis method be able to deal with noise, outliers and missing values gracefully. Ability to discover different cluster shapes: Clusters come in different shapes and not all clusters are spherical. it is therefore desirable that a cluster analysis method be able to discover cluster shapes other than spherical. Different data types: Many problems have a mixture of data types, for example numerical, categorical and even textual. Result independent of data input order: It is therefore desirable that a cluster analysis method not be sensitive to data input order. Whatever the order, the result of cluster analysis of the same data should be the same. 31/10/2014 TYPES OF DATA AND COMPUTING DISTANCE • Data sets come in a number of different forms. The data may be quantitative,binary,nominal or ordinal. 1. Quantitative data is quite common for example,weight,marks,height,price,salary and count. 2. Examples of binary data involve gender, marital status etc. 3. Qualitative nominal data is similar to binary data which may take more than two values but has no natural order for example religion, foods or colors. 4. Qualitative ordinal data is similar to nominal data except that data has an order associated with it for example, grades A,B,C,D. 31/10/2014 COMPUTING DISTANCE • Let the distance between two points x and y be D(x,y).We now define a number of distance measures. 1. Euclidean distance: It is most commonly used to compute distances. The formula for it is D(x,y) =(∑(xi-yi)2)/0.5 2. Manhattan distance: The formula for manhattan distance is D(x,y)= ∑!xi-yi! 3. Chebychev distance : The formula to be applied is D(x,y)=Max(xi-yi) 4. Categorical data distance: The formula involved is D(x,y)= (Number of xi-yi)/N. 31/10/2014 TYPES OF CLUSTER ANALYSIS METHODS • The cluster analysis methods may be divided into the following categories. 1. Paritional methods 2. Hierarchical methods 3. Density based methods 4. Grid-based methods 5. Model-based methods.. 1. 2. Partitional methods: Partitional methods obtain a single level partition of objects. These methods usually are based on greedy heuristics that are used iteratively to obtain a local optimum solution. Given n objects, these methods make k<n clusters of data and use an iterative relocation method. Hierarchical methods: Hierarchical methods obtain a nested partition of the objects resulting in a tree of clusters. 31/10/2014 3. 4. 5. Density-based methods: In this class of methods, typically for each data point in a cluster at least a minimum number of points must exist within a given radius. Grid-based methods: In this class of methods, the object space rather than the data is divided into a grid. Grid-based methods are not affected by data ordering. Model-based methods: A model is assumed, perhaps based on a probability distribution. Similarity measurement is based on the mean values. 31/10/2014 5/4/2017 PARTITIONAL METHODS • It involves 2 methods 1. The K-Means method and 2. Expectation Maximization method. • K-Means Clustering: • • • • • Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple 31/10/2014 K-MEANS CLUSTERING ALGORITHM • • • • Initial centroids are often chosen randomly. – Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. 31/10/2014 • • • Most of the convergence happens in the first few iterations. – Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d ). n = number of points, K = number of clusters, I = number of iterations, d = number of attributes 31/10/2014 IMPORTANCE OF CHOOSING INTITIAL CENTROIDS Iteration 6 3 2.5 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 x 31/10/2014 0.5 1 1.5 2 SOLUTION TO INTIAL CENTROIDS • Multiple runs – Helps, but probability is not on your side • Sample and use hierarchical clustering to determine initial centroids • Select more than k initial centroids and then select among these initial centroids – Select most widely separated • Post processing • Bisecting K-means – Not as susceptible to initialization issues 31/10/2014 BISECTING K MEANS ALGORITHM 31/10/2014 5/4/2017 2. EXPECTATION MAXIMIZATION METHOD • In contrast to the k-means method the expectation maximization(EM) method is based on the assumption that the objects in the data set have attributes whose values are distributed according to some unknown linear combination or mixture of simple probability distributions. • While the k-means method involves assigning objects to clusters to minimize within group variation ,the EM method assigns objects to different clusters with certain probabilities in an attempt to maximize expectation of assignment. • The EM method involves 2 steps-The first step called the estimation step involves estimating the probability distribution of the clusters given the data the second step called the maximization step involves finding the model parameters that maximize the likelihood of the solution. 31/10/2014 5/4/2017 HIERARCHICAL METHODS • It involves 2 algorithms: 1. Agglomerative method 2. Divisive method. 1. AGGLOMERATIVE-More popular hierarchical clustering techniqueBasic algorithm is straightforward 1. Compute the proximity matrix 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the proximity matrix 6. Until only a single cluster remains • Key operation is the computation of the proximity of two clusters – Different approaches to defining the distance between clusters distinguish the different algorithms. 31/10/2014 2. DIVISIVE METHOD 31/10/2014 DENSITY BASED METHODS • Density based methods is based upon DBSCAN which is a density-based algorithm. – Density = number of points within a specified radius (Eps) – A point is a core point if it has more than a specified number of points (MinPts) within Eps • These are points that are at the interior of a cluster – A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point – A noise point is any point that is not a core point or a border point. 31/10/2014 DB SCAN ALGORITHM • Eliminate noise points • Perform clustering on the remaining points 31/10/2014 QUALITY AND VALIDITY OF CLUSTER ANALYSIS METHODS • The quality of a method involves a number of criteria. 1. Efficiency of the method 2. Ability of the method to deal with noisy and missing data 3. Ability of the method to deal with large problems 4. Ability of the method to deal with a variety of attribute types and magnitudes. 31/10/2014