Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototype-based – Fuzzy c-means – Mixture Model Clustering Density-based – Grid-based clustering – Subspace clustering Graph-based – Chameleon Scalable Clustering Algorithms – Cure and Birch Characteristics of Clustering Algorithms Data e Web Mining 2 Hard (Crisp) vs Soft (Fuzzy) Clustering Hard (Crisp) vs. Soft (Fuzzy) clustering – Generalize K-means objective function (for all the N k points) k N 2 ∑ wij = 1 SSE = ∑ ∑ w ij ( x i − c j ) , j=1 j=1 i=1 wij : weight with which object xi belongs to cluster Cj € € – To minimize SSE, repeat the following steps: Fixed cj and determine wij (cluster assignment) Fixed wij and recompute cj – Hard clustering: wij ∈ {0,1} Data e Web Mining 3 Hard (Crisp) vs Soft (Fuzzy) Clustering c1 x c2 1 2 5 SSE(x) is minimized when wx1 = 1, wx2 = 0 Data e Web Mining 4 Fuzzy C-means Objective p: fuzzifier (p > 1) function k N 2 SSE = ∑ ∑ w ( x i − c j ) , p ij j=1 i=1 wij k ∑w ij =1 j=1 : weight with which object xi belongs to cluster Cj – € To minimize objective function, repeat the following: Fixed cj and determine wij Fixed wij and recompute cj € – Fuzzy clustering: wij ∈[0,1] Data e Web Mining 5 Fuzzy C-means c1 x c2 1 2 5 SSE(x) is minimized when wx1 = 0.9, wx2 = 0.1 Data e Web Mining 6 Fuzzy C-means Objective function: k p: fuzzifier (p > 1) k N 2 SSE = ∑ ∑ w ( x i − c j ) , p ij j=1 i=1 Initialization: ∑w ij =1 j=1 choose the weights wij randomly Repeat: € – Update centroids: € – Update weights: Data e Web Mining 7 Fuzzy K-means Applied to Sample Data Data e Web Mining 8 Hard (Crisp) vs Soft (Probabilistic) Clustering Idea is to model the set of data points as arising from a mixture of distributions – Typically, normal (Gaussian) distribution is used – But other distributions have been very profitably used. Clusters are found by estimating the parameters of the statistical distributions – Can use a k-means like algorithm, called the EM algorithm, to estimate these parameters Actually, k-means is a special case of this approach – Provides a compact representation of clusters – The probabilities with which point belongs to each cluster provide a functionality similar to fuzzy clustering. Data e Web Mining 9 Probabilistic Clustering: Example Informal example: consider modeling the points that generate the following histogram. Looks like a combination of two normal distributions Suppose we can estimate the mean and standard deviation of each normal distribution. – This completely describes the two clusters – We can compute the probabilities with which each point belongs to each cluster – Can assign each point to the cluster (distribution) in which it is most probable. Data e Web Mining 10 Probabilistic Clustering: EM Algorithm Initialize the parameters Repeat For each point, compute its probability under each distribution Using these probabilities, update the parameters of each distribution Until there is not change Very similar to of K-means Consists of assignment and update steps Can use random initialization – Problem of local minima For normal distributions, typically use K-means to initialize If using normal distributions, can find elliptical as well as spherical shapes. Data e Web Mining 11 Probabilistic Clustering: EM Algorithm Choose K seeds: means of a gaussian distribution Estimation: calculate probability of belonging to a cluster based on distance Maximization: move mean of gaussian to centroid of data set, weighted by the contribution of each point Repeat till means don’t move Data e Web Mining 12 Probabilistic Clustering Applied to Sample Data Data e Web Mining 13 Grid-based Clustering A type of density-based clustering Data e Web Mining 14 Grid-based Clustering Issues – how to discretize the dimensions equal width vs. equal frequency discretization – density of cells containing the points close to the border of a cluster can be very low these cells are discarded. A possible solution is to reduce the size of cells, but this may yield additional problems Data e Web Mining 15 Subspace Clustering Until now, we found clusters by considering all of the attributes Some clusters may involve only a subset of attributes, i.e., subspaces of the data – Example: In a document collection, documents can be represented as vectors, where the dimensions correspond to terms When k-means is used to find document clusters, the resulting clusters can typically be characterized by 10 or so terms Data e Web Mining 16 Example Three clear clusters. – The circle points are not a cluster in three dimensions – If the dimensions are discretized (equal width), these points are included in low density cells Data e Web Mining 17 Histograms to determine density Equi-width Contiguous intervals to be clustered discretized space Density Threshold = 6% Data e Web Mining 18 Example Data e Web Mining 19 Example Data e Web Mining 20 Example Data e Web Mining 21 Example : remarks The circles do not form a cluster in the three dimensions, but they may form a cluster in some subspaces A cluster in the three dimensions is part of a cluster (maybe a larger one) in the subspaces Data e Web Mining 22 Clique Algorithm - Overview A grid-based clustering algorithm that methodically finds subspace clusters – Partitions the data space into rectangular units of equal volume – Measures the density of each unit by the fraction of points it contains – A unit is dense if the fraction of overall points it contains is above a user specified threshold, τ – A cluster is a group of collections of contiguous (touching) dense units Data e Web Mining 23 Clique Algorithm It is impractical to check each subspace to see if it is dense, due to the exponential number of them – 2n subspaces, if n are the dimensions Monotone property of density-based clusters: – If a set of points forms a density based cluster in k dimensions, then the same set of points is also part of a density based cluster in all possible subsets of those dimensions Very similar to the Apriori algorithm for frequent itemset mining Can find overlapping clusters Data e Web Mining 24 Clique Algorithm Data e Web Mining 25 Limitations of Clique Time complexity is exponential in number of dimensions – Especially if “too many” dense units are generated at lower stages May fail if clusters are of widely differing densities, since the threshold is fixed – Determining appropriate threshold and unit interval length can be challenging Data e Web Mining 26 Graph-Based Clustering: General Concepts Graph-Based clustering uses the proximity graph – Start with the proximity matrix – Consider each point as a node in a graph – Each edge between two nodes has a weight which is the proximity between the two points – Initially the proximity graph is fully connected – MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph In the simplest case, clusters are connected components in the graph. Data e Web Mining 27 Graph-Based Clustering: Chameleon Based on several key ideas – Sparsification of the proximity graph – Partitioning the data into clusters that are relatively pure subclusters of the “true” clusters – Merging based on preserving characteristics of clusters Data e Web Mining 28 Graph-Based Clustering: Sparsification The amount of data that needs to be processed is drastically reduced, thus making the algorithm more scalable – Sparsification can eliminate more than 99% of the entries in a proximity matrix – The amount of time required to cluster the data is drastically reduced The size of the problems that can be handled is increased – Data e Web Mining 29 Graph-Based Clustering: Sparsification … Clustering may work better – Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points. – The nearest neighbors of a point tend to belong to the same class as the point itself. – This reduces the impact of noise and outliers and sharpens the distinction between clusters. Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph partitioning algorithms) – Chameleon and Hypergraph-based Clustering Data e Web Mining 30 Sparsification in the Clustering Process Data e Web Mining 31 Limitations of Current Merging Schemes Existing merging schemes in hierarchical clustering algorithms are static in nature – MIN or CURE: Merge two clusters based on their closeness (or minimum distance) – GROUP-AVERAGE: Merge two clusters based on their average connectivity Data e Web Mining 32 Limitations of Current Merging Schemes (a) (b) (c) (d) Closeness schemes will merge (a) and (b) Average connectivity schemes will merge (c) and (d) Data e Web Mining 33 Chameleon: Clustering Using Dynamic Modeling Adapt to the characteristics of the data set to find the natural clusters Use a dynamic model to measure the similarity between clusters – Main properties are the relative closeness and relative interconnectivity of the cluster – Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters – The merging scheme preserves self-similarity Data e Web Mining 34 Experimental Results: CHAMELEON Data e Web Mining 35 Experimental Results: CHAMELEON Data e Web Mining 36 Experimental Results: CHAMELEON Data e Web Mining 37 CURE: a Scalable Algorithm Agglomerative hierarchical clustering algorithms vary in terms of how the proximity of two clusters are computed MIN (single link) – susceptible to noise/outliers MAX (complete link)/GROUP AVERAGE/Centroid: – may not work well with non-globular clusters CURE (Clustering Using REpresentatives) algorithm tries to handle both problems – It is a graph-based algorithm – Starts with a proximity matrix/proximity graph Data e Web Mining 38 CURE Algorithm Represents a cluster using multiple representative points – Goals: scalability, by choosing points that capture the geometry and shape of clusters – Representative points are found by selecting a constant number of points from a cluster The first representative point is chosen to be the point farthest from the center of the cluster Remaining representative points are chosen so that they are farthest from all previously chosen points Data e Web Mining 39 CURE Algorithm “Shrink” representative points toward the center of the cluster by a factor, α × × Shrinking representative points toward the center helps avoid problems with noise and outliers – shrinking factor: α Cluster similarity is the similarity of the closest pair of representative points (MIN) from different clusters Data e Web Mining 40 CURE Algorithm Uses agglomerative hierarchical scheme to perform clustering; – α = 0: similar to centroid-based – α = 1: somewhat similar to single-link (MIN) CURE is better able to handle clusters of arbitrary shapes and sizes Data e Web Mining 41 Experimental Results: CURE (10 clusters) Data e Web Mining 42 Experimental Results: CURE (9 clusters) Data e Web Mining 43 Experimental Results: CURE Picture from CURE, Guha, Rastogi, Shim. Data e Web Mining 44 Experimental Results: CURE (centroid) (single link) Picture from CURE, Guha, Rastogi, Shim. Data e Web Mining 45 CURE Cannot Handle Differing Densities Original Points CURE Data e Web Mining 46 BIRCH: a Scalable Algorithm Balanced Iterative Reducing and Clustering using Hierarchies Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans Weakness: handles only numeric data (Euclidean space), and is sensitive to the order of the data record Data e Web Mining 47 BIRCH → Clustering Feature (CF): → (N,LS,SS ) – Number of point, Linear Sum of points, Sum of Squares of points – CF incrementally updated, to be used for computing centroids, variance (used for measuring the diameter of the cluster) € – Also used for computing distances between clusters Data e Web Mining 48 BIRCH CF is a compact storage for data on points in a cluster Has enough information to calculate the intra-cluster distances Additivity theorem allows us to merge subclusters – C3 = C1 C2 – CFC3= <nC1+ nC2 , LSC1+ LSC2, SSC1+SSC2> Data e Web Mining 49 BIRCH Basic steps of BIRCH – Load the data into memory by creating a CF tree that “summarizes” the data (see the following slide) – Perform global clustering. Produces a better clustering than the initial step. An agglomerative, hierarchical technique was selected. – Redistribute the data points using the centroids of clusters discovered in the global clustering phase, and thus, discover a new (and hopefully better) set of clusters. Data e Web Mining 50 BIRCH BIRCH maintains a balanced CF-Tree – Branching Factor B: max entry number in a non-leaf – Max size leaf L: max entry number in a leaf – Threshold T: the diameter of a leaf < T CF Tree Root B=7 CF1 CF2 CF3 CF6 L=6 child1 child2 child3 child6 Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node prev CF1 CF2 CF6 Leaf node next prev CF1 CF2 Data e Web Mining CF4 next 51 Characteristics of Data High dimensionality Size of data set Sparsity of attribute values Noise and Outliers Types of attributes and type of data sets Differences in attribute scale Properties of the data space – Can you define a meaningful centroid Data e Web Mining 52 Characteristics of Clusters Data distribution Shape Differing sizes Differing densities Poor separation Relationship of clusters Subspace clusters Data e Web Mining 53 Characteristics of Clustering Algorithms Order dependence Non-determinism Parameter selection Scalability Underlying model Optimization based approach Data e Web Mining 54