Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB) Density-based (DBSCAN, CLIQUE) Large size data (STING, BIRCH, CURE) 9/03 Data Mining – Clustering G Dong (WSU) 1 The Clustering Problem • The clustering problem is about grouping a set of data tuples into a number of clusters. Data in the same cluster are highly similar to each other and data in different clusters are highly different from each other. • About clusters – Inter-clusters distance maximization – Intra-clusters distance minimization • Clustering vs. classification – Which one is more difficult? Why? – Various possible ways of clustering, which way is the best? 9/03 Data Mining – Clustering G Dong (WSU) 2 Different ways of representing clusters • • • • • • Division with boundaries Venn diagram or spheres Probabilistic Dendrograms Trees Rules 1 2 3 I1 0.5 0.2 0.3 I2 … In 9/03 Data Mining – Clustering G Dong (WSU) 3 Major Categories of Algorithms • Partitioning: Divide into k partitions (k fixed); regroup to get better clustering. • Hierarchical: Divide into different number of partitions in layers - merge (bottom-up) or divide (top-down). • Density-based: Continue to grow a cluster as long as the density of the cluster exceeds a threshold • Grid-based: First divide space into grids, then perform clustering on the grids. 9/03 Data Mining – Clustering G Dong (WSU) 4 k-Means • Algorithm 1. 2. 3. 4. 5. • Given k Randomly pick k instances as the initial centers Assign the rest instances to closest one of k clusters Recalculate the mean of each cluster Repeat 3 & 4 until means don’t change How good the clusters are – Initial and final clusters – Within-cluster variation diff(x,mean)^2 – Why don’t we consider inter-cluster distance? 9/03 Data Mining – Clustering G Dong (WSU) 5 Example • For simplicity, 1 dimensional objects and k=2. • Objects: 1, 2, 5, 6,7 • K-means: – Randomly select 5 and 6 as initial centroids; – => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 – => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 – => no change. – Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 = 2.5 9/03 Data Mining – Clustering G Dong (WSU) 6 Discussions • Limitations: – – – – Means cannot be defined for categorical attributes; Choice of k; Sensitive to outliers; Crisp clustering • Variants of k-means exist: – Using modes to deal with categorical attributes • How about distance measures • Is it similar to or different from k-NN? – With and without learning 9/03 Data Mining – Clustering G Dong (WSU) 7 k-Medoids • k-Means algorithm is sensitive to outliers – Is this true? How to prove it? • Medoid – the most centrally located point in a cluster, as a representative point of the cluster. • In contrast, a centroid is not necessarily in a cluster. • An example Initial Medoids 9/03 Data Mining – Clustering G Dong (WSU) 8 Partition Around Medoids • PAM: 1. Given k 2. Randomly pick k instances as initial medoids 3. Assign each instance to the nearest medoid 4. Calculate the objective function • the sum of dissimilarities of all instances to their nearest medoids 5. Randomly select an instance y 6. Swap some medoid x by y if the swap reduces the objective function 7. Repeat (3-6) until no change 9/03 Data Mining – Clustering G Dong (WSU) 9 k-Means and k-Medoids • The key difference lies in how they update means or medoids • Both require distance calculation and reassignment of instances • Time complexity Outlier (100 unit away) – Which one is more costly? • Dealing with outliers 9/03 Data Mining – Clustering G Dong (WSU) 10 EM (Expectation Maximization) • Moving away from crisp clusters as in k-Means by allowing an instance to belong to several clusters • Finite mixtures – a statistical clustering model – A mixture is a set of k probability distributions, representing k clusters – The simplest finite mixture: one feature with a Gaussian – When k=2, we need to estimate 5 parameters: 2 pairs of μ, 2 pairs of σ, and pA, where pB = 1- pA • EM – Estimate using instances – Maximize the overall likelihood that data came from this data set 9/03 Data Mining – Clustering G Dong (WSU) 11 Agglomerative • Each object is viewed as a cluster (bottom up). • Repeat until the number of clusters is small enough – Choose a closest pair of clusters – Merge the two into one • Defining “closest”: Centroid (mean of cluster) distance, (average) sum of pairwise distance, … – Refer to the Evaluation part • A dendrogram is a tree that shows clustering process. 9/03 Data Mining – Clustering G Dong (WSU) 12 Dendrogram • Cluster 1, 2, 4, 5, 6, 7 into two clusters (centriod distance) 1 2 4 5 6 7 9/03 Data Mining – Clustering G Dong (WSU) 13 An example to show different Links • Single link A B C D E – Merge the nearest clusters measured by the shortest edge between the two – (((A B) (C D)) E) • Complete link – Merge the nearest clusters measured by the longest edge between the two – (((A B) E) (C D)) A 0 1 2 2 3 B 1 0 2 4 3 C 2 2 0 1 5 D 2 4 1 0 3 E 3 3 5 3 0 B A • Average link – Merge the nearest clusters measured byE the average edge length between the two – (((A B) (C D)) E) 9/03 Data Mining – Clustering G Dong (WSU) C D 14 Divisive • All instances belong to one cluster (top-down) • To find an optimal division at each layer (especially the top one) is computationally prohibitive. • One heuristic method is based on the Minimum Spanning Tree (MST) algorithm – Connecting all instances with MST (O(N2)) – Repeatedly cut out the longest edges at each iteration until some stopping criterion is met or until one instance remains in each cluster. 9/03 Data Mining – Clustering G Dong (WSU) 15 COBWEB • Building a conceptual hierarchy incrementally • Each cluster has a probabilistic description • Category Utility: kijP(fi=vij)P(fi=vij|ck)P(ck|fi=vij) – All categories ck, all features fi, all feature values vij • It attempts to maximize both the probability that two objects in the same category have values in common and the probability that objects in different categories will have different property values 9/03 Data Mining – Clustering G Dong (WSU) 16 A tree of clusters produced by COBWEB: 9/03 Data Mining – Clustering G Dong (WSU) 17 • Processing one instance at a time by choosing best among – Placing the instance in the best existing category – Adding a new category containing only the instance – Merging of two existing categories into a new one and adding the instance to that category – Splitting of an existing category into two and placing the instance in the best new resulting category Grandparent Grandparent Split Parent Child 1 9/03 Child 2 Merge Child 1 Data Mining – Clustering G Dong (WSU) Child 2 18 Cobweb Demo http://kiew.cs.uni-dortmund.de:8001/mlnet/instances/81d91eaae317b2bebb 9/03 Data Mining – Clustering G Dong (WSU) 19 Density-based • DBSCAN –Density-Based Clustering of Applications with Noise • It grows regions with sufficiently high density into clusters and can discover clusters of arbitrary shape in spatial databases with noise. – Many existing clustering algorithms find spherical shapes of clusters • DBSCAN defines a cluster as a maximal set of density-connected points. 9/03 Data Mining – Clustering G Dong (WSU) 20 • Defining density and connection – – – – – -neighborhood of an object x (core object) (M, P, Q) MinPts of objects within -neighborhood (say, 3) directly density-reachable (Q from M, M from P) density-reachable (Q from P, P not from Q) [asymmetric] density-connected (O, R, S) [symmetric] <for border points> • What is the relationship between DR and DC? Q R M S P 9/03 O Data Mining – Clustering G Dong (WSU) 21 • Clustering with DBSCAN – Search for clusters by checking the -neighborhood of each instance x – If the -neighborhood of x contains more than MinPts, create a new cluster with x as a core object – Iteratively collect directly density-reachable objects from these core object and merge density-reachable clusters – Terminate when no new point can be add to any cluster • DBSCAN is sensitive to the thresholds of density, but it is many folds faster than CLARANS • Time complexity O(N log N) if a spatial index is used, O(N2) otherwise 9/03 Data Mining – Clustering G Dong (WSU) 22 Dealing with Large Data • Key ideas – Reducing the number of instances to be maintained, and yet to maintain the distribution – Identifying relevant subspaces where clusters possibly exist – Using summarized information to avoid repeated data access • Sampling – CLARA (Clustering LARge Applications) working on samples instead of the whole data – CLARANS (Clustering Large Applications based on RANdomized Search) 9/03 Data Mining – Clustering G Dong (WSU) 23 • Grid: STING (STatistical INformation Grid) – Statistical parameters of higher-level cells can easily be computed from those of lower-level cells • Attribute-independent: count • Attribute-dependent: mean, standard deviation, min, max • Type of distribution: normal, uniform, exponential, or unknown – Irrelevant cells can be removed 9/03 Data Mining – Clustering G Dong (WSU) 24 Representatives • BIRCH using Clustering Feature (CF) and CF tree – A cluster feature is a triplet about sub-clusters of instances (N, LS, SS) • N - the number of instances, LS – linear sum, SS – square sum – Two thresholds: branching factor (the max number of children per non-leaf node) and diameter threshold – Two phases 1. Build an initial in-memory CF tree 2. Apply a clustering algorithm to cluster the leaf nodes in CF tree • 9/03 CURE (Clustering Using REpresentitives) is another example Data Mining – Clustering G Dong (WSU) 25 CF Tree Root B: Branching factor L: Threshold: max diameter of subclusters at leaf nodes B=7 CF1 CF2 CF3 CF6 L=6 child1 child2 child3 child6 Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node prev CF1 CF2 9/03 CF6 next Leaf node prev CF1 CF2 Data Mining – Clustering G Dong (WSU) CF4 next 26 • Taking advantage of the property of density – If it’s dense in higher dimensional subspaces, it should be dense in some lower dimensional subspaces – CLIQUE (CLustering In QUEst) • With high dimensional data, there are many void subspaces • Using the property identified, we can start with dense lower dimensional data • CLIQUE is a density-based method that can automatically find subspaces of the highest dimensionality such that high-density clusters exist in those subspaces 9/03 Data Mining – Clustering G Dong (WSU) 27 Drawbacks of Distance-Based Method • Drawbacks of square-error based clustering method – Consider only one point as representative of a cluster – Good only for convex shaped, similar size and density, and if k can be reasonably estimated 9/03 Data Mining – Clustering G Dong (WSU) 28 Chameleon • A hierarchical Clustering Algorithm Using Dynamic Modeling – Observations on the weakness of pure distance based methods • Basic steps: – Build K nearest neighbor graph – Partition the graph – Merge the “strongly connected partitions,” in terms of strength of connections between partitions 9/03 Data Mining – Clustering G Dong (WSU) 29 Summary • There are many clustering algorithms • Good clustering algorithms maximize inter-cluster dissimilarity and intra-cluster similarity • Without prior knowledge, it is difficult to choose the best clustering algorithm. • Clustering is an important tool for outlier analysis. 9/03 Data Mining – Clustering G Dong (WSU) 30 Bibliography • I.H. Witten and E. Frank. Data Mining – Practical Machine Learning Tools and Techniques with Java Implementations. 2000. Morgan Kaufmann. • M. Kantardzic. Data Mining – Concepts, Models, Methods, and Algorithms. 2003. IEEE. • J. Han and M. Kamber. Data Mining – Concepts and Techniques. 2001. Morgan Kaufmann. • M. H. Dunham. Data Mining – Introductory and Advanced Topics. 9/03 Data Mining – Clustering G Dong (WSU) 31