Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 10. Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary May 4, 2017 Data Mining: Concepts and Techniques 1 Hierarchical Clustering • Produces a set of nested clusters organized as a hierarchical tree. • Can be visualized as a dendrogram – Tree like diagram – Records the sequences of merges or splits May 4, 2017 Data Mining: Concepts and Techniques 2 Dendrograms Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. Each level shows clusters for that level. Leaf – individual clusters Root – one cluster A cluster at level i is the union of its children clusters at level i+1. each level is typically associated with a distance threshold: subclusters of the clusters at that level were combined because they had a distance between them of less than the distance threshold. May 4, 2017 Data Mining: Concepts and Techniques 3 Dendrograms May 4, 2017 Data Mining: Concepts and Techniques 4 Inter-cluster distance (similarity) metrics • Many hierarchical clustering algorithms require that the distance (or inter-cluster similarity) between clusters to be determined. • Let Ki, Kj be two clusters, i Ki and j Kj,,, there is a variety of distance metrics to calculate the distance between clusters – Single link: Smallest distance between two points, one in Ki and the other in Kj : dist(Ki,Kj) = min(dist(i, j)). – Complete link: Largest distance between two points, one in Ki and the other in Kj : dist(Ki,Kj) = max(dist(i, j)). – Average link: Average distance between two points, one in Ki and the other in Kj : dist(Ki,Kj) = avg(dist(i, j)). – Centroid: Distance between the centroid: dist(Ki,Kj) = dist(CKi,CKj). – Medoid: Distance between the medoids: dist(Ki,Kj) = dist(MKi,MKj). May 4, 2017 Data Mining: Concepts and Techniques 5 Inter-cluster distance metrics (cont.) •Single link: Calculate the smallest distance between an element in one cluster and an element in the other cluster. •Complete link (Farthest neighbor): Calculate the largest distance between an element in one cluster and an element in the other cluster. •Average link: Calculate the average distance between each element in one cluster and all elements in the other cluster. •Centroid: Calculate the distance between cluster centroids. •Medoid: Calculate the distance between cluster medoids. single link complete link centroid medoid May 4, 2017 Data Mining: Concepts and Techniques 6 Hierarchical Clustering Two main types of algorithms: – Agglomerative (bottom-up merging) • Start with the points as individual clusters • Merge clusters until only one is left – Divisive (top-down splitting) • Start with all the points as one cluster • Split clusters until only singleton clusters remain – Agglomerative is more popular • Traditional hierarchical algorithms use a similarity or distance matrix, and merge or split one cluster at a time May 4, 2017 Data Mining: Concepts and Techniques 7 Agglomerative Clustering (Bottom-Up Merging) • Bottom-up merging techniques are called agglomerative as they agglomerate (merge) smaller clusters into bigger ones. • Typically, clusters are merged if the distance metric between the two clusters is less than a certain threshold. • The threshold increases at each stage of merging. • A variety of distance metrics can be used to calculate the distance (inter-cluster similarity) between clusters May 4, 2017 Data Mining: Concepts and Techniques 88 Agglomerative Clustering (Bottom-Up Merging) The sequence of graphs below show how a bottom-up merging techniques may proceed. Each example starts off in its own cluster. At each stage, we use the distance threshold to decide which clusters to merge, until eventually we have only one super-cluster. (We may also stop at some other termination condition e.g. we may stop when we have less than a certain number of super-clusters.) Stage 1: 5 clusters Stage 2: 3 clusters Merge Stage 3: 1 cluster Merge Stage 3: Cluster 1 Stage 2: Cluster 1 Stage 2: Cluster 2 Stage 2: Cluster 3 Stage 1: Cluster 1 Stage 1: Cluster 2 Stage 1: Cluster 3 Stage 1: Cluster 4 Stage 1: Cluster 5 A B C D E Source: Alan Abrahams May 4, 2017 Data Mining: Concepts and Techniques 9 Divisive Clustering (Top-down Splitting) The sequence of graphs below show how a top-down splitting techniques may proceed. We start off with one super-cluster. At each stage, we decide where to split, until eventually we have reasonable sub-clusters. Stage 1: 1 cluster Stage 2: 2 clusters Stage 3: 3 clusters Split Split Stage 1: Cluster 1 Stage 2: Cluster 1 Stage 2: Cluster 2 Stage 3: Cluster 2 Stage 3: Cluster 1 A Source: Alan Abrahams May 4, 2017 B Stage 3: Cluster 3 C Data Mining: Concepts and Techniques D E 10 10 Agglomerative Algorithms Agglomerative algorithms start with each inidividual item in its own cluster and iterative merge clusters until all items belong in on cluster. Key operation is the computation of the proximity of two clusters. Different approaches to defining the distance between clusters distinguishes the different algorithms. Single link Complete link (Farthest neighbor) Average link Centroid Medoid May 4, 2017 Data Mining: Concepts and Techniques 11 Agglomerative Algorithms In the algorithm given on the next slide: d: the threshold distance k: the number of clusters K: the set of clusters The procedure NewClusters determine how to form the next level clusters from the previous level. This is where the different types of agglomerative algorithms differ. Different approaches to defining the distance between clusters results in the different algorithms. Single link Complete link (Farthest neighbor) Average link Centroid Medoid May 4, 2017 Data Mining: Concepts and Techniques 12 Agglomerative Algorithm © Prentice Hall May 4, 2017 Data Mining: Concepts and Techniques 13 Basic agglomerative algorithm using single-link (min) technique View all items with links (distances) between them. Find maximal connected components in this graph. (A connected component is a graph in which there exists a path between two vertices.) Two clusters are merged if there is at least one edge which connects them, and the edge distance is less then or equal to the threshold distances at that level. Similarity of two clusters is based on the two closest points (nearest neighbours) in the different clusters (uses single-link distance metrics). It is also called nearest neighbour clustering Can handle non-elliptical shapes. Sensitive to noise and outliers. Could be agglomerative or divisive May 4, 2017 Data Mining: Concepts and Techniques 14 Agglomerative Example A B C D E A 0 1 2 2 3 B 1 0 2 4 3 C 2 2 0 1 5 D 2 4 1 0 3 E 3 3 5 3 0 A B E C D Threshold of 1 2 34 5 A B C D E May 4, 2017 Data Mining: Concepts and Techniques 15 Basic agglomerative algorithms Agglomerative algorithm using complete-link (max) technique Merged if the maximum distance is less than or equal to the distance threshold. Clusters found using the complete link tend to be more compact than the single link tehnique Tends to break large clusters. Less susceptible to noise and outliers. Agglomerative algorithm using average-link technique Merged if the average distance is less than or equal to the distance threshold Compromise between Single and Complete Link. Need to use average connectivity for scalability since total connectivity favors large clusters. Less susceptible to noise and outliers. May 4, 2017 Data Mining: Concepts and Techniques 16 Agglomerative Clustering May 4, 2017 Data Mining: Concepts and Techniques 17 MST agglomerative algorithm for single link clustering • Minimum Spanning Tree (MST) is the tree that contains all vertices of the graph (there exists an edge adjacent to any vertices) and the sum of the edge weights is minimal. • In clustering, edge weight represents the distance between 2 points. 1. Start with C0 = { (x1), (x2), …, (xn)} clustering. Find MST on G(1) 2. Repeat step 3,4 until all objects are in a single cluster 3. Merge two clusters connected by the MST edge with the smallest weight. Update the clustering. 4. Update the weight of the selected edge in step 3 with 1. Source: Songrit Maneewongvatana May 4, 2017 Data Mining: Concepts and Techniques 18 MST for single link: example agglomerative x1 x2 4.2 2.6 x3 x1 x2 x1 x2 x1 x2 x1 x2 1.7 1.9 x5 MST x4 x3 x4 x5 x3 x4 x5 x3 x4 x5 x3 x4 x5 divisive x2 x4 x3 x1 x5 x2 x4 x3 x1 x5 x2 x4 x3 x1 x5 x2 x4 x3 x1 x5 0 2 4 6 8 Proximity dendrograms May 4, 2017 Data Mining: Concepts and Techniques 19 Hierarchical Clustering (AGNES & DIANA) Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 a Step 1 Step 2 Step 3 Step 4 agglomerative (AGNES) ab b abcde c cde d de e Step 4 May 4, 2017 Step 3 Step 2 Step 1 Step 0 Data Mining: Concepts and Techniques divisive (DIANA) 20 AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g., Splus Use the Single-Link method and the dissimilarity matrix. Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 1 2 May 4, 2017 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 Data Mining: Concepts and Techniques 0 1 2 3 4 5 6 7 8 9 10 21 DIANA (Divisive Analysis) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g., Splus Inverse order of AGNES Eventually each node forms a cluster on its own 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 May 4, 2017 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Data Mining: Concepts and Techniques 0 1 2 3 4 5 6 7 8 9 10 22 More on Hierarchical Clustering Methods Major weakness of agglomerative clustering methods 2 do not scale well: time complexity of at least O(n ), where n is the number of total objects can never undo what was done previously Integration of hierarchical with distance-based clustering BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction CHAMELEON (1999): hierarchical clustering using dynamic modeling May 4, 2017 Data Mining: Concepts and Techniques 23 BIRCH (1996) Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96) Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree Scales linearly: finds a good clustering with a single scan Weakness: handles only numeric data, and sensitive to the and improves the quality with a few additional scans order of the data record. May 4, 2017 Data Mining: Concepts and Techniques 24 BIRCH (Cont.) • Clustering Feature tree (CF tree) is the main concept of BIRCH algorithm. • Clustering feature is a triple containing the summarized information about a cluster. Elements in the triple are 1. Number of the data points in the cluster. 2. The sum of all data point vectors. 3. The sum of all data point vectors squared. E.g. If the cluster has n data points, {x1, …, xn}, clustering feature is: n n 2 n, xi , xi i 1 i 1 In other words, the triple consists of zero, first, and second moments of point set. May 4, 2017 Data Mining: Concepts and Techniques 25 BIRCH: CF tree • CF tree is a height-balanced tree. • The branching factor B and the cluster threshold T are adjustable. • Each internal node contains at most B elements, each of the form [CFi, childi], childi is a pointer points to the ith child node. CFi is the CF of the cluster associated with ith child node. The CF of the internal node is the sum of all CFi of its children. • Each leaf node contains at most L entries, in the form [CFi]. Like internal node, leaf node represents the cluster made up of subclusters represented by all CFi in the node. • The page size in the disk determines the value of L, B. May 4, 2017 Data Mining: Concepts and Techniques 26 BIRCH: CF tree (cont.) • All entries in a leaf node must satisfy the threshold requirement, that is the radius or diameter (depending on the algorithm) measure of the cluster must be less than T. • Threshold T mandates the tree size. Larger T, smaller the tree because leaf-level clusters can absorb more points. • The CF tree is built by inserting data one by one. Initially, there is only root node. • Each time a new data is inserted into the tree, the algorithm finds the most appropriate path by choosing the closest child node according to the desired distance metric. • Once reaching the leaf node, it finds the closest cluster entry, and tests if the new data point can be absorbed by the cluster without violating the threshold requirement. Source: Songrit Maneewongvatana May 4, 2017 Data Mining: Concepts and Techniques 27 BIRCH: CF tree (cont.) • If the new data point can be assigned into the cluster, just updating the CF of the cluster. • If not, allocate a new entry in the leaf and assign the data point to newly created cluster. – In case that there is no space left in the leaf node for the new entry, the leaf node must be split. • If a leaf node requires a split, two clusters in the leaf that are farthest apart as seeds, put one entry in the first newly created leaf and put the other in the second leaf. Assign all other cluster to either leaf depending on the distance to the seed clusters. • The node split may propagate up the tree, if there is no available entry at the parent of the leaf node. May 4, 2017 Data Mining: Concepts and Techniques 28 BIRCH: CF tree (cont.) • The CFs along the path from the leaf to the root need to be updated to reflect the newly inserted data. • BIRCH also provides a method to enhance the space utilization of each node. At an internal node which the propagation of the node split terminates, the algorithm tries to merge two closest entries if that is possible. • Since each node contains only a limited number of clusters, the clustering structure in CF tree may not reflect the actual clustering. • Also, the same data point may end up in different clusters depending on the insertion order. May 4, 2017 Data Mining: Concepts and Techniques 29 Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: Ni=1=Xi SS: Ni=1=Xi2 CF = (5, (16,30),(54,190)) 10 9 8 7 6 5 4 3 2 1 0 0 May 4, 2017 1 2 3 4 5 6 7 8 9 10 Data Mining: Concepts and Techniques (3,4) (2,6) (4,5) (4,7) (3,8) 30 CF Tree Root B=7 CF1 CF2 CF3 CF6 L=6 child1 child2 child3 child6 Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node prev CF1 CF2 May 4, 2017 CF6 next Leaf node prev CF1 CF2 Data Mining: Concepts and Techniques CF4 next 31 BIRCH: algorithm • Phase1: Build the CF tree. – If run out of memory before finishing the data set, it will build another CF tree with larger T value, reinserting leaves from the old tree and continue with the remaining data. • Phase2: (Optional) Condense the tree by scanning the leaf entries of the tree built in phase 1. Remove outliers and group subclusters that are close together into larger ones. • Phase3: Global clustering, any clustering algorithms can be used. E.g. we can treat each subcluster at leaf level as a data point. • Phase4: (Optional) Reassign each point to the closet cluster. This require one or more passes of full data scan. May 4, 2017 Data Mining: Concepts and Techniques 32 BIRCH (cont.) Advantages • It uses a compact representation (CF). • It provides adjustable parameter T which, as an effect, adjusts the size of CF tree so that it can be fitted into main memory. • One data scan is sufficient to build the clustering, even extra passes will give higher quality clustering. Disadvantages • It is dependent on the order of the inserting data. • The data type is limited to numeric and in metric space. • Treating subclusters as points may not provide the best result during global clustering. May 4, 2017 Data Mining: Concepts and Techniques 33 CURE (Clustering Using REpresentatives ) CURE: proposed by Guha, Rastogi & Shim, 1998 May 4, 2017 Stops the creation of a cluster hierarchy if a level consists of k clusters Uses multiple representative points to evaluate the distance between clusters, adjusts well to arbitrary shaped clusters and avoids single-link effect Data Mining: Concepts and Techniques 34 Drawbacks of Distance-Based Method Drawbacks of square-error based clustering method May 4, 2017 Consider only one point as representative of a cluster Good only for convex shaped, similar size and density, and if k can be reasonably estimated Data Mining: Concepts and Techniques 35 Cure: The Algorithm Draw random sample s. Partition sample to p partitions with size s/p Partially cluster partitions into s/pq clusters Eliminate outliers By random sampling If a cluster grows too slow, eliminate it. Cluster partial clusters. Label data in disk May 4, 2017 Data Mining: Concepts and Techniques 36 Data Partitioning and Clustering s = 50 p=2 s/p = 25 s/pq = 5 y y y x y y x x May 4, 2017 Data Mining: Concepts and Techniques x x 37 Cure: Shrinking Representative Points y y x x Shrink the multiple representative points towards the gravity center by a fraction of . Multiple representatives capture the shape of the cluster May 4, 2017 Data Mining: Concepts and Techniques 38 Clustering Categorical Data: ROCK ROCK: Robust Clustering using linKs, by S. Guha, R. Rastogi, K. Shim (ICDE’99). Suited for categorical attributes Use links to measure similarity/proximity Not distance based 2 2 Computational complexity: O(n nmmma n log n) Basic ideas: T T Similarity function and neighbors: Sim(T , T ) T T Let T1 = {1,2,3}, T2={3,4,5} 1 Sim( T1, T 2) May 4, 2017 1 2 1 2 2 {3} 1 0.2 {1,2,3,4,5} 5 Data Mining: Concepts and Techniques 39 Rock: Algorithm Links: The number of common neighbours for the two points. {1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5} {1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5} 3 {1,2,3} {1,2,4} Algorithm Draw random sample Cluster with links Label data in disk May 4, 2017 Data Mining: Concepts and Techniques 40 CHAMELEON CHAMELEON: hierarchical clustering using dynamic modeling, by G. Karypis, E.H. Han and V. Kumar’99 Measures the similarity based on a dynamic model Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters A two phase algorithm 1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters 2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters May 4, 2017 Data Mining: Concepts and Techniques 41 Overall Framework of CHAMELEON Construct Partition the Graph Sparse Graph İnto small clusters Data Set Final Clusters Merge Partition May 4, 2017 Data Mining: Concepts and Techniques 42