Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Clustering COMP 790 790-90 90 Research Seminar BCB 713 Module Spring 2011 Wei Wang The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Hierarchical Clustering Group data objects into a tree of clusters Step 0 a Step 1 Step 2 Step 3 Step 4 ab b abcde c cde d de e Step 4 2 agglomerative l ti (AGNES) Step 3 Step 2 Step 1 Step 0 divisive (DIANA) COMP 790-090 Data Mining: Concepts, Algorithms, and Applications AGNES (Agglomerative Nesting) Initially, each object is a cluster Step-by-step Step by step cluster merging, merging until all objects form a cluster Single-link g approach pp Each cluster is represented by all of the objects in the cluster The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters 3 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Dendrogram Show how to merge clusters hierarchically Decompose data objects into a multi multi-level level nested partitioning (a tree of clusters) A clustering l t i off the th data d t objects: bj t cutting tti the th dendrogram at the desired level Each connected component forms a cluster 4 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications DIANA (DIvisive ANAlysis) Initially, all objects are in one cluster Step by step splitting clusters until each Step-by-step cluster contains only one object 10 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 5 10 10 9 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 0 10 1 2 3 4 5 6 7 8 9 10 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Distance Measures Minimum distance Maximum distance Mean distance Average distance d min (Ci , C j ) min d ( p, q ) pCi , qC j d max (Ci , C j ) max d ( p, q ) pCi , qC j d mean (Ci , C j ) d (mi , m j ) d avg (Ci , C j ) 1 ni n j d ( p, q ) pCi qC j m: mean for f a cluster l t C: a cluster n: the number of objects in a cluster 6 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Challenges of Hierarchical Clustering Methods Hard to choose merge/split points Never undo merging/splitting Merging/splitting decisions are critical Do not scale well: O(n2) Wh is What i the h bottleneck b l k when h the h data d can’t ’ fit fi in i memory? Integrating hierarchical clustering with other techniques BIRCH, CURE, CHAMELEON, ROCK COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 7 BIRCH Balanced Iterative Reducing and Clustering using Hierarchies CF (Clustering Feature) tree: a hierarchical data structure summarizing object info Clustering objects clustering leaf nodes of the CF tree 8 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: Ni=1=Xi CF = (5, (16,30),(54,190)) SS: Ni=1=Xi2 10 9 8 7 6 5 4 3 2 1 0 0 9 1 2 3 4 5 6 7 8 9 10 (3,4) (2,6) ( ) (4,5) (4,7) (3 8) (3,8) COMP 790-090 Data Mining: Concepts, Algorithms, and Applications CF-tree in BIRCH Clustering feature: Summarize the statistics for a subcluster: the 0th, 1st and 2nd moments of the subcluster Register crucial measurements for computing cluster andd utilize tili storage t efficiently ffi i tl A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering A nonleaf node in a tree has descendants or “children” The nonleaf nodes store sums of the CFs of children 10 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications CF Tree B = 7 CF1 CF2 CF3 L = 6 child1 child2 child3 CF6 Root child6 Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node prev 11 CF1 CF2 CF6 next Leaf node prev CF1 CF2 CF4 next COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Parameters of A CF-tree Branching factor: the maximum number of children Threshold: max diameter of sub-clusters stored at the leaf nodes 12 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications BIRCH Clustering Phase 1: scan DB to build an initial inmemory CF tree (a multi multi-level level compression of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree 13 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Pros & Cons of BIRCH Linear scalability Good clustering with a single scan Quality can be further improved by a few additional scans Can handle only numeric data S iti to Sensitive t the th order d off the th data d t records d 14 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Drawbacks of Square Error Based Methods One representative per cluster Good only for convex shaped having similar size i andd density d i A number of clusters parameter k G d only Good l if k can be b reasonably bl estimated ti t d 15 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications CURE: the Ideas Each cluster has c representatives Choose c well scattered ppoints in the cluster Shrink them towards the mean of the cluster by a fraction of The representatives capture the physical shape and geometry of the cluster M Merge the h closest l two clusters l Distance of two clusters: the distance between the two closest representatives 16 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications