Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Clustering CS 685: Special Topics in Data Mining Jinze Liu The UNIVERSITY of Mining, KENTUCKY CS685 : Special Topics in Data UKY Outline • • • • • • • What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering methods Outlier analysis CS685 : Special Topics in Data Mining, UKY Hierarchical Clustering • Group data objects into a tree of clusters Step 0 a Step 1 Step 2 Step 3 Step 4 ab b abcde c cde d de e Step 4 agglomerative (AGNES) Step 3 Step 2 Step 1 Step 0 divisive (DIANA) CS685 : Special Topics in Data Mining, UKY AGNES (Agglomerative Nesting) • Initially, each object is a cluster • Step-by-step cluster merging, until all objects form a cluster – Single-link approach – Each cluster is represented by all of the objects in the cluster – The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters CS685 : Special Topics in Data Mining, UKY Dendrogram • Show how to merge clusters hierarchically • Decompose data objects into a multi-level nested partitioning (a tree of clusters) • A clustering of the data objects: cutting the dendrogram at the desired level – Each connected component forms a cluster CS685 : Special Topics in Data Mining, UKY DIANA (DIvisive ANAlysis) • Initially, all objects are in one cluster • Step-by-step splitting clusters until each cluster contains only one object 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 CS685 : Special Topics in Data Mining, UKY Distance Measures • • • • Minimum distance Maximum distance Mean distance Average distance d min (Ci , C j ) min d ( p, q ) pCi , qC j d max (Ci , C j ) max d ( p, q) pCi , qC j d mean (Ci , C j ) d (mi , m j ) d avg (Ci , C j ) 1 ni n j d ( p, q ) pCi qC j m: mean for a cluster C: a cluster n: the number of objects in a cluster CS685 : Special Topics in Data Mining, UKY Challenges of Hierarchical Clustering Methods • Hard to choose merge/split points – Never undo merging/splitting – Merging/splitting decisions are critical • Do not scale well: O(n2) • What is the bottleneck when the data can’t fit in memory? • Integrating hierarchical clustering with other techniques – BIRCH, CURE, CHAMELEON, ROCK CS685 : Special Topics in Data Mining, UKY BIRCH • Balanced Iterative Reducing and Clustering using Hierarchies • CF (Clustering Feature) tree: a hierarchical data structure summarizing object info – Clustering objects clustering leaf nodes of the CF tree CS685 : Special Topics in Data Mining, UKY Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: Ni=1=Xi SS: N CF = (5, (16,30),(54,190)) 2 i=1=Xi 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 (3,4) (2,6) (4,5) (4,7) (3,8) CS685 : Special Topics in Data Mining, UKY CF-tree in BIRCH • Clustering feature: – Summarize the statistics for a subcluster: the 0th, 1st and 2nd moments of the subcluster – Register crucial measurements for computing cluster and utilize storage efficiently • A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering – A nonleaf node in a tree has descendants or “children” – The nonleaf nodes store sums of the CFs of children CS685 : Special Topics in Data Mining, UKY CF Tree B = 7 CF1 CF2 CF3 L = 6 child1 child2 child3 CF6 Root child6 Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node prev CF1 CF2 CF6 next Leaf node prev CF1 CF2 CF4 next CS685 : Special Topics in Data Mining, UKY Parameters of A CF-tree • Branching factor: the maximum number of children • Threshold: max diameter of sub-clusters stored at the leaf nodes CS685 : Special Topics in Data Mining, UKY CF Tree Insertion • Identifying the appropriate leaf: recursively descending the CF tree and choosing the closest child node according to a chosen distance metric • Modifying the leaf: test whether the leaf can absorb the node without violating the threshold. If there is no room, split the node • Modifying the path: update CF information up the path. 14 of 28 CS685 : Special Topics in Data Mining, UKY Example of the BIRCH Algorithm New subcluster sc4 sc5 sc8 sc6 sc3 sc1 sc2 LN1 LN1 LN2 Root LN2 LN3 sc8 sc1 sc3sc4sc5sc6 sc7 sc2 sc7 LN3 Vladimir Jelić ([email protected]) 15 of 28 CS685 : Special Topics in Data Mining, UKY Merge Operation in BIRCH If the branching factor of a leaf node can not exceed 3, then LN1 is split sc4 sc5 sc1 sc8 sc2 sc3 sc6 LN2 Root LN1” LN2 LN3 LN1” LN1’ LN1’ sc8 sc1 sc3sc4sc5sc6 sc7 sc2 sc7 LN3 19 of 28 CS685 : Special Topics in Data Mining, UKY Merge Operation in BIRCH If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one sc1 sc3 Root sc2 sc8 NLN1 NLN2 LN1’ LN1” LN1’ LN1” sc6 sc4 LN2 sc5 LN2 sc7 LN3 LN3 sc8 sc1 sc2sc3 sc4sc5 sc6 sc7 19 of 28 CS685 : Special Topics in Data Mining, UKY Merge Operation in BIRCH Assume that the subclusters are numbered according to the order of formation sc6 sc5 sc3 LN2 root sc2 sc1 LN1 sc4 LN1 sc1 LN2 sc6 sc2 sc3 sc4 sc5 19 of 28 CS685 : Special Topics in Data Mining, UKY Merge Operation in BIRCH If the branching factor of a leaf node can not exceed 3, then L sc5 sc6 sc3 LN2” sc4 LN2’ sc1 root sc2 LN1 LN1 LN2’ LN2” sc6 sc5 sc3 sc1 sc2 sc4 19 of 28 CS685 : Special Topics in Data Mining, UKY Merge Operation in BIRCH LN2’ and LN1 will be merged, and the newly formed node wil be split immediately sc2 sc5 sc6 sc3 LN2” sc4 sc1 LN3” LN3’ root LN3’ LN3” LN2” sc2 sc1 sc4 sc3 sc5 Vladimir Jelić ([email protected]) sc6 19 of 28 CS685 : Special Topics in Data Mining, UKY Birch Clustering Algorithm (1) • Phase 1: Scan all data and build an initial inmemory CF tree. • Phase 2: condense into desirable length by building a smaller CF tree. • Phase 3: Global clustering • Phase 4: Cluster refining – this is optional, and requires more passes over the data to refine the results 20 of 28 CS685 : Special Topics in Data Mining, UKY Birch Clustering Algorithm (2) 21 of 28 CS685 : Special Topics in Data Mining, UKY Birch – Phase 1 • Start with initial threshold and insert points into the tree • If run out of memory, increase thresholdvalue, and rebuild a smaller tree by reinserting values from older tree and then other values • Good initial threshold is important but hard to figure out • Outlier removal – when rebuilding tree remove outliers Vladimir Jelić 22 of 28 ([email protected]) CS685 : Special Topics in Data Mining, UKY Birch - Phase 2 • Optional • Phase 3 sometime have minimum size which performs well, so phase 2 prepares the tree for phase 3. • BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and groups dense clusters into larger ones. Vladimir Jelić 23 of 28 ([email protected]) CS685 : Special Topics in Data Mining, UKY Birch – Phase 3 • Problems after phase 1: – Input order affects results – Splitting triggered by node size • Phase 3: – cluster all leaf nodes on the CF values according to an existing algorithm – Algorithm used here: agglomerative hierarchical clustering Vladimir Jelić 24 of 28 ([email protected]) CS685 : Special Topics in Data Mining, UKY Birch – Phase 4 • Optional • Do additional passes over the dataset & reassign data points to the closest centroid from phase 3 • Recalculating the centroids and redistributing the items. • Always converges (no matter how many time phase 4 is repeated) Vladimir Jelić 25 of 28 ([email protected]) CS685 : Special Topics in Data Mining, UKY Conclusions (1) • Birch performs faster than existing algorithms (CLARANS and KMEANS) on large datasets • Scans whole data only once • Handles outliers better • Superior to other algorithms in stability and scalability Vladimir Jelić 26 of 28 ([email protected]) CS685 : Special Topics in Data Mining, UKY Conclusions (2) • Since each node in a CF tree can hold only a limited number of entries due to the size, a CF tree node doesn’t always correspond to what a user may consider a nature cluster. Moreover, if the clusters are not spherical in shape, it doesn’t perform well because it uses the notion of radius or diameter to control the boundary of a cluster Vladimir Jelić 27 of 28 ([email protected]) CS685 : Special Topics in Data Mining, UKY BIRCH Clustering • Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) • Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree CS685 : Special Topics in Data Mining, UKY Pros & Cons of BIRCH • Linear scalability – Good clustering with a single scan – Quality can be further improved by a few additional scans • Can handle only numeric data • Sensitive to the order of the data records CS685 : Special Topics in Data Mining, UKY Drawbacks of Square Error Based Methods • One representative per cluster – Good only for convex shaped having similar size and density • A number of clusters parameter k – Good only if k can be reasonably estimated CS685 : Special Topics in Data Mining, UKY Drawback of Distance-based Methods • Hard to find clusters with irregular shapes • Hard to specify the number of clusters • Heuristic: a cluster must be dense CS685 : Special Topics in Data Mining, UKY Directly Density Reachable • Parameters MinPts = 3 Eps = 1 cm – Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Epsneighborhood of that point – NEps(p): {q | dist(p,q) Eps} p q • Core object p: |Neps(p)|MinPts • Point q directly density-reachable from p iff q Neps(p) and p is a core object CS685 : Special Topics in Data Mining, UKY Density-Based Clustering: Background (II) • Density-reachable p p1 q – Directly density reachable p1p2, p2p3, …, pn1 pn pn density-reachable from p1 • Density-connected – Points p, q are density-reachable from o p and q are density-connected p q o CS685 : Special Topics in Data Mining, UKY DBSCAN • A cluster: a maximal set of density-connected points – Discover clusters of arbitrary shape in spatial databases with noise Outlier Border Eps = 1cm Core MinPts = 5 CS685 : Special Topics in Data Mining, UKY DBSCAN: the Algorithm • Arbitrary select a point p • Retrieve all points density-reachable from p wrt Eps and MinPts • If p is a core point, a cluster is formed • If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database • Continue the process until all of the points have been processed CS685 : Special Topics in Data Mining, UKY Problems of DBSCAN • Different clusters may have very different densities • Clusters may be in hierarchies CS685 : Special Topics in Data Mining, UKY