Download Insertion Algorithm

BIRCH: A New data clustering Algorithm and Its Applications • Tian Zhang • Raghu Ramakrishnan • Miron Livny • Presented by: Peter Vile • Data Clustering: Problem of dividing N data points into K groups so as to minimize an intra-group difference metric. Many Algorithms already exists Problem: Due to abundance of local minima, there is no way to find a globally minimal solution without trying all possible partitions. Other Methods • Probability-Based Clustering • COBWEB – use probabilistic measurements for making decisions (Category Utility) – Hierarchical, Iterative • Disadvantage – Category Utility takes time and memory – tends to over fit Cobweb’s Limits -assumes probability distributions of attributes are independent -can only have discrete values, approximates continuous data with discretization -storing and manipulating large sets of data becomes infeasible The Competition • Distance-Based Clustering • KMEANS, CLARAN • CLARAN – like Kmeans – Node - set of medians – starts in random node and moves to closest neighbor BIRCH • Good -doesn’t assume attributes independent -minimizes memory useage -scans data once from disk. -can handle very large data sets (use the concept of summarization) -exploits the observation that not every data point is important for clustering purposes Limitations of BIRCH • Handles only metric data Definitions • Centroid – avg value • Radius – std dev • Diameter – avg. pairwise distance within a cluster  X0  i 1 X i N N 1    N X i  X 02  2  R   i 1   N   1     N  N X i  X j 2  2  i 1 j 1  D  N  N  1    How Far Away • Given the centroids of two clusters: • centroid Euclidean distance D0    2 D0  X 01  X 02   1 2 • centroid Manhattan distance D1 d    (i )  (i ) D1 | X 01  X 0 2 |  | X 01  X 0 2 | – d is the dimension i 1 More Distances •   2 12    Average Inter-cluster   X  X i j  i 1 j  N1 1  D2    Distance (D2) N1 N 2    N1 N1  N 2 • Average Intra-cluster    X  X    Distance (D3) D3     N  N  N  N  1  • Variance Increase Distance (D4) N1  N 2 N1  N 2 i 1 j 1 1 2 2 i 1 j 2     1 2 1 2 2 2 2 N1  N 2    N1  N 2  N1        X  N1  N 2   l 1 X l    X  l N1   N  N  l   l  N  1 1 2  l 1 1 D 4   k 1 X   i 1 X i    j  N 1  X j     k    1 N  N N N  1 2 1 2          Clustering Feature (CF) • “A Clustering Feature (CF) entry is a triple summarizing the information that we maintain about a sub-cluster of data points.”  • CF Definition: CF = (N, L S , SS) – – – N : number of data points in the cluster  N  L S : Linear sum of the N data points, i1 Xi N 2 X SS : Square sum of the N data points, i 1 i • CF Representativity Theorem: Given CF entries for all sub-clusters, all the measurements, Q1 and Q2, can be computed accurately • CF Additivity Theorem: Assume that CF1 and CF2 are the CF entries of two disjoint sub-clusters. Then the CF entry of the subcluster that is formed by merging the two disjoint sub-clusters is :   CF1  N1 , LS1 , SS1  CF2  N 2 , LS2 , SS2     CF1  CF2  N1  N 2 , LS1  LS2 , SS1  SS2  CF-tree Features • Has two parameters – 1. Branching Factor • B - non-leaf node ( [CF i, child i], i = 1..B) – child i - pointer to its i-th child node • L - leaf node ( [CF j, prev, next], j = 1..L) – 2. Threshold • specify the size of each leaf entry – diameter(D) of each leaf entry < T – or radius(R) of each leaf entry < T CF-tree Features(Continue) • Tree size is a function of T – tree size = f(T) – T increases tree size decreases • Page Size (P) – A node is required to fit in a page of size P – P can be varied for performance tuning • CF tree will be built dynamically as new data objects are inserted CF-tree • Two Algorithms used to build a CF-tree – 1. Insertion Algorithm • Purpose: Building a CF-Tree – 2. Rebuilding Algorithm • Purpose: – Rebuild the whole tree with larger T (smaller size) – this happens when CF-tree size limit is exceeded Insertion Algorithm • 1. Identifying the appropriate leaf – non-leaf node – use distance metric to chose closest branch • 2. Insert leaf into leaf node – merge with closest leaf [CF i, prev, next] – if T violated make new leaf [CF i+1, prev, next] – if L violated split into two leaf nodes • choose two leaves that are farthest apart • put the rest in leaf node with the closest leaf Insertion Algorithm • 3. Update tree path – if B is violated split the node – CF should be the sum of child CFs • 4. A Merging Refinement – try to combine two closest children of the node that did not split – might free space Rebuilding Algorithm • When to do it? – If the CF-tree size limit is exceeded • What does it do? – Creates new tree with larger T(diameter/radius) • larger T -> smaller tree size • How – Deletes path from old tree and adds path to new tree. Rebuilding Algorithm • Procedure – ‘OldCurrentPath’ starts at leftmost branch – 1. Create the ‘NewCurrentPath’ in new tree • copy nodes from OldCurrentPath into NewCurrentPath – 2. Insert leaf entries in ‘OldCurrentPath’ into the new tree • if leaf does not go into NewCurrentPath remove it from the NewCurrentPath Rebuilding Algorithm – 3. Free space in ‘OldCurrentPath’ and ‘NewCurrentPath’ • delete nodes in ‘OldCurrentPath’ • if ‘NewCurrentPath’ Is empty delete its nodes – 4. Process the next path in the old tree – only needs enough pages to cover ‘OldCurrentPath’ • usually the height of the tree Potential Problems • Anomaly 1: Natural cluster is split across two leaf nodes, or two distant clusters are placed in the same node • Anomaly 2: Sub cluster ends up in the wrong leaf node Reducibility Theorem: Assume we rebuild CF-tree ti+1 of threshold Ti by the above algorithm, and let Si and Si+1be the sizes of ti and ti+1 respectively. If Ti+1 > Ti, then Si+1<Si, and the transformation from ti to ti+1 needs at most h extra pages of memory, where h is the height of t i BIRCH Clustering Algorithm • Four Phases – 1. Loading • scan all data and build an initial in-memory CF-tree – 2. Optional Condensing • rebuild CF-tree to make it smaller • faster analysis, but reduced accuracy – 3. Global Clustering • run clustering algorithm on CFs (KMEANS) • handles anomaly 1 BIRCH Clustering Algorithm – 4. Optional Refining • Use the centroids of the clusters as seeds • Scan data again and assign points to closest seed • Handles anomaly 2 BIRCH Phase 1 • Building CF-tree – Heuristic Threshold (T default = 0) • when rebuilding need new T (diameter/radius) • use avg distance of closest leaf pairs in same node • should reduce size of tree by about half BIRCH Phase 1 – Outlier Handling Option • CF with small N(# data points) is saved to disk • try to reinsert when – run out of memory – finish reading data • if data is noisy, improves runtime and accuracy – Delay Split Option • about to run out of memory • CFs that would cause the tree to split saved to disk How does this compare to other clustering methods? Ran against Kmeans and CLARANS Results Results Results Results Runtime Conclusions • BIRCH vs CLARANS and KMEANS – runs faster (fewer scans) – less order sensitive – less memory Where can I use this? • Interactive and Iterative Pixel Classification – MVI ( Multi-band Vegetation Imager) – BIRCH helps classify pixels through clustering • Code Generalization in Image Compression – compressing visual data to save space – code book - vector code words for image blocks – BIRCH assigns nearest code word to vector Main limitations of BIRCH? • Ability to only handle metric data. Name the two algorithms in BIRCH clustering? 1. Inserting 2. Rebuilding What is the purpose to have phase 4 in the BIRCH clustering algorithm? -All copies of a given data point go to the same cluster. -option to discard outliers -can converge to a minimum

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Insertion Algorithm