Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BIRCH: A New data clustering Algorithm and Its Applications • Tian Zhang • Raghu Ramakrishnan • Miron Livny • Presented by: Peter Vile • Data Clustering: Problem of dividing N data points into K groups so as to minimize an intra-group difference metric. Many Algorithms already exists Problem: Due to abundance of local minima, there is no way to find a globally minimal solution without trying all possible partitions. Other Methods • Probability-Based Clustering • COBWEB – use probabilistic measurements for making decisions (Category Utility) – Hierarchical, Iterative • Disadvantage – Category Utility takes time and memory – tends to over fit Cobweb’s Limits -assumes probability distributions of attributes are independent -can only have discrete values, approximates continuous data with discretization -storing and manipulating large sets of data becomes infeasible The Competition • Distance-Based Clustering • KMEANS, CLARAN • CLARAN – like Kmeans – Node - set of medians – starts in random node and moves to closest neighbor BIRCH • Good -doesn’t assume attributes independent -minimizes memory useage -scans data once from disk. -can handle very large data sets (use the concept of summarization) -exploits the observation that not every data point is important for clustering purposes Limitations of BIRCH • Handles only metric data Definitions • Centroid – avg value • Radius – std dev • Diameter – avg. pairwise distance within a cluster X0 i 1 X i N N 1 N X i X 02 2 R i 1 N 1 N N X i X j 2 2 i 1 j 1 D N N 1 How Far Away • Given the centroids of two clusters: • centroid Euclidean distance D0 2 D0 X 01 X 02 1 2 • centroid Manhattan distance D1 d (i ) (i ) D1 | X 01 X 0 2 | | X 01 X 0 2 | – d is the dimension i 1 More Distances • 2 12 Average Inter-cluster X X i j i 1 j N1 1 D2 Distance (D2) N1 N 2 N1 N1 N 2 • Average Intra-cluster X X Distance (D3) D3 N N N N 1 • Variance Increase Distance (D4) N1 N 2 N1 N 2 i 1 j 1 1 2 2 i 1 j 2 1 2 1 2 2 2 2 N1 N 2 N1 N 2 N1 X N1 N 2 l 1 X l X l N1 N N l l N 1 1 2 l 1 1 D 4 k 1 X i 1 X i j N 1 X j k 1 N N N N 1 2 1 2 Clustering Feature (CF) • “A Clustering Feature (CF) entry is a triple summarizing the information that we maintain about a sub-cluster of data points.” • CF Definition: CF = (N, L S , SS) – – – N : number of data points in the cluster N L S : Linear sum of the N data points, i1 Xi N 2 X SS : Square sum of the N data points, i 1 i • CF Representativity Theorem: Given CF entries for all sub-clusters, all the measurements, Q1 and Q2, can be computed accurately • CF Additivity Theorem: Assume that CF1 and CF2 are the CF entries of two disjoint sub-clusters. Then the CF entry of the subcluster that is formed by merging the two disjoint sub-clusters is : CF1 N1 , LS1 , SS1 CF2 N 2 , LS2 , SS2 CF1 CF2 N1 N 2 , LS1 LS2 , SS1 SS2 CF-tree Features • Has two parameters – 1. Branching Factor • B - non-leaf node ( [CF i, child i], i = 1..B) – child i - pointer to its i-th child node • L - leaf node ( [CF j, prev, next], j = 1..L) – 2. Threshold • specify the size of each leaf entry – diameter(D) of each leaf entry < T – or radius(R) of each leaf entry < T CF-tree Features(Continue) • Tree size is a function of T – tree size = f(T) – T increases tree size decreases • Page Size (P) – A node is required to fit in a page of size P – P can be varied for performance tuning • CF tree will be built dynamically as new data objects are inserted CF-tree • Two Algorithms used to build a CF-tree – 1. Insertion Algorithm • Purpose: Building a CF-Tree – 2. Rebuilding Algorithm • Purpose: – Rebuild the whole tree with larger T (smaller size) – this happens when CF-tree size limit is exceeded Insertion Algorithm • 1. Identifying the appropriate leaf – non-leaf node – use distance metric to chose closest branch • 2. Insert leaf into leaf node – merge with closest leaf [CF i, prev, next] – if T violated make new leaf [CF i+1, prev, next] – if L violated split into two leaf nodes • choose two leaves that are farthest apart • put the rest in leaf node with the closest leaf Insertion Algorithm • 3. Update tree path – if B is violated split the node – CF should be the sum of child CFs • 4. A Merging Refinement – try to combine two closest children of the node that did not split – might free space Rebuilding Algorithm • When to do it? – If the CF-tree size limit is exceeded • What does it do? – Creates new tree with larger T(diameter/radius) • larger T -> smaller tree size • How – Deletes path from old tree and adds path to new tree. Rebuilding Algorithm • Procedure – ‘OldCurrentPath’ starts at leftmost branch – 1. Create the ‘NewCurrentPath’ in new tree • copy nodes from OldCurrentPath into NewCurrentPath – 2. Insert leaf entries in ‘OldCurrentPath’ into the new tree • if leaf does not go into NewCurrentPath remove it from the NewCurrentPath Rebuilding Algorithm – 3. Free space in ‘OldCurrentPath’ and ‘NewCurrentPath’ • delete nodes in ‘OldCurrentPath’ • if ‘NewCurrentPath’ Is empty delete its nodes – 4. Process the next path in the old tree – only needs enough pages to cover ‘OldCurrentPath’ • usually the height of the tree Potential Problems • Anomaly 1: Natural cluster is split across two leaf nodes, or two distant clusters are placed in the same node • Anomaly 2: Sub cluster ends up in the wrong leaf node Reducibility Theorem: Assume we rebuild CF-tree ti+1 of threshold Ti by the above algorithm, and let Si and Si+1be the sizes of ti and ti+1 respectively. If Ti+1 > Ti, then Si+1<Si, and the transformation from ti to ti+1 needs at most h extra pages of memory, where h is the height of t i BIRCH Clustering Algorithm • Four Phases – 1. Loading • scan all data and build an initial in-memory CF-tree – 2. Optional Condensing • rebuild CF-tree to make it smaller • faster analysis, but reduced accuracy – 3. Global Clustering • run clustering algorithm on CFs (KMEANS) • handles anomaly 1 BIRCH Clustering Algorithm – 4. Optional Refining • Use the centroids of the clusters as seeds • Scan data again and assign points to closest seed • Handles anomaly 2 BIRCH Phase 1 • Building CF-tree – Heuristic Threshold (T default = 0) • when rebuilding need new T (diameter/radius) • use avg distance of closest leaf pairs in same node • should reduce size of tree by about half BIRCH Phase 1 – Outlier Handling Option • CF with small N(# data points) is saved to disk • try to reinsert when – run out of memory – finish reading data • if data is noisy, improves runtime and accuracy – Delay Split Option • about to run out of memory • CFs that would cause the tree to split saved to disk How does this compare to other clustering methods? Ran against Kmeans and CLARANS Results Results Results Results Runtime Conclusions • BIRCH vs CLARANS and KMEANS – runs faster (fewer scans) – less order sensitive – less memory Where can I use this? • Interactive and Iterative Pixel Classification – MVI ( Multi-band Vegetation Imager) – BIRCH helps classify pixels through clustering • Code Generalization in Image Compression – compressing visual data to save space – code book - vector code words for image blocks – BIRCH assigns nearest code word to vector Main limitations of BIRCH? • Ability to only handle metric data. Name the two algorithms in BIRCH clustering? 1. Inserting 2. Rebuilding What is the purpose to have phase 4 in the BIRCH clustering algorithm? -All copies of a given data point go to the same cluster. -option to discard outliers -can converge to a minimum