Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Clustering by impact Daniel Barbará George Mason University ISE Dept. http://www.ise.gmu.edu/~dbarbara (joint work with P. Chen, J. Couto, and Y. Li) 4/8/2002 Copyright Daniel Barbara Problem Organizations are constantly acquiring and storing new data (data streams) The need to quickly extract knowledge from the newly arrived data (and compare it with the old) is pressing. Applications: Intrusion detection Tuning Intelligence analysis Outline Clustering data streams Our method Continuous data: Fractal Clustering Categorical (nominal) data: Entropy-based Tracking clusters Future work Clustering and data streams To cluster continuously arriving data streams a clustering algorithm should behave incrementally: make the decision based on the newly arrived point and a concise description of the clusters encountered so far. Concise bounded amount of RAM to describe the clusters, independently of the number of data points processed so far… Problem (cont.) Most algorithms in the literature do not have that property: They look at the entire set of points at once (e.g., K-means) They cannot make decisions point by point. The description of the clusters is usually the set of points in them. Some of the algorithms have high complexity Some inroads Paper by U. Fayyad, D. Bradley and C. Reina: “Scaling Clustering algorithms to large databases” (KDD’98) Main idea: keep descriptions of centroids + set descriptions that are likely and unlikely to change given a new data point. Papers by Motwani, et al. Incrementally updating centroids while receiving a data stream. The goal is to have an approximation to “min squares” whose performance is bounded. Our proposal Find functions that naturally define clusters and that can be easily computed given a new point and a concise representation of the current clusters. Place a new point in the cluster for which the evaluated function shows a minimum (or a maximum) – less impact--- “Impact” functions Numerical data points: fractal dimension Measures the self-similarity of points. The idea is that the lower the change in the fractal dimension (when the point is included), the more self-similar the point is w/respect to the cluster Categorical data points: entropy. Also measures similarity Lower entropy means similar points. Fractal Clustering Fractal dimension, is a (not necessarily integer) number that characterizes the number of dimensions ``filled'' by the object represented by the dataset. The object on the upper right corner, called the Menger sponge (when complete) has a F.D. equal to 2.73 (less than the embedding space, whose dimension is 3) Conjecture: if part of a dataset brings about a change in the overall fractal dimension of the set, then this part is ``anomalous'' (exhibits different behavior) with respect to the rest of the dataset. Fractal dimension log pi log pi i Dq { log r log piq i ( q 1) log r pi otherwise Probability distribution r = grid size for q 1 Box Counting Cantor Dust Set Box counting (cont.) p r Population vs. grid size (logxlog) 4 8 r0 r0 / 3 r0 /9 10 8 pop. 2 4 2 1 1 2 3 r log 2n D1 = - limn-> log ( ) n= 0.63 Initialization Algorithm Take an unlabelled point in the sample and start a cluster. Find close neighbors and add them to the cluster. Find close neighbors to points in the cluster… If you can’t go to first step. Space management Space management Space in RAM is not proportional to the size of the dataset, but rather to the size of the grid and number of grid levels kept. These vary with: Dimensionality Accuracy (odd-shaped clusters may require more levels). Experiments Dataset1 Scalability results with Dataset1 6000 4000 2000 0 3, 00 0, 00 0 30 ,0 00 ,0 00 t 30 0, 00 0 30 ,0 00 seconds execution time number of points Quality of clusters (Dataset1) Percentage of points clustered right 50 C1 30 ,0 00 ,0 00 3, 00 0, 00 0 30 0, 00 0 0 30 ,0 00 % 100 Dataset size C2 C3 High dimensional set 10 dimensions, 2 clusters % of points clustered right % 100 100 98 96 94 92 90 C1 94.3 C2 C1 C2 Cluster Results with the noisy dataset 92 % of the noise gets filtered out. % points clustered right 99.57 100 100 % 83.62 80 C1 C2 C3 60 C1 C2 Cluster C3 Memory usage vs. dimensions Size (Kb.) Memory used vs. dimensions 3000 2,000 2000 1000 64 Size(Kb) 500 0 1 2 3 4 5 6 7 8 9 10 dimensions Memory reduction Space taken by the boxes is small, but it grows with the number of dimensions. Memory reduction techniques: • Use boxes with # points > epsilon. • Cache boxes • Have only smallest granularity boxes and derive the rest. None of them causes a significant degradation of quality. (2 and 3 have an impact on running time.) Memory reduction 75 80 55 % 60 40 19 25 1 2 Memory reduction 20 0 3 Technique 4 Comparison with other algorithms FC CURE Algorithm Comparison of quality C2 C1 right outliers C2 C1 0 50 100 % 150 Entropy-based Clustering (COOLCAT) For Categorical data Place new point where it minimizes some function of the entropies of the individual clusters (e.g., min (max (entropy Ci))) Heuristic (problem is NP-Hard) Entropy of each cluster: E (Ck ) P(Vij / Ck ) log P(Vij / Ck ) i 1,...d j 1,.. ji Minimize expected entropy Initialization Need to seed “k” clusters: Select a sample Find 2 points that are the most dissimilar (their joint entropy is the highest). Place them in 2 different clusters Find another point that is the most dissimilar (pairwise) to the ones selected, and start another cluster. Incremental phase For a given point and k current clusters: Compute the expected entropy as the new point is placed in each cluster. Choose the one that minimizes the expected entropy After finishing with a batch of points, reprocess m% of them (take the ``worse’’ fits out and re-cluster them): helps with the issue of order dependency Conciseness Notice that the current cluster description is concise: Counts of Vij for every i= 1,.., d (number of attributes), and for every j (domain of each attribute) COOLCAT and the MDL MDL = minimum description length. Widely used to argue about how good a classifier is: how many bits does it take to send to a receiver the description of your classifier + the exceptions (misclassifications) MDL (cont.) K (h, D ) K (h) K ( D using h) h model, D = data K ( h ) k log(| D |) K ( h, D ) E (C ) C clustering Experimental results Real and synthetic datasets Evaluate quality and performance Quality: Category utility function (how much “better” is the distribution probability in the individual clusters w/respect to the original distribution) External entropy: take an attribute not used in the clustering and compute the entropy of each cluster w/respect to it, then the expected external entropy Experimental results Archaeological data set Alg. m CU Ext. E Exp E Coolcat 0 0.7626 0 4.8599 Coolcat 10 0.7626 0 4.8599 Coolcat 20 0.7626 0 4.8599 Brute F. - 0.7626 0 4.8599 ROCK 0.3312 0.96 - - KDD99 Cup data (intrusion detection) Ext Exp E E CU k Performance (synthetic data) T (sec.) N x 1000 Tracking clusters Clustering data streams as they come: Consider r.v X = 0 if new point is outlier; 1 otherwise.Using Chernoff bounds: Must see s “successes” – not outliers– in a window w 3(1 ) 2 s ln( ) 2 2(1 ) 2 w ln( ) 2 (1 ) p If you don’t, it is time for new clusters… FC, COOLCAT and Tracking Find a good definition of outlier: FC: if the min change in FD exceeds a threshold. COOLCAT: mutual information of new point with respect to clusters One tracking experiment with FC One tracking experiment with COOLCAT (intrusion detection) attacks density No attacks Mutual Information Hierarchical clustering More tracking experiments Hybrid data: numeric and categorical Indexing based on clustering … Bibliography ``Using the Fractal Dimension to Cluster Datasets,'' Proceedings of the the ACM-SIGKDD International Conference on Knowledge and Data Mining , Boston, August 2000. D. Barbara, P.Chen. ``Tracking Clusters in Evolving Data Sets,'' Proceedings of FLAIRS'2001, Special Track on Knowledge Discovery and Data Mining , Key West, FL, May 2001. D. Barbara, P. Chen. ``Fractal Characterization of Web Workloads,'' Proceedings of the 11th International World Wide Web Conference, May 2002. D. Menasce, V. Almeida, D. Barbara, B. Abrahao, and F. Ribeiro. ``Using Self-Similarity to Cluster Large Data Sets,’’ to appear in Journal of Data Mining and Knowledge Discovery, Kluwer Academic pub. D. Barbara, P.Chen ``Requirements for Clustering Data Streams,'' SIGKDD Explorations (Special Issue on Online, Interactive, and Anytime Data Mining), Vol. 3, No. 2, Jan 2002. D. Barbara ``COOLCAT: An Entropy-Based Algorithm for Categorical Clustering,’’ Submitted for publication. D. Barbara, J. Couto, Y. Li.