Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Techniques of Classification and Clustering Advanced Database Systems, mod1-3, 2006 1 Problem Description Assume A={A1, A2, …, Ad}: (ordered or unordered) domain S= A1 A2 … Ad : d-dimensional (numerical or non-numerical) space Input V={v1, v2, …, vm}: d-dimensional points, where vi = vi1, vi2, …, vid. The jth component of vi is drawn from domain Aj. Output G={g1, g2, …, gk}: a set groups of V with label vL, where gi V. Advanced Database Systems, mod1-3, 2006 2 Classification Supervised classification Discriminant analysis, or simply Classification A collection of labeled (pre-classified) patterns are provided Aims to label a newly encountered, yet unlabeled (training) patterns Unsupervised classification Clustering Aims to group a given collection of unlabeled patterns into meaningful clusters Category labels are data driven Advanced Database Systems, mod1-3, 2006 3 Methods for Classification Neural Nets Classification functions are obtained by passing multiple passes over training sets Poor generation efficiency Not efficient handling of non-numerical data Decision trees If E contains only objects of one group, the decision tree is just a leaf labeled with that group. Construct a DT that correctly classifies objects in the training data set. Test to classify the unseen objects in the test data set. Advanced Database Systems, mod1-3, 2006 4 Decision Trees (Ex: Credit Analysis) s a la ry e d u c a t io n la b e l 10000 h ig h s c h o o l re je c t 40000 u n d e r g ra d u a t e ac c ept 15000 u n d e r g ra d u a t e re je c t 75000 g ra d u a t e ac c ept 18000 g ra d u a t e ac c ept salary < 20000 yes no education in graduate yes accept Advanced Database Systems, mod1-3, 2006 accept no reject 5 Decision Trees Pros Fast execution time Generated rules are easy to interpret by humans Scale well for large data sets Can handle high dimensional data Cons Cannot capture correlations among attributes Consider only axis-parallel cuts Advanced Database Systems, mod1-3, 2006 6 Decision Tree Algorithms Classifiers from machine learning community: ID3[J. R. Quinlan, Induction of decision trees, Machine Learning, 1, 1986 ] C4.5[J. Ross Quinlan, C4.5: Programs for and Neural Networks, Cambridge University . Press, Cambridge, 1996. Machine Learning, Morgan Kaufman, 1993] CART[L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Wadsworth, Belmont, 1984.] Classifiers for large database: SLIQ[MAR96], SPRINT[John Shafer, Rakesh Agrawal, and Manish Mehta, SPRINT: A scalable parallel classifier for data mining, the VLDB Conference, Bombay, India, September 1996. ] SONAR[Takeshi Fukuda, Yasuhiko Morimoto, and Shinichi Morishita, Constructing efficient decision trees by using optimized numeric association rules, the VLDB Conference, Bombay, India, 1996.] Rainforest[J. Gehrke, R. Ramakrishnan, V. Ganti, RainForest – A Framework for Fast Decision Tree Construction of Large Datasets, Proc. of VLDB Conf., 1998.] Pruning phase followed by building phase Advanced Database Systems, mod1-3, 2006 7 Decision Tree Algorithms Building phase Recursively split nodes using best splitting attribute for node Pruning phase Smaller imperfect decision tree generally achieves better accuracy Prune leaf nodes recursively to prevent over-fitting Advanced Database Systems, mod1-3, 2006 8 Preliminaries Theoretic Background Entropy Similarity measures Advanced terms Advanced Database Systems, mod1-3, 2006 9 Information Theory Concepts Entropy of a random variable X with probability distribution p(x): H ( p) p( x) log p( x) x The Kullback-Leibler(KL) Divergence or “Relative Entropy” between two probability distributions p and q: KL( p, q) p( x) log( p( x) q( x)) x Mutual Information between random variables X and Y: p( x, y ) I ( X , Y ) p( x, y ) log x y p ( x) p( y ) Advanced Database Systems, mod1-3, 2006 10 What is Entropy S is a sample of “training data set” Entropy measures the impurity of S H X p1 log 2 p1 p2 log 2 p2 pm log 2 pm m p j log 2 p j j 1 ek ek E S log 2 e k e H(X)=The entropy of X -If H(X)=0, it means X is one value; As H() increases, X are heterogeneous values. -For the same number of X values, -“Low Entropy” means X is from a uniform (boring) distribution: A histogram of the frequency distribution of values of X would be flat; and so the values sampled from it would be all over the place -“High Entropy” means X is from varied (peaks and valleys) distribution: A histogram of the frequency distribution of values of X would have many lows and one or two highs; and so the values sampled from it would be more predictable. Advanced Database Systems, mod1-3, 2006 11 T. Fukuda, Y. Morimoto, S. Morishita, T. Tokuyama, Constructing Efficient Decision Trees by Using Optimized Numeric Association Rules, Proc. of VLDB Conf., 1996. Entropy-Based Data Segmentation 100 C1 C2 C3 40 30 30 Attribute has three categories, 40 C1, 30 C2, 30 C3. ek ek E S log 2 e k e 40 40 30 30 30 30 log log log 100 100 100 100 100 100 1.09 S1 C1 C2 C3 S2 C1 C2 C3 60 40 10 10 40 0 20 20 S3 C1 C2 C3 S4 C1 C2 C3 60 20 20 20 40 20 10 10 Advanced Database Systems, mod1-3, 2006 Splitting nj ni E Si ; S j E Si E S j n n E S1 ; S2 0.80 E S3 ; S 4 1.075 12 R. Rgrawal, S. Ghosh, T. Imielinski, B. Iyer, A. Swami, An Interval Classifier for Database Mining Applications, Proc. ofVLDB, 1992. Information Theoretic Measure Information gain by branching on Ai : gain(Ai) = E - Ei The entropy E of an object set: the object set containing object ek of group Gk. The expected entropy for the tree with Ai as the root ek ek E log 2 e k e eij j Ei Ei j e where Eij is the expected entropy for the subtree of an object set. Information content of eij eij I Ai log 2 e the value of Ai j e Advanced Database Systems, mod1-3, 2006 13 ek e log 2 k e k e 40 40 30 30 30 30 log log log 100 100 100 100 100 100 1.09 E S Ex 100 C1 C2 C3 40 30 30 S1 C1 C2 C3 S2 C1 C2 C3 60 20 20 20 40 20 10 10 S3 C1 C2 C3 S4 C1 C2 C3 40 40 0 0 30 0 30 0 S5 C1 C2 C3 30 0 0 30 Gain gain(Ai) = E - Ei Advanced Database Systems, mod1-3, 2006 Splitting 60 1 40 2 E1 ( S1; S 2) E1 ( S1) E1 ( S 2) 1.075 100 100 E2 ( S 3; S 4; S 5) 0 gain1=E-E1=0.015 gain2=E-E2=1.09 14 Distributional Similarity Measures Cosine Jaccard coefficient Dice coefficient Overlap coefficient L1 distance (City block distance) Euclidean distance (L2 distance) Hellinger distance Information Radius (Jensen-Shannon divergence) Skew divergence Confusion Probability Lin’s Similarity Measure Advanced Database Systems, mod1-3, 2006 15 Similarity Measures Minkowski distance 1 p p d d p xi , x j xi ,k x j ,k xi x j p k 1 Euclidean distance 2 2 2 p=2: x11 x21 y12 y22 ... y1d y2d Manhattan distance p=1: x11 x21 y12 y22 ... y1d y2d Mahalanobis distance d M xi , x j xi x j 1 x x T i Normalization due to weight schemes is the sample covariance matrix of the patterns or the j known covariance matrix of the pattern generation process; Advanced Database Systems, mod1-3, 2006 16 General form I (common (A,B)) IT-Sim (A,B) = I (description (A,B)) I (common (A,B)): information content associated with the statement describing what A and B have in common I (description (A,B)): information content associated with the statement describing A and B (s): probability of the statement within the world of the objects in question, i.e., fraction of objects exhibiting feature s. 2sA B log s IT-Sim (A,B) = sA log s sB log s Advanced Database Systems, mod1-3, 2006 17 Similarity Measures The Set/Bag Model: Let X and Y be two collections of XML documents Jaccard’s Coefficient simJacc X , Y X Y X Y Dice’s Coefficient simDice X , Y Advanced Database Systems, mod1-3, 2006 2* X Y X Y 18 t3 Similarity Measures d 2 θ Cosine-Similarity Measure (CSM) d j dk Sim d j , d k d j dk n w i 1 ij d1 t1 t2 wik w w n i 1 2 ij n i 1 2 ik The Vector-Space Model: Cosine-Similarity Measure (CSM) SimCos X , Y Advanced Database Systems, mod1-3, 2006 X Y XY 19 Query Processing: a single cosine For every term i, with each doc j, store term frequency tfij. Some tradeoffs on whether to store term count, term weight, or weighted by idfi. At query time, accumulate component-wise sum m sim(d j , dk ) wi, j w i , k i1 If you’re indexing 5 billion documents (web search) an array of accumulators is infeasible Ideas? Advanced Database Systems, mod1-3, 2006 20 Similarity Measures (2) The Generalized Cosine-Similarity Measure (GCSM): Let X and Y be vectors X i xi liand Y i yi li n where Hierarchical Model simGCSM X , Y n X Y xi y j li l j i 1 j 1 X Y X X YY Why only for depth? li l j 2* depth( LCAU li , l j ) depth(li ) depth(l j ) Advanced Database Systems, mod1-3, 2006 21 f d , t log wd , t 2 Dim Similarities N n(t ) N f d , t log n(t ) t wd , t 2 f d , t log t N n(t ) N f d , t log n(t ) Cosine Measure simd1 , d 2 wd1 , t wd 2 , t Hellinger Measure sim d1 , d 2 wd1 , t wd 2 , t t t Tanimoto Measure sim d1 , d 2 wd , t wd , t 1 wd , t wd , t wd , t wd , t 2 1 Clarity Measure 2 t 2 2 1 2 t KLwt , d1 || wt , d 2 KLwt , d1 || GE sim d1 , d 2 KLwt , d 2 || wt , d1 KLwt , d 2 || GE Advanced Database Systems, mod1-3, 2006 22 Advanced Terms Conditional Entropy Information Gain Advanced Database Systems, mod1-3, 2006 23 Specific Conditional Entropy H(Y|X=v) Suppose I’m trying to predict output Y and I have input X X=College Major, Y= likes “Gladiator” Let’s assume this reflects the true probabilities X Y Math Yes History No CS Yes Math No Math No CS Yes History No Math Yes Advanced Database Systems, mod1-3, 2006 From this data we estimate: -P(LikeG=Yes)=0.5 -P(Major=Math & LikeG=No) = 0.25 -P(Major=Math)=0.5 -P(LikeG=Yes |Major=Hisgory)=0 Note: -H(X)=1.5; -H(Y)=1 ---H(Y|X=Math)=1; H(Y|X=History)=0; H(Y|X=CS)=0; 24 Conditional Entropy Definition of Conditional Entropy X H(Y|X)=The average specific conditional entropy of Y If you choose a record at random what will be the conditional entropy of Y, conditioned on that row’s value of X Expected number of bits to transmit Y if both sides will know the m value of X Pr ob( X v j ) H (Y | X v j ) Y j 1 vj Prob(X=vj) H(Y|X=vj) Math Yes History No Math 0.5 1 CS Yes History 0.25 0 Math No CS 0.25 0 Math No CS Yes History No Math Yes Advanced Database Systems, mod1-3, 2006 H(Y|X)=0.5*1+0.25*0+0.25*0=0.5 25 Information Gain Definition of Information Gain IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends the line knew X? IG(Y|X) = H(Y) – H(Y|X) X Y Math Yes History No CS Yes Math No Math No CS Yes History No Math Yes H(Y) = 1 H(Y|X) = 0.5*1+0.25*0+0.25*0=0.5 Thus, IG(Y|X) = 1-0.5 = 0.5 Advanced Database Systems, mod1-3, 2006 26 Relative Information Gain Definition of Relative Information Gain RIG(Y|X) = I must transmit Y, what fraction of the bits on average would it save me if both ends the line knew X? RIG(Y|X) = [H(Y) – H(Y|X)]/H(Y) X Y Math Yes History No CS Yes Math No Math No CS Yes History No Math Yes H(Y) = 1 H(Y|X) = 0.5*1+0.25*0+0.25*0=0.5 Thus, IG(Y|X) = (1-0.5)/1 = 0.5 Advanced Database Systems, mod1-3, 2006 27 What is Information Gain used for? Suppose you are trying to predict whether someone is going to live past 80 years. From historical data you might find… IG(LongLife | HairColor) = 0.01 IG(LongLife | Smoker) = 0.2 IG(LongLife | Gender) = 0.25 IG(LongLife | LastDigitOfSSN) = 0.00001 IG tells you how interesting 1 2-d contingency table is going to be. Advanced Database Systems, mod1-3, 2006 28 Clustering Given: Data points and number of desired clusters K Group the data points into K clusters Data points within clusters are more similar than across clusters Sample applications: Customer segmentation Market basket customer analysis Attached mailing in direct marketing Clustering companies with similar growth Advanced Database Systems, mod1-3, 2006 29 A Clustering Example Income: High Children:1 Car:Luxury Cluster 1 Income: Low Children:0 Car:Compact Income: Medium Children:2 Car: Sedan and Car:Truck Children:3 Income: Medium Cluster 4 Cluster 3 Cluster 2 Advanced Database Systems, mod1-3, 2006 30 Different ways of representing clusters (b) (a) d a j k g (c) a b c ... e i 1 0.4 0.1 0.3 e a c h d j k b f c h i f b g 2 3 0.1 0.5 0.8 0.1 0.3 0.4 Advanced Database Systems, mod1-3, 2006 (d) g a c i e d kb j f h 31 Clustering Methods Partitioning Given a set of objects and a clustering criterion, partitional clustering obtains a partition of the objects into clusters such that the objects in a cluster are more similar to each other than to objects in different clusters. K-means, and K-mediod methods determine K cluster representatives and assign each object to the cluster with its representative closest to the object such that the sum of the distances squared between the objects and their representatives is minimized. Hierarchical Nested sequence of partitions. Agglomerative: starts by placing each object in its own cluster and then merge these atomic clusters into larger and larger clusters until all objects are in a single cluster. Divisive: starts with all objects in cluster and subdividing into smaller pieces. Advanced Database Systems, mod1-3, 2006 32 Algorithms k-Means Fuzzy C-Means Clustering Hierarchical Clustering Probabilistic Clustering Advanced Database Systems, mod1-3, 2006 33 Similarity Measures (2) Mutual Neighbor Distance (MND) MND(xi, xj) = NN(xi, xj)+NN(xj, xi), where NN(xi, xj) is the neighbor number xj with respect to xi. Distance under context s(xi, xj)=f(xi, xj, e), where e is the context x2 x2 C C B B A x1 Advanced Database Systems, mod1-3, 2006 D A FE x1 34 K-Means Clustering Algorithm 1. Choose k cluster centers to coincide with k randomly-chosen patterns 2. Assign each pattern to its closest cluster center. 3. Recompute the cluster centers using the current cluster memberships. 4. If a convergence criterion is not met, go to step 2. Typical convergence criteria: No (or minimal) reassignment of patterns to new cluster centers, or minimal decrease in squared error. Advanced Database Systems, mod1-3, 2006 35 Objective Function k-Means algorithm aims at minimizing the following objective function: (square error function) k n J x j 1 i 1 Advanced Database Systems, mod1-3, 2006 ( j) i cj 2 36 K-Means Algorithm (Ex) G F E D I A H C B Advanced Database Systems, mod1-3, 2006 J 37 Distortion Given a clustering , we denote by (x) the centroid this clustering associates with an arbitrary point x. A measure of quality for : Distortion = x d2(x, (x))/R Where R is the total number of points and x ranges over all input points. Improvement Distortion + (# parameters) log R = Distortion + mk log R Advanced Database Systems, mod1-3, 2006 38 Remarks The way to initialize the means is the problem. One popular way to start is to randomly choose k of the samples The results produced depend on the initial values for the means It can happen that the set of samples closest to mi is empty, so the mi cannot be updated. The results depend on the metric used to measure x mi Advanced Database Systems, mod1-3, 2006 39 Related Work [Clustering] Graph-based clustering [] For an XML document collection C, s-Graph sg (C) = (N, E), a directed graph such that N is the set of all the elements and attributes in the documents in C and (a, b) E if and only if a is a parent element of b in document(s) in C (b can be element or attribute). For two sets, C1 and C2, of XML documents, the distance between them, where |sg(Ci)| is the number of edges | sg (C1) sg (C 2) | dist (C1, C 2) 1 Max{| sg (C1) |, | sg (C 2) |} Advanced Database Systems, mod1-3, 2006 40 Fuzzy C-Means Clustering FCM is a method of clustering which allows one piece of data to belong to two or more N C clusters. 2 m J uij xi c j i 1 j 1 Fuzzy partitioning is carried out through an iterative optimization of the objective function shown above, with the update of membership u and the cluster center c by: N uij 1 xi c j xi ck k 1 C Advanced Database Systems, mod1-3, 2006 2 m 1 , cj m u ij xi i 1 N m u ij i 1 41 j cluster 1 i data itum 1 1 N m u 2 0 ij xi 1 .3 i 1 uij , c 3 j 2 N m C x c m 1 u 4 ij j i i 1 5 x c k 1 i k .. Membership 2 3 4 5 …. 0 0 0 0 0 0 0 1 0 0 0 0 .2 0 .5 0 0 0 The iteration stop when ) kj(iu )1 kj(iu ji xam, where is a termination criterion between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum or a saddle point of Jm. Advanced Database Systems, mod1-3, 2006 42 Fuzzy Clustering membership membership 1 1 0 . . …. . . . . . . …. . . Cluster C1 C2 0 . . …. . . . . . . …. . . Cluster C1 C2 Properties uij [0,1] for all i,j C for all i uij 1 N j 1 0 uij N for all N i 1 Advanced Database Systems, mod1-3, 2006 43 Speculations Correlation between m and : More iteration k for less . Advanced Database Systems, mod1-3, 2006 44 Hierarchical Clustering Basic Process 1. Start by assigning each item to a cluster. N clusters for N items. (Let the distances between the clusters the same as the distances between the items they contain.) 2. Find the closest (most similar) pair of clusters and merge them into a single cluster. 3. Compute distances between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered intoa single cluster of size N. Advanced Database Systems, mod1-3, 2006 45 Hierarchical Clustering (Ex) F similarity C B G D E A A B C D E F G dendrogram Advanced Database Systems, mod1-3, 2006 46 Hierarchical Clustering Algorithms Single-linkage clustering The distance between two clusters is the minimum of the distances between all pairs of patterns drawn from the two clusters (one pattern from the first cluster, the other from the second). Complete-linkage clustering The distance between two clusters is the maximum of the distances between all pairs of patterns drawn from the two clusters Average-linkage clustering Minimum-variance algorithm Advanced Database Systems, mod1-3, 2006 47 Single-/Complete-Link Clustering 1 2 1 1 11 2 22222 1 2 2 1 1 1*********2 22 22 1 1 1 2 1 1 222 2 1 1 1 2 22 2 1 1 Advanced Database Systems, mod1-3, 2006 1 2 1 1 11 2 22222 1 2 2 1 1 1*********2 22 22 1 1 1 2 1 1 222 2 1 1 1 2 22 2 1 1 48 Single Linkage Hierarchical Cluster Steps: 1. 2. 3. 4. 5. Begin with the disjoint clustering having level L(k)=0 and sequence number m=0. Find the least dissimilar pair of clusters in the current clustering, d[(r),(s)]= min d[(i),(j)], where the minimum is over all pairs of clusters in the current clustering. Increment the sequence number m=m+1. Merge clusters (r) and (s) into a single cluster to form the next clustering m. Set L(m) = d[(r),(s)]. Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined: d[(k),(r,s)]= min (d[(s),(r)], d[(k),(s)]). If all objects are in one cluster, stop. Else go to step 2. Advanced Database Systems, mod1-3, 2006 49 Ex: Single-Linkage Cities States 0 Advanced Database Systems, mod1-3, 2006 50 Agglomerative Hierarchical d b , b maxb , b Clustering bi XOR b j i j i j ALGORITHM Agglomerative Hierarchical Clustering INPUT: bit-vectors B in bitmap index BI OUTPUT: a tree T METHOD: (1) Place each bit-vector Bi in its cluster (singleton), creating the list of clusters L (initially, the leaves of T): L=B1, B2, …, Bn. (2) Compute a merging cost function, d Bi , B j min bi Bi ,b j B j d bi , b j between every pair of elements in L to find the two closest clusters {Bi,Bj} which will be the cheapest couple to merge. (3) Remove Bi and Bj from L. (4) Merge Bi and Bj to create a new internal node Bjj in T which will be the parent of Bi and Bj in the result tree. (5) Repeat from (2) until there is only one set remaining. Advanced Database Systems, mod1-3, 2006 51 Graph-Theoretic Clustering Construct the minimal spanning tree (MST) Delete the MST edges with the largest lengths x2 B 0.5 6.5 F 1.5 D A 3.5 C 1.5 G 1.7 E x1 Advanced Database Systems, mod1-3, 2006 52 Improving k-Means D. Pelleg and A. Moore, Accelerating Exact k-means Algorithms with Geometric Reasoning, ACM Proceedings of Conf. on Knowledge and Data Mining, 1999. Definitions Center of clusters (Th2) Center of rectangle; owner(h) c1 dominates c2 w.r.t. h if h is in the same side as c1 wrt c2. (pg.7,9) Update Centroid If for all other centers c’, c dominates c’ wrt h (so c=owner(h), pg 10) insert into owner(h); or split h (blacklist version) c1 dominates c2 wrt h’ for any h’ contained in h. (pg.11) Advanced Database Systems, mod1-3, 2006 53 Clustering Categorical Data: ROCK S. Guha, R. Rastogi, K. Shim, ROCK: Robust Clustering using linKs, IEEE Conf Data Engineering, 1999 Use links to measure similarity/proximity Not distance based Computational complexity: O(n2 nmmma n2 log n) Basic ideas: Similarity function and neighbors: Let T1 = {1,2,3}, T2={3,4,5} Sim(T1, T 2) T1 T2 Sim(T1 , T2 ) T1 T2 {3} 1 0.2 {1,2,3,4,5} 5 Advanced Database Systems, mod1-3, 2006 54 Using Jaccard Coefficient <1,2,3,4,5> CLUSTER 1 {1,2,3} {1,4,5} {1,2,4} {2,3,4} {1,2,5} {2,3,5} {1,3,4} {2,4,5} {1,3,5} {3,4,5} <1,2,6,7> CLUSTER 2 {1,2,6} {1,2,7} {1,6,7} {2,6,7} Advanced Database Systems, mod1-3, 2006 According to Jaccard coefficient, the distance between {1,2,3} and {1,2,6} is the same as the one between {1,2,3} and {1,2,4}, although the former is from two different clusters. 55 ROCK Inducing LINK: the main problem is local properties involving only the two points are considered: Neighbor: If two points are similar enough with each other, they are neighbors Link: the link for pair of points is the number of common neighbors. Advanced Database Systems, mod1-3, 2006 56 S. Guha, R. Rastogi, K. Shim, ROCK: Robust Clustering using linKs, IEEE Conf Data Engineering, 1999 Rock: Algorithm Links: The number of common neighbors for the two points. {1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5} {1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5} 3 {1,2,3} {1,2,4} Algorithm Draw random sample Cluster with links Label data in disk Advanced Database Systems, mod1-3, 2006 57 S. Guha, R. Rastogi, K. Shim, ROCK: Robust Clustering using linKs, IEEE Conf Data Engineering, 1999 Rock: Algorithm Criterion function: to maximize link for the k clusters E k link ( P , P ) l i 1 Pq , Pr Ci q r Ci denotes cluster i of size ni. {1,2,3} {1,2,4} {1,2,5} {1,3,4} {1,3,5} {1,4,5} {2,3,4} {2,3,5} {2,4,5} {3,4,5} {1,2,6} {1,2,7} {1,6,7} {2,6,7} Advanced Database Systems, mod1-3, 2006 For the similarity threshold 0.5, link ({1,2,6}, {1,2,7}) = 4 link ({1,2,6}, {1,2,3}) = 3 link ({1,6,7}, {1,2,3}) = 2 link ({1,2,3}, {1,4,5}) = 3 58 More on Hierarchical Clustering Methods Major weakness of agglomerative clustering methods do not scale well: time complexity of at least O(n2), where n is the number of total objects can never undo what was done previously Integration of hierarchical with distance-based clustering BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction Advanced Database Systems, mod1-3, 2006 59 Zhang, Ramakrishnan, Livny, Birch: Balanced Iterative Reducing and Clustering using Hierarchies, ACM SIGMOD 1996. BIRCH Pre-cluster data points using CF-tree For each point •CF-tree is traversed to find the closest cluster •If the threshold criterion is satisfied, the point is absorbed into the cluster •Otherwise, it forms a new cluster Requires only single scan of data Cluster summaries stored in CF-tree are given to main memory hierarchical clustering algorithm Advanced Database Systems, mod1-3, 2006 60 Initialization of BIRCH CF of a cluster of n d-dimensional vectors, V1,…,Vn, is defined as (n,LS, SS) n is the number of vectors LS is the sum of vectors SS is the sum of squares of vectors CF1+CF2 = (n1 + n1 LS1 + LS1, SS1 + SS1) This property is used for incremental maintaining cluster features Distance between two clusters CF1 and CF2 is defined to be the distance between their centroids. Advanced Database Systems, mod1-3, 2006 61 Zhang, Ramakrishnan, Livny, Birch: Balanced Iterative Reducing and Clustering using Hierarchies, ACM SIGMOD 1996. Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS (linear sum of N data points): Ni=1=Xi SS (square sum of N data points: Ni=1=Xi2 CF = (5, (16,30),(54,190)) 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Advanced Database Systems, mod1-3, 2006 9 10 (3,4) (2,6) (4,5) (4,7) (3,8) 62 Zhang, Ramakrishnan, Livny, Birch: Balanced Iterative Reducing and Clustering using Hierarchies, ACM SIGMOD 1996. Notations Given N d-dimensional data points in a cluster: {Xi} Centroid X0, radius R, diameter D, controid Euclidian distance D0, centroid Manhattan distance D1: X0 N i 1 Xi D0 N 1/ 2 N ( X i X 0) 2 R i 1 N 1/ 2 N N ( X i X j )2 D i 1 j 1 N ( N 1) Advanced Database Systems, mod1-3, 2006 X 0 X 0 2 1/ 2 1 2 d D1 X 01 X 02 X 01(i ) X 0(2 j ) i 1 63 Zhang, Ramakrishnan, Livny, Birch: Balanced Iterative Reducing and Clustering using Hierarchies, ACM SIGMOD 1996. Notations (2) Given N d-dimensional data points in a cluster: {Xi} Average inter-cluster distance D2, average intracluster distance D3, variance increase distance D4: 1/ 2 N1 N1 N2 ( X i X j )2 i 1 j N1 1 D2 N1 N 2 D4 N1 N 2 k 1 Xl X k l 1 N1 N 2 N1 N 2 2 1/ 2 N1 N2 N1 N2 ( X i X j )2 i 1 j N1 1 D3 ( N1 N 2 )( N1 N 2 1) N1 X X i l 1 l i 1 N1 Advanced Database Systems, mod1-3, 2006 N1 N1 N2 Xl l N1 1 Xj j N1 1 N2 2 N1 N 2 2 64 Zhang, Ramakrishnan, Livny, Birch: Balanced Iterative Reducing and Clustering using Hierarchies, ACM SIGMOD 1996. CF Tree B=7 Root CF1 CF2 CF3 CF6 child1 child2 child3 child6 L=6 Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node prev CF1 CF2 CF6 next Advanced Database Systems, mod1-3, 2006 Leaf node prev CF1 CF2 CF4 next 65 Example Given (T=2?), B=3 for “3,” “6,” “8,” and “1” {(2,(9, 45)} {(2,(4,10)), (2,(14,100))} For “2” inserted (1,(2,4)) {(3,(6,14), {(2,(3,5)), (1,(3,9))} (2,(14,100))} {(2,(14,100))} For “5” inserted (1,(5,25)) {(3,(6,14), {(2,(3,5)), (1,(3,9))} (3,(19,125))} {(2,(11,61)), (1,(8,64))} For “7” inserted (1,(7,49)) {(3,(6,14), {(2,(3,5)), (1,(3,9))} Advanced Database Systems, mod1-3, 2006 (4,(26,174))} {(2,(11,61)), (2,(15,113))} 66 Evaluation of BIRCH Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans Weakness: handles only numeric data and sensitive to the order of the data record. Advanced Database Systems, mod1-3, 2006 67 Data Summarization similarity To compress the data into suitable representative objects OPTICS; Data Bubble Finding clusters from hierarchical clustering – depending on the “resolution” A B C D E F G Advanced Database Systems, mod1-3, 2006 68 OPTICS M. Ankerst, M. Breunig, H. Kriegel, J. Sander, OPTICS: Ordering Points to Identify the Clustering Structure, ACM SIGMOD, 1999. Pre: N(q): the subset of D contained in the – neighborhood of q. ( is a radius) Definition 1: (directly density-reachable) Object p is directly density-reachable from object q wrt. and MinPts in a set of objects D if 1) p N,(q) (N(q) is the subset of D contained in the -neighborhood of q.) 2) Card(N(q)) >= MinPts (Card(N) denotes the cardinality of the set N) Definitions Directly density-reachable (p.51 & Figure 2) density reachable [transitivity of ddr] Density-connected: (p -> o <- q) Core-distance , MinPts (p): MinPts_distance (p) Reachability-distance , MinPts (p,o) wrt o: max(core-distance(o), dist(o,p)) Figure 4 Ex) cluster ordering reachability values; Fig 12 Advanced Database Systems, mod1-3, 2006 69 Data Bubbles M. Breunig, H. Kriegel, P. Kroger, J. Sander, Data Bubbles: Quality Preserving Performance Boosting for Hierarchical Clustering, ACM SIGMOD, 2001. -neighborhood of P k-distance of P, at least for k objects O’ D it holds N P X D | d P, X d(P,O’) ≤ d(P,O), and at k dist P d P, O most k-1 objects O’ D it N k dist P P N k P holds d(P,O’) < d(P,O). k-nearest neighbors of P MinPts-dist(P): a distance in which there are at least MinPts objects within the neighborhood of P. Advanced Database Systems, mod1-3, 2006 70 Data Bubbles M. Breunig, H. Kriegel, P. Kroger, J. Sander, Data Bubbles: Quality Preserving Performance Boosting for Hierarchical Clustering, ACM SIGMOD, 2001. Structural distortion Figure 11 Data Bubbles, B=(n,rep,extend,nnDist) n: # of objects in X; rep: a representative bject for X; extent: estimation of the radius of X; nnDist: partial function, estimating k-nearest neighbor distances in X. Distance (B,C): page 6-83 Dist(B.rep, C.rep) - [B.extent + C.extend] + [B.nnDist(1) + C.nnDist(1)] Advanced Database Systems, mod1-3, 2006 Max [B.nnDist(1) + C.nnDist(1)] 71 K-Means in SQL C. Ordonez, Integrating K-Means Clustering with a Relational DBMS Using SQL, IEEE TKDE 2006. Dataset: Y={y1,y2,…,yn} dn matrix, where yi=d1 column vector K-Means: to find k clusters, by minimizing the square Nj Xj error from the centers. Square distance, Eq(1); and objective fn, Eq(2) Matrices • W: k weights (fractions of n) dk matrix • C: k means (centroids) dk matrix • R: k variances (square distances) k1 matrix Matrices • Mj: contains the d sums of point dimension values in cluster j; dk matrix; • Qj: contains the d sums of squared dimension values in cluster j; dk matrix • Nj: contains # points in cluster j; k1 matrix Intermediate matrices: YH, YV, YD, YNN, NMQ, WCR; Figure 193 Advanced Database Systems, mod1-3, 2006 Mj y i y i X j Qj y y i T i y i X j Wj Nj k j1 Nj Cj Mj Nj Rj Qj C j C Tj Nj 72 Y YH CH YV C Y1 Y2 Y3 i Y1 Y2 Y3 j Y1 Y2 Y3 i l val l k C1/C2 1 2 3 1 1 2 3 1 1 2 3 1 1 1 1 1 1 1 2 3 2 1 2 3 2 9 8 7 1 2 2 2 1 2 9 8 7 3 9 8 7 1 3 3 3 1 3 9 8 7 4 9 8 7 2 1 1 1 2 9 9 8 7 5 9 8 7 2 2 2 2 2 8 2 3 3 3 2 7 3 1 9 Insert into C Select 1,1,Y1 From CH Where j=1; … Insert into C Select d,k,Yd From CH Where j=k; Insert into YD Select i, sum(YV.val-C.C1)**2) AS d1, … sum(YV.val-C.Ck)**2) AS dk FROM YV, C Where YV.l = C.l Group by i; i d1 d2 i j 3 2 8 1 0 116 1 1 3 3 7 2 0 116 2 1 4 1 9 3 116 0 3 2 4 2 8 4 116 0 4 2 4 3 7 5 116 0 5 2 5 1 9 5 2 8 5 3 7 Insert into YNN CASE When d1 <= d2 AND d1 <= dk Then 1 When d2 <= d3 .. Then 2 ELSE k Insert into MNQ Select l,j,sum(1.0) AS N, sum(YV.val) AS M, sum(YV.va.*YV.val) AS Q FROM YV, YNN Where YV.i = YNN.i GROUP by l,j; Advanced Database Systems, mod1-3, 2006 YNN YD NMQ WCR l j N M Q l j W C R 1 1 2 2 3 1 1 0.4 1 0 2 1 2 4 3 2 1 0.4 2 0 3 1 2 6 7 3 1 0.4 3 0 1 2 3 27 7 1 2 0.6 9 0 2 2 3 24 7 2 2 0.6 8 0 3 2 3 21 7 3 2 0.6 7 0 73 Incremental Data Summarization S. Nassar, J. Sander, C. Cheng, Incremental and Effective Data Summarization for Dynamic Hierarchical Clustering, ACM SIGMOD, 2004. For D={Xi} for 1iN, ={data bubble}, the data index i= n/N. For D={Xi} with the mean X and standard deviation X, P X X k X 1 12 k is “good” iff [ - , + ], “under-filled” iff < - , and “over-filled” iff > + . Advanced Database Systems, mod1-3, 2006 74 Research Issues Reduction Dimensions Approximation Advanced Database Systems, mod1-3, 2006 75 Advanced Database Systems, mod1-3, 2006 76 Guha, Rastogi & Shim, CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD, 1998 Cure: The Algorithm Guha, Rastogi & Shim, CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD, 1998 Draw random sample s. Partition sample to p partitions with size s/p Partially cluster partitions into s/pq clusters Eliminate outliers By random sampling If a cluster grows too slow, eliminate it. Cluster partial clusters. Label data in disk Advanced Database Systems, mod1-3, 2006 77 Guha, Rastogi & Shim, CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD, 1998 Data Partitioning and Clustering s = 50 p=2 s/p = 25 s/pq = 5 y y y x y y x x Advanced Database Systems, mod1-3, 2006 x x 78 Guha, Rastogi & Shim, CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD, 1998 Cure: Shrinking Representative Points y y x x Shrink the multiple representative points towards the gravity center by a fraction of . Multiple representatives capture the shape of the cluster Advanced Database Systems, mod1-3, 2006 79 Density-Based Clustering Methods Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98) Advanced Database Systems, mod1-3, 2006 80 CLIQUE (Clustering In QUEst) Agrawal, Gehrke, Gunopulos, Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, ACM SIGMOD 1998. Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space CLIQUE can be considered as both density-based and grid-based It partitions each dimension into the same number of equal length interval It partitions a d-dimensional data space into non-overlapping rectangular units A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter A cluster is a maximal set of connected dense units within a subspace Advanced Database Systems, mod1-3, 2006 81 40 50 20 30 40 50 age 60 Vacation =3 30 Vacation (week) 0 1 2 3 4 5 6 7 Salary (10,000) 0 1 2 3 4 5 6 7 20 age 60 30 50 age Advanced Database Systems, mod1-3, 2006 82 Agrawal, Gehrke, Gunopulos, Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, ACM SIGMOD 1998. CLIQUE: The Major Steps Partition the data space and find the number of points that lie inside each cell of the partition. Identify the subspaces that contain clusters using the Apriori principle Identify clusters: Determine dense units in all subspaces of interests Determine connected dense units in all subspaces of interests. Generate minimal description for the clusters Determine maximal regions that cover a cluster of connected dense units for each cluster Determination of minimal cover for each cluster Advanced Database Systems, mod1-3, 2006 83 Strength and Weakness of CLIQUE Strength It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces It is insensitive to the order of records in input and does not presume some canonical data distribution It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases Weakness The accuracy of the clustering result may be degraded at the expense of simplicity of the method Advanced Database Systems, mod1-3, 2006 84 Model based clustering Assume data generated from K probability distributions Typically Gaussian distribution Soft or probabilistic version of K-means clustering Need to find distribution parameters. EM Algorithm Advanced Database Systems, mod1-3, 2006 85 EM Algorithm Initialize K cluster centers Iterate between two steps Expectation step: assign points to clusters w Pr(d P(di ck ) wk Pr( di | ck ) wk j Pr( d c ) i i |cj ) j k i N Maximation step: estimate model parameters k Advanced Database Systems, mod1-3, 2006 1 m d i P ( d i ck ) i 1 P ( d i c j ) m k 86 CURE (Clustering Using Epresentatives ) Guha, Rastogi & Shim, CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD, 1998 Stops the creation of a cluster hierarchy if a level consists of k clusters Uses multiple representative points to evaluate the distance between clusters, adjusts well to arbitrary shaped clusters and avoids single-link effect Advanced Database Systems, mod1-3, 2006 87 Guha, Rastogi & Shim, CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD, 1998 Drawbacks of Distance-Based Method Drawbacks of square-error based clustering method Consider only one point as representative of a cluster Good only for convex shaped, similar size and density, and if k can be reasonably estimated Advanced Database Systems, mod1-3, 2006 88 Zhang, Ramakrishnan, Livny, Birch: Balanced Iterative Reducing and Clustering using Hierarchies, ACM SIGMOD 1996. BIRCH Dependent on order of insertions Works for convex, isotropic clusters of uniform size Labeling Problem Centroid approach: Labeling Problem: even with correct centers, we cannot label correctly Advanced Database Systems, mod1-3, 2006 89 Jensen-Shannon Divergence Jensen-Shannon(JS) divergence between two probability distributions: JS ( p1 , p2 ) 1 KL( p1 , 1 p1 2 p2 ) 2 KL( p2 , 1 p1 2 p2 ) H ( 1 p1 2 p2 ) 1 H ( p1 ) 2 H ( p2 ) where 1 , 2 0, 1 2 1 Jensen-Shannon(JS) divergence between a finite number of probability distributions: JS ({ p1 ,...., p n }) i KL( pi , 1 p1 ..... n p n ) i H i pi i H ( pi ) i i Advanced Database Systems, mod1-3, 2006 90 Information-Theoretic Clustering: (preserving mutual information) (Lemma) The loss in mutual information equals: k I ( X , Y ) I ( X , Yˆ ) ( yˆ j ) JS * ({p( x | yt ) : yt yˆ j }) j 1 Interpretation: Quality of each cluster is measured by the Jensen-Shannon Divergence between the individual distributions in the cluster. Can rewrite the above as: k I ( X , Y ) I ( X , Yˆ ) t KL( p( x | yt ), p( x | yˆ j )) j 1 yt yˆ j Goal: Find a clustering that minimizes the above loss Advanced Database Systems, mod1-3, 2006 91 Information Theoretic Co-clustering (preserving mutual information) (Lemma) Loss in mutual information equals I ( X , Y ) - I ( Xˆ , Yˆ ) KL( p( x, y ) || q( x, y )) H ( Xˆ , Yˆ ) H ( X | Xˆ ) H (Y | Yˆ ) H ( X , Y ) where q( x, y ) p( xˆ, yˆ ) p( x | xˆ ) p( y | yˆ ), where x xˆ, y yˆ Can be shown that q(x,y) is a “maximum entropy” approximation to p(x,y). q(x,y) preserves marginals : q(x)=p(x) & q(y)=p(y) Advanced Database Systems, mod1-3, 2006 92 p ( x, y ) .05 .05 00 .04 .04 .5 .5 0 00 0 0 0 .5 .5 0 0 .05 .05 0 0 .05 .05 0 0 0 0 0 0 .05 .05 .05 .05 .04 0 .04 .04 .04 .04 0 0 0 .5 .5 0 0 .04 .03 .03 .2 .2 p( xˆ, yˆ ) 0 .05 .05 .04 .04 0 .36 .36 .28 0 0 .28 .36 .36 0 0 p ( y | yˆ ) p( x | xˆ ) 0 0 .054 .054 00 .036 .036 .054 .042 0 0 .054 .042 0 0 0 0 0 0 .042 .054 .042 .054 .036 028 .028 .036 .036 .028 .028 .036 0 .054 .054 .036 .036 0 q ( x, y ) #parameters that determine q are: (m-k)+(kl-1)+(n-l) Advanced Database Systems, mod1-3, 2006 93 Preserving Mutual Information Lemma: KL( p( x, y ) || q( x, y )) p( x) KL( p( y | x) || q( y | xˆ )) xˆ where xxˆ q( y | xˆ ) p( y | yˆ ) p( yˆ | xˆ ) p( y | yˆ ) p( yˆ | x) p( x | xˆ ) xxˆ Note that q ( y | xˆ ) may be thought of as the “prototype” of x̂ the cluster is row cluster (the usual “centroid” of ) p( y | x) p( x | xˆ ) xxˆ Similarly, KL( p( x, y ) || q( x, y )) p( y ) KL( p( x | y ) || q( x | yˆ )) yˆ y yˆ Advanced Database Systems, mod1-3, 2006 94 Example – Continued q ( y | xˆ ) .36 .36 .28 0 0 0 0 0 0 .28 .36 .36 .18 .18 .14 .14 .18 .18 .30 0 0 .30 .30 0 .20 .20 .30 0 0 .30 0 .30 p( xˆ, yˆ ) q ( x | yˆ ) .16 .24 .24 Advanced Database Systems, mod1-3, 2006 .16 95 Co-Clustering Algorithm Advanced Database Systems, mod1-3, 2006 96 Properties of Co-clustering Algorithm Theorem: The co-clustering algorithm monotonically decreases loss in mutual information (objective function value) Marginals p(x) and p(y) are preserved at every step (q(x)=p(x) and q(y)=p(y) ) Can be generalized to higher dimensions Advanced Database Systems, mod1-3, 2006 97 Advanced Database Systems, mod1-3, 2006 98 Applications -- Text Classification Assigning class labels to text documents Training and Testing Phases New Document Class-1 Document collection Grouped into classes Class-m Classifier (Learns from Training data) New Document With Assigned class Training Data Advanced Database Systems, mod1-3, 2006 99 Dimensionality Reduction Feature Selection Document Bag-of-words Vector Of words 1 Word#1 Word#k • Select the “best” words • Throw away rest • Frequency based pruning • Information criterion based pruning m Feature Clustering Document Bag-of-words Vector Of words 1 Cluster#1 • Do not throw away words • Cluster words instead • Use clusters as features Cluster#k m Advanced Database Systems, mod1-3, 2006 100 Experiments Data sets 20 Newsgroups data •20 classes, 20000 documents Classic3 data set •3 classes (cisi, med and cran), 3893 documents Dmoz Science HTML data •49 leaves in the hierarchy •5000 documents with 14538 words •Available at http://www.cs.utexas.edu/users/manyam/dmoz. txt Implementation Details Bow – for indexing,co-clustering, clustering and classifying Advanced Database Systems, mod1-3, 2006 101 Naïve Bayes with word clusters Naïve Bayes classifier Assign document d to the class with the highest score v c (d ) arg max i (log( p(ci )) p( wt | d ) log( p( wt | ci ))) * Relation to KL Divergence t 1 c* (d ) arg min i ( KL( p(W | d ), p(W | ci ) log p(ci ))) Using word clusters instead of words k c (d ) arg max i (log( p(ci )) p( xˆs | d ) log( p( xˆs | ci ))) * s 1 where parameters for clusters are estimated according to joint statistics Advanced Database Systems, mod1-3, 2006 102 T. Fukuda, Y. Morimoto, S. Morishita, T. Tokuyama, Constructing Efficient Decision Trees by Using Optimized Numeric Association Rules, Proc. of VLDB Conf., 1996. Selecting Correlated Attributes To decide A and A’ are strongly correlated iff E ( S ) E ( S ( Ropt ); S ( Ropt )) E ( S ) min{E ( S ( I ); S ( I )), E ( S ( I '); S ( I '))} where a threshold 1 Advanced Database Systems, mod1-3, 2006 103 MDL-based Decision Tree Pruning M. Mehta, J. Rissanen, R. Agrawal, MDL-based Decision Tree Pruning, Proc. on KDD Conf., 1995. Two steps for induction of decision trees 1. Construct a DT using training data 2. Reduce the DT by pruning to prevent “overfitting” Possible approaches Cost-complexity pruning: using a separate set of samples for pruning DT pruning: using the same training data sets for testing MDL-based pruning: using Minimum Description Length (MDL) principle. Advanced Database Systems, mod1-3, 2006 104 M. Mehta, J. Rissanen, R. Agrawal, MDL-based Decision Tree Pruning, Proc. on KDD Conf., 1995. Pruning Using MDL Principle View decision tree as a means for efficiently encoding classes of records in training set MDL Principle: best tree is the one that can encode records using the fewest bits Cost of encoding tree includes 1 bit for encoding type of each node (e.g. leaf or internal) Csplit : cost of encoding attribute and value for each split n*E: cost of encoding the n records in each leaf (E is entropy) Advanced Database Systems, mod1-3, 2006 105 M. Mehta, J. Rissanen, R. Agrawal, MDL-based Decision Tree Pruning, Proc. on KDD Conf., 1995. Pruning Using MDL Principle Problem: to compute the minimum cost subtree at root of built tree Suppose minCN is the cost of encoding the minimum cost subtree rooted at N Prune children of a node N if minCN = n*E+1 Compute minCN as follows: N is leaf: n*E+1 N has children N1 and N2: min{n*E+1,Csplit+1+minCN1+minCN2} Prune tree in a bottom-up fashion Advanced Database Systems, mod1-3, 2006 106 R. Rastogi, K. Shim, PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning, Proc. of VLDB Conf., 1998. MDL Pruning - Example Root education in graduate no 10 18 40 reject reject accept 1 5 2 yes N 3.8 no X X yes salary < 40k N1 1 N2 1 • Cost of encoding records in N, (n*E+1) = 3.8 • Csplit = 2.6 • minCN = min{3.8,2.6+1+1+1} = 3.8 • Since minCN = n*E+1, N1 and N2 are pruned Advanced Database Systems, mod1-3, 2006 107 PUBLIC R. Rastogi, K. Shim, PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning, Proc. of VLDB Conf., 1998. Prune tree during (not after) building phase Execute pruning algorithm (periodically) on partial tree Problem: how to compute minCN for a “yet to be expanded” leaf N in a partial tree Solution: compute lower bound on the subtree cost at N and use this as minCN when pruning minCN is thus a “lower bound” on the cost of subtree rooted at N Prune children of a node N if minCN = n*E+1 Guaranteed to generate identical tree to that generated by SPRINT Advanced Database Systems, mod1-3, 2006 108 R. Rastogi, K. Shim, PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning, Proc. of VLDB Conf., 1998. PUBLIC(1) sal education Label 10K High-school Reject 40K Under Accept 15K Under Reject 75K grad Accept 18K grad Accept education in graduate N 5.8 no N1 1 yes N2 1 • Simple lower bound for a subtree: 1 • Cost of encoding records in N = n*E+1 = 5.8 • Csplit = 4 • minCN = min{5.8, 4+1+1+1} = 5.8 • Since minCN = n*E+1, N1 and N2 are pruned Advanced Database Systems, mod1-3, 2006 109 PUBLIC(S) Theorem: The cost of any subtree with s splits and k rooted at node N is at least 2*s+1+s*log a + ni i=s+2 a is the number of attributes k is the number of classes ni (>= ni+1) is the number of records belonging to class i Lower bound on subtree cost at N is thus the minimum of: n*E+1 (cost with zero split) k 2*s+1+s*log a + ni i=s+2 Advanced Database Systems, mod1-3, 2006 110 What’s Clustering Clustering is a kind of unsupervised learning. Clustering is a method of grouping data that share similar trend and patterns. Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. Example: After clustering: Thus, we see clustering means grouping of data or dividing a large data set into smaller data sets of some similarity. Advanced Database Systems, mod1-3, 2006 111 Partitional Algorithms Enumerate K partitions optimizing some criterion Example: square-error criterion e , 2 k nj j 1 i 1 ( j) i x cj 2 Where x is the ith pattern belonging to the jth cluster and c is the centroid of the jth cluster. Advanced Database Systems, mod1-3, 2006 112 Squared Error Clustering Method 1. Select an initial partition of the patterns with a fixed number of clusters and cluster centers 2. Assign each pattern to its closest cluster center and compute the new cluster centers as the centroids of the clusters. Repeat this step until convergence is achieved, i.e., until the cluster membership is stable. 3. Merge and split clusters based on some heuristic information, optionally repeating step 2. Advanced Database Systems, mod1-3, 2006 113 Agglomerative Clustering Algorithm 1. Place each pattern in its own cluster. Construct a list of interpattern distances for all distinct unordered pairs of patterns, and sort this list in ascending order 2. Step through the sorted list of distances, forming for each distinct dissimilarity value dk a graph on the patterns where pairs of patterns closer than dk are connected by a graph edge. If all the patterns are members of a connected graph, stop. Otherwise, repeat this step. 3. The output of the algorithm is a nested hierarchy of graphs with can be cut at a desired dissimilarity level forming a partition identified by simply connected components in the corresponding graph. Advanced Database Systems, mod1-3, 2006 114 Agglomerative Hierarchical Clustering Mostly used hierarchical clustering algorithm Initially each point is a distinct cluster Repeatedly merge closest clusters until the number of clusters becomes K Closest: dmean (Ci, Cj) = m m i dmin (Ci, Cj) = pmin , q j pq Ci C j Likewise dave (Ci, Cj) and dmax (Ci, Cj) Advanced Database Systems, mod1-3, 2006 115 Clustering Summary of Drawbacks of Traditional Methods Partitional algorithms split large clusters Centroid-based method splits large and nonhyperspherical clusters Centers of subclusters can be far apart Minimum spanning tree algorithm is sensitive to outliers and slight change in position Exhibits chaining effect on string of outliers Cannot scale up for large databases Advanced Database Systems, mod1-3, 2006 116 Model-based Clustering Mixture of Gaussians Gaussian pdf: P(i) Data point, N(i,2I) Consider Data points: x1, x2,…, xN P(1),…, P(k), Likelihood function L Pdata | i i Pi Px | i , 1 , 2 ,..., k N i 1 Maximize the likelihood function by calculating L 0 i Advanced Database Systems, mod1-3, 2006 117 Overview of EM Clustering Extensions and generalizations. The EM (expectation maximization) algorithm extends the k-means clustering technique in two important ways: 1. Instead of assigning cases or observations to clusters to maximize the differences in means for continuous variables, the EM clustering algorithm computes probabilities of cluster memberships based on one or more probability distributions. The goal of the clustering algorithm then is to maximize the overall probability or likelihood of the data, given the (final) clusters. 2. Unlike the classic implementation of k-means clustering, the general EM algorithm can be applied to both continuous and categorical variables (note that the classic k-means algorithm can also be modified to accommodate categorical variables). Advanced Database Systems, mod1-3, 2006 118 EM Algorithm The EM algorithm for clustering is described in detail in Witten and Frank (2001). The basic approach and logic of this clustering method is as follows. Suppose you measure a single continuous variable in a large sample of observations. Further, suppose that the sample consists of two clusters of observations with different means (and perhaps different standard deviations); within each sample, the distribution of values for the continuous variable follows the normal distribution. The resulting distribution of values (in the population) may look like this: Advanced Database Systems, mod1-3, 2006 119 EM v.s. k-Means Classification probabilities instead of classifications. The results of EM clustering are different from those computed by k-means clustering. The latter will assign observations to clusters to maximize the distances between clusters. The EM algorithm does not compute actual assignments of observations to clusters, but classification probabilities. In other words, each observation belongs to each cluster with a certain probability. Of course, as a final result you can usually review an actual assignment of observations to clusters, based on the (largest) classification probability. Advanced Database Systems, mod1-3, 2006 120 Finding k V-fold cross-validation. This type of cross-validation is useful when no test sample is available and the learning sample is too small to have the test sample taken from it. A specified V value for V-fold cross-validation determines the number of random subsamples, as equal in size as possible, that are formed from the learning sample. The classification tree of the specified size is computed V times, each time leaving out one of the subsamples from the computations, and using that subsample as a test sample for cross-validation, so that each subsample is used V - 1 times in the learning sample and just once as the test sample. The CV costs computed for each of the V test samples are then averaged to give the V-fold estimate of the CV costs. Advanced Database Systems, mod1-3, 2006 121 L Pdata | i i Pi Px | i , 1 , 2 ,..., k N i 1 Expectation Maximization A mixture of Gaussians Ex: x1=30, P(x1)=1/2; x2=18, P(x2)=u; x3=0, P(x3)=2u; x4=23, P(x4)=1/2-3u Likelihood for X1: a students; x2: b students; x3: c students; x4: d students a d 1 b c 1 L Pa, b, c, d | i 2 3 2 2 To maximize L, calculate the log Likelihood L: a d 1 1 b c PL log log log 2 log 3 2 2 PL b 2c 3d Supposing a=14, b=6, 0 c=9,d=10, then u=1/10. 2 1 3 2 If x1+x2: h students a+b=h bc a=h/(u+1), b=2uh/(u+1) 6(b c d ) Advanced Database Systems, mod1-3, 2006 122 Gaussian (Normal) pdf 1 p( x | , ) e 2 2 1 x 2 2 The Gaussian function with mean () and standard deviation (). The properties of the function: - symmetric about the mean - Gains its maximum value at the mean, the minimum value at plus and minus infinity - The distribution is often referred to as “bell shaped” - At one standard deviation from the mean the function has dropped to about 2/3 of its maximum value, at two standard deviations it has falled to about a 1/7. - The area under the function one standard deviation from the mean is about 0.682. Two standard deviations it is 0.9545, and the three s.d. it is 0.9973. The total area under the curve is 1. Advanced Database Systems, mod1-3, 2006 123 1 p( x | , ) e 2 2 Gaussian x F , 2 ( x) p( z | , 2 )dz 1 x 2 2 Think the cumulative distribution, F,2(x) ~ Uniform (0,1) x F 1 ( x) ~ p( x | , 2 ) , 2 Advanced Database Systems, mod1-3, 2006 124 Multi-variate Density Estimation p( x | ) p px | , Mixture of k j 1 j 2 j j p1 ,..., pk , 1 ,... k , ,..., 2 1 2 k Gaussians contains all the parameters of the mixture model. {pi} are known as mixing proportions or coefficients. A mixture of Gaussians model Generic mixture p( x | ) P y j px | y j P(y) y=1 y=2 j 1, 2 P(x|y=1) P(x|y=2) Advanced Database Systems, mod1-3, 2006 125 Mixture Density If we are given just x we do not know which mixture component this example came from k p( x | ) p j p x | j , j2 j 1 We can evaluate the posterior probability that an observed x was generated from the first mixture component P( y 1) p( x | y 1) P ( y 1 | x , ) P( y 1) p(x | y 1) j 1, 2 p1 p x | 1 , 12 2 p p x | , j j j j 1, 2 Advanced Database Systems, mod1-3, 2006 126