Download Clustering - Hong Kong University of Science and Technology

Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology [email protected] Thanks: J.W. Han, I. Witten, E. Frank 1 Essentials   Terminology:  Objects = rows = records  Variables = attributes = features A good clustering method   high on intra-class similarity and low on inter-class similarity What is similarity?  Based on computation of distance  Between two numerical attributes  Between two nominal attributes  Mixed attributes 2 The database Object i  x11 ... ... x1 p       xi1 ... ... xip    ... ... ... ...   x  x np   n1 3 Major clustering methods  Partition based (K-means)  Produces sphere-like clusters  Good when       know number of clusters, Small and med sized databases Hierarchical methods (Agglomerative or divisive)  Produces trees of clusters  Fast Density based (DBScan)  Produces arbitrary shaped clusters  Good when dealing with spatial clusters (maps) Grid-based  Produces clusters based on grids  Fast for large, multidimensional databases Model-based  Based on statistical models  Allow objects to belong to several clusters 4 The K-Means Clustering Method: for numerical attributes  Given k, the k-means algorithm is implemented in four steps:     Partition objects into k non-empty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment 5 The mean point X 4.5 Y 1 4 2 3.5 3 2 4 2.5 2 3 3 1.5 1 4 2 0.5 0 0 2.5 1 2 2.75 3 4 5 X The mean point can be a virtual point 6 The K-Means Clustering Method  Example 10 10 9 9 8 8 7 7 6 6 5 5 10 9 8 7 6 5 4 4 3 2 1 0 0 1 2 3 4 5 6 7 8 K=2 Arbitrarily choose K object as initial cluster center 9 10 Assign each objects to most similar center 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 reassign 10 9 9 8 8 7 7 6 6 5 5 4 3 2 1 0 1 2 3 4 5 6 7 8 8 9 10 reassign 10 0 7 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 7 Comments on the K-Means Method  Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.  Comment: Often terminates at a local optimum.  Weakness  Applicable only when mean is defined, then what about categorical data?  Need to specify k, the number of clusters, in advance  Unable to handle noisy data and outliers too well  Not suitable to discover clusters with non-convex shapes 8 Robustness X 4.5 Y 4 3.5 1 2 2 4 3 3 400 2 101.5 2.75 3 2.5 2 1.5 1 0.5 0 1 10 100 1000 X 9 Variations of the K-Means Method   A few variants of the k-means which differ in  Selection of the initial k means  Dissimilarity calculations  Strategies to calculate cluster means Handling categorical data: k-modes (Huang’98)  Replacing means of clusters with modes  Using new dissimilarity measures to deal with categorical objects  Using a frequency-based method to update modes of clusters  A mixture of categorical and numerical data: k-prototype method 10 K-Modes: See J. X. Huang’s paper online (Data Mining and Knowledge Discovery Journal, Springer) 11 Formalization of K-Means 12 K-Means: Cont. 13 K-Modes: See J. X. Huang’s paper online (Data Mining and Knowledge Discovery Journal, Springer) 14 K-Modes (Cont.) 15 K-Modes 16 K-Modes: Cost Function 17 Finding K-Modes 18 Mixed Types: K-Prototypes 19 K-Modes: Evaluation Data 20 K-Modes: Evaluation 21 Some Experiments 22 What is the problem of k-Means Method?  The k-means algorithm is sensitive to outliers !  Since an object with an extremely large value may substantially distort the distribution of the data.  K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 23 The K-Medoids Clustering Method  Find representative objects, called medoids, in clusters  Medoids are located in the center of the clusters.  Given data points, how to find the medoid? 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 24 K-Medoids: most centrally located objects 25 CLARA 26 CLASA: Simulated Annealing 27 Sampling based method: MCMRS 28 KMedoids: Evaluation 29 Density-Based Clustering Methods    Clustering based on density (local cluster criterion), such as density-connected points Major features:  Discover clusters of arbitrary shape  Handle noise  One scan  Need density parameters as termination condition Several interesting studies:  DBSCAN: Ester, et al. (KDD’96)  OPTICS: Ankerst, et al (SIGMOD’99).  DENCLUE: Hinneburg & D. Keim (KDD’98)  CLIQUE: Agrawal, et al. (SIGMOD’98) 30 Density-Based Clustering   Clustering based on density (local cluster criterion), such as density-connected points Each cluster has a considerable higher density of points than outside of the cluster 31 Density-Based Clustering: Background    Two parameters:  e: Maximum radius of the neighbourhood  MinPts: Minimum number of points in an Epsneighbourhood of that point Ne(p): {q belongs to D | dist(p,q) <= e} Directly density-reachable: A point p is directly densityreachable from a point q wrt. e, MinPts if  1) p belongs to Ne(q)  2) core point condition: |Ne (q)| >= MinPts p q MinPts = 5 e = 1 cm 32 Density-Based Clustering: Background (II)  Density-reachable:   p A point p is density-reachable from a point q wrt. e, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi p1 q Density-connected  A point p is density-connected to a point q wrt. e, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. e and MinPts. p q o 33 DBSCAN: Density Based Spatial Clustering of Applications with Noise   Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise Outlier Border Eps = 1cm Core MinPts = 5 34 DBSCAN: The Algorithm      Arbitrary select a point p Retrieve all points density-reachable from p wrt e and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed. 35 DBSCAN Properties    Generally takes O(nlogn) time Still requires user to supply Minpts and e Advantage  Can find points of arbitrary shape  Requires only a minimal (2) of the parameters 36 Model-Based Clustering Methods   Attempt to optimize the fit between the data and some mathematical model Statistical and AI approach  Conceptual clustering     A form of clustering in machine learning Produces a classification scheme for a set of unlabeled objects Finds characteristic description for each concept (class) COBWEB (Fisher’87)    A popular a simple method of incremental conceptual learning Creates a hierarchical clustering in the form of a classification tree Each node refers to a concept and contains a probabilistic description of that concept 37 The COBWEB Conceptual Clustering Algorithm 8.8.1  The COBWEB algorithm was developed by D. Fisher in the 1990 for clustering objects in a object-attribute data set.     Fisher, Douglas H. (1987) Knowledge Acquisition Via Incremental Conceptual Clustering The COBWEB algorithm yields a classification tree that characterizes each cluster with a probabilistic description  Probabilistic description of a node: (fish, prob=0.92) Properties:  incremental clustering algorithm, based on probabilistic categorization trees  The search for a good clustering is guided by a quality measure for partitions of data COBWEB only supports nominal attributes CLASSIT is the version which works with nominal and numerical attributes 38 The Classification Tree Generated by the COBWEB Algorithm 39 Input: A set of data like before  Can automatically guess the class attribute  That is, after clustering, each cluster more or less corresponds to one of Play=Yes/No category  Example: applied to vote data set, can guess correctly the party of a senator based on the past 14 votes! 40 Clustering: COBWEB • In the beginning tree consists of empty node • Instances are added one by one, and the tree is updated appropriately at each stage • Updating involves finding the right leaf an instance (possibly restructuring the tree) • Updating decisions are based on partition utility and category utility measures 41 Clustering: COBWEB Ai  Vij Ck attribute - value pair class Intra - class similarity : P ( Ai  Vij | C k )  The larger this probability, the greater the proportion of class members sharing the value (Vij) and the more predictable the value is of class members. 42 Clustering: COBWEB Inter - class similarity : P ( C k | Ai  Vij )  The larger this probability, the fewer the objects that share this value (Vij) and the more predictive the value is of class Ck. 43 Clustering: COBWEB Partition Utility : n PU (C1 , C2 ,..., Cn )   P( Ai  Vij ) * P(Ck | Ai  Vij ) * P( Ai  Vij | Ck ) k 1 i j  The formula is a trade-off between intra-class similarity and inter- class dissimilarity, summed across all classes (k), attributes (i), and values (j). 44 Clustering: COBWEB Rewrite the equation using Bayes' rule : P ( Ai  Vij ) * P ( C k | Ai  Vij )  P ( Ai  Vij ) * P ( Ck , Ai Vij ) P ( Ai Vij )  P ( C k , Ai  Vij )  P ( C k ) * P ( Ai  Vij | C k ) Partition Utility can be rewritten as : n PU ( C1 , C 2 ,..., C n )   P (Ck )  P ( Ai Vij |Ck ) 2 k 1 i j 45 Clustering: COBWEB Category Utility : n CU ( C1 , C 2 ,..., C n )   P ( Ck )  P ( Ai Vij |Ck ) 2  P ( Ai Vij ) 2 k 1 i j Increase in the expected number of attribute values that can be correctly guessed (Posterior Probability) The expected number of correct guesses give no such knowledge (Prior Probability) 46 The Category Utility Function   The COBWEB algorithm operates based on the socalled category utility function (CU) that measures clustering quality. If we partition a set of objects into m clusters, then the CU of this particular partition is  2 2 PCk  PAi  Vij | Ck    PAi  Vij    k 1 i j i j  m m Question: Why divide by m? - hint: if m=#objects, CU is max! 47 Insights of the CU Function  For a given object in cluster Ck, if we guess its attribute values according to the probabilities of occurring, then the expected number of attribute values that we can correctly guess is  PA  V i i ij | Ck  2 j 48 Finite mixtures      Probabilistic clustering algorithms model the data using a mixture of distributions Each cluster is represented by one distribution  The distribution governs the probabilities of attributes values in the corresponding cluster They are called finite mixtures because there is only a finite number of clusters being represented Usually individual distributions are normal distribution Distributions are combined using cluster weights 49 A two-class mixture model data A A B B A A A A A 51 43 62 64 45 42 46 45 45 B A A B A B A A A 62 47 52 64 51 65 48 49 46 B A A B A A B A A 64 51 52 62 49 48 62 43 40 A B A B A B B B A 48 64 51 63 43 65 66 65 46 A B B A B B A B A 39 62 64 52 63 64 48 64 48 A A B A A A 51 48 64 42 48 41 model A=50, A =5, pA=0.6 B=65, B =2, pB=0.4 50 Using the mixture model  The probability of an instance x belonging to cluster A is: Pr[ A | x]  with  f ( x;  , )  1 2  Pr[ x | A] Pr[ A] f ( x;  A , A ) p A  Pr[ x] Pr[ x] ( x   )2 2 e 2 The likelihood of an instance given the clusters is: Pr[ x | the distributi ons ]  Pr[ x | cluster ] Pr[ cluster ]  i i i 51 Learning the clusters     Assume we know that there are k clusters To learn the clusters we need to determine their parameters  I.e. their means and standard deviations We actually have a performance criterion: the likelihood of the training data given the clusters Fortunately, there exists an algorithm that finds a local maximum of the likelihood 52 The EM algorithm    EM algorithm: expectation-maximization algorithm  Generalization of k-means to probabilistic setting Similar iterative procedure: 1. Calculate cluster probability for each instance (expectation step) 2. Estimate distribution parameters based on the cluster probabilities (maximization step) Cluster probabilities are stored as instance weights 53 More on EM  Estimating parameters from weighted instances: A  A   2 w1 x1  w2 x2  ...  wn xn w1  w2  ...  wn w1 ( x1   ) 2  w2 ( x2   ) 2  ...  wn ( xn   ) 2  w1  w2  ...  wn Procedure stops when log-likelihood saturates Log-likelihood (increases with each iteration; we wish it to be largest):  log( p A Pr[ xi | A]  pB Pr[ xi | B]) i 54

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Clustering - Hong Kong University of Science and Technology