Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan Outline • What is Clustering? - Categories of clustering methods - Similarity measures • Partition-Based Clustering • Hierarchical Clustering • Model-Based (mixture of probabilities) clustering • Scalable Clustering • Grid-Based Clustering • Cluster Validity • Clustering of Large Datasets © 2007 Cios / Pedrycz / Swiniarski / Kurgan 2 What is Clustering? How do we understand data? We look for structure in data by revealing groups/clusters. Clusters are about abstraction of data. The structure is formed based on similarities between patterns (data). © 2007 Cios / Pedrycz / Swiniarski / Kurgan 3 How hard is clustering? Consider N data points to be split into “c” groups (clusters). The number of possible splits (partitions) is described as 1 c c i c N ( 1) i c! i 1 i Even for a small problem of N =100 and c =5 we end up with 1067 partitions © 2007 Cios / Pedrycz / Swiniarski / Kurgan 4 Clustering Challenges – from Bezdek © 2007 Cios / Pedrycz / Swiniarski / Kurgan 5 Clustering Challenges – from Bezdek © 2007 Cios / Pedrycz / Swiniarski / Kurgan 6 Categories of Clustering We distinguish between three main categories (classes) of clustering methods • Partition-based • Hierarchical • Model-based (mixture of probabilities) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 7 Categories of Clustering • Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Examples: k-means, k-medoids, CLARANS • Hierarchical approach: – Create a hierarchical decomposition of the data using some criterion – Examples: Diana, Agnes, BIRCH, CAMELEON • Density-based approach: – Based on connectivity and density functions – Examples: DBSCAN, OPTICS, DenClue • Grid-based approach: – Based on a multiple-level granularity structure – Examples: STING, WaveCluster, CLIQUE 8 Categories of Clustering • Model-based: – A model is hypothesized for each of the clusters and its best fit is found – Examples: EM (expectation maximization), SOM, COBWEB • Frequent pattern-based: – Based on analysis of frequent patterns – Example: p-Cluster • User-guided or constraint-based: – Clustering by user-specified or application-specific constraints – Examples: COD (obstacles), constrained clustering • Link-based clustering: – Objects are linked together in various ways – Examples: SimRank, LinkClus 9 Partition-Based Clustering It is also referred to as objective function clustering, relies on the minimization of a certain objective function (performance index) The result of the minimization is a partition matrix and a set of prototypes (centers) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 10 Partition-Based Clustering • Partitioning method: Partition data D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci) E ik1 pCi ( p ci )2 • Given k, find a partition that optimizes the chosen partitioning criterion – Global (optimal): exhaustively enumerate all partitions - not feasible! – Heuristic: k-means and k-medoids algorithms • k-means: Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition Around Medoids): Each cluster is represented by one of the objects in the cluster 11 Model-Based Clustering In MBC we assume a certain probabilistic model of data and estimate its parameters, such as mean, covariance matrix, etc. Mixture density model: we assume that data are a result of a mixture of “c” sources generating data and each source is treated as a separate cluster. Maximum Likelihood Estimation: method for estimating parameters of a model. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 12 Similarity measures Similarity measure to be used is the most fundamental component of every clustering algorithm; it is used to quantify similarity (or dissimilarity) between data points. The data points with the highest similarity (e.g., with the shortest distance) are candidates for forming a cluster. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 13 Distance Functions for Continuous Data Euclidean distance (equivalent to p=2 in Minkowski) d( x, y ) n 2 (x i y i ) i 1 Hamming distance (p=1 in Minkowski) n d( x, y ) | x i y i | i 1 Tchebyschev distance (p= ∞ in Minkowski) d(x, y) maxi1,2,..., n | x i yi | © 2007 Cios / Pedrycz / Swiniarski / Kurgan 14 Distance Functions for Continuous Data Minkowski distance d( x, y ) p n (x i y i ) , p 0 p i 1 (from Wikipedia) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 15 Distance Functions for Continuous Data Canberra distance d( x, y ) | x i y i | , x i and y i are positive i 1 x i y i n Example: v1(1,1,1), v2(1,1,0), v3(10,5,0), v4(1,2,3), v5(2,4,6) d12=1 d13=2.485 d45=1 n x i yi Angular separation d( x, y ) Example: v1(7,6,3,-1), v2(0,3,4,5) d12=0.363 i 1 n n 2 [ x i y i2 ]1 / 2 i 1 i 1 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 16 Distance Functions for Discrete Data Binary data x = [x1 x2 …xn] y = [y1 y2 …yn] a- number of occurrences where both xi and yi are 1 d- number of occurrences where both xi and yi are 0 b, c- number of occurrences where xi and yi are different (0-1) 1 0 1 a c 0 b d © 2007 Cios / Pedrycz / Swiniarski / Kurgan 17 Distance Functions for Discrete Data Matching index ad abcd Rusell & Rao a abcd Jacard index a abc Czekanowski 2a 2a b c © 2007 Cios / Pedrycz / Swiniarski / Kurgan 18 Hierarchical Clustering HC provides graphical illustration of relationships between the data in the form of a dendrogram (binary tree). There are two approaches to HC: • Bottom – up / agglomerative • Top-down / divisive © 2007 Cios / Pedrycz / Swiniarski / Kurgan 19 Hierarchical Clustering Agglomerative / bottom-up method starts with each object in the data forming its own cluster, and then successively merges the clusters until one large cluster is formed, which encompasses the entire dataset. Divisive / top-down method starts by considering the entire data as one cluster and then splits up the cluster(s) until each object forms its own cluster. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 20 Hierarchical Clustering Top –down / divisive Bottom-up / agglomerative {a} {b,c,d,e} {f,g,h} a b c d e f g h © 2007 Cios / Pedrycz / Swiniarski / Kurgan 21 Hierarchical Agglomerative Clustering numbers of clusters at different levels 4 3 2 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 22 Hierarchical Agglomerative Clustering Given : a data set and the distance function 1. start with “N “ clusters by assigning each pattern to a separate cluster 2. proceed with this initial configuration of the clusters and merge the clusters that are the closest. In other words, if S and T are the two clusters being recognized as the closest, form a single cluster {S, T} and reduce the number of clusters by one 3 repeat step 2 until a minimal number of the clusters has been reached. Result : clusters of data (partition) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 23 Distance Between Clusters Single linkage method: T S min complete linkage : T S max average linkage : T -S xT yS xT yS xy xy 1 xy card(S )card(T ) xT yS © 2007 Cios / Pedrycz / Swiniarski / Kurgan 24 Single Linkage S imilarity between S and T is calculated based on the minimal distance between the elements belonging to the corresponding clusters © 2007 Cios / Pedrycz / Swiniarski / Kurgan 25 Complete Linkage We rely on the maximal distance between the patterns in the analyzed clusters. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 26 Average Linkage We combine two clusters based upon their averaged distance between the patterns in the clusters. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 27 Hausdorff Distance Function d(A, B) max{maxxA min yBd(x, y), maxyBmin xAd(x, y)} from Wikipedia Two sets are close if every point of either set is close to some point of the other set. It is the greatest of all the distances from a point in one set to the closest point in the other set. d(A,B) = max { min d(A,B)} = sup { inf d(A,B)} © 2007 Cios / Pedrycz / Swiniarski / Kurgan 28 Lance-Williams updating formula d AB,C α A d A,C α Bd B,C βdA,B γ | d A,C d B,C | Clustering method Single link Complete link centroid A (B) 1/2 1/2 0 0 -1/2 1/2 0 nA nA nB median 1/2 nAnB (n A n B ) 2 -1/4 0 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 29 Hierarchical Divisive Method HD algorithm starts by considering all divisions of the data into two nonempty subsets which amounts to huge # of possible divisions: n 1 2 1 However, it is possible to construct divisive methods that don’t consider all divisions, most of which may be incorrect anyway. One such algorithm is by MacNaughton - Smith (1964) -> see the next slides © 2007 Cios / Pedrycz / Swiniarski / Kurgan 30 Hierarchical Divisive Method At first A:=C and B:= 1. Move one object at a time from A to B. For each object iA we compute average dissimilarity to all other objects of A: 1 a(i ) d (i, j ) | A | 1 jA j i Object m of A for which a(m) is the largest, is moved to B: A : A \ {m}, B : {m} © 2007 Cios / Pedrycz / Swiniarski / Kurgan 31 Hierarchical Divisive Method 2. Move other objects from A to B (called the “splinter group”) If |A|=1, stop. Otherwise compute a(i) for all iA, and the average dissimilarity of i to all objects of B, denoted as d(i,B) 1 1 a(i) d (i, B) d (i, j ) d (i, k ) | A | 1 jA | B | kB j i © 2007 Cios / Pedrycz / Swiniarski / Kurgan 32 Hierarchical Divisive Method Select the object hA for which a(h) d (h, B) max iA (a(i) d (i, B)) If a(h)-d(h,B) > 0 move h from A to B, go to 2 If a(h)-d(h,B) 0 the process stops The division of C into clusters A and B is complete. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 33 Hierarchical Divisive Method a a 0.0 b 2.0 c 6.0 d 10.0 e 9.0 b c d e 2.0 6.0 10.0 9.0 0.0 5.0 9.0 8.0 5.0 0.0 4.0 5.0 9.0 4.0 0.0 3.0 8.0 5.0 3.0 0.0 Object’ Average Dissimilarity to the Other Objects a (2.0 + 6.0 + 10.0 + 9.0)/4 = 6.75 b (2.0 + 5.0 + 9.0 +8.0)/4 = 6.00 c (6.0 + 5.0 + 4.0 + 5.0)/4 = 5.00 d (10.0 + 9.0 + 4.0 + 3.0)/4 = 6.50 e (9.0 + 8.0 + 5.0 + 3.0)/4 = 5.25 In this example, object a is chosen to initiate the splinter group. At this stage we have clusters A={b,c,d,e} and B={a} © 2007 Cios / Pedrycz / Swiniarski / Kurgan 34 Hierarchical Divisive Method Average Dissimilarity Object to remaining Objects Average Dissimilarity to Objects of Splinter Group Difference b c d e 2.00 6.00 10.00 9.00 5.33 -1.33 -4.67 -3.67 (5.0+9.0+8.0)/37.33 (5.0+4.0+5.0)/34.67 (9.0+4.0+3.0)/35.33 (8.0+5.0+3.0)/35.33 Therefore, object b changes sides, so new splinter group is B={a, b} and the remaining group becomes A={c, d, e} Average Dissimilarity Object to remaining Objects Average Dissimilarity to Objects of Splinter Group Difference c d e (6.0+5.0)/2=5.50 (10.0+9.0)/2=9.50 (9.0+8.0)/2=8.50 -1.00 -6.00 -4.50 (4.0+5.0)/2=4.50 (4.0+3.0)/2=3.50 (5.0+3.0)/2=4.00 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 35 Partition / Objective Function Clustering Develop and optimize a partition matrix so that a certain performance index is optimized (minimized). Objective function Minimization Structure © 2007 Cios / Pedrycz / Swiniarski / Kurgan 36 Partition / Objective Function Clustering It depends on minimization of a performance index Q 2 c N c N Q U m x v U m d 2 ik k i ik ik i1 k1 i1 k1 c – number of clusters U – partition matrix © 2007 Cios / Pedrycz / Swiniarski / Kurgan 37 Clustering: Representation how to represent clusters? Partition matrix N data points c clusters data partition matrix 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 U 0 0 0 0 0 1 1 0 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 38 Partition Matrix 1 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 U 0 0 0 0 0 1 1 0 U {U | 0 N u k 1 cluster1 : {1,4,5,8} cluster2 : {2,3} cluster3 : {6,7} c ik N, 0 u ik 1 } = i1 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 39 Partition / Objective Function Clustering Clustering is guided by the minimization of some objective function Q. Representation of structure is in the form of: • Partition matrix U = [uik], i=1,2,…,c; k=1, 2,…,N N 0 u N for i 1, 2, ..., c k1 ik N uik 1 for k 1, 2, ..., N i1 •Prototypes vi, i=1,2,…, c © 2007 Cios / Pedrycz / Swiniarski / Kurgan 40 Partition / Objective Function Clustering Given: the (guessed!) number of clusters (c), and the chosen similarity function (and value of the power factor (m) for fuzzy clustering only) Compute the prototypes (v) and update the partition matrix (U) based on the conditions of the minimized objective function Result: partition matrix and prototypes © 2007 Cios / Pedrycz / Swiniarski / Kurgan 41 K-Means Proposed about 60 years ago independently by few researchers: – Steinhaus, H., 1956. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. IV (C1.III), 801–804 – Lloyd, S., 1982. Least squares quantization in PCM. IEEE Trans. Inform. Theory 28,129–137. Originally as an unpublished Bell laboratories Technical Note (1957) – Ball, G., Hall, D., 1965. ISODATA, a novel method of data anlysis and pattern classification. Technical report NTIS AD 699616. Stanford Research Institute, Stanford, CA – MacQueen, J., 1967. Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley Symposium on Mathematics. Statistics and Probability. University of California Press, pp. 281–297 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 42 K-Means It is the simplest and most frequently used clustering algorithm because of ease of implementation, efficiency and many empirical successes. Input Output c = number of clusters d = distance function U = partition matrix m = power (fuzziness) factor V = Cluster centers not used in hard K- means t = termination criteria Amount of movement between clusters v = cluster centers Randomly chosen each run © 2007 Cios / Pedrycz / Swiniarski / Kurgan 43 K-Means Given: the number of clusters k and dataset X with n points 1. select initial k means as the first k points of data (or they can be user-provided or chosen randomly) 2. calculate distances between all the points and the kmeans 3. allocate the point(s) to the cluster(s) whose mean is nearest to it 4. recalculate the means of the k-clusters 5. repeat 2-4 until a termination criterion is satisfied Result: A vector of size k of means (prototypes) of clusters; Partition Matrix P of size kxn © 2007 Cios / Pedrycz / Swiniarski / Kurgan 44 K-Means Termination criteria When the summed difference of the old and new partitions (partition matrices) is less than a threshold • Hard Unew - Uold == 0; • Fuzzy Unew - Uold < user chosen number (like 0.0001) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 45 K-Means Objective function c Q=å i=1 N åu ||x k - v i || 2 ik k=1 Minimize Q : Sum of Errors Squared of N samples over c clusters subject to these constraints uik = 0 or 1 (membership value = Sample k belongs to cluster i or not) N 0< å u < N for i=1, 2, ..., c k=1 ik N å uik =1 for k =1, 2, ...,N i=1 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 46 K-Means: Example Anil Jain. Data Clustering: 50 years beyond K-means; (http://biometrics.cse.msu.edu/Publications/Clustering/JainClustering_PRL10.pdf) K-Means Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • By comparison: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) Weaknes: Finds only local optimum. The global optimum may be found using deterministic annealing or genetic algorithms. Applicable only to numeric data (so that a mean can be calculated). – User needs to specify k - the number of clusters – Unable to handle noisy data and/or outliers – Cannot discover clusters with non-convex shapes From Data Mining: Concepts and Techniques 48 K-Means: Selecting K • K-means is NP-hard problem and as a greedy algorithm converges to a local minimum • To minimize the local minima problem, run the algorithm with different starting points for a given K and pick the result with the smallest squared error • Different starting values of K lead to different clustering results: – Run the K-means clustering algorithm with different values of K and pick the result that is the most meaningful to the domain expert (!) – Perform silhouette analysis 49 Silhouette Plot Analysis http://www.plosone.org/article/info:doi/10.1371/journal.pone.0006937 50 Fuzzy C-Means How to deal (quantify) data that are in-between clusters? Consider partial membership to clusters – emergence of fuzzy sets elements with partial membership © 2007 Cios / Pedrycz / Swiniarski / Kurgan 51 Fuzzy C-Means Allows for partial membership in clusters –> fuzzy partition matrix U Objective function c N m Q u ik || x k v i || 2 i 1 k 1 U = [uik]; uik – degree of membership of k-th data to i-th cluster m – fuzzification coefficient, m>1 ||. || - distance function © 2007 Cios / Pedrycz / Swiniarski / Kurgan 52 Fuzzy C-Means: Optimization c Q i 1 N u k 1 m ik || x k v i || Min prototypes, UU Q 2 Min Q with respect to prototypes prototypesQ 0 Min Q with respect to partition matrix UQ 0 Q that is 0, i 1,2,..., c j 1,2, ..., n v ij that is Q 0, i 1,2,..., c k 1,2, ..., N u ik constraint : U is a partition matrix! ! © 2007 Cios / Pedrycz / Swiniarski / Kurgan 53 Fuzzy C-Means: Optimization c min prototypes N i 1 m u ik || x k v i || 2 k 1 vi N N m T ui k ( xk vi ) ( xk vi ) 2 k 1 m ui k ( xk vi ) k 1 N u N u k 1 m ik ( k x vi ) 0 vi m ik k x k 1 N um ik k 1 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 54 Fuzzy C-Means Initialize: select the number of clusters (c), stopping value (e), fuzzification coefficient (m). The distance function is Euclidean or weighted Euclidean. The initial partition matrix consists of random entries Repeat update prototypes N m u ik x k v i k 1N m u ik k 1 update partition matrix u ik 1 c || x v || k i l 1 || x v || j k 2/(m1) until a certain stopping criterion has been satisfied © 2007 Cios / Pedrycz / Swiniarski / Kurgan 55 Fuzzy C-Means Design aspects stopping criterion: termination of iterations maxik | uik(iter+1) – uik(iter)| fuzzification coefficient (m) : m>1 Shape of the membership functions m =2.0 – typical value m close to 1 – set like shape of membership functions m higher than 2.0 - spike like membership functions © 2007 Cios / Pedrycz / Swiniarski / Kurgan 56 Kernel-Based Clustering Idea: The original data points in the n-dimensional space are transformed, through some mapping f, into elements in m-dimensional space, where m > n Objective function in the new space: Q c N m u ik i 1 k 1 || φ(x k ) φ(v i ) || 2 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 57 Kernel-Based Clustering Given the dimensionality of a new space, m > n, we calculate in the new space a kernel function K(x,v) as a dot product K(x,v) = fT(x)f(v) Gaussian kernel K(x,v) = exp(- ||x-v||2/s2) ||f(xk)-f(vi)||2 = 2 – K(xk, vi) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 58 59 60 61 62 63 64 65 Kernel K-Means K-Means cannot separate clusters that are non-linearly separable To solve this problem kernel K-Means algorithm was designed: Before clustering, all points are mapped into a higher-dimensional space using some nonlinear function; then the algorithm partitions (clusters) points in the new space. Major difference is calculation of distance in the kernel K-means algorithm by the kernel method and not, for instance, by simple Euclidean. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 66 Kernel K-means Q c N c N m u ik i 1 k 1 Q i 1 || φ(x k ) φ(v i ) || w(x) || φ(x k 1 k ) m j || 2 2 To calculate the distances between the points in the new space and the mj we use a kernel function that is specified in the kernel matrix K. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 67 Kernel K-means Input: K -kernel function, k - number of clusters 1. Initialize the k clusters: C1(0), ...,Ck(0) 2. Set t = 0 3. For each point x, find its new cluster by: J*(x) = argmin j ||f(x)−mj||2 4. Compute the updated clusters as Cj (t+1) = {x : J*(x) = J} 5. If not converged, set t = t + 1 and go to step 3; otherwise, stop Result: partition into clusters C1, ....,Ck © 2007 Cios / Pedrycz / Swiniarski / Kurgan 68 K-Medoids Clustering To enhance robustness of clustering we use medoids instead of prototype mean values. In one-dimensional case the medoid is the median. Consider an ordered collection of data points x1 <= x2 <= … <=xN Median is the central point in the sequence (if N is even) or an average of the two points in the middle (if N is odd). median median mean mean outlier © 2007 Cios / Pedrycz / Swiniarski / Kurgan 69 Median as a Robust Estimator Median is a solution to the minimization problem: N N k 1 k 1 min ii | x k x ii | | x k med | We increase robustness by using this objective function c N n u ik | x k j v ij | i 1 k 1j1 Advantage: one of the original points becomes cluster center © 2007 Cios / Pedrycz / Swiniarski / Kurgan 70 K-Means vs. K-Medoids K-Means is sensitive to outliers because a point with an extremely large value substantially distorts distribution of the data K-Medoids: Instead of using the means we use the medoids - the most centrally located points in clusters. 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K-Medoids K-Medoids Clustering: Find representative objects (medoids) in clusters – PAM (Partitioning Around Medoids) • Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering • PAM works effectively for small data sets but does not scale up well due to its computational complexity: O(k(n-k)2 ) for each iteration, where n # of data points, k # of clusters Efficiency improvement on PAM – CLARA (Clustering LARge Applications) : PAM on sub-samples of data – CLARANS : Randomized re-sampling 72 PAM medoids – a family of the most centrally positioned data points. PAM clustering: represent the structure in the data by a collection of medoids, each data point is grouped around the medoid to which its distance is the shortest. PAM starts with an arbitrary collection of elements treated as medoids. At each step of the optimization, we make a swap between a certain data and one of the medoids assuming that the swap results in improvement of the quality of the clustering. Limitations -- size of the dataset. PAM works well for small datasets with a small number of clusters, (100 data points and 5 clusters). © 2007 Cios / Pedrycz / Swiniarski / Kurgan 73 PAM 1. Select k representative objects arbitrarily 2. For each pair of a non-selected object h and the selected object i, calculate the total swapping cost TCih 3. For each pair of i and h, • If TCih < 0, i is replaced by h • Then assign each non-selected object to the most similar representative object Repeat steps 2-3 until there is no change 74 PAM: K-Medoids Algorithm K=2 Total Cost = 20 10 10 10 9 9 9 8 8 8 Arbitrary choose k objects as initial medoids 7 6 5 4 3 2 Assign each remaining object to the nearest medoids 7 6 5 4 3 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 7 6 5 4 3 2 1 0 0 Until no change 10 8 if quality is improved. 6 3 4 5 6 7 8 9 10 10 Compute total cost of swapping 9 Swap O and Oramdom 2 Randomly select a nonmedoid object,Oramdom Total Cost = 26 Do loop 1 7 9 8 7 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 PAM: Finding the Best Cluster Center Four Cases 76 PAM: Finding the Best Cluster Center Case 1 Suppose p currently belongs to cluster represented by Oj and that D(p,Oi) < D(p,Orandom) If Oj is replaced by Orandom, p will belong to the cluster represented by Oi swap cost: C=d(p,Oi) - d(p,Oj) Case 2 P currently belongs to the cluster represented by Oj and this time assume that D(p,Oi) > D(p,Orandom) If Oj is replaced by Orandom, p will belong to the cluster represented by Orandom swap cost: C=d(p,Orandom)-d(p,Oj) PAM: Finding the Best Cluster Center Case 3 P currently belongs to a cluster represented by Oi instead of Oj and D(p,Oi) < D(p,Orandom) If Oj is replaced by Orandom, p will still belong to the cluster represented by Oi swap cost: C=0 Case 4 P currently belongs to cluster represented by Oi but D(p,Oi) > D(p,Orandom) If we replace Oj with Orandom, p will belong to Orandom swap cost: C=d(p,Orandom)-d(p,Oi) CLARA CLARA (Kaufmann and Rousseeuw, 1990) – It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output Strength: deals with larger data sets than PAM Weakness: – Efficiency depends on the sample size – A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased 79 CLARA Five examples of size 40 (small number) +2K 1. For i =1 to 5, repeat the following steps: 2. Draw a sample of 40 + 2K objects randomly from the entire data set, and call PAM to find K-medoids of the sample. 3. For each object Oj in the entire data set, determine which of the K-medoids is the most similar to Oj. 4. Calculate the average dissimilarity of the clustering obtained in the previous step. If this value is less than the current minimum, use this value as the current minimum, and retain the K-medoids found in Step 2 as the best set of medoids obtained so far. 5. Return to Step 1 to start the next iteration Model-Based Clustering Mixture of data as an underlying model Each component is described by some conditional probability density function described by parameters c p(x| θ 1, θ 2,…, θ c) = p(x | θ )p i 1 i i Parameters estimation of the mixture of data © 2007 Cios / Pedrycz / Swiniarski / Kurgan 81 Mixture of Data Model Maximum likelihood estimation Given data x1, x2, …, xN choose parameters such that the value of the expression N P( X | θ) p(x k | θ) k 1 becomes maximized © 2007 Cios / Pedrycz / Swiniarski / Kurgan 82 Density-Based Clustering Clustering based on density (local cluster criterion) such as densityconnected points. Characteristics: – Discovers clusters of arbitrary shape – Handles noise – One scan – Needs density parameters specified. Example algorithms: – DBSCAN: Ester, et al. (KDD’96) – OPTICS: Ankerst, et al (SIGMOD’99). – DENCLUE: Hinneburg & D. Keim (KDD’98) – CLIQUE: Agrawal, et al. (SIGMOD’98) 83 Density-Based Clustering Two parameters are used: – Eps: maximum radius of the neighborhood – MinPts: minimum number of points in the Eps-neighborhood of that point • NEps(p): {q belongs to D | dist(p,q) <= Eps} • Directly density-reachable: A point p is directly density-reachable from a point q (with Eps, MinPts) if – p belongs to NEps(q) – core point condition: |NEps (q)| >= MinPts p q MinPts = 5 Eps = 1 cm Data Mining: Concepts and Techniques 84 Density-Reachable and Density-Connected • Density-reachable: – A point p is density-reachable from a point q (w.r.t. Eps, MinPts) if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly densityreachable from pi p p1 q • Density-connected – A point p is density-connected to a point q (w.r.t. Eps, MinPts) if there is a point o such that both, p and q are density-reachable from o p q o Data Mining: Concepts and Techniques 85 DBSCAN: Pseudocode • Arbitrarily select a point p • Retrieve all points density-reachable from p (with Eps and MinPts) • If p is a core point, a cluster is formed • If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the data • Continue the process until all points have been processed Complexity: O(nlogn) 86 DBSCAN: Density-Based Spatial Clustering with Noise • Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points • Discovers clusters of arbitrary shape in spatial databases with noise Outlier Border Eps = 1cm Core MinPts = 5 87 DBSCAN: Example http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html 88 OPTICS OPTICS: Ordering Points To Identify the Clustering Structure (Ankerst et al., SIGMOD’99) – Produces an order with respect to its density-based clustering structure – This cluster-ordering contains info equivalent to the densitybased clusterings Complexity: O(nlogn) 89 OPTICS: Extension of DBSCAN Core Distance (CD) of object p is the smallest ‘ value that makes p a core object; if p not core the CD is undefined. ‘ p Reachability Distance (RD) of q with respect to p is the Max of (CD of p and its p1 Euclidean d(p,q)) p RD = Max (CD (p), d (o, p)) p2 RD(p1, p) = 2.8cm RD(p2,p) = 4cm MinPts = 5 = 3 cm 90 OPTICS Reachability Chart Points are ordered according to their density reachibility, namely points with lower (like ‘ ) are processed first. For each point its RD is calculated and a graph is formed, which visualizes the data’s clustering structure. RD © 2007 Cios / Pedrycz / Swiniarski / Kurgan 91 Grid-Based Clustering Describes structure in data in the language of generic geometric constructs – hyperboxes and their combinations Collection of clusters of different geometry Formation of clusters by merging adjacent hyperboxes of the grid © 2007 Cios / Pedrycz / Swiniarski / Kurgan 92 Grid-Based Clustering Hyperboxes { B1, B2, …, Bp.} with two requirements: a) Bi is nonempty in the sense it includes some data points, b) the hyperboxes are disjoint that is Bi Bj = if p i j, c) a union of all hyperboxes covers all data that is B i X i 1 where X = {x1, x2, …, xN}. It is also required that such hyperboxes “cover” some maximal number (say bmax) of data points. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 93 Grid-Based Clustering: Steps Formation of the grid structure Insertion of data into the grid structure Computation of the density index of each hyperbox of the grid structure Sorting the hyperboxes with respect to the values of their density index Identification of cluster centres (viz. the hyperboxes of the highest density) Traversal of neighboring hyperboxes and merging process Choice of the grid: too rough grid may not help capture the details of the structure in the data. too detailed grid produces a significant computational overhead. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 94 Grid-Based Clustering • Clustering highly-dimensional data • Clusters may exist only in some subspaces • Methods – Subspace-clustering: find clusters in all possible subspaces • Example: CLIQUE 95 CLIQUE CLIQUE can be considered as both density-based and grid-based – It partitions each dimension into the same number of equal length intervals / rectangular units – A unit is dense if the fraction of total data points contained in the unit exceeds the user-specified input parameter – A cluster is formed as a maximal set of connected dense units within a subspace • Uses the monotonicity property: If a collection of points S forms a cluster in a K-dimensional space, then S is also part of a cluster in any (K-1) dimensional projections of this space. • Pruning is used to eliminate outliers that are not dense enough. This threshold is called the “optimal cut point” Data Mining: Concepts and Techniques 96 40 50 20 30 40 50 age 60 Vacation 30 Vacation (week) 0 1 2 3 4 5 6 7 Salary (10,000) 0 1 2 3 4 5 6 7 20 age 60 30 50 age Data Mining: Concepts and Techniques 97 References Data Mining Concepts and Techniques, edited by Jiawei Han http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletFCM.html Yaling Pei, http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html http://www.cs.uiuc.edu/~yyz/teaching/InfoVis-s10/gps-clustering.pdf Yizhou Yu, www.cs.ndsu.nodak.edu/~adenton/datamining/DENCLUE.PPT Li Cheng, webdocs.cs.ualberta.ca/~zaiane/courses/cmput695-00/.../LiClique.pdf