* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Clustering
Survey
Document related concepts
Transcript
Data Mining: Concepts and Techniques — Chapter 6 — Clustring May 22, 2017 1 Data Mining: Concepts and Techniques 1 What is Cluster Analysis? • Cluster: a collection of data objects – Similar to one another within the same cluster – Dissimilar to the objects in other clusters • Cluster analysis – Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters • Unsupervised learning: no predefined classes 2 Clustering: Applications • Pattern Recognition • Image Processing • Economic Science (especially market research) • WWW – Document classification – Cluster Weblog data to discover groups of similar access patterns 3 Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use: Identification of areas of similar land use in an earth observation database • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost • City-planning: Identifying groups of houses according to their house type, value, and geographical location 4 Quality: What Is Good Clustering? • A good clustering method will produce high quality clusters with – high intra-class similarity – low inter-class similarity • The quality of a clustering result depends on both the similarity measure used by the method and its implementation • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns 5 Measure the Quality of Clustering • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j) • There is a separate “quality” function that measures the “goodness” of a cluster. • The definitions of distance functions are usually very different for any kinds of variables. • It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective. 6 Requirements of Clustering in Machine Learning • Scalability • Ability to deal with different types of attributes • Ability to handle dynamic data • Minimal requirements for domain knowledge to determine input parameters • Able to deal with noise and outliers • High dimensionality • Interpretability and usability 7 Similarity and Dissimilarity Between Objects There are some methods to measure similarity (Distance) between objects (can refer to KNN slides ) 8 Dissimilarity between Binary Variables Object j 1 0 1 a b • A contingency table for binary data Object i 0 c d sum a c b d • Distance measure for symmetric binary variables: d (i, j) bc a bc d • Distance measure for asymmetric binary variables: sum a b cd p d (i, j) bc a bc 9 Dissimilarity between Binary Variables • Example Name Jack Mary Jim Gender M F M Fever Y Y Y Cough N N P Test-1 P P N Test-2 N N N Test-3 N P N Test-4 N N N – gender is a symmetric attribute – the remaining attributes are asymmetric binary 01 d ( jack , mary ) 0.33 2 01 11 d ( jack , jim ) 0.67 111 1 2 d ( jim , mary ) 0.75 11 2 10 Nominal Variables • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green • Method 1: Simple matching – m: # of matches, p: total # of variables m d (i, j) p p • Method 2: use a large number of binary variables – creating a new binary variable for each of the M nominal states 11 Major Clustering Approaches (I) • Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – • • Typical methods: Graph Partitioning, k-means, k-medoids Hierarchical approach: – Create a hierarchical decomposition of the set of data (or objects) using some criterion – Typical methods: Diana, Agnes, BIRCH, ROCK Density-based approach: – Based on connectivity and density functions – Typical methods: DBSACN, OPTICS 12 Major Clustering Approaches (II) • • Grid-based approach: – based on a multiple-level granularity structure – Typical methods: STING, WaveCluster, CLIQUE Model-based: – A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other – • Typical methods: EM, SOM, COBWEB Frequent pattern-based: – Based on the analysis of frequent patterns – Typical methods: pCluster 13 An Example: Graph Partitioning(1) Sessions Co-Occurrence Matrix S1: a, c, b, d, c S2: b, d, e, a S3:e,c [...] Sn: b, e, d Create Undirected Graph based on Matrix M 14 An Example: Graph Partitioning(2) Co-occurrence Matrix Undirected Graph Clusters C1: a, b, e C2: c, f, g, i C3: d Small Clusters are filtered out (MinClusterSize) 15 An Example: Graph Partitioning(3) 16 16 Distance (Similarity) Matrix for Graph-based Clustering • Similarity (Distance) Matrix – based on the distance or similarity measure we can construct a symmetric matrix of distance (or similarity values) – (i, j) entry in the matrix is the distance (similarity) between items i and j Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix. The diagonal is all 1’s (similarity) or all 0’s (distance) dij similarity (or distance) of Di to D j 17 Example: Term Similarities in Documents • Suppose we want to cluster terms that appear in a collection of documents with different frequencies Each term can be viewed as a vector of term frequencies (weights) Doc1 Doc2 Doc3 Doc4 Doc5 T1 0 3 3 0 2 T2 4 1 0 1 2 T3 0 4 0 0 2 T4 0 3 0 3 3 T5 0 1 3 0 1 T6 2 2 0 0 4 T7 1 0 3 2 0 T8 3 1 0 0 2 • We need to compute a term-term similarity matrix – For simplicity we use the dot product as similarity measure (note that this is the non-normalized version of cosine similarity) N sim(Ti ,T j ) ( wik w jk ) k 1 – Example: N = total number of dimensions (in this case documents) wik = weight of term i in document k. Sim(T1,T2) = <0,3,3,0,2> * <4,1,0,1,2> 0x4 + 3x1 + 3x0 + 0x1 + 2x2 = 7 18 Example: Term Similarities in Documents Doc1 Doc2 Doc3 Doc4 Doc5 T1 0 3 3 0 2 T2 4 1 0 1 2 T3 0 4 0 0 2 T4 0 3 0 3 3 T5 0 1 3 0 1 T6 2 2 0 0 4 T7 1 0 3 2 0 T8 3 1 0 0 2 N sim(Ti , Tj ) ( wik w jk ) k 1 Term-Term Similarity Matrix T2 T3 T4 T5 T6 T7 T8 T1 7 16 15 14 14 9 7 T2 T3 T4 T5 T6 T7 8 12 3 18 6 17 18 6 16 0 8 6 18 6 9 6 9 3 2 16 3 19 Similarity (Distance) Thresholds – A similarity (distance) threshold may be used to mark pairs that are “sufficiently” similar T2 T3 T4 T5 T6 T7 T2 T3 T4 T5 T6 T7 T8 T1 7 16 15 14 14 9 7 8 12 3 18 6 17 18 6 16 0 8 6 18 6 9 6 9 3 2 16 3 T2 T3 T4 T5 T6 T7 T2 T3 T4 T5 T6 T7 T8 T1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 Using a threshold value of 10 in the previous example 20 Graph Representation • The similarity matrix can be visualized as an undirected graph – each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix) T2 T3 T4 T5 T6 T7 T8 T1 T2 T3 T4 T5 T6 T7 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 If no threshold is used, then matrix can be represented as a weighted graph T1 T3 T5 T4 T2 T7 T6 T8 21 Simple Clustering Algorithms • If we are interested only in threshold (and not the degree of similarity or distance), we can use the graph directly for clustering • Clique Method (complete link) – all items within a cluster must be within the similarity threshold of all other items in that cluster – clusters may overlap – generally produces small but very tight clusters • Single Link Method – any item in a cluster must be within the similarity threshold of at least one other item in that cluster – produces larger but weaker clusters 22 Simple Clustering Algorithms • Clique Method – a clique is a completely connected subgraph of a graph – in the clique method, each maximal clique in the graph becomes a cluster T1 T3 Maximal cliques (and therefore the clusters) in the previous example are: T5 T4 T2 {T1, T3, T4, T6} {T2, T4, T6} {T2, T6, T8} {T1, T5} {T7} Note that, for example, {T1, T3, T4} is also a clique, but is not maximal. T7 T6 T8 23 Simple Clustering Algorithms • Single Link Method – selected an item not in a cluster and place it in a new cluster – place all other similar item in that cluster – repeat step 2 for each item in the cluster until nothing more can be added – repeat steps 1-3 for each item that remains unclustered T1 T3 In this case the single link method produces only two clusters: T5 T4 {T1, T3, T4, T5, T6, T2, T8} {T7} Note that the single link method does not allow overlapping clusters, thus partitioning the set of items. T2 T7 T6 T8 24 K-Means (A partitioning clustering Method) • In machine learning, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. • Centroid: the “middle” of a cluster Cm • iN 1(t ip ) N Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance km1tmiKm (Cm tmi )2 • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion – Heuristic methods: k-means and k-medoids algorithms – k-means : Each cluster is represented by the center of the cluster – k-medoids: Each cluster is represented by one of the objects in the cluster 25 The K-Means Clustering Method • Given k, the k-means algorithm is implemented in four steps: – Partition objects into k nonempty subsets – Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) – Assign each object to the cluster with the nearest seed point – Go back to Step 2, stop when no more new assignment 26 The K-Means Clustering Method 10 10 9 9 8 8 7 7 6 6 5 5 10 9 8 7 6 5 4 4 3 2 1 0 0 1 2 3 4 5 6 7 8 K=2 Arbitrarily choose K object as initial cluster center 9 10 Assign each objects to most similar center 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 reassign 10 9 9 8 8 7 7 6 6 5 5 4 3 2 1 0 1 2 3 4 5 6 7 8 8 9 10 reassign 10 0 7 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 7 8 27 9 10 The K-Means Clustering Method: An Example Document Clustering Initial Set with k=3 C1 = {D1,D2}, C2 = {D3,D4}, C3 = {D5,D6} Cluster Means D1 D2 D3 D4 D5 D6 D7 D8 C1 C2 C3 T1 0 4 0 0 0 2 1 3 4/2 0/2 2/2 T2 3 1 4 3 1 2 0 1 4/2 7/2 3/2 T3 3 0 0 0 3 0 3 0 3/2 0/2 3/2 T4 0 1 0 3 0 0 2 0 1/2 3/2 0/2 T5 2 2 2 3 1 4 0 2 4/2 5/2 5/2 28 Example: K-Means Now compute the similarity (or distance) of each item with each cluster, resulting a cluster-document similarity matrix (here we use dot product as the similarity measure). C1 C2 C3 D1 29/2 31/2 28/2 D2 29/2 20/2 21/2 D3 24/2 38/2 22/2 D4 27/2 45/2 24/2 D5 17/2 12/2 17/2 D6 32/2 34/2 30/2 D7 15/2 6/2 11/2 D8 24/2 17/2 19/2 For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table). After the reallocation we have the following new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment. C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5} This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation… 29 Example: K-Means Now compute new cluster centroids using the original document-term matrix C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5} D1 D2 D3 D4 D5 D6 D7 D8 C1 C2 C3 T1 0 4 0 0 0 2 1 3 8/3 2/4 0/1 T2 3 1 4 3 1 2 0 1 2/3 12/4 1/1 T3 3 0 0 0 3 0 3 0 3/3 3/4 3/1 T4 0 1 0 3 0 0 2 0 3/3 3/4 0/1 T5 2 2 2 3 1 4 0 2 4/3 11/4 1/1 30 Comments on the K-Means Method Weakness – Applicable only when mean is defined, then what about categorical data? – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers 31 Variations of the K-Means Method • A few variants of the k-means which differ in – Selection of the initial k means – Dissimilarity calculations – Strategies to calculate cluster means 32 What Is the Problem of the K-Means Method? • The k-means algorithm is sensitive to outliers ! – Since an object with an extremely large value may substantially distort the distribution of the data. • K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 33 The K-Medoids Clustering Method • Find representative objects, called medoids, in clusters • PAM (Partitioning Around Medoids) Algorithm : 1. Initialize: randomly select k of the n data points as the medoids 2. Associate each data point to the closest medoid. ("closest" here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) 3. For each medoid m 1. For each non-medoid data point o 2. Swap m and o and compute the total cost of the configuration 4. Select the configuration with the lowest cost. 5. repeat steps 2 to 5 until there is no change in the medoid. 34 Hierarchical Clustering • Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 a Step 1 Step 2 Step 3 Step 4 ab b abcde c cde d de e Step 4 agglomerative (AGNES-Agglomerative Nesting) Step 3 Step 2 Step 1 Step 0 divisive (DIANA-Divisive analysis) 35 Agglomerative hierarchical clustering • Group data objects in a bottom-up fashion. • Initially each data object is in its own cluster. • Then we merge these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. • A user can specify the desired number of clusters as a termination condition. 36 Divisive hierarchical clustering • Groups data objects in a top-down fashion. • Initially all data objects are in one cluster. • We then subdivide the cluster into smaller and smaller clusters, until each object forms cluster on its own or satisfies certain termination conditions, such as a desired number of clusters is obtained. 37 AGNES (Agglomerative Nesting) 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 38 10 Dendrogram: Shows How the Clusters are Merged 39 Density-Based Clustering Methods • Clustering based on density (local cluster criterion), such as density- connected points • Major features: – Discover clusters of arbitrary shape – Handle noise – One scan – Need density parameters as termination condition 40 Density-Based Clustering: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)& OPTIC (Ordering Points To Identify the Clustering Structure) 41 GRID-BASED CLUSTERING METHODS • This is the approach in which we quantize space into a finite number of cells that form a grid structure on which all of the operations for clustering is performed. • Assume that we have a set of records and we want to cluster with respect to two attributes, then, we divide the related space (plane), into a grid structure and then we find the clusters. 42 Salary (10,000) 8 Our “space” is this plane 7 6 5 4 3 2 1 0 20 30 40 50 60 Age 43 STING: A Statistical Information Grid Approach • The spatial area is divided into rectangular cells • There are several levels of cells corresponding to different levels of resolution 44 Different Grid Levels during Query Processing. 45 The STING Clustering Method – Each cell at a high level is partitioned into a number of smaller cells in the next lower level – Statistical info of each cell is calculated and stored beforehand and is used to answer queries – Parameters of higher level cells can be easily calculated from parameters of lower level cell • count, mean, s, min, max • type of distribution—normal, uniform, etc. – Use a top-down approach to answer spatial data queries – Start from a pre-selected layer—typically with a small number of cells – For each cell in the current level compute the confidence interval 46 Comments on STING • • • • Remove the irrelevant cells from further consideration When finish examining the current layer, proceed to the next lower level Repeat this process until the bottom layer is reached Advantages: – Query-independent, easy to parallelize, incremental update • Disadvantages: – All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected 47 Model-Based Clustering • What is model-based clustering? – Attempt to optimize the fit between the given data and some mathematical model – Based on the assumption: Data are generated by a mixture of underlying probability distribution • Typical methods – Statistical approach • EM (Expectation maximization), AutoClass – Machine learning approach • COBWEB, CLASSIT – Neural network approach • SOM (Self-Organizing Feature Map) 48 Conceptual Clustering • Conceptual clustering – A form of clustering in machine learning – Produces a classification scheme for a set of unlabeled objects – Finds characteristic description for each concept (class) • COBWEB – A popular a simple method of incremental conceptual learning – Creates a hierarchical clustering in the form of a classification tree – Each node refers to a concept and contains a probabilistic description of that concept 49 COBWEB Clustering Method A classification tree 50 What Is Outlier Discovery? • What are outliers? – The set of objects are considerably dissimilar from the remainder of the data • Problem: Define and find outliers in large data sets • Applications: – Credit card fraud detection – Customer segmentation – Medical analysis 51