Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Summer school CLUSTER ANALYSIS Asso. Prof. NGUYEN Tri Thanh University of Engineering and Technology August 11, 2016 1 Outline 1. Introduction 2. Clustering applications 3. A Categorization of Major Clustering Methods 4. Data representation 5. Flat clustering 6. Hierarchical clustering 7. Cluster evaluation 8. Summary 9. Discussion August 11, 2016 Data Mining: Concepts and Techniques 2 Introduction n Cluster analysis (clustering) n n Given a set of objects, group similar data according to the characteristics into clusters Cluster: A collection of data objects n similar (or related) to one another within the same group n dissimilar (or unrelated) to the objects in other groups n How to identify the similarity? n How to identify the number of clusters? August 11, 2016 Data Mining: Concepts and Techniques 3 Introduction n Clustering is embedded in human naturally n Group animals, plants n Group students n Group customers n Facebook groups n Interest groups n … August 11, 2016 Data Mining: Concepts and Techniques 4 Clustering vs Classification n n Clustering n No predefined classes (unknown number of clusters) n Unlabeled data objects n Unsupervised learning Classification n Predefined classes n Labeled data objects n Predict/identify the class of an unlabeled object n Supervised learning August 11, 2016 Data Mining: Concepts and Techniques 5 Clustering applications n Image Processing and Pattern Recognition n Spatial Data Analysis n n Create thematic maps in GIS by clustering feature spaces Detect spatial clusters or for other spatial mining tasks n Economic Science (especially market research) n WWW n n Document classification Cluster Weblog data to discover groups of similar access patterns August 11, 2016 Data Mining: Concepts and Techniques 6 Clustering Applications n Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs n Land use: Identification of areas of similar land use in an earth observation database n Insurance: Identifying groups of motor insurance policy holders with a high average claim cost n City-planning: Identifying groups of houses according to their house type, value, and geographical location n Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults August 11, 2016 Data Mining: Concepts and Techniques 7 Categorization of Clustering Approaches n n n n Partitioning approach: n Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors n Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: n Create a hierarchical decomposition of the set of data (or objects) using some criterion n Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON Density-based approach: n Based on connectivity and density functions n Typical methods: DBSACN, OPTICS, DenClue Grid-based approach: n based on a multiple-level granularity structure n Typical methods: STING, WaveCluster, CLIQUE August 11, 2016 Data Mining: Concepts and Techniques 8 Major Clustering Approaches (II) n n n n Model-based: n A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other n Typical methods: EM, SOM, COBWEB Frequent pattern-based: n Based on the analysis of frequent patterns n Typical methods: p-Cluster User-guided or constraint-based: n Clustering by considering user-specified or application-specific constraints n Typical methods: COD (obstacles), constrained clustering Link-based clustering: n Objects are often linked together in various ways n Massive links can be used to cluster objects: SimRank, LinkClus August 11, 2016 Data Mining: Concepts and Techniques 9 Flat clustering August 11, 2016 Data Mining: Concepts and Techniques 10 Data object representation n A set/vector of features/attributes n n Person: name, age, sex, job, … Text: set of distinct words dis(d1 , d 2 ) = n n 2 ( d − d ) ∑i=1 1i 2i Similarity n n The distance: the smaller the more similar The similarity: the bigger the more similar d1 • d 2 sim(d1 , d 2 ) = = d1 d 2 August 11, 2016 ∑ n i =1 n 2 1i ∑i =1 d d1i d 2i n 2 d ∑i=1 2i 11 K-means Clustering n n Number of clusters (k) is known in advance Clusters are represented by the centroid of the documents that belong to that cluster 1 c= n S ∑d d∈S Cluster membership is determined by the most similar cluster centroid 1. Select k documents from S to be used as cluster centroids. This is usually done at random. 2. Assign documents to clusters according to their simility to the cluster centroids, i.e. for each document find the most similar centroid and assign that document to the corresponding cluster. 3. For each cluster recompute the cluster centroid using the newly computed cluster members. 4. Go to Step 2 until the process converges, i.e. the same documents are assigned to each cluster in two consecutive iterations or the cluster centroids remain the same. August 11, 2016 12 The K-Means Clustering Method n Example Assign each objects to most 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 9 9 8 8 7 7 6 6 5 5 4 3 2 1 0 0 1 2 3 4 5 6 Arbitrarily choose K object as initial cluster center 7 8 9 10 10 similar center K=2 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 reassign 10 10 9 9 8 8 7 7 6 6 5 5 4 2 1 0 0 1 2 3 4 5 6 7 8 7 8 9 10 reassign 3 August 11, 2016 10 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 13 K-means Clustering Discussion n n n n In step 2 documents are moved between clusters in order to maximize the intra-cluster similarity The clustering maximizes the criterion function (a measure for evaluating clustering quality) In distance-based k-means clustering the criterion function is the sum of squared errors (based on Euclidean distance and means) For k-means clustering of documents a function based on k centroids and similarity is used J =∑ ∑ sim(c , d i =1 d j ∈Di August 11, 2016 14 i j ) K-means Clustering Discussion (cont’d) n n n n Clustering that maximizes this function is called minimum variance clustering K-means algorithm produces minimum variance clustering, but does not guarantee that it always finds the global maximum of the criterion function After each iteration the value of J increases, but it may converge to a local maximum The result greatly depends on the initial choice of cluster centroids August 11, 2016 15 Sample data August 11, 2016 Data Mining: Concepts and Techniques 16 K-means Clustering Example (result) Clustering of CCSU Departments data with 6 TFIDF attributes (k = 2) August 11, 2016 17 Hierarchical clustering August 11, 2016 Data Mining: Concepts and Techniques 18 Hierarchical Partitioning n n n Produces a nested sequence of partitions of the set of data points Can be displayed as a tree (called dendrogram) with a single cluster including all points at the root and singleton clusters (individual points) at the leaves Example of hierarchical partitioning of set of numbers {1, 2, 4, 5, 8, 10} The similarity measure used in this example is computed as (10-d)/10 where d is the distance between data points or cluster centers August 11, 2016 19 Approaches to Hierarchical Partitioning n Agglomerative n n Starts with the data points and at each step merges the two closest (most similar) points (or clusters at later steps) until a single cluster remains Divisible n n Starts with the original set of points and at each step splits a cluster until only individual points remain To implement this approach we need to decide which cluster to split and how to perform the split August 11, 2016 20 Approaches to Hierarchical Partitioning (cont’d) n The agglomerative approach is more popular as it needs only the definition of a distance or similarity function on clusters/points August 11, 2016 21 Approaches to Hierarchical Partitioning (cont’d) n For data points in the Euclidean space the Euclidean distance is the best choice n For documents represented as TF-IDF vectors the preferred measure is the cosine similarity defined as follows d1 • d 2 sim(d1 , d 2 ) = = d1 d 2 August 11, 2016 ∑ n i =1 n 2 1i ∑i =1 d 22 d1i d 2i n 2 d ∑i=1 2i Agglomerative Hierarchical Clustering n There are several versions of this approach depending on how similarity on clusters sim(S1,S2) is defined (S1,S2 are sets of documents) n Similarity between cluster centroids, i.e. sim(S1,S2)=sim(c1,c2), where the centroid c of cluster S is 1 c= S n ∑d d∈S Maximum similarity between documents from each cluster (nearest neighbor clustering) sim( S1 , S 2 ) = max sim(d1 , d 2 ) d1∈S1 , d 2 ∈S 2 August 11, 2016 23 Agglomerative Hierarchical Clustering (cont’d) n Minimum similarity between documents from each cluster (farthest neighbor clustering) sim( S1 , S 2 ) = min sim(d1 , d 2 ) d1∈S1 , d 2 ∈S 2 n Average similarity between documents from each cluster 1 sim( S1 , S 2 ) = S1 S 2 August 11, 2016 ∑ sim(d , d ) 1 d1∈S1 , d 2 ∈S 2 24 2 Agglomerative Clustering Algorithm n S is the initial set of documents and G is the clustering tree n k and q are control parameters that stop merging n n when a desired number of clusters (k) is reached or when the similarity between the clusters to be merged becomes less than a specified threshold (q) August 11, 2016 25 Agglomerative Clustering Algorithm (cont’d) 1. G ← {{d } | d ∈ S } (initialize G with singleton clusters, each one containing a document from S ) 2. If G ≤ k then exit (stop if the desired number of clusters is reached) 3. Find Si , S j ∈ G, such that (i, j ) = arg max ( i , j ) sim( Si , S j ) (find the two closest clusters) 4. If sim(Si ,S j ) < q then exit (stop if the similarity of the closest clusters is less than q) 5. Remove Si and S j from G 6. G = G ∪ {Si , S j } (merge Si and S j , and add the new cluster to the hierarchy) 7. Go to 2 For n documents both time and space complexity of the algorithm are O (n2) August 11, 2016 26 Agglomerative Clustering Example 1 August 11, 2016 27 Agglomerative Clustering Example 1 August 11, 2016 28 DIANA clustering algorithm 0. G ← {d1 , d 2 ,..., d n } (initialize G with singleton cluster of all documents); g ← G 1. Find a document d that is the most distinct from the others S ← {d } 2.1. For all d i ∉ S , calculate li = avg ( ∑ | d i − d |) − avg ( ∑ | d i − d |) d i ∉S d i ∉S 2.2. Find d h having max lh , if lh > 0, S ← {d } ∪ {d h } 3.Repeat until there is no lh > 0. Update G m 4. Select the cluster g having the biggest diameter m ∑∑ (d i − d j )2 i =1 j =1 m(m-1 ) , goto 1 5. Stop when all clusters have a single document August 11, 2016 Data Mining: Concepts and Techniques 29 Clustering evaluation August 11, 2016 Data Mining: Concepts and Techniques 30 Web Content Mining n Evaluating Clustering n Similarity-Based Criterion Functions n Probabilistic Criterion Functions n MDL-Based Model and Feature Evaluation n Classes to Clusters Evaluation n Precision, Recall and F-measure n Entropy August 11, 2016 31 Similarity-Based Criterion Functions (distance) n Basic idea: the cluster center ci (centroid or mean in case of numeric data) best represents cluster Di if it minimizes the sum of the lengths of the “error” vectors x -ci for all x ∈ Di k 1 ci = Di 2 J e = ∑ ∑ x − ci i =1 x∈Di n ∑x x∈Di Alternative formulation based on pairwise distance between cluster members 1 k 1 Je = ∑ 2 i =1 Di August 11, 2016 2 ∑x x j , xl ∈Di 32 j − xl Similarity-Based Criterion Functions (cosine similarity) n n n For document clustering the (centroid) cosine similarity is used k ci • d j 1 ci = dj J s = ∑ ∑ sim(ci , d j ) sim(ci , d j ) = ∑ Di d j ∈Di ci d j i =1 d j ∈Di Equivalent form based on pairwise similarity 1 k 1 JS = ∑ sim(d j , d k ) ∑ 2 i =1 Di x j , xk ∈Di Another formulation based on intracluster similarity (used to controls merging of clusters in hierarchical Avegrage pair-wise similarity agglomerative clustering) 1 k 1 1 k JS = ∑ sim(d j , d l ) = ∑ Di sim( Di ) ∑ 2 i =1 Di x j , xl ∈Di 2 i =1 August 11, 2016 33 Classes to Clusters Evaluation n n n n n Assume that the classification of the documents in a sample is known, i.e. each document has a class label Cluster the sample without using the class labels Assign to each cluster the class label of the majority of documents in it Compute the error as the proportion of documents with different class and cluster label Or compute the accuracy as the proportion of documents with the same class and cluster label August 11, 2016 34 Classes to Clusters Evaluation (Example) August 11, 2016 35 Confusion matrix (contingency table) TP (True Positive), FN (False Negative), FP (False Positive), TN (True Negative) TP + TN Accuracy = TP + FP + TN + FN FP + FN Error = TP + FP + TN + FN August 11, 2016 36 Precision and Recall TP Precision = TP + FP TP Recall = TP + FN August 11, 2016 37 F-Measure Generalized confusion matrix form classes and k clusters Combining precision and recall P(i, j ) = nij m ∑i =1 nij R(i, j ) = nij ∑ 2 P(i, j ) R(i, j ) F (i, j ) = P(i, j ) + R(i, j ) Evaluating the whole clustering ni F = ∑i =1 n m max F (i, j ) j =1,...,k k ni = ∑ j =1 nij m k n = ∑i =1 ∑ j =1 nij August 11, 2016 38 Total number of documents k n j =1 ij F-Measure (Example) August 11, 2016 39 Entropy n n Consider the class label as a random event and evaluate its probability distribution in each cluster The probability of class i in cluster j is estimated by the proportion of occurrences of class label i in cluster j pij = nij m ∑n ij i =1 n The entropy is as a measure of “impurity” and accounts for the average information in an arbitrary message about the class label m H j = −∑ pij log pij i =1 August 11, 2016 40 Entropy (cont’d) n To evaluate the whole clustering we sum up the entropies of individual clusters weighted with the proportion of documents in each k H =∑ i =1 August 11, 2016 nj n Hj 41 Entropy (Examples) n n n A “pure” cluster where all documents have a single class label has entropy of 0 The highest entropy is achieved when all class labels have the same probability For example, for a two class problem the 50-50 situation has the highest entropy of (-0.5 log 0.5- 0.5 log 0.5)=1 August 11, 2016 42 Entropy (Examples) (cont’d) n Compare the entropies of the previously discussed clusterings for attributes offers and students H (offers ) = 11 ⎛ 8 8 3 3⎞ 9 ⎛ 3 3 6 6⎞ ⎜ − log − log ⎟ + ⎜ − log − log ⎟ = 0.878176 20 ⎝ 11 11 11 11 ⎠ 20 ⎝ 9 9 9 9⎠ H ( students ) = August 11, 2016 15 ⎛ 10 10 5 5⎞ 5 ⎛ 1 1 4 4⎞ − log − log + − log − log ⎜ ⎟ ⎜ ⎟ = 0.869204 20 ⎝ 15 15 15 15 ⎠ 20 ⎝ 5 5 5 5⎠ 43 Summary n n n n n Cluster analysis groups objects based on their similarity and has wide applications Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches There are still lots of research issues on cluster analysis August 11, 2016 Data Mining: Concepts and Techniques 44 Reference n n n J. Han and M. Kamber, Data Mining-Concepts and Techniques, Morgan Kaufmann, 2006 Z. Markov and D. T. Larose, Data mining the web, uncovering patterns in Web content, structure and usage, John Wiley & Sons, 2007 Nguyễn Hà Nam, Nguyễn Trí Thành, Hà Quang Thụy, Giáo trình khai phá dữ liệu, NXB ĐHQGHN, 2014 August 11, 2016 Data Mining: Concepts and Techniques 45 Discussion August 11, 2016 Data Mining: Concepts and Techniques 46