Survey

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

CLUSTERING PARTITIONING METHODS Elsayed Hemayed Data Mining Course Outline 2 Introduction Clustering Methods K-Means Clustering Selecting K Acknowledgement: some of the material in these slides are from [Max Bramer, “Principles of Data Mining”, SpringerVerlag London Limited 2007] Partitioning Methods Introduction 3 Extracting information from unlabelled data. Clustering is concerned with grouping together objects that are similar to each other and dissimilar to the objects belonging to other clusters. Examples: In an economics application we might be interested in finding countries whose economies are similar. In a financial application we might wish to find clusters of companies that have similar financial performance. In a marketing application we might wish to find clusters of customers with similar buying behaviour. Partitioning Methods Clustering Examples 4 In a medical application we might wish to find clusters of patients with similar symptoms. In a document retrieval application we might wish to find clusters of documents with related content. In a crime analysis application we might look for clusters of high volume crimes such as burglaries or try to cluster together much rarer (but possibly related) crimes such as murders. Partitioning Methods Clustering Example 5 Partitioning Methods Clustering: Application 1 6 Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. Partitioning Methods Clustering: Application 2 7 Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents. Partitioning Methods Clustering Methods 8 Partitioning methods K-Means Hierarchical methods Agglomerative Hierarchical Clustering Divisive hierarchical clustering Density-based methods DBSCAN: a Density-Based Spatial Clustering of Applications with Noise Grid-based methods STING: A Statistical Information Grid Approach to Spatial Data Mining High Dimensional Data Clustering CLIQUE: A Dimension-Growth Subspace Clustering Method Partitioning Methods 9 Partitioning Methods K-Means Clustering Partitioning Methods K-Means Clustering 10 k-means clustering is an exclusive clustering algorithm. Each object is assigned to precisely one of a set of clusters. (There are other methods that allow objects to be in more than one cluster.) For this method of clustering we start by deciding how many clusters k we would like to form from our data. The value of k is generally a small integer, such as 2, 3, 4 or 5, but may be larger. Partitioning Methods The k-Means Clustering Algorithm 11 1. Choose a value of k. 2. Select k objects in an arbitrary fashion. Use these as the initial set of k centroids. 3. Assign each of the objects to the cluster for which it is nearest to the centroid. 4. Recalculate the centroids of the k clusters. 5. Repeat steps 3 and 4 until the centroids no longer move. Partitioning Methods Example (k=3) 12 Partitioning Methods Initial Clusters 13 Partitioning Methods Revised Cluster 14 Partitioning Methods Third Set of Clusters 15 These are the same clusters as before. Their centroids will be the same as those from which the clusters were generated. Hence the termination condition of the k-means algorithm has been met and these are the final clusters produced by the algorithm for the initial choice of centroids made. Partitioning Methods Finding the Best Set of Clusters 16 It can be proved that the k-means algorithm will always terminate, but it does not necessarily find the best set of clusters, corresponding to minimising the value of the objective function. The initial selection of centroids can significantly affect the result. Solution: Try different initial selection and take the best But what should be k Try different k But which one to choose Partitioning Methods Selecting K 17 The table shown value of Objective Function For Different Values of k These results suggest that the best value of k is probably 3. The value of the function for k = 3 is much less than for k = 2, but only a little better than for k = 4. We normally prefer to find a fairly small number of Partitioning Methods clusters as far as possible. K-Selection Strategy 18 We are not trying to find the value of k with the smallest value of the objective function. That will occur when the value of k is the same as the number of objects, i.e. each object forms its own cluster of one. The objective function will then be zero, but the clusters will be worthless. We usually want a fairly small number of clusters and accept that the objects in a cluster will be spread around the centroid (but ideally not too far away). Partitioning Methods Summary 19 Introduction Clustering Methods K-Means Clustering Selecting K Partitioning Methods