Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
K-Means Cluster Analysis Chapter 3 PPDM Cl Class © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Cluster Analysis? z Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster Inter cluster distances are maximized Intra-cluster distances are minimized © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Applications of Cluster Analysis z Discovered Clusters Understanding – Group p related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations z Industry Group Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN 1 2 Fannie Mae DOWN Fed Home Loan DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN 3 4 Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP Technology1-DOWN Technology2-DOWN Financial-DOWN Oil-UP Summarization – Reduce the size of large data sets C uste g p Clustering precipitation ec p tat o in Australia © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 What is not Cluster Analysis? z Supervised classification – Have class label information z Simple segmentation – Di Dividing idi students t d t into i t different diff t registration i t ti groups alphabetically, by last name z Results of a query – Groupings are a result of an external specification z Graph partitioning – Some mutual relevance and synergy, but areas are not identical © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Types of Clusterings z A clustering is a set of clusters z Important distinction between hierarchical and partitional sets of clusters z Partitional Clustering – A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset z Hierarchical clustering – A set of nested clusters organized as a hierarchical tree © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Partitional Clustering Original g Points © Tan,Steinbach, Kumar A Partitional Clustering g Introduction to Data Mining 4/18/2004 7 Hierarchical Clustering p1 p3 p4 p2 p1 p2 Traditional Hierarchical Clustering p3 p4 Traditional Dendrogram p1 p3 p4 p2 p1 p2 Non-traditional Hierarchical Clustering © Tan,Steinbach, Kumar p3 p4 Non-traditional Dendrogram Introduction to Data Mining 4/18/2004 8 Other Distinctions Between Sets of Clusters z Exclusive versus non-exclusive – In non-exclusive clusterings, points may belong to multiple clusters. clusters – Can represent multiple classes or ‘border’ points z Fuzzyy versus non-fuzzyy – In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 – Weights must sum to 1 – Probabilistic clustering has similar characteristics z Partial versus complete – In I some cases, we only l wantt to t cluster l t some off the th data d t z Heterogeneous versus homogeneous – Cluster of widely different sizes, shapes, and densities © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Types of Clusters z Well-separated clusters z Center-based clusters z Contiguous clusters z Density-based clusters z P Property t or Conceptual C t l z Described by an Objective Function © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Types of Clusters: Well-Separated z Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Types of Clusters: Center-Based z Center-based – A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster – The center of a cluster is often a centroid, the average of all th points the i t iin th the cluster, l t or a medoid, d id the th mostt “representative” “ t ti ” point of a cluster 4 center-based clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Types of Clusters: Contiguity-Based z Contiguous Cluster (Nearest neighbor or Transitive) – A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Types of Clusters: Density-Based z Density-based – A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. – Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Types of Clusters: Conceptual Clusters z Shared Property or Conceptual Clusters – Finds clusters that share some common property or represent a particular concept. . 2 Overlapping Circles © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Types of Clusters: Objective Function z Clusters Defined by an Objective Function – Finds clusters that minimize or maximize an objective function. – Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function function. (NP Hard) – Can have global or local objectives. Hierarchical clustering algorithms typically have local objectives P titi Partitional l algorithms l ith typically t i ll h have global l b l objectives bj ti – A variation of the global objective function approach is to fit the data to a parameterized model. Parameters for the model are determined from the data. Mixture models assume that the data is a ‘mixture' of a number of statistical distributions. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Types of Clusters: Objective Function … z Map the clustering problem to a different domain and solve a related problem in that domain – Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proximities between points – Clustering is equivalent to breaking the graph into connected components, one for each cluster. – Want to minimize the edge weight between clusters and maximize the edge weight within clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Characteristics of the Input Data Are Important z Type of proximity or density measure – This is a derived measure, but central to clustering z Sparseness – Dictates type of similarity – Adds to efficiency z Attribute type – Dictates type of similarity z Type of Data – Dictates type of similarity – Other characteristics, e.g., autocorrelation z z z Dimensionality Di i lit Noise and Outliers Type of Distribution © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Clustering Algorithms z K-means and its variants z Hierarchical clustering z Density-based clustering © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 K-means Clustering z z Partitional clustering approach Each cluster is associated with a centroid (center point) z Each point is assigned to the cluster with the closest centroid z Number of clusters, clusters K K, must be specified The basic algorithm is very simple z © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 K-means Clustering – Details z Initial centroids are often chosen randomly. – Clusters produced vary from one run to another. z The centroid is (typically) the mean of the points in the cluster. z ‘Closeness’ is measured by Euclidean distance, cosine similarity correlation similarity, correlation, etc etc. K-means will converge for common similarity measures mentioned above. z z Most of the convergence happens in the first few iterations. – z Often the stopping pp g condition is changed g to ‘Until relatively y few points change clusters’ Complexity is O( n * K * I * d ) – n = number of points, K = number of clusters, I = number of iterations, d = number of attributes © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Two different K-means Clusterings 3 2.5 Original Points 2 y 15 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 2.5 2.5 2 2 1.5 1.5 y 3 y 3 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x -1.5 -1 -0.5 0 0.5 1 1.5 2 x Optimal Clustering © Tan,Steinbach, Kumar -2 Introduction to Data Mining Sub-optimal Clustering 4/18/2004 22 Importance of Choosing Initial Centroids Iteration 6 1 2 3 4 5 3 2.5 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Importance of Choosing Initial Centroids Iteration 1 Iteration 2 Iteration 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y 3 y 3 y 3 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 x 0 0.5 1 1.5 2 -2 It ti 4 Iteration It ti 5 Iteration 2.5 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 -0.5 0 0.5 x © Tan,Steinbach, Kumar 1 1.5 2 0 0.5 1 1.5 2 1 1.5 2 y 2.5 y 2.5 y 3 -1 -0.5 It ti 6 Iteration 3 -1.5 -1 x 3 -2 -1.5 x -2 -1.5 -1 -0.5 0 0.5 1 x Introduction to Data Mining 1.5 2 -2 -1.5 -1 -0.5 0 0.5 x 4/18/2004 24 Evaluating K-means Clusters z Most common measure is Sum of Squared Error (SSE) – For each point, the error is the distance to the nearest cluster – To get SSE, we square these errors and sum them. K SSE = ∑ ∑ dist 2 ( mi , x ) i =1 x∈Ci – x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi corresponds to the center (mean) of the cluster – Given two clusters, we can choose the one with the smallest error – One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering l t i with ith hi higher h K © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Importance of Choosing Initial Centroids … Iteration 5 1 2 3 4 3 2.5 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Importance of Choosing Initial Centroids … Iteration 1 Iteration 2 2.5 2.5 2 2 1.5 1.5 y 3 y 3 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 x 0 0.5 Iteration 3 2.5 2 2 2 1.5 1.5 1.5 y 2.5 y 2.5 y 3 1 1 1 0.5 0.5 0.5 0 0 0 -1 1 -0.5 05 0 05 0.5 x © Tan,Steinbach, Kumar 2 Iteration 5 3 -1.5 15 1.5 Iteration 4 3 -2 2 1 x 1 15 1.5 2 -2 2 -1.5 15 -1 1 -0.5 05 0 05 0.5 1 x Introduction to Data Mining 15 1.5 2 -2 2 -1.5 15 -1 1 -0.5 05 0 05 0.5 1 15 1.5 2 x 4/18/2004 27 Problems with Selecting Initial Points z If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small. – Ch Chance is i relatively l ti l smallll when h K iis llarge – If clusters are the same size, n, then – For example, if K = 10, then probability = 10!/1010 = 0.00036 – Sometimes the initial centroids will readjust themselves in ‘ i ht’ way, and ‘right’ d sometimes ti they th d don’t ’t – Consider an example of five pairs of clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 10 Clusters Example Iteration 4 1 2 3 8 6 4 y 2 0 -2 -4 -6 0 5 10 15 20 x Starting with two initial centroids in one cluster of each pair of clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 10 Clusters Example Iteration 2 8 6 6 4 4 2 2 y y Iteration 1 8 0 0 -2 -2 -4 -4 -6 -6 0 5 10 15 20 0 5 x 6 6 4 4 2 2 0 15 20 0 -2 -2 -4 -4 -6 -6 10 20 Iteration 4 8 y y Iteration 3 5 15 x 8 0 10 15 20 x 0 5 10 x Starting with two initial centroids in one cluster of each pair of clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 10 Clusters Example Iteration 4 1 2 3 8 6 4 y 2 0 -2 -4 -6 0 5 10 15 20 x Starting with some pairs of clusters having three initial centroids, while other have only one. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 10 Clusters Example Iteration 2 8 6 6 4 4 2 2 y y Iteration 1 8 0 0 -2 -2 -4 -4 -6 -6 0 5 10 15 20 0 5 8 8 6 6 4 4 2 2 0 -2 -4 -4 -6 -6 5 10 15 20 15 20 0 -2 0 10 x Iteration 4 y y x Iteration 3 15 20 x 0 5 10 x Starting with some pairs of clusters having three initial centroids, while other have only one. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 Solutions to Initial Centroids Problem z Multiple runs – Helps, p , but probability p y is not on yyour side Sample and use hierarchical clustering to determine initial centroids z Select more than k initial centroids and then select among these initial centroids z – Select most widely separated Postprocessing z Bisecting K-means z – Not as susceptible to initialization issues © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Handling Empty Clusters z Basic K-means algorithm can yield empty clusters z Several strategies – Choose the point that contributes most to SSE – Choose a point from the cluster with the highest SSE – If there are several empty clusters, the above can be repeated several times. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Updating Centers Incrementally z In the basic K-means algorithm, centroids are updated after all points are assigned to a centroid z An alternative is to update the centroids after each assignment (incremental approach) – – – – – Each assignment updates zero or two centroids More expensive Introduces an order dependency N Never gett an empty t cluster l t Can use “weights” to change the impact © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Pre-processing and Post-processing z Pre-processing – Normalize the data – Eliminate outliers z Post processing Post-processing – Eliminate small clusters that may represent outliers – Split ‘loose’ loose clusters, clusters ii.e., e clusters with relatively high SSE – Merge g clusters that are ‘close’ and that have relatively y low SSE – Can use these steps during the clustering process ISODATA SO © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 Bisecting K-means z Bisecting K-means algorithm – Variant of K-means that can produce a partitional or a hierarchical clustering © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Bisecting K-means Example © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Limitations of K-means z K-means has problems when clusters are of differing – Sizes – Densities – Non-globular shapes z K-means has problems when the data contains outliers. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40 Limitations of K-means: Differing Density K-means (3 Clusters) Original Points © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Limitations of K-means: Non-globular Shapes Original Points © Tan,Steinbach, Kumar K-means (2 Clusters) Introduction to Data Mining 4/18/2004 42 Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many y clusters. Find parts of clusters, but need to put together. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 Overcoming K-means Limitations Original Points © Tan,Steinbach, Kumar K-means Clusters Introduction to Data Mining 4/18/2004 44 Overcoming K-means Limitations Original Points © Tan,Steinbach, Kumar K-means Clusters Introduction to Data Mining 4/18/2004 45