Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Prof. Sumanta Guha Slide Sources: Han & Kamber “Data Mining: Concepts and Techniques” book, slides by Han,  Han & Kamber, adapted and supplemented by Guha Chapter 7: Cluster Analysis What is Cluster Analysis?   Cluster: a collection of data objects  Similar to one another within the same cluster  Dissimilar to the objects in other clusters Cluster analysis  Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters  Unsupervised learning: no predefined classes  Typical applications  As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms Clustering: Rich Applications and Multidisciplinary Efforts   Pattern Recognition Spatial Data Mining     Detect spatial clusters or for other spatial mining tasks Image Processing Economic Science   Create thematic maps in GIS by clustering feature spaces Market research WWW   Document classification Cluster Weblog data to discover groups of similar access patterns Examples of Clustering Applications  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Land use: Identification of areas of similar land use in an earth observation database  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. Fraud detection – outliers !  City-planning: Identifying groups of houses according to their house type, value, and geographical location  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Quality: What Is Good Clustering?  A good clustering method will produce high quality clusters with   high intra-class similarity  low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Measure the Quality of Clustering     Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j) There is a separate “quality” function that measures the “goodness” of a cluster. The definitions of distance functions are usually very different for numeric, boolean, categorical and ordinal variables.  Numeric: income, temperature, price, etc.  Boolean: Yes/no, e.g, student? citizen?  Categorical: color (red, blue, green, …), nationality, etc.  Ordinal: Excellent/Very good…, High/medium/low (i.e., with order) It is hard to define “similar enough” or “good enough”  the answer is typically highly subjective. Requirements of Clustering in Data Mining  Scalability  Ability to deal with different types of attributes  Ability to handle dynamic data  Discovery of clusters with arbitrary shape  Minimal requirements for domain knowledge to determine input parameters  Able to deal with noise and outliers  Insensitive to order of input records  High dimensionality  Incorporation of user-specified constraints  Interpretability and usability Major Clustering Approaches  Partitioning approach:  Given n objects in the database, a partitioning approach splits it into k groups.   Typical methods: k-means, k-medoids, CLARANS Hierarchical approach:  Create a hierarchical decomposition of the set of data (or objects) using one of two methods:  Agglomerative (bottom-up): start with each object as a separate group; successively, merge groups that are close until a termination condition holds.  Divisive (top-down): start with all objects in one group; successively split groups that are not “tight” until a termination condition holds.  Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON Partitioning Algorithms: Basic Concept  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Km, so as to minimize the sum of squared errors SSE C =  tmi Km (Cm  tmi ) k m1 2 If there was no restriction on the number k of clusters, how could we minimize SSE?! m is the cluster leader or representative (which itself may or where may not belong to the database D).  Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-medoids algorithms  k-means (MacQueen’67): Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 1. The K-Means Clustering Method Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster. Input:  k: the number of clusters,  D: a data set containing n objects. Output: A set of k clusters. Method: (1) arbitrarily choose k objects from D as the initial cluster centers; (2) repeat (3) assign each object to the cluster whose center is closest to the object (4) update the cluster centers as the mean value of the objects in each cluster; (5) until no change; The K-Means Clustering Method  Example 10 10 9 9 8 8 7 7 6 6 5 5 10 9 8 7 6 5 4 4 3 2 1 0 0 1 2 3 4 5 6 7 8 K=2 Arbitrarily choose K object as initial cluster center 9 10 Assign each objects to most similar center 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 reassign 10 9 9 8 8 7 7 6 6 5 5 4 3 2 1 0 1 2 3 4 5 6 7 8 8 9 10 reassign 10 0 7 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 K-Means Clustering Method: Example with 8 points on a line 1 2 2.5 3 3.6 4 5.3 Randomly Cluster Compute to new nearest choose cluster 2cluster objects leaders leader. as=cluster cluster No leaders. change means.= Exit! 8 9 9.5 10 11 Comments on the K-Means Method  Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.    Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as deterministic annealing Weakness  Applicable only when mean is defined, then what about categorical data?  Need to specify k, the number of clusters, in advance  Unable to handle noisy data and outliers  Not suitable to discover clusters with non-convex shapes How to choose k, the number of clusters? (thanks Daniel Martin, Quora)  Rule of thumb: k ≈√(n/2) Elbow method: Plot SSE vs. number of clusters. Choose k where the SSE starts leveling off, i.e., from where there is not much gain in adding another cluster. E.g., below the choice would be k = 6, SSE =  Variations of the K-Means Method   A few variants of the k-means which differ in  Selection of the initial k means  Dissimilarity calculations  Strategies to calculate cluster means Handling categorical data: k-modes (Huang’98)  Replacing means of clusters with modes  Using new dissimilarity measures to deal with categorical objects  Using a frequency-based method to update modes of clusters  A mixture of categorical and numerical data: k-prototype method What Is the Problem with the K-Means Method?  The k-means algorithm is sensitive to outliers !  Since an object with an extremely large value may substantially distort the distribution of the data.  K-Medoids: Instead of taking the mean value of the objects in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 More Partitioning Clustering Algorithms 2. K-medoids (PAM = Partitioning Around Medoids) 3. CLARA (Custering LARge Applications) 4. CLARANS (Clustering Large Applications based on RANdomized Search) Read above three clustering methods from the paper Efficient and Effective Clustering Methods for Spatial Data Mining, by Ng and Han, Intnl. Conf. on Very Large Data Bases (VLDB’94), 1994, which proposes CLARANS, but has a good presentation of PAM and CLARA as well. Hierarchical Clustering Algorithms 5. ROCK ROCK: A Robust Clustering Algorithm for Categorical Data, by (Sudipto) Guha, Rastogi and Shim, Information Systems, 2000. Main slides: ROCK slides by the authors Related slides: ROCK slides by Olusegun et al Hierarchical Clustering Algorithms 6. DBSCAN A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, by Ester, Kriegel, Sander and Xu, Intnl. Conf. Knowledge Discovery And Data Mining (KDD’96), 1996. DBSCAN Example 1 What are the clusters if Use Manhattan distance! 1. Eps = 2, MinPts = 7? 2. Eps = 2, MinPts = 5? DBSCAN Example 2 What are the clusters if Eps = 1, MinPts = 4? Use Manhattan distance! Note: The middle point must belong to both clusters if we follow the definition. So clusters can overlap! Hierarchical Clustering Algorithms 7. CLIQUE Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, by Agrawal, Gehrke, Gunopulos and Raghavan, ACM- SIGMOD Intnl. Conf. on Management of Data (SIGMOD’98), 1998.