Download Clustering

Clustering II Data Mining Overview  Data Mining  Data warehouses and OLAP (On Line Analytical Processing.)  Association Rules Mining  Clustering: Hierarchical and Partitional approaches  Classification: Decision Trees and Bayesian classifiers  Sequential Patterns Mining  Advanced topics: outlier detection, web mining Major Clustering Approaches  Partitioning algorithms: Construct various partitions and then evaluate them by some criterion  Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion  Density-based algorithms: based on connectivity and density functions  Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other Partitioning Algorithms: Basic Concept  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters  Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-medoids algorithms   k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster Optimization problem   The goal is to optimize a score function The most commonly used is the square error criterion: k E   i  1 p  Ci p  mi 2 The K-Means Clustering Method  Given k, the k-means algorithm is implemented in 4 steps:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster.  Assign each object to the cluster with the nearest seed point.  Go back to Step 2, stop when no more new assignment. The K-Means Clustering Method 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Comments on the K-Means Method   Strength  Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.  Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness  Applicable only when mean is defined, then what about categorical data?  Need to specify k, the number of clusters, in advance  Unable to handle noisy data and outliers  Not suitable to discover clusters with non-convex shapes The K-Medoids Clustering Method  Find representative objects, called medoids, in clusters  PAM (Partitioning Around Medoids, 1987)  starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering  PAM works effectively for small data sets, but does not scale well for large data sets  CLARA (Kaufmann & Rousseeuw, 1990)  CLARANS (Ng & Han, 1994): Randomized sampling  Focusing + spatial data structure (Ester et al., 1995) PAM (Partitioning Around Medoids)  PAM (Kaufman and Rousseeuw, 1987), built in Splus  Use real object to represent the cluster  Select k representative objects arbitrarily  For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih  For each pair of i and h,    If TCih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object repeat steps 2-3 until there is no change PAM Clustering: Total swapping cost TCih=jCjih 10 10 9 9 t 8 7 j t 8 7 6 5 i 4 3 j 6 h 4 5 h i 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 Cjih = d(j, h) - d(j, i) 0 1 2 3 4 5 6 7 8 9 10 Cjih = 0 10 10 9 9 h 8 8 7 j 7 6 6 i 5 5 i 4 h 4 t j 3 3 t 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 Cjih = d(j, t) - d(j, i) 10 0 1 2 3 4 5 6 7 8 9 Cjih = d(j, h) - d(j, t) 10 CLARA (Clustering Large Applications) (1990)  CLARA (Kaufmann and Rousseeuw in 1990)  Built in statistical analysis packages, such as S+  It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output  Strength: deals with larger data sets than PAM  Weakness:  Efficiency depends on the sample size  A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased CLARANS (“Randomized” CLARA)  CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)  CLARANS draws sample of neighbors dynamically  The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids  If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum  It is more efficient and scalable than both PAM and CLARA Hierarchical Clustering  Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 a Step 1 Step 2 Step 3 Step 4 ab b abcde c cde d de e Step 4 agglomerative (AGNES) Step 3 Step 2 Step 1 Step 0 divisive (DIANA) AGNES (Agglomerative Nesting)  Agglomerative, Bottom-up approach  Merge nodes that have the least dissimilarity  Go on in a non-descending fashion  Eventually all nodes belong to the same cluster 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 A Dendrogram Shows How the Clusters are Merged Hierarchically Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. DIANA (Divisive Analysis)  Top-down approach  Inverse order of AGNES  Eventually each node forms a cluster on its own 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 More on Hierarchical Clustering Methods   Major weakness of vanilla agglomerative clustering methods 2  do not scale well: time complexity of at least O(n ), where n is the number of total objects  can never undo what was done previously Integration of hierarchical with distance-based clustering  BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters  CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction BIRCH  Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)  Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering  Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)  Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree BIRCH  Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans  Weakness: handles only numeric data, and sensitive to the order of the data record. Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: Ni=1=Xi CF = (5, (16,30),(54,190)) SS: Ni=1=Xi2 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 (3,4) (2,6) (4,5) (4,7) (3,8) CF Tree B=7 Root CF1 CF2 CF3 CF6 child1 child2 child3 child6 L=6 Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node prev CF1 CF2 CF6 next Leaf node prev CF1 CF2 CF4 next CURE (Clustering Using REpresentatives )  CURE: proposed by Guha, Rastogi & Shim, 1998  Stops the creation of a cluster hierarchy if a level consists of k clusters  Uses multiple representative points to evaluate the distance between clusters, adjusts well to arbitrary shaped clusters and avoids single-link effect Drawbacks of Distance-Based Method  Drawbacks of square-error based clustering method  Consider only one point as representative of a cluster  Good only for convex shaped, similar size and density, and if k can be reasonably estimated Cure: The Algorithm  Draw random sample s.  Partition sample to p partitions with size s/p  Partially cluster partitions into s/pq clusters  Eliminate outliers  By random sampling  If a cluster grows too slow, eliminate it.  Cluster partial clusters.  Label data in disk Data Partitioning and Clustering    s = 50 p=2 s/p = 25  s/pq = 5 y y y x y y x x x x Cure: Shrinking Representative Points y y x  Shrink the multiple representative points towards the gravity center by a fraction of .  Multiple representatives capture the shape of the cluster x Clustering Categorical Data: ROCK   ROCK: Robust Clustering using linKs, by S. Guha, R. Rastogi, K. Shim (ICDE’99).  Use links to measure similarity/proximity  Not distance based  Computational complexity: 2 2 O ( n  nm m  n log n) m a Basic ideas:  Similarity function and neighbors: T1  T2 Let T1 = {1,2,3}, T2={3,4,5} Sim(T , T )  1 Sim( T1, T 2)  2 {3} 1   0.2 {1,2,3,4,5} 5 T1  T2 Rock: Algorithm   Links: The number of common neighbours for the two points. {1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5} {1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5} 3 {1,2,3} {1,2,4} Algorithm  Draw random sample  Cluster with links  Label data in disk Density-Based Clustering Methods    Clustering based on density (local cluster criterion), such as density-connected points Major features:  Discover clusters of arbitrary shape  Handle noise  One scan  Need density parameters as termination condition Several interesting studies:  DBSCAN: Ester, et al. (KDD’96)  OPTICS: Ankerst, et al (SIGMOD’99).  DENCLUE: Hinneburg & D. Keim (KDD’98)  CLIQUE: Agrawal, et al. (SIGMOD’98) CLIQUE (Clustering In QUEst)  Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).  Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space  CLIQUE can be considered as both density-based and grid-based  It partitions each dimension into the same number of equal length interval  It partitions an m-dimensional data space into non-overlapping rectangular units  A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter  A cluster is a maximal set of connected dense units within a subspace CLIQUE: The Major Steps  Partition the data space and find the number of points that lie inside each cell of the partition.  Identify the subspaces that contain clusters using the Apriori principle  Identify clusters:    Determine dense units in all subspaces of interests Determine connected dense units in all subspaces of interests. Generate minimal description for the clusters  Determine maximal regions that cover a cluster of connected dense units for each cluster  Determination of minimal cover for each cluster =3 30 40 Vacation 20 50 Salary (10,000) 0 1 2 3 4 5 6 7 30 Vacation (week) 0 1 2 3 4 5 6 7 age 60 20 30 40 50 age 50 age 60 Strength and Weakness of CLIQUE   Strength  It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces  It is insensitive to the order of records in input and does not presume some canonical data distribution  It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases Weakness  The accuracy of the clustering result may be degraded at the expense of simplicity of the method Model based clustering     Assume data generated from K probability distributions Typically Gaussian distribution Soft or probabilistic version of K-means clustering Need to find distribution parameters. EM Algorithm EM Algorithm   Initialize K cluster centers Iterate between two steps  Expectation step: assign points to clusters  w Pr(d P(di  ck )  wk Pr( di | ck ) wk   j  Pr( d  c ) i i |cj ) j k i N Maximation step: estimate model parameters k 1  m d i P ( d i  ck )  i 1  P ( d i  c j ) m k

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Clustering