Download Ch8-clustering

Chapter 8: Clustering 1 Searching for groups    Clustering is unsupervised or undirected. Unlike classification, in clustering, no preclassified data. Search for groups or clusters of data points (records) that are similar to one another.  Similar points may mean: similar customers, products, that will behave in similar ways. 2 Group similar points together  Group points into classes using some distance measures.   Within-cluster distance, and between cluster distance Applications:   As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms 3 An Illustration 4 Examples of Clustering Applications    Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics. City-planning: Identifying groups of houses according to their house type, value, and geographical location 5 Concepts of Clustering   Clusters Different ways of representing clusters      Division with boundaries Spheres Probabilistic Dendrograms … 1 2 3 I1 0.5 0.2 0.3 I2 … In 6 Clustering  Clustering quality      Inter-clusters distance  maximized Intra-clusters distance  minimized The quality of a clustering result depends on both the similarity measure used by the method and its application. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Clustering vs. classification   Which one is more difficult? Why? There are a huge number of clustering techniques. 7 Dissimilarity/Distance Measure     Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d (i, j) The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define “similar enough” or “good enough”. The answer is typically highly subjective. 8 Types of data in clustering analysis  Interval-scaled variables  Binary variables  Nominal, ordinal, and ratio variables  Variables of mixed types 9 Interval-valued variables   Continuous measurements in a roughly linear scale, e.g., weight, height, temperature, etc Standardize data (depending on applications)  Calculate the mean absolute deviation: s f  1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |) where  mf  1 n (x1 f  x2 f  ...  xnf ) . Calculate the standardized measurement (z-score) xif  m f zif  sf 10 Similarity Between Objects   Distance: Measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: q q d (i, j)  (| x  x |  | x  x | ... | x  x | ) i1 j1 i2 j 2 ip jp q q where (xi1, xi2, …, xip) and (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer  If q = 1, d is Manhattan distance d (i, j) | x  x |  | x  x | ... | x  x | i1 j1 i2 j 2 i p jp 11 Similarity Between Objects (Cont.)  If q = 2, d is Euclidean distance: d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 ) i1 j1 i2 j 2 ip jp  Properties      d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j) Also, one can use weighted distance, and many other similarity/distance measures. 12 Binary Variables  A contingency table for binary data Object j Object i   1 0 1 a b 0 c d sum a  c b  d sum a b cd p Simple matching coefficient (invariant, if the bc binary variable is symmetric): d (i, j)  a bc  d Jaccard coefficient (noninvariant if the binary variable is asymmetric): d (i, j)  bc a bc 13 Dissimilarity of Binary Variables  Example Name Jack Mary Jim    Gender M F M Fever Y Y Y Cough N N P Test-1 P P N Test-2 N N N Test-3 N P N Test-4 N N N gender is a symmetric attribute (not used below) the remaining attributes are asymmetric attributes let the values Y and P be set to 1, and the value N be set to 0 01 d ( jack , mary )   0.33 2 01 11 d ( jack , jim )   0.67 111 1 2 d ( jim , mary )   0.75 11 2 14 Nominal Variables   A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green, etc Method 1: Simple matching  m: # of matches, p: total # of variables m d (i, j)  p  p  Method 2: use a large number of binary variables  creating a new binary variable for each of the M nominal states 15 Ordinal Variables  An ordinal variable can be discrete or continuous  Order is important, e.g., rank  Can be treated like interval-scaled (f is a variable)   replace xif by their ranks map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by zif  rif {1,...,M f } rif 1  M f 1 compute the dissimilarity using methods for intervalscaled variables 16 Ratio-Scaled Variables   Ratio-scaled variable: a measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt, e.g., growth of a bacteria population. Methods:  treat them like interval-scaled variables—not a good idea! (why?—the scale can be distorted)  apply logarithmic transformation yif = log(xif)  treat them as continuous ordinal data and then treat their ranks as interval-scaled 17 Variables of Mixed Types  A database may contain all six types of variables   symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio One may use a weighted formula to combine their effects p (f) (f)  f  1 ij dij d (i, j)   pf  1 ij( f )    f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w. f is interval-based: use the normalized distance f is ordinal or ratio-scaled  compute ranks rif and r 1 z  if  and treat zif as interval-scaled M 1 if f 18 Major Clustering Techniques     Partitioning algorithms: Construct various partitions and then evaluate them by some criterion Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion Density-based: based on connectivity and density functions Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of the model to each other. 19 Partitioning Algorithms: Basic Concept   Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-medoids algorithms   k-means : Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster 20 The K-Means Clustering  Given k, the k-means algorithm is as follows: 1) Choose k cluster centers to coincide with k randomly-chosen points 2) Assign each data point to the closest cluster center 3) Recompute the cluster centers using the current cluster memberships. 4) If a convergence criterion is not met, go to 2). Typical convergence criteria are: no (or minimal) reassignment of data points to new cluster centers, or minimal decrease in squared error. p is a point and mi k E   pC | p  mi | i 1 i 2 is the mean of cluster Ci 21 Example  For simplicity, 1 dimensional data and k=2. data: 1, 2, 5, 6,7  K-means:       Randomly select 5 and 6 as initial centroids; => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 => no change. Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 = 2.5 22 Comments on K-Means    Strength: efficient: O(tkn), where n is # data points, k is # clusters, and t is # iterations. Normally, k, t << n. Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness      Applicable only when mean is defined, difficult for categorical data Need to specify k, the number of clusters, in advance Sensitive to noisy data and outliers Not suitable to discover clusters with non-convex shapes Sensitive to initial seeds 23 Variations of the K-Means Method   A few variants of the k-means which differ in  Selection of the initial k seeds  Dissimilarity measures  Strategies to calculate cluster means Handling categorical data: k-modes    Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency based method to update modes of clusters 24 k-Medoids clustering method  k-Means algorithm is sensitive to outliers    Since an object with an extremely large value may substantially distort the distribution of the data. Medoid – the most centrally located point in a cluster, as a representative point of the cluster. An example Initial Medoids  In contrast, a centroid is not necessarily inside a cluster. 25 Partition Around Medoids  PAM: 1. 2. 3. 4. Given k Randomly pick k instances as initial medoids Assign each data point to the nearest medoid x Calculate the objective function  5. 6. 7. the sum of dissimilarities of all points to their nearest medoids. (squared-error criterion) Randomly select an point y Swap x by y if the swap reduces the objective function Repeat (3-6) until no change 26 Comments on PAM   Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean (why?) Outlier (100 unit away) Pam works well for small data sets but does not scale well for large data sets.  O(k(n-k)2 ) for each change where n is # of data, k is # of clusters 27 CLARA: Clustering Large Applications     CLARA: Built in statistical analysis packages, such as S+ It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output Strength: deals with larger data sets than PAM Weakness:    Efficiency depends on the sample size A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased There are other scale-up methods e.g., CLARANS 28 Hierarchical Clustering  Use distance matrix for clustering. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 a Step 1 Step 2 Step 3 Step 4 ab b abcde c cde d de e Step 4 agglomerative divisive Step 3 Step 2 Step 1 Step 0 29 Agglomerative Clustering At the beginning, each data point forms a cluster (also called a node). Merge nodes/clusters that have the least dissimilarity. Go on merging Eventually all nodes belong to the same cluster 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 30 A Dendrogram Shows How the Clusters are Merged Hierarchically Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. 31 Divisive Clustering  Inverse order of agglomerative clustering  Eventually each node forms a cluster on its own 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 32 More on Hierarchical Methods  Major weakness of agglomerative clustering methods    do not scale well: time complexity at least O(n2), where n is the total number of objects can never undo what was done previously Integration of hierarchical with distance-based clustering to scale-up these clustering methods   BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction 33 Summary      Cluster analysis groups objects based on their similarity and has wide applications Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, etc Clustering can also be used for outlier detection which are useful for fraud detection What is the best clustering algorithm? 34

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Ch8-clustering