Download Hierarchical Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Examples of Clustering Applications  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Land use: Identification of areas of similar land use in an earth observation database  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  City-planning: Identifying groups of houses according to their house type, value, and geographical location  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Applications of Cluster Analysis  Discovered Clusters Understanding – Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations  Industry Group Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN 1 2 Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN 3 4 Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP Technology1-DOWN Technology2-DOWN Financial-DOWN Oil-UP Summarization – Reduce the size of large data sets Clustering precipitation in Australia © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Requirements of Clustering in Data Mining  Scalability  Ability to deal with different types of attributes  Ability to handle dynamic data  Discovery of clusters with arbitrary shape  Minimal requirements for domain knowledge to determine input parameters  Able to deal with noise and outliers  Insensitive to order of input records  High dimensionality  Incorporation of user-specified constraints  Interpretability and usability © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Measure the Quality of Clustering  Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)  There is a separate “quality” function that measures the “goodness” of a cluster.  The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables.  Weights should be associated with different variables based on applications and data semantics.  It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Data Structures  Data matrix – (two modes)   x11   ... x  i1  ... x  n1 ... x1f ... ... ... ... xif ... ... ... ... ... xnf ... ... x1p   ...  xip   ...  xnp   Dissimilarity matrix – (one mode) © Tan,Steinbach, Kumar  0  d(2,1) 0   d(3,1) d ( 3,2) 0  : :  : d ( n,1) d ( n,2) ... Introduction to Data Mining       ... 0 4/18/2004 ‹#› Type of data in clustering analysis  Interval-scaled variables  Binary variables  Nominal, ordinal, and ratio variables  Variables of mixed types © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Interval-valued variables  Standardize data – Calculate the mean absolute deviation: sf  1 n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |) where m f  1n (x1 f  x2 f  ...  xnf ) – Calculate the standardized measurement (z-score) . xif  m f zif  sf  Using mean absolute deviation is more robust than using standard deviation © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Similarity and Dissimilarity Between Objects  Distances are normally used to measure the similarity or dissimilarity between two data objects  Some popular ones include: Minkowski distance: d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q ) i1 j1 i2 j2 ip jp where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer  If q = 1, d is Manhattan distance d (i, j) | x  x |  | x  x | ... | x  x | i1 j1 i2 j 2 i p jp © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Similarity and Dissimilarity Between Objects (Cont.)  If q = 2, d is Euclidean distance: d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 ) i1 j1 i2 j2 ip jp – Properties  d(i,j) 0 d(i,i) =0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j) Also, one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Binary Variables Object j  1 0 1 a b Object i 0 c d sum a  c b  d data  Distance measure for symmetric binary variables:  Distance measure for asymmetric binary variables:  sum a b cd p A contingency table for binary d (i, j)  d (i, j)  bc a bc  d bc a bc Jaccard coefficient (similarity measure for asymmetric binary variables): © Tan,Steinbach, Kumar Introduction to Data Mining simJaccard (i, j)  a a b c 4/18/2004 ‹#› Dissimilarity between Binary Variables  Example Name Jack Mary Jim Gender M F M Fever Y Y Y Cough N N P Test-1 P P N Test-2 N N N Test-3 N P N Test-4 N N N – gender is a symmetric attribute – the remaining attributes are asymmetric binary – let the values Y and P be set to 1, and the value N be set to 0 01  0.33 2 01 11 d ( jack , jim )   0.67 111 1 2 d ( jim , mary )   0.75 11 2 d ( jack , mary )  © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Nominal Variables  A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green  Method 1: Simple matching – m: # of matches, p: total # of variables m d (i, j)  p  p  Method 2: use a large number of binary variables – creating a new binary variable for each of the M nominal states © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Ordinal Variables  An ordinal variable can be discrete or continuous  Order is important, e.g., rank  Can be treated like interval-scaled – replace xif by their rank rif {1,...,M f } – map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by zif rif 1  M f 1 – compute the dissimilarity using methods for interval-scaled variables © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Ratio-Scaled Variables  Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt  Methods: – treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted) – apply logarithmic transformation yif = log(xif) – treat them as continuous ordinal data treat their rank as intervalscaled © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Variables of Mixed Types  A database may contain all the six types of variables – symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio  One may use a weighted formula to combine their effects  pf  1 ij( f ) d ij( f ) d (i, j)   pf  1 ij( f ) – f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise – f is interval-based: use the normalized distance – f is ordinal or ratio-scaled compute ranks rif and and treat zif as interval-scaled zif © Tan,Steinbach, Kumar Introduction to Data Mining  r M if 1 f 1 4/18/2004 ‹#› Vector Objects  Vector objects: keywords in documents, gene features in micro-arrays, etc.  Broad applications: information retrieval, biologic taxonomy, etc.  Cosine measure  A variant: Tanimoto coefficient © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Types of Clusterings  A clustering is a set of clusters  Important distinction between hierarchical and partitional sets of clusters  Partitional Clustering – A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset  Hierarchical clustering – A set of nested clusters organized as a hierarchical tree © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Partitional Clustering Original Points © Tan,Steinbach, Kumar A Partitional Clustering Introduction to Data Mining 4/18/2004 ‹#› Hierarchical Clustering p1 p3 p4 p2 p1 p2 Traditional Hierarchical Clustering p3 p4 Traditional Dendrogram p1 p3 p4 p2 p1 p2 Non-traditional Hierarchical Clustering © Tan,Steinbach, Kumar p3 p4 Non-traditional Dendrogram Introduction to Data Mining 4/18/2004 ‹#› Types of Clusters  Well-separated clusters  Center-based clusters  Contiguous clusters  Density-based clusters  Property or Conceptual  Described by an Objective Function © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Types of Clusters: Well-Separated  Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Types of Clusters: Center-Based  Center-based – A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster – The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Types of Clusters: Contiguity-Based  Contiguous Cluster (Nearest neighbor or Transitive) – A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Types of Clusters: Density-Based  Density-based – A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. – Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Types of Clusters: Conceptual Clusters  Shared Property or Conceptual Clusters – Finds clusters that share some common property or represent a particular concept. . 2 Overlapping Circles © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Clustering Algorithms  K-means and its variants  Hierarchical clustering  Density-based clustering © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› K-means Clustering  Partitional clustering approach  Each cluster is associated with a centroid (center point)  Each point is assigned to the cluster with the closest centroid  Number of clusters, K, must be specified  The basic algorithm is very simple © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Comments on the K-Means Method  Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))  Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms  Weakness – Applicable only when mean is defined, then what about categorical data? – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers – Not suitable to discover clusters with non-convex shapes © Tan,Steinbach, Kumar May 22, 2017 Introduction to Data Mining Data Mining: Concepts and 4/18/2004 30 ‹#› Evaluating K-means Clusters  Most common measure is Sum of Squared Error (SSE) – For each point, the error is the distance to the nearest cluster – To get SSE, we square these errors and sum them. K SSE    dist 2 (mi , x ) i 1 xCi – x is a data point in cluster Ci and mi is the representative point for cluster Ci  can show that mi corresponds to the center (mean) of the cluster – Given two clusters, we can choose the one with the smallest error – One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K  © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Limitations of K-means  K-means has problems when clusters are of differing – Sizes – Densities – Non-globular shapes  K-means has problems when the data contains outliers. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Limitations of K-means: Differing Density K-means (3 Clusters) Original Points © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Limitations of K-means: Non-globular Shapes Original Points © Tan,Steinbach, Kumar K-means (2 Clusters) Introduction to Data Mining 4/18/2004 ‹#› Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree  Can be visualized as a dendrogram  – A tree like diagram that records the sequences of merges or splits 5 6 0.2 4 3 4 2 0.15 5 2 0.1 1 0.05 3 0 1 © Tan,Steinbach, Kumar 3 2 5 4 1 6 Introduction to Data Mining 4/18/2004 ‹#› Strengths of Hierarchical Clustering  Do not have to assume any particular number of clusters – Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level  They may correspond to meaningful taxonomies – Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Hierarchical Clustering  Two main types of hierarchical clustering – Agglomerative:  Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left  – Divisive:  Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters)   Traditional hierarchical algorithms use a similarity or distance matrix – Merge or split one cluster at a time © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Hierarchical Clustering  Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 a Step 1 Step 2 Step 3 Step 4 agglomerative (AGNES) ab b abcde c cde d de e divisive (DIANA) Step May 22, 20174 © Tan,Steinbach, Kumar Step 3 Step Step Step Concepts and 2 Data Mining: 1 0 Introduction to Data Mining 4/18/2004 39 ‹#› Agglomerative Clustering Algorithm  More popular hierarchical clustering technique  Basic algorithm is straightforward 1. Compute the proximity matrix 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the proximity matrix 6.  Until only a single cluster remains Key operation is the computation of the proximity of two clusters – Different approaches to defining the distance between clusters distinguish the different algorithms © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› How to Define Inter-Cluster Similarity p1 Similarity? p2 p3 p4 p5 ... p1 p2 p3 p4      p5 MIN . MAX . Group Average . Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 ... p1 p2 p3 p4      p5 MIN . MAX . Group Average . Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 ... p1 p2 p3 p4      p5 MIN . MAX . Group Average . Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 ... p1 p2 p3 p4      p5 MIN . MAX . Group Average . Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 ... p1   p2 p3 p4      p5 MIN . MAX . Group Average . Proximity Matrix Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Hierarchical Clustering: Group Average  Compromise between Single and Complete Link  Strengths – Less susceptible to noise and outliers  Limitations – Biased towards globular clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Hierarchical Clustering: Time and Space requirements  O(N2) space since it uses the proximity matrix. – N is the number of points.  O(N3) time in many cases – There are N steps and at each step the size, N2, proximity matrix must be updated and searched – Complexity can be reduced to O(N2 log(N) ) time for some approaches © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Hierarchical Clustering: Problems and Limitations  Once a decision is made to combine two clusters, it cannot be undone  No objective function is directly minimized  Different schemes have problems with one or more of the following: – Sensitivity to noise and outliers – Difficulty handling different sized clusters and convex shapes – Breaking large clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Cluster Validity  For supervised classification we have a variety of measures to evaluate how good our model is – Accuracy, precision, recall  For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?  But “clusters are in the eye of the beholder”!  Then why do we want to evaluate them? – – – – To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Quality: What Is Good Clustering?  A good clustering method will produce high quality clusters with – high intra-class similarity – low inter-class similarity  The quality of a clustering result depends on both the similarity measure used by the method and its implementation  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Internal Measures: Cohesion and Separation  Cluster Cohesion: Measures how closely related are objects in a cluster – Example: SSE  Cluster Separation: Measure how distinct or wellseparated a cluster is from other clusters  Example: Squared Error – Cohesion is measured by the within cluster sum of squares (SSE) WSS    ( x  mi )2 i xC i – Separation is measured by the between cluster sum of squares BSS   Ci (m  mi ) 2 i – Where |Ci| is the size of cluster i © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Internal Measures: Cohesion and Separation  A proximity graph based approach can also be used for cohesion and separation. – Cluster cohesion is the sum of the weight of all links within a cluster. – Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion © Tan,Steinbach, Kumar separation Introduction to Data Mining 4/18/2004 ‹#›

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Hierarchical Clustering