Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
10/19/16 Hierarchical Clustering Lecture 9 Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 2016 Lecture 9: Hierarchical Clustering 1 Lecture 9: Required Reading Witten et al. (2011: 273-284) 2016 Lecture 9: Hierarchical Clustering 2 1 10/19/16 Outline • • • • Dendogram Agglomerative clustering Distance measures Cut-off Lecture 9: Hierarchical Clustering 2016 3 Hierarchical clustering l Recursively splitting clusters produces a hierarchy that can be represented as a dendogram Could also be represented as a Venn diagram of sets and subsets (without intersections) ♦ Height of each node in the dendogram can be made proportional to the dissimilarity between its children ♦ 2016 Lecture 9: Hierarchical Clustering 4 2 10/19/16 Agglomerative clustering l l Bottom-up approach Simple algorithm Requires a distance/similarity measure ♦ Start by considering each instance to be a cluster ♦ Find the two closest clusters and merge them ♦ Continue merging until only one cluster is left ♦ The record of mergings forms a hierarchical clustering structure – a binary dendogram ♦ Lecture 9: Hierarchical Clustering 2016 5 Distance measures l Single-linkage Minimum distance between the two clusters ♦ Distance between the clusters closest two members ♦ Can be sensitive to outliers ♦ l Complete-linkage Maximum distance between the two clusters ♦ Two clusters are considered close only if all instances in their union are relatively similar ♦ Also sensitive to outliers ♦ Seeks compact clusters ♦ 2016 Lecture 9: Hierarchical Clustering 6 3 10/19/16 Distance measures cont. l l Compromise between the extremes of minimum and maximum distance Represent clusters by their centroid, and use distance between centroids – centroid linkage ♦ ♦ l l Works well for instances in multidimensional Euclidean space Not so good if all we have is pairwise similarity between instances Calculate average distance between each pair of members of the two clusters – average-linkage Technical deficiency of both: results depend on the numerical scale on which distances are measured Lecture 9: Hierarchical Clustering 2016 7 More distance measures l Group-average clustering ♦ ♦ l Ward's clustering method ♦ ♦ l 2016 Uses the average distance between all members of the merged cluster Differs from average-linkage because it includes pairs from the same original cluster Calculates the increase in the sum of squares of the distances of the instances from the centroid before and after fusing two clusters Minimize the increase in this squared distance at each clustering step All measures will produce the same result if the clusters are compact and well separated Lecture 9: Hierarchical Clustering 8 4 10/19/16 Example hierarchical clustering l 50 examples of different creatures from the zoo data Complete-linkage Dendogram 2016 Polar plot Lecture 9: Hierarchical Clustering 9 Example hierarchical clustering 2 Single-linkage 2016 Lecture 9: Hierarchical Clustering 10 5 10/19/16 Incremental clustering l l l Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally Start: ♦ l Then: ♦ ♦ ♦ ♦ l tree consists of empty root node add instances one by one update tree appropriately at each stage to update, find the right leaf for an instance May involve restructuring the tree Base update decisions on category utility Lecture 9: Hierarchical Clustering 2016 11 Clustering weather data ID Outlook A Sunny B Sunny Temp. Humidity Windy Hot High False Hot High True C Overcast Hot High False D Rainy Mild High False E Rainy Cool Normal False True F Rainy Cool Normal G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True 2016 1 2 3 Lecture 9: Hierarchical Clustering 12 6 10/19/16 Clustering weather data ID Outlook Temp. Humidity Windy A Sunny B Sunny Hot High False Hot High C Overcast Hot True High False D Rainy E Rainy Mild High False Cool Normal F Rainy False Cool Normal True G Overcast Cool Normal True H Sunny Mild High False I Sunny Cool Normal False J Rainy Mild Normal False K Sunny Mild Normal True L Overcast Mild High True M Overcast Hot Normal False N Rainy Mild High True 2016 4 5 Merge best host and runner-up 3 Consider splitting the best host if merging doesn’t help Lecture 9: Hierarchical Clustering 13 Final hierarchy 2016 Lecture 9: Hierarchical Clustering 14 7 10/19/16 Example: the iris data (subset) 2016 Lecture 9: Hierarchical Clustering 15 Clustering with cutoff 2016 Lecture 9: Hierarchical Clustering 16 8 10/19/16 The end 2016 Lecture 9: Hierarchical Clustering 17 9