Download Hierarchical Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
10/19/16
Hierarchical Clustering
Lecture 9
Marina Santini
Acknowledgements
Slides borrowed and adapted from:
Data Mining by I. H. Witten, E. Frank and M. A. Hall
2016
Lecture 9: Hierarchical Clustering
1
Lecture 9: Required Reading
Witten et al. (2011: 273-284)
2016
Lecture 9: Hierarchical Clustering
2
1
10/19/16
Outline
• 
• 
• 
• 
Dendogram
Agglomerative clustering
Distance measures
Cut-off
Lecture 9: Hierarchical Clustering
2016
3
Hierarchical clustering
l 
Recursively splitting clusters produces a
hierarchy that can be represented as a
dendogram
Could also be represented as a Venn diagram
of sets and subsets (without intersections)
♦  Height of each node in the dendogram can be
made proportional to the dissimilarity between
its children
♦ 
2016
Lecture 9: Hierarchical Clustering
4
2
10/19/16
Agglomerative clustering
l 
l 
Bottom-up approach
Simple algorithm
Requires a distance/similarity measure
♦  Start by considering each instance to be a
cluster
♦  Find the two closest clusters and merge them
♦  Continue merging until only one cluster is left
♦  The record of mergings forms a hierarchical
clustering structure – a binary dendogram
♦ 
Lecture 9: Hierarchical Clustering
2016
5
Distance measures
l 
Single-linkage
Minimum distance between the two clusters
♦  Distance between the clusters closest two
members
♦  Can be sensitive to outliers
♦ 
l 
Complete-linkage
Maximum distance between the two clusters
♦  Two clusters are considered close only if all
instances in their union are relatively similar
♦  Also sensitive to outliers
♦  Seeks compact clusters
♦ 
2016
Lecture 9: Hierarchical Clustering
6
3
10/19/16
Distance measures cont.
l 
l 
Compromise between the extremes of minimum and
maximum distance
Represent clusters by their centroid, and use distance
between centroids – centroid linkage
♦ 
♦ 
l 
l 
Works well for instances in multidimensional Euclidean
space
Not so good if all we have is pairwise similarity
between instances
Calculate average distance between each pair of
members of the two clusters – average-linkage
Technical deficiency of both: results depend on the
numerical scale on which distances are measured
Lecture 9: Hierarchical Clustering
2016
7
More distance measures
l 
Group-average clustering
♦ 
♦ 
l 
Ward's clustering method
♦ 
♦ 
l 
2016
Uses the average distance between all members of
the merged cluster
Differs from average-linkage because it includes pairs
from the same original cluster
Calculates the increase in the sum of squares of the
distances of the instances from the centroid before
and after fusing two clusters
Minimize the increase in this squared distance at each
clustering step
All measures will produce the same result if the clusters
are compact and well separated
Lecture 9: Hierarchical Clustering
8
4
10/19/16
Example hierarchical clustering
l 
50 examples of different creatures from the zoo data
Complete-linkage
Dendogram
2016
Polar plot
Lecture 9: Hierarchical Clustering
9
Example hierarchical clustering 2
Single-linkage
2016
Lecture 9: Hierarchical Clustering
10
5
10/19/16
Incremental clustering
l 
l 
l 
Heuristic approach (COBWEB/CLASSIT)
Form a hierarchy of clusters incrementally
Start:
♦ 
l 
Then:
♦ 
♦ 
♦ 
♦ 
l 
tree consists of empty root node
add instances one by one
update tree appropriately at each stage
to update, find the right leaf for an instance
May involve restructuring the tree
Base update decisions on category utility
Lecture 9: Hierarchical Clustering
2016
11
Clustering weather data
ID
Outlook
A
Sunny
B
Sunny
Temp.
Humidity
Windy
Hot
High
False
Hot
High
True
C
Overcast
Hot
High
False
D
Rainy
Mild
High
False
E
Rainy
Cool
Normal
False
True
F
Rainy
Cool
Normal
G
Overcast
Cool
Normal
True
H
Sunny
Mild
High
False
I
Sunny
Cool
Normal
False
J
Rainy
Mild
Normal
False
K
Sunny
Mild
Normal
True
L
Overcast
Mild
High
True
M
Overcast
Hot
Normal
False
N
Rainy
Mild
High
True
2016
1
2
3
Lecture 9: Hierarchical Clustering
12
6
10/19/16
Clustering weather data
ID
Outlook
Temp.
Humidity
Windy
A
Sunny
B
Sunny
Hot
High
False
Hot
High
C
Overcast
Hot
True
High
False
D
Rainy
E
Rainy
Mild
High
False
Cool
Normal
F
Rainy
False
Cool
Normal
True
G
Overcast
Cool
Normal
True
H
Sunny
Mild
High
False
I
Sunny
Cool
Normal
False
J
Rainy
Mild
Normal
False
K
Sunny
Mild
Normal
True
L
Overcast
Mild
High
True
M
Overcast
Hot
Normal
False
N
Rainy
Mild
High
True
2016
4
5
Merge best host
and runner-up
3
Consider splitting the best host if
merging doesn’t help
Lecture 9: Hierarchical Clustering
13
Final hierarchy
2016
Lecture 9: Hierarchical Clustering
14
7
10/19/16
Example: the iris data (subset)
2016
Lecture 9: Hierarchical Clustering
15
Clustering with cutoff
2016
Lecture 9: Hierarchical Clustering
16
8
10/19/16
The end
2016
Lecture 9: Hierarchical Clustering
17
9
Related documents