Download Clustering Hierarchical Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Clustering
COMP 790
790-90
90 Research Seminar
BCB 713 Module
Spring 2011
Wei Wang
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Hierarchical Clustering
Group data objects into a tree of clusters
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
2
agglomerative
l
ti
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
AGNES (Agglomerative
Nesting)
Initially, each object is a cluster
Step-by-step
Step
by step cluster merging,
merging until all
objects form a cluster
Single-link
g
approach
pp
Each cluster is represented by all of the objects
in the cluster
The similarity between two clusters is measured
by the similarity of the closest pair of data
points belonging to different clusters
3
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Dendrogram
Show how to merge clusters hierarchically
Decompose data objects into a multi
multi-level
level
nested partitioning (a tree of clusters)
A clustering
l t i off the
th data
d t objects:
bj t cutting
tti the
th
dendrogram at the desired level
Each connected component forms a cluster
4
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
DIANA (DIvisive ANAlysis)
Initially, all objects are in one cluster
Step by step splitting clusters until each
Step-by-step
cluster contains only one object
10
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
5
10
10
9
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
0
10
1
2
3
4
5
6
7
8
9
10
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Distance Measures
Minimum distance
Maximum distance
Mean distance
Average distance
d min (Ci , C j )  min d ( p, q )
pCi , qC j
d max (Ci , C j )  max d ( p, q )
pCi , qC j
d mean (Ci , C j )  d (mi , m j )
d avg (Ci , C j ) 
1
ni n j
  d ( p, q )
pCi qC j
m: mean for
f a cluster
l t
C: a cluster
n: the number of objects in a cluster
6
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Challenges of Hierarchical
Clustering Methods
Hard to choose merge/split points
Never undo merging/splitting
Merging/splitting decisions are critical
Do not scale well: O(n2)
Wh is
What
i the
h bottleneck
b l
k when
h the
h data
d can’t
’ fit
fi in
i
memory?
Integrating hierarchical clustering with other
techniques
BIRCH, CURE, CHAMELEON, ROCK
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
7
BIRCH
Balanced Iterative Reducing and Clustering
using Hierarchies
CF (Clustering Feature) tree: a hierarchical
data structure summarizing object info
Clustering objects  clustering leaf nodes of
the CF tree
8
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
CF = (5, (16,30),(54,190))
SS: Ni=1=Xi2
10
9
8
7
6
5
4
3
2
1
0
0
9
1
2
3
4
5
6
7
8
9
10
(3,4)
(2,6)
( )
(4,5)
(4,7)
(3 8)
(3,8)
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
CF-tree in BIRCH
Clustering feature:
Summarize the statistics for a subcluster: the 0th, 1st and
2nd moments of the subcluster
Register crucial measurements for computing cluster
andd utilize
tili storage
t
efficiently
ffi i tl
A CF tree: a height-balanced tree storing the
clustering features for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of children
10
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
CF Tree
B = 7 CF1 CF2 CF3
L = 6 child1 child2 child3
CF6
Root
child6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev
11
CF1 CF2
CF6 next
Leaf node
prev
CF1 CF2
CF4 next
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Parameters of A CF-tree
Branching factor: the maximum number of
children
Threshold: max diameter of sub-clusters
stored at the leaf nodes
12
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
BIRCH Clustering
Phase 1: scan DB to build an initial inmemory CF tree (a multi
multi-level
level compression
of the data that tries to preserve the inherent
clustering structure of the data)
Phase 2: use an arbitrary clustering
algorithm to cluster the leaf nodes of the
CF-tree
13
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Pros & Cons of BIRCH
Linear scalability
Good clustering with a single scan
Quality can be further improved by a few
additional scans
Can handle only numeric data
S iti to
Sensitive
t the
th order
d off the
th data
d t records
d
14
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Drawbacks of Square Error
Based Methods
One representative per cluster
Good only for convex shaped having similar
size
i andd density
d i
A number of clusters parameter k
G d only
Good
l if k can be
b reasonably
bl estimated
ti t d
15
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
CURE: the Ideas
Each cluster has c representatives
Choose c well scattered ppoints in the cluster
Shrink them towards the mean of the cluster by
a fraction of 
The representatives capture the physical shape
and geometry of the cluster
M
Merge
the
h closest
l
two clusters
l
Distance of two clusters: the distance between
the two closest representatives
16
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications