Download ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Chapter 10. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
May 4, 2017
Data Mining: Concepts and Techniques
1
Hierarchical Clustering
• Produces a set of nested clusters organized as a
hierarchical tree.
• Can be visualized as a dendrogram
– Tree like diagram
– Records the sequences of merges or splits
May 4, 2017
Data Mining: Concepts and Techniques
2
Dendrograms




Dendrogram: a tree data structure
which illustrates hierarchical
clustering techniques.
Each level shows clusters for that
level.
 Leaf – individual clusters
 Root – one cluster
A cluster at level i is the union of its
children clusters at level i+1.
each level is typically associated
with a distance threshold: subclusters of the clusters at that level
were combined because they had a
distance between them of less than
the distance threshold.
May 4, 2017
Data Mining: Concepts and Techniques
3
Dendrograms
May 4, 2017
Data Mining: Concepts and Techniques
4
Inter-cluster distance (similarity) metrics
• Many hierarchical clustering algorithms require that the distance
(or inter-cluster similarity) between clusters to be determined.
• Let Ki, Kj be two clusters, i  Ki and j  Kj,,, there is a variety of
distance metrics to calculate the distance between clusters
– Single link: Smallest distance between two points, one in Ki and the
other in Kj : dist(Ki,Kj) = min(dist(i, j)).
– Complete link: Largest distance between two points, one in Ki and
the other in Kj : dist(Ki,Kj) = max(dist(i, j)).
– Average link: Average distance between two points, one in Ki and
the other in Kj : dist(Ki,Kj) = avg(dist(i, j)).
– Centroid: Distance between the centroid: dist(Ki,Kj) = dist(CKi,CKj).
– Medoid: Distance between the medoids: dist(Ki,Kj) = dist(MKi,MKj).
May 4, 2017
Data Mining: Concepts and Techniques
5
Inter-cluster distance metrics (cont.)
•Single link: Calculate the smallest distance between an element in one cluster and an
element in the other cluster.
•Complete link (Farthest neighbor): Calculate the largest distance between an
element in one cluster and an element in the other cluster.
•Average link: Calculate the average distance between each element in one cluster
and all elements in the other cluster.
•Centroid: Calculate the distance between cluster centroids.
•Medoid: Calculate the distance between cluster medoids.
single link
complete link
centroid
medoid
May 4, 2017
Data Mining: Concepts and Techniques
6
Hierarchical Clustering
Two main types of algorithms:
– Agglomerative (bottom-up merging)
• Start with the points as individual clusters
• Merge clusters until only one is left
– Divisive (top-down splitting)
• Start with all the points as one cluster
• Split clusters until only singleton clusters remain
– Agglomerative is more popular
• Traditional hierarchical algorithms use a similarity or
distance matrix, and merge or split one cluster at a time
May 4, 2017
Data Mining: Concepts and Techniques
7
Agglomerative Clustering (Bottom-Up Merging)
• Bottom-up merging techniques are called agglomerative
as they agglomerate (merge) smaller clusters into bigger
ones.
• Typically, clusters are merged if the distance metric
between the two clusters is less than a certain threshold.
• The threshold increases at each stage of merging.
• A variety of distance metrics can be used to calculate the
distance (inter-cluster similarity) between clusters
May 4, 2017
Data Mining: Concepts and Techniques
88
Agglomerative Clustering (Bottom-Up Merging)
The sequence of graphs below show how a bottom-up merging techniques may
proceed. Each example starts off in its own cluster. At each stage, we use the
distance threshold to decide which clusters to merge, until eventually we have
only one super-cluster. (We may also stop at some other termination condition e.g. we may stop when we have less than a certain number of super-clusters.)
Stage 1: 5 clusters
Stage 2: 3 clusters
Merge
Stage 3: 1 cluster
Merge
Stage 3: Cluster 1
Stage 2: Cluster 1
Stage 2: Cluster 2
Stage 2:
Cluster 3
Stage 1:
Cluster 1
Stage 1:
Cluster 2
Stage 1:
Cluster 3
Stage 1:
Cluster 4
Stage 1:
Cluster 5
A
B
C
D
E
Source: Alan Abrahams
May 4, 2017
Data Mining: Concepts and Techniques
9
Divisive Clustering (Top-down Splitting)
The sequence of graphs below show how a top-down splitting techniques
may proceed. We start off with one super-cluster. At each stage, we decide
where to split, until eventually we have reasonable sub-clusters.
Stage 1: 1 cluster
Stage 2: 2 clusters
Stage 3: 3 clusters
Split
Split
Stage 1: Cluster 1
Stage 2: Cluster 1
Stage 2: Cluster 2
Stage 3:
Cluster 2
Stage 3:
Cluster 1
A
Source: Alan Abrahams
May 4, 2017
B
Stage 3:
Cluster 3
C
Data Mining: Concepts and Techniques
D
E
10
10
Agglomerative Algorithms



Agglomerative algorithms start with each inidividual item
in its own cluster and iterative merge clusters until all
items belong in on cluster.
Key operation is the computation of the proximity of two
clusters.
Different approaches to defining the distance between
clusters distinguishes the different algorithms.
 Single link
 Complete link (Farthest neighbor)
 Average link
 Centroid
 Medoid
May 4, 2017
Data Mining: Concepts and Techniques
11
Agglomerative Algorithms



In the algorithm given on the next slide:
 d: the threshold distance
 k: the number of clusters
 K: the set of clusters
The procedure NewClusters determine how to form the
next level clusters from the previous level. This is where
the different types of agglomerative algorithms differ.
Different approaches to defining the distance between
clusters results in the different algorithms.
 Single link
 Complete link (Farthest neighbor)
 Average link
 Centroid
 Medoid
May 4, 2017
Data Mining: Concepts and Techniques
12
Agglomerative Algorithm
© Prentice Hall
May 4, 2017
Data Mining: Concepts and Techniques
13
Basic agglomerative algorithm using single-link (min)
technique








View all items with links (distances) between them.
Find maximal connected components in this graph. (A
connected component is a graph in which there exists a
path between two vertices.)
Two clusters are merged if there is at least one edge
which connects them, and the edge distance is less then
or equal to the threshold distances at that level.
Similarity of two clusters is based on the two closest
points (nearest neighbours) in the different clusters (uses
single-link distance metrics).
It is also called nearest neighbour clustering
Can handle non-elliptical shapes.
Sensitive to noise and outliers.
Could be agglomerative or divisive
May 4, 2017
Data Mining: Concepts and Techniques
14
Agglomerative Example
A
B
C
D
E
A
0
1
2
2
3
B
1
0
2
4
3
C
2
2
0
1
5
D
2
4
1
0
3
E
3
3
5
3
0
A
B
E
C
D
Threshold of
1 2 34 5
A B C D E
May 4, 2017
Data Mining: Concepts and Techniques
15
Basic agglomerative algorithms


Agglomerative algorithm using complete-link (max) technique
 Merged if the maximum distance is less than or equal to the
distance threshold.
 Clusters found using the complete link tend to be more compact
than the single link tehnique
 Tends to break large clusters.
 Less susceptible to noise and outliers.
Agglomerative algorithm using average-link technique
 Merged if the average distance is less than or equal to the
distance threshold
 Compromise between Single and Complete Link.
 Need to use average connectivity for scalability since total
connectivity favors large clusters.
 Less susceptible to noise and outliers.
May 4, 2017
Data Mining: Concepts and Techniques
16
Agglomerative Clustering
May 4, 2017
Data Mining: Concepts and Techniques
17
MST agglomerative algorithm for single
link clustering
• Minimum Spanning Tree (MST) is the tree that contains
all vertices of the graph (there exists an edge adjacent
to any vertices) and the sum of the edge weights is
minimal.
• In clustering, edge weight represents the distance
between 2 points.
1. Start with C0 = { (x1), (x2), …, (xn)} clustering. Find MST
on G(1)
2. Repeat step 3,4 until all objects are in a single cluster
3. Merge two clusters connected by the MST edge with
the smallest weight. Update the clustering.
4. Update the weight of the selected edge in step 3 with
1.
Source: Songrit Maneewongvatana
May 4, 2017
Data Mining: Concepts and Techniques
18
MST for single link: example
agglomerative
x1
x2
4.2 2.6
x3
x1
x2
x1
x2
x1
x2
x1
x2
1.7
1.9
x5
MST
x4
x3
x4
x5
x3
x4
x5
x3
x4
x5
x3
x4
x5
divisive
x2 x4 x3 x1 x5
x2 x4 x3 x1 x5 x2 x4 x3 x1 x5 x2 x4 x3 x1 x5
0
2
4
6
8
Proximity dendrograms
May 4, 2017
Data Mining: Concepts and Techniques
19
Hierarchical Clustering (AGNES & DIANA)

Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
agglomerative
(AGNES)
ab
b
abcde
c
cde
d
de
e
Step 4
May 4, 2017
Step 3
Step 2 Step 1 Step 0
Data Mining: Concepts and Techniques
divisive
(DIANA)
20
AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Use the Single-Link method and the dissimilarity matrix.

Merge nodes that have the least dissimilarity

Go on in a non-descending fashion

Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
May 4, 2017
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
Data Mining: Concepts and Techniques
0
1
2
3
4
5
6
7
8
9
10
21
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
May 4, 2017
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Data Mining: Concepts and Techniques
0
1
2
3
4
5
6
7
8
9
10
22
More on Hierarchical Clustering Methods


Major weakness of agglomerative clustering methods
2
 do not scale well: time complexity of at least O(n ),
where n is the number of total objects
 can never undo what was done previously
Integration of hierarchical with distance-based clustering
 BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
 CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of the
cluster by a specified fraction
 CHAMELEON (1999): hierarchical clustering using
dynamic modeling
May 4, 2017
Data Mining: Concepts and Techniques
23
BIRCH (1996)


Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering


Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree

Scales linearly: finds a good clustering with a single scan

Weakness: handles only numeric data, and sensitive to the
and improves the quality with a few additional scans
order of the data record.
May 4, 2017
Data Mining: Concepts and Techniques
24
BIRCH (Cont.)
• Clustering Feature tree (CF tree) is the main concept
of BIRCH algorithm.
• Clustering feature is a triple containing the
summarized information about a cluster. Elements in
the triple are
1. Number of the data points in the cluster.
2. The sum of all data point vectors.
3. The sum of all data point vectors squared.
E.g. If the cluster has n data points, {x1, …, xn}, clustering feature
is:
n
n
2
 n,  xi ,  xi 
i 1
i 1
In other words, the triple consists of zero, first, and second
moments of point set.
May 4, 2017
Data Mining: Concepts and Techniques
25
BIRCH: CF tree
• CF tree is a height-balanced tree.
• The branching factor B and the cluster threshold T are
adjustable.
• Each internal node contains at most B elements, each
of the form [CFi, childi], childi is a pointer points to the
ith child node. CFi is the CF of the cluster associated
with ith child node. The CF of the internal node is the
sum of all CFi of its children.
• Each leaf node contains at most L entries, in the form
[CFi]. Like internal node, leaf node represents the
cluster made up of subclusters represented by all CFi in
the node.
• The page size in the disk determines the value of L, B.
May 4, 2017
Data Mining: Concepts and Techniques
26
BIRCH: CF tree (cont.)
• All entries in a leaf node must satisfy the threshold requirement,
that is the radius or diameter (depending on the algorithm)
measure of the cluster must be less than T.
• Threshold T mandates the tree size. Larger T, smaller the tree
because leaf-level clusters can absorb more points.
• The CF tree is built by inserting data one by one. Initially, there is
only root node.
• Each time a new data is inserted into the tree, the algorithm finds
the most appropriate path by choosing the closest child node
according to the desired distance metric.
• Once reaching the leaf node, it finds the closest cluster entry, and
tests if the new data point can be absorbed by the cluster without
violating the threshold requirement.
Source: Songrit Maneewongvatana
May 4, 2017
Data Mining: Concepts and Techniques
27
BIRCH: CF tree (cont.)
• If the new data point can be assigned into the cluster,
just updating the CF of the cluster.
• If not, allocate a new entry in the leaf and assign the
data point to newly created cluster.
– In case that there is no space left in the leaf node for the new
entry, the leaf node must be split.
• If a leaf node requires a split, two clusters in the leaf
that are farthest apart as seeds, put one entry in the
first newly created leaf and put the other in the second
leaf. Assign all other cluster to either leaf depending
on the distance to the seed clusters.
• The node split may propagate up the tree, if there is no
available entry at the parent of the leaf node.
May 4, 2017
Data Mining: Concepts and Techniques
28
BIRCH: CF tree (cont.)
• The CFs along the path from the leaf to the root need
to be updated to reflect the newly inserted data.
• BIRCH also provides a method to enhance the space
utilization of each node. At an internal node which the
propagation of the node split terminates, the algorithm
tries to merge two closest entries if that is possible.
• Since each node contains only a limited number of
clusters, the clustering structure in CF tree may not
reflect the actual clustering.
• Also, the same data point may end up in different
clusters depending on the insertion order.
May 4, 2017
Data Mining: Concepts and Techniques
29
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2
CF = (5, (16,30),(54,190))
10
9
8
7
6
5
4
3
2
1
0
0
May 4, 2017
1
2
3
4
5
6
7
8
9
10
Data Mining: Concepts and Techniques
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
30
CF Tree
Root
B=7
CF1
CF2 CF3
CF6
L=6
child1
child2 child3
child6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
May 4, 2017
CF6 next
Leaf node
prev CF1 CF2
Data Mining: Concepts and Techniques
CF4 next
31
BIRCH: algorithm
• Phase1: Build the CF tree.
– If run out of memory before finishing the data set, it will build
another CF tree with larger T value, reinserting leaves from the
old tree and continue with the remaining data.
• Phase2: (Optional) Condense the tree by scanning the
leaf entries of the tree built in phase 1. Remove
outliers and group subclusters that are close together
into larger ones.
• Phase3: Global clustering, any clustering algorithms can
be used. E.g. we can treat each subcluster at leaf level
as a data point.
• Phase4: (Optional) Reassign each point to the closet
cluster. This require one or more passes of full data
scan.
May 4, 2017
Data Mining: Concepts and Techniques
32
BIRCH (cont.)
Advantages
• It uses a compact representation (CF).
• It provides adjustable parameter T which, as an effect,
adjusts the size of CF tree so that it can be fitted into
main memory.
• One data scan is sufficient to build the clustering, even
extra passes will give higher quality clustering.
Disadvantages
• It is dependent on the order of the inserting data.
• The data type is limited to numeric and in metric
space.
• Treating subclusters as points may not provide the best
result during global clustering.
May 4, 2017
Data Mining: Concepts and Techniques
33
CURE (Clustering Using REpresentatives )

CURE: proposed by Guha, Rastogi & Shim, 1998


May 4, 2017
Stops the creation of a cluster hierarchy if a level
consists of k clusters
Uses multiple representative points to evaluate the
distance between clusters, adjusts well to arbitrary
shaped clusters and avoids single-link effect
Data Mining: Concepts and Techniques
34
Drawbacks of Distance-Based Method

Drawbacks of square-error based clustering method


May 4, 2017
Consider only one point as representative of a cluster
Good only for convex shaped, similar size and density,
and if k can be reasonably estimated
Data Mining: Concepts and Techniques
35
Cure: The Algorithm

Draw random sample s.

Partition sample to p partitions with size s/p

Partially cluster partitions into s/pq clusters

Eliminate outliers

By random sampling

If a cluster grows too slow, eliminate it.

Cluster partial clusters.

Label data in disk
May 4, 2017
Data Mining: Concepts and Techniques
36
Data Partitioning and Clustering



s = 50
p=2
s/p = 25

s/pq = 5
y
y
y
x
y
y
x
x
May 4, 2017
Data Mining: Concepts and Techniques
x
x
37
Cure: Shrinking Representative Points
y
y
x


x
Shrink the multiple representative points towards the
gravity center by a fraction of .
Multiple representatives capture the shape of the cluster
May 4, 2017
Data Mining: Concepts and Techniques
38
Clustering Categorical Data: ROCK


ROCK: Robust Clustering using linKs,
by S. Guha, R. Rastogi, K. Shim (ICDE’99).
 Suited for categorical attributes
 Use links to measure similarity/proximity
 Not distance based
2
2
 Computational complexity: O(n  nmmma  n log n)
Basic ideas:
T T
 Similarity function and neighbors:
Sim(T , T ) 
T T
Let T1 = {1,2,3}, T2={3,4,5}
1
Sim( T1, T 2) 
May 4, 2017
1
2
1
2
2
{3}
1

 0.2
{1,2,3,4,5}
5
Data Mining: Concepts and Techniques
39
Rock: Algorithm

Links: The number of common neighbours for
the two points.
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
3
{1,2,3}
{1,2,4}

Algorithm
 Draw random sample
 Cluster with links
 Label data in disk
May 4, 2017
Data Mining: Concepts and Techniques
40
CHAMELEON



CHAMELEON: hierarchical clustering using dynamic
modeling, by G. Karypis, E.H. Han and V. Kumar’99
Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity
and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
A two phase algorithm
 1. Use a graph partitioning algorithm: cluster objects
into a large number of relatively small sub-clusters
 2. Use an agglomerative hierarchical clustering
algorithm: find the genuine clusters by repeatedly
combining these sub-clusters
May 4, 2017
Data Mining: Concepts and Techniques
41
Overall Framework of CHAMELEON
Construct
Partition the Graph
Sparse Graph
İnto small clusters
Data Set
Final Clusters
Merge Partition
May 4, 2017
Data Mining: Concepts and Techniques
42