Download ClusterAnalysis

Document related concepts

Business intelligence wikipedia , lookup

Human genetic clustering wikipedia , lookup

Transcript
Data Mining:
Concepts and Techniques
(3rd ed.)
Chapter 10. Cluster Analysis
1
Outline






CLUSTER ANALYSIS: BASIC CONCEPTS
PARTITIONING ALGORITHMS
 K-MEANS
 K-MEDOIDS
HIERARCHICAL CLUSTERING
DENSITY-BASED CLUSTERING
EVALUATION OF CLUSTERING
SUMMARY
2
CLUSTER ANALYSIS: BASIC
CONCEPTS
3
What is Cluster Analysis?




Cluster: A collection of data objects
 similar (or related) to one another within the same group
 dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
4
Clustering for Data Understanding and
Applications








Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Climate: understanding earth climate, find patterns of atmospheric
and ocean
Economic Science: market research
5
Example (1/2)

Clustering the website visitors based on age,
income, and recency
6
Example (2/2)

Clustering customers based on their profile
7
Clustering as a Preprocessing Tool (Utility)

Summarization:


Compression:


Image processing: vector quantization
Finding K-nearest Neighbors


Preprocessing for regression, PCA, classification, and
association analysis
Localizing search to one or a small number of clusters
Outlier detection

Outliers are often viewed as those “far away” from any
cluster
8
Quality: What Is Good Clustering?

A good clustering method will produce high quality
clusters


high intra-class similarity: cohesive within clusters

low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on

the similarity measure used by the method

its implementation, and

Its ability to discover some or all of the hidden patterns
9
Measure the Quality of Clustering

Dissimilarity/Similarity metric

Similarity is expressed in terms of a distance function,
typically metric: d(i, j)

The definitions of distance functions are usually
rather different for interval-scaled, boolean,
categorical, ordinal ratio, and vector variables

Weights should be associated with different variables
based on applications and data semantics
10
How to


Grouping the following customers based on their
profiles
How to: define distance function? Assign weights
No
age
income student credit_rating buys_computer
to attributes?
1 <=30
2 <=30
3 31…40
4 >40
5 >40
6 >40
7 31…40
8 <=30
9 <=30
10 >40
11 <=30
12 31…40
13 31…40
14 >40
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
11
Measure the Quality of Clustering

Quality of clustering:

There is usually a separate “quality” function that
measures the “goodness” of a cluster.

It is hard to define “similar enough” or “good enough”

The answer is typically highly subjective
12
Major Clustering Approaches (I)




Partitioning approach:
 Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)
using some criterion
 Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
15
Major Clustering Approaches (II)




Model-based:
 A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
 Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: p-Cluster
User-guided or constraint-based:
 Clustering by considering user-specified or application-specific
constraints
 Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
 Objects are often linked together in various ways
 Massive links can be used to cluster objects: SimRank, LinkClus
16
PARTITIONING ALGORITHMS
17
Partitioning Algorithms: Basic Concept

Partitioning method: Partitioning a database D of n objects into a set of
k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)
E  ik1 pCi || p  ci ||2

Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms


k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
18
K-MEANS
19
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in four
steps:
1.
2.
3.
4.
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
Assign each object to the cluster with the nearest
seed point
Go back to Step 2, stop when the assignment
does not change
20
An Example of K-Means Clustering
K=2
Arbitrarily
partition
objects into
k groups
The initial data set

Partition objects into k nonempty
subsets

Repeat


Compute centroid (i.e., mean
point) for each partition

Assign each object to the
cluster of its nearest centroid
Update the
cluster
centroids
Loop if
needed
Reassign objects
Update the
cluster
centroids
Until no change
21
K-Means Algorithm
22
Demo

K-means Algorithm Demo - YouTube


https://www.youtube.com/watch?v=zHbxbb2ye3E
K-means - Interactive demo


http://home.deib.polimi.it/matteucc/Clustering/tutorial_h
tml/AppletKM.html
http://www.math.le.ac.uk/people/ag153/homepage/Km
eansKmedoids/Kmeans_Kmedoids.html
23
Example


Data: 4 types of medicines and each has two attributes
(pH and weight index).
Goal: to group these objects into K=2 group of medicine.
D
C
Medicine
Weight
pH-Index
A
1
1
B
2
1
C
4
3
D
5
4
A
B
Example

Step 1: Use initial seed points for partitioning
c1  A, c2  B
D
C
A
B
Euclidean distance
d( D, c1 )  ( 5  1)2  ( 4  1)2  5
d( D, c2 )  ( 5  2)2  ( 4  1)2  4.24
Assign each object to the cluster
with the nearest seed point
Example

Step 2: Compute new centroids of the current
partition
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
c1  (1, 1)
2  4  5 1 3 4 

c2  
,

3
3


11 8
( , )
3 3
Example

Step 2: Renew membership based on new
centroids
Compute the distance of all
objects to the new centroids
Assign the membership to objects
Example

Step 3: Repeat the first two steps until its
convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
1
 1 2 11
c1  
,
  (1 , 1)
2 
2
 2
1
1
 45 34
c2  
,

(
4
,
3
)

2 
2
2
 2
Example

Step 3: Repeat the first two steps until its
convergence
Compute the distance of all
objects to the new centroids
Stop due to no new assignment
Membership in each cluster no
longer change
Exercise
For the medicine data set, use K-means with the Manhattan distance
metric for clustering analysis by setting K=2 and initialising seeds as
C1 = A and C2 = C. Answer three questions as follows:
1.
How many steps are required for convergence?
2.
What are memberships of two clusters after convergence?
3.
What are centroids of two clusters after convergence?
D
Medicine
Weight
pH-Index
A
1
1
B
2
1
C
4
3
D
5
4
C
A
30
B
Comments on the K-Means Method

Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

Comment: Often terminates at a local optimal.

Weakness

Applicable only to objects in a continuous n-dimensional space


Using the k-modes method for categorical data
In comparison, k-medoids can be applied to a wide range of
data

Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al., 2009)

Sensitive to noisy data and outliers

Not suitable to discover clusters with non-convex shapes
31
K-MEDOIDS
33
What Is the Problem of the K-Means Method?

The k-means algorithm is sensitive to outliers !

Since an object with an extremely large value may substantially
distort the distribution of the data

K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
34
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10
10
10
9
9
9
8
8
8
Arbitrary
choose k
object as
initial
medoids
7
6
5
4
3
2
7
6
5
4
3
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
0
10
1
2
3
4
5
6
7
8
9
10
Assign
each
remainin
g object
to
nearest
medoids
7
6
5
4
3
2
1
0
0
K=2
Until no
change
2
3
4
5
6
7
8
9
10
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
Do loop
1
10
10
Compute
total cost of
swapping
9
9
Swapping O
and Oramdom
8
If quality is
improved.
5
5
4
4
3
3
2
2
1
1
7
6
0
8
7
6
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
35
The K-Medoid Clustering Method

K-Medoids Clustering: Find representative objects (medoids) in clusters

PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

Starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering

PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)

Efficiency improvement on PAM

CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples

CLARANS (Ng & Han, 1994): Randomized re-sampling
36
K-Medoid Algorithm
37
HIERARCHICAL CLUSTERING
38
Hierarchical Clustering

Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
39
Agglomerative versus Divisive

A hierarchical clustering method can be either
agglomerative or divisive, depending on whether
the hierarchical decomposition is formed in a
bottom-up (merging) or top-down (splitting)
fashion.
40
Distance between clusters

Proximity matrix: contains the distance between each
cluster
41
Distance between Clusters

X
X
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ci, Cj) = min(tip, tjq)

Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(Ci, Cj) = max(tip, tjq)

Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(Ci, Cj) = avg(tip, tjq)

Centroid: distance between the centroids of two clusters, i.e.,
dist(Ci, Cj) = dist(ci, cj)

Medoid: distance between the medoids of two clusters, i.e., dist(Ci,
Cj) = dist(mi, mj)

Medoid: a chosen, centrally located object in the cluster
42
Distance between Clusters
Single link
Complete link
Average
43
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)


Centroid: the “middle” of a cluster
Cm 
iN 1(t
ip
)
N
Radius: square root of average distance from any point
of the cluster to its centroid

 N (t  cm ) 2
Rm  i 1 ip
N
Diameter: square root of average mean squared
distance between all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)
44
AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical packages, e.g., Splus

Use the single-link method and the dissimilarity matrix

Merge nodes that have the least dissimilarity

Go on in a non-descending fashion

Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
45
Single-Linkage Clustering


The algorithm is an agglomerative scheme that
erases rows and columns in the proximity matrix
as old clusters are merged into new ones.
The N*N proximity matrix is D = [d(i,j)]. The
clusterings are assigned sequence numbers
0,1,......, (n-1) and L(k) is the level of the kth
clustering. A cluster with sequence number m is
denoted (m) and the proximity between clusters
(r) and (s) is denoted d [(r),(s)].
46
AGNES algorithm
1.
2.
3.
4.
5.
Begin with the disjoint clustering having level L(0) = 0 and sequence number
m = 0.
Find the least dissimilar pair of clusters in the current clustering, say pair (r),
(s), according to
d[(r),(s)] = min d[(i),(j)]
where the minimum is over all pairs of clusters in the current clustering.
Increment the sequence number : m = m +1. Merge clusters (r) and (s) into a
single cluster to form the next clustering m. Set the level of this clustering to
L(m) = d[(r),(s)]
Update the proximity matrix, D, by deleting the rows and columns
corresponding to clusters (r) and (s) and adding a row and column
corresponding to the newly formed cluster. The proximity between the new
cluster, denoted (r,s) and old cluster (k) is defined in this way:
d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]
If all objects are in one cluster, stop. Else, go to step 2.
47
Example

Hierarchical
clustering of
distances in
kilometers between
the following Italian
cities.






TO-Torino
MI-Milan
FI-Firenze
RM-Rome
NA-Napoli
BA-Bari
48
Example (single-linkage)
Input distance matrix (L = 0 for all the clusters)
BA
FI
MI
NA
RM
TO
BA
0
662
877
255
412
996
FI
662
0
295
468
268
400
MI
877
295
0
754
564
138
NA
255
468
754
0
219
869
RM
412
268
564
219
0
669
TO
996
400
138
869
669
0
The nearest pair of cities is MI and TO, at distance
138.
These are merged into a single cluster called "MI/TO".
The level of the new cluster is L(MI/TO) = 138
the new sequence number is m = 1.
49
Example (single-linkage)
Compute the distance from this new compound object to all other objects.
In single link clustering the rule is that the distance from the compound object to
another object is equal to the shortest distance from any member of the cluster to
the outside object.
After merging MI with TO we obtain the following matrix:
BA
FI
MI/TO
NA
RM
BA
0
662
877
255
412
FI
662
0
295
468
268
MI/TO
877
295
0
754
564
NA
255
468
754
0
219
RM
412
268
564
219
0
min d(i,j) = d(NA,RM) = 219 =>
merge NA and RM into a new cluster called NA/RM
L(NA/RM) = 219
m=2
50
Example (single-linkage)
BA
FI
MI/TO
NA/RM
BA
0
662
877
255
FI
662
0
295
268
MI/TO
877
295
0
564
NA/RM
255
268
564
0
min d(i,j) = d(BA,NA/RM) = 255 =>
merge BA and NA/RM into a new cluster called
BA/NA/RM
L(BA/NA/RM) = 255
m=3
51
Example (single-linkage)
BA/NA/RM
FI
MI/TO
BA/NA/RM
0
268
564
FI
268
0
295
MI/TO
564
295
0
min d(i,j) = d(BA/NA/RM,FI) = 268 =>
merge BA/NA/RM and FI into a new cluster called
BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m=4
52
Example (single-linkage)
Finally, we merge the last two clusters at level 295.
BA/FI/NA/RM
MI/TO
BA/FI/NA/RM
0
295
MI/TO
295
0
53
Example (single-linkage)

The process is summarized by the following hierarchical
tree:
295
268
255
219
138
54
Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram

A clustering of the data objects is obtained by
cutting the dendrogram at the desired level, then
each connected component forms a cluster
55
Dendrogram: Shows How Clusters are Merged
K=3
56
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
57
DENSITY-BASED CLUSTERING
59
Density-Based Clustering Methods


Clustering based on density (local cluster criterion), such
as density-connected points
Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition
Convex regions inaccurately
identified, noise or outliers are
included in the clusters.
60
Density-Based Clustering Methods

Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
61
Density-Based Clustering: Basic Concepts



Two parameters:

Eps: Maximum radius of the neighbourhood

MinPts: Minimum number of points in an Epsneighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly densityreachable from a point q w.r.t. Eps, MinPts if

p belongs to NEps(q)

core point condition:
|NEps (q)| ≥ MinPts
p
q
MinPts = 5
Eps = 1 cm
62
Density-Reachable and Density-Connected

Density-reachable:


A point p is density-reachable from
a point q w.r.t. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
p
p1
q
Density-connected

A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o w.r.t.
Eps and MinPts
q
p
o
63
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise


Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core
MinPts = 5
64
DBSCAN: The Algorithm

Arbitrary select a point p

Retrieve all points density-reachable from p w.r.t. Eps
and MinPts

If p is a core point, a cluster is formed

If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database

Continue the process until all of the points have been
processed
65
DBSCAN: Sensitive to Parameters
66
SUMMARY
74
Summary







Cluster analysis groups objects based on their similarity and has
wide applications
Measure of similarity can be computed for various types of data
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
AGNES and DIANA are hierarchical clustering algorithms, and there
are also probabilistic hierarchical clustering algorithms
DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
Quality of clustering results can be evaluated in various ways
75