Download Chapter 9 Part 1

Document related concepts

Principal component analysis wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Chapter 9
UNSUPERVISED LEARNING:
Clustering
Part 1
Cios / Pedrycz / Swiniarski / Kurgan
Outline
• What is Clustering?
- Categories of clustering methods
- Similarity measures
• Partition-Based Clustering
• Hierarchical Clustering
• Model-Based (mixture of probabilities)
clustering
• Scalable Clustering
• Grid-Based Clustering
• Cluster Validity
• Clustering of Large Datasets
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
2
What is Clustering?
How do we understand data?
We look for structure in data by revealing
groups/clusters.
Clusters are about abstraction of data.
The structure is formed based on similarities
between patterns (data).
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
3
How hard is clustering?
Consider N data points to be split into “c”
groups (clusters). The number of possible
splits (partitions) is described as
1 c
c i  c  N
 ( 1)  i
c! i 1
i
Even for a small problem of N =100 and c =5 we
end up with 1067 partitions
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
4
Clustering Challenges – from Bezdek
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
5
Clustering Challenges – from Bezdek
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
6
Categories of Clustering
We distinguish between three main categories
(classes) of clustering methods
• Partition-based
• Hierarchical
• Model-based (mixture of probabilities)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
7
Categories of Clustering
• Partitioning approach:
– Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
– Examples: k-means, k-medoids, CLARANS
• Hierarchical approach:
– Create a hierarchical decomposition of the data using some
criterion
– Examples: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
– Based on connectivity and density functions
– Examples: DBSCAN, OPTICS, DenClue
• Grid-based approach:
– Based on a multiple-level granularity structure
– Examples: STING, WaveCluster, CLIQUE
8
Categories of Clustering
• Model-based:
– A model is hypothesized for each of the clusters and its best
fit is found
– Examples: EM (expectation maximization), SOM, COBWEB
• Frequent pattern-based:
– Based on analysis of frequent patterns
– Example: p-Cluster
• User-guided or constraint-based:
– Clustering by user-specified or application-specific
constraints
– Examples: COD (obstacles), constrained clustering
• Link-based clustering:
– Objects are linked together in various ways
– Examples: SimRank, LinkClus
9
Partition-Based Clustering
It is also referred to as objective function
clustering, relies on the minimization of a
certain objective function (performance
index)
The result of the minimization is a partition
matrix and a set of prototypes (centers)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
10
Partition-Based Clustering
• Partitioning method: Partition data D of n objects into a set of k
clusters, such that the sum of squared distances is minimized
(where ci is the centroid or medoid of cluster Ci)
E  ik1 pCi ( p  ci )2
• Given k, find a partition that optimizes the chosen partitioning
criterion
– Global (optimal): exhaustively enumerate all partitions - not
feasible!
– Heuristic: k-means and k-medoids algorithms
• k-means: Each cluster is represented by the center of the cluster
• k-medoids or PAM (Partition Around Medoids): Each cluster is
represented by one of the objects in the cluster
11
Model-Based Clustering
In MBC we assume a certain probabilistic model of
data and estimate its parameters, such as mean,
covariance matrix, etc.
Mixture density model: we assume that data are a
result of a mixture of “c” sources generating data
and each source is treated as a separate cluster.
Maximum Likelihood Estimation: method for estimating
parameters of a model.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
12
Similarity measures
Similarity measure to be used is the most
fundamental component of every clustering
algorithm; it is used to quantify similarity (or
dissimilarity) between data points.
The data points with the highest similarity (e.g.,
with the shortest distance) are candidates for
forming a cluster.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
13
Distance Functions for Continuous Data
Euclidean distance (equivalent to p=2 in Minkowski)
d( x, y ) 
n
2
 (x i  y i )
i 1
Hamming distance (p=1 in Minkowski)
n
d( x, y )   | x i  y i |
i 1
Tchebyschev distance (p= ∞ in Minkowski)
d(x, y)  maxi1,2,..., n | x i  yi |
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
14
Distance Functions for Continuous Data
Minkowski distance
d( x, y ) 
p
n
 (x i  y i ) , p  0
p
i 1
(from Wikipedia)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
15
Distance Functions for Continuous Data
Canberra distance d( x, y )   | x i  y i | , x i and y i are positive
i 1 x i  y i
n
Example:
v1(1,1,1), v2(1,1,0), v3(10,5,0), v4(1,2,3), v5(2,4,6)
d12=1
d13=2.485
d45=1
n
 x i yi
Angular separation
d( x, y ) 
Example:
v1(7,6,3,-1), v2(0,3,4,5)
d12=0.363
i 1
n
n
2
[  x i  y i2 ]1 / 2
i 1
i 1
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
16
Distance Functions for Discrete Data
Binary data x = [x1 x2 …xn]
y = [y1 y2 …yn]
a- number of occurrences where both xi and yi are 1
d- number of occurrences where both xi and yi are 0
b, c- number of occurrences where xi and yi are different (0-1)
1
0
1
a
c
0
b
d
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
17
Distance Functions for Discrete Data
Matching index
ad
abcd
Rusell & Rao
a
abcd
Jacard index
a
abc
Czekanowski
2a
2a  b  c
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
18
Hierarchical Clustering
HC provides graphical illustration of relationships
between the data in the form of a dendrogram
(binary tree).
There are two approaches to HC:
• Bottom – up / agglomerative
• Top-down / divisive
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
19
Hierarchical Clustering
Agglomerative / bottom-up method starts with each
object in the data forming its own cluster, and then
successively merges the clusters until one large
cluster is formed, which encompasses the entire
dataset.
Divisive / top-down method starts by considering the
entire data as one cluster and then splits up the
cluster(s) until each object forms its own cluster.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
20
Hierarchical Clustering
Top –down /
divisive
Bottom-up /
agglomerative
{a}
{b,c,d,e}
{f,g,h}
a
b
c
d
e
f
g
h
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
21
Hierarchical Agglomerative Clustering
numbers of clusters
at different levels
4
3
2
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
22
Hierarchical Agglomerative Clustering
Given : a data set and the distance function
1. start with “N “ clusters by assigning each pattern to a
separate cluster
2. proceed with this initial configuration of the clusters and
merge the clusters that are the closest. In other words, if S
and T are the two clusters being recognized as the closest,
form a single cluster {S, T} and reduce the number of clusters
by one
3 repeat step 2 until a minimal number of the clusters has been
reached.
Result : clusters of data (partition)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
23
Distance Between Clusters
Single linkage method: T  S  min
complete linkage : T  S  max
average linkage : T -S 
xT
yS
xT
yS
xy
xy
1
xy

card(S )card(T )
xT
yS
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
24
Single Linkage
S imilarity between S and T is calculated based on the
minimal distance between the elements belonging to the
corresponding clusters
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
25
Complete Linkage
We rely on the maximal distance between
the patterns in the analyzed clusters.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
26
Average Linkage
We combine two clusters based
upon their averaged distance
between the patterns in the clusters.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
27
Hausdorff Distance Function
d(A, B)  max{maxxA min yBd(x, y), maxyBmin xAd(x, y)}
from Wikipedia
Two sets are close if every point of either set is close to some point of the other
set. It is the greatest of all the distances from a point in one set to the closest
point in the other set.
d(A,B) = max { min d(A,B)} = sup { inf d(A,B)}
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
28
Lance-Williams updating formula
d AB,C  α A d A,C  α Bd B,C  βdA,B  γ | d A,C  d B,C |
Clustering
method
Single link
Complete link
centroid
A (B)


1/2
1/2
0
0
-1/2
1/2
0
nA
nA  nB
median
1/2

nAnB
(n A  n B ) 2
-1/4
0
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
29
Hierarchical Divisive Method
HD algorithm starts by considering all divisions of the data into two
nonempty subsets which amounts to huge # of possible divisions:
n 1
2 1
However, it is possible to construct divisive methods that don’t
consider all divisions, most of which may be incorrect anyway.
One such algorithm is by MacNaughton - Smith (1964) -> see the
next slides
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
30
Hierarchical Divisive Method
At first A:=C and B:=
1. Move one object at a time from A to B.
For each object iA we compute average dissimilarity to all
other objects of A:
1
a(i ) 
d (i, j )

| A | 1 jA
j i
Object m of A for which a(m) is the largest, is moved to B:
A : A \ {m}, B : {m}
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
31
Hierarchical Divisive Method
2. Move other objects from A to B (called the “splinter
group”)
If |A|=1, stop. Otherwise compute a(i) for all iA, and the average
dissimilarity of i to all objects of B, denoted as d(i,B)
1
1
a(i)  d (i, B) 
d (i, j ) 
d (i, k )


| A | 1 jA
| B | kB
j i
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
32
Hierarchical Divisive Method
Select the object hA for which
a(h)  d (h, B)  max iA (a(i)  d (i, B))
If a(h)-d(h,B) > 0  move h from A to B, go to 2
If a(h)-d(h,B)  0  the process stops
The division of C into clusters A and B is complete.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
33
Hierarchical Divisive Method
a
a  0.0
b  2.0
c  6.0

d 10.0
e  9.0
b
c
d
e
2.0 6.0 10.0 9.0
0.0 5.0 9.0 8.0
5.0 0.0 4.0 5.0

9.0 4.0 0.0 3.0
8.0 5.0 3.0 0.0
Object’ Average Dissimilarity to the
Other Objects
a
(2.0 + 6.0 + 10.0 + 9.0)/4 = 6.75
b
(2.0 + 5.0 + 9.0 +8.0)/4 = 6.00
c
(6.0 + 5.0 + 4.0 + 5.0)/4 = 5.00
d
(10.0 + 9.0 + 4.0 + 3.0)/4 = 6.50
e
(9.0 + 8.0 + 5.0 + 3.0)/4 = 5.25
In this example, object a is chosen to initiate the splinter group.
At this stage we have clusters A={b,c,d,e} and B={a}
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
34
Hierarchical Divisive Method
Average Dissimilarity
Object to remaining Objects
Average Dissimilarity
to Objects of
Splinter Group
Difference
b
c
d
e
2.00
6.00
10.00
9.00
5.33
-1.33
-4.67
-3.67
(5.0+9.0+8.0)/37.33
(5.0+4.0+5.0)/34.67
(9.0+4.0+3.0)/35.33
(8.0+5.0+3.0)/35.33
Therefore, object b changes sides, so new splinter group is B={a, b} and the
remaining group becomes A={c, d, e}
Average Dissimilarity
Object to remaining Objects
Average Dissimilarity
to Objects of
Splinter Group
Difference
c
d
e
(6.0+5.0)/2=5.50
(10.0+9.0)/2=9.50
(9.0+8.0)/2=8.50
-1.00
-6.00
-4.50
(4.0+5.0)/2=4.50
(4.0+3.0)/2=3.50
(5.0+3.0)/2=4.00
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
35
Partition / Objective Function Clustering
Develop and optimize a partition matrix so that a certain performance
index is optimized (minimized).
Objective function
Minimization
Structure
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
36
Partition / Objective Function Clustering
It depends on minimization of a performance index Q
2 c N
c N
Q   U m x v    U m d 2
ik
k
i
ik
ik
i1 k1
i1 k1
c – number of clusters
U – partition matrix
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
37
Clustering: Representation
how to represent clusters?
Partition matrix
N data points
c clusters
data  partition matrix
1 0 0 1 1 0 0 1
0 1 1 0 0 0 0 0 

U
0 0 0 0 0 1 1 0 




© 2007 Cios / Pedrycz / Swiniarski / Kurgan
38
Partition Matrix
1 0 0 1 1 0 0 1
0 1 1 0 0 0 0 0 

U
0 0 0 0 0 1 1 0 




U  {U | 0 
N
u
k 1
cluster1 : {1,4,5,8}
cluster2 : {2,3}
cluster3 : {6,7}
c
ik
 N, 0   u ik 1
}
=
i1
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
39
Partition / Objective Function Clustering
Clustering is guided by the minimization of some
objective function Q.
Representation of structure is in the form of:
• Partition matrix U = [uik], i=1,2,…,c; k=1, 2,…,N
N
0   u  N for i 1, 2, ..., c
k1 ik
N
 uik 1 for k 1, 2, ..., N
i1
•Prototypes vi, i=1,2,…, c
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
40
Partition / Objective Function Clustering
Given: the (guessed!) number of clusters (c), and the
chosen similarity function (and value of the power
factor (m) for fuzzy clustering only)
Compute the prototypes (v) and update the
partition matrix (U) based on the conditions of the
minimized objective function
Result: partition matrix and prototypes
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
41
K-Means
Proposed about 60 years ago independently by few researchers:
– Steinhaus, H., 1956. Sur la division des corp materiels en parties. Bull.
Acad. Polon. Sci. IV (C1.III), 801–804
– Lloyd, S., 1982. Least squares quantization in PCM. IEEE Trans. Inform.
Theory 28,129–137. Originally as an unpublished Bell laboratories
Technical Note (1957)
– Ball, G., Hall, D., 1965. ISODATA, a novel method of data anlysis and
pattern classification. Technical report NTIS AD 699616. Stanford Research
Institute, Stanford, CA
– MacQueen, J., 1967. Some methods for classification and analysis of
multivariate observations. In: Fifth Berkeley Symposium on Mathematics.
Statistics and Probability. University of California Press, pp. 281–297
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
42
K-Means
It is the simplest and most frequently used clustering algorithm because of
ease of implementation, efficiency and many empirical successes.
Input
Output
c = number of clusters
d = distance function
U = partition matrix
m = power (fuzziness) factor
V = Cluster centers
not used in hard K- means
t = termination criteria
Amount of movement
between clusters
v = cluster centers
Randomly chosen each run
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
43
K-Means
Given: the number of clusters k and dataset X with n points
1. select initial k means as the first k points of data (or they
can be user-provided or chosen randomly)
2. calculate distances between all the points and the kmeans
3. allocate the point(s) to the cluster(s) whose mean is
nearest to it
4. recalculate the means of the k-clusters
5. repeat 2-4 until a termination criterion is satisfied
Result: A vector of size k of means (prototypes) of clusters;
Partition Matrix P of size kxn
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
44
K-Means
Termination criteria
When the summed difference of the old and new
partitions (partition matrices) is less than a
threshold
• Hard
Unew - Uold == 0;
• Fuzzy
Unew - Uold < user chosen number (like 0.0001)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
45
K-Means
Objective function
c
Q=å
i=1
N
åu
||x k - v i ||
2
ik
k=1
Minimize Q : Sum of Errors Squared of N samples over c clusters
subject to these constraints
uik = 0 or 1 (membership value = Sample k belongs
to cluster i or not)
N
0< å u < N for i=1, 2, ..., c
k=1 ik
N
å uik =1 for k =1, 2, ...,N
i=1
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
46
K-Means: Example
Anil Jain. Data Clustering: 50 years beyond K-means; (http://biometrics.cse.msu.edu/Publications/Clustering/JainClustering_PRL10.pdf)
K-Means
Strength:
Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
• By comparison: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Weaknes:
Finds only local optimum.
The global optimum may be found using deterministic annealing or
genetic algorithms.
Applicable only to numeric data (so that a mean can be calculated).
– User needs to specify k - the number of clusters
– Unable to handle noisy data and/or outliers
– Cannot discover clusters with non-convex shapes
From Data Mining: Concepts and Techniques
48
K-Means: Selecting K
• K-means is NP-hard problem and as a greedy algorithm converges
to a local minimum
• To minimize the local minima problem, run the algorithm with
different starting points for a given K and pick the result with the
smallest squared error
• Different starting values of K lead to different clustering results:
– Run the K-means clustering algorithm with different values of K
and pick the result that is the most meaningful to the domain
expert (!)
– Perform silhouette analysis
49
Silhouette Plot Analysis
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0006937
50
Fuzzy C-Means
How to deal (quantify) data that are in-between clusters?
Consider partial membership to clusters – emergence of fuzzy sets
elements with partial membership
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
51
Fuzzy C-Means
Allows for partial membership in clusters –> fuzzy partition matrix U
Objective function
c
N
m
Q    u ik
|| x k  v i || 2
i 1 k 1
U = [uik]; uik – degree of membership of k-th data to i-th cluster
m – fuzzification coefficient, m>1
||. || - distance function
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
52
Fuzzy C-Means: Optimization
c
Q
i 1
N
u
k 1
m
ik
|| x k  v i ||  Min prototypes, UU Q
2
Min Q with respect to prototypes
 prototypesQ  0
Min Q with respect to partition matrix
UQ  0
Q
that is
 0, i  1,2,..., c j  1,2, ..., n
v ij
that is
Q
 0, i  1,2,..., c k  1,2, ..., N
u ik
constraint : U is a partition matrix! !
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
53
Fuzzy C-Means: Optimization
c
min prototypes
N

i 1
m
u ik
|| x k  v i || 2
k 1

vi
N

N
m

T
ui k ( xk  vi ) ( xk  vi )  2
k 1
m
ui k ( xk  vi )
k 1
N
u
N
u
k 1
m
ik ( k
x  vi )  0
vi 
m
ik k
x
k 1
N

um
ik
k 1
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
54
Fuzzy C-Means
Initialize: select the number of clusters (c), stopping value (e), fuzzification coefficient (m).
The distance function is Euclidean or weighted Euclidean. The initial partition matrix consists of random
entries
Repeat
update prototypes
N
m
 u ik x k
v i  k 1N
m
 u ik
k 1
update partition matrix
u ik 
1
c  || x  v || 
k
i


l 1 || x  v || 
j 
 k
2/(m1)
until a certain stopping criterion has been satisfied
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
55
Fuzzy C-Means
Design aspects
stopping criterion: termination of iterations
maxik | uik(iter+1) – uik(iter)|  
fuzzification coefficient (m) : m>1
Shape of the membership functions
m =2.0 – typical value
m close to 1 – set like shape of membership functions
m higher than 2.0 - spike like membership functions
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
56
Kernel-Based Clustering
Idea:
The original data points in the n-dimensional space
are transformed, through some mapping f,
into elements in m-dimensional space, where m > n
Objective function in the new space:
Q
c
N
m
  u ik
i 1 k 1
|| φ(x k )  φ(v i ) ||
2
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
57
Kernel-Based Clustering
Given the dimensionality of a new space, m > n, we calculate in the
new space a kernel function K(x,v) as a dot product
K(x,v) = fT(x)f(v)
Gaussian kernel
K(x,v) = exp(- ||x-v||2/s2)
||f(xk)-f(vi)||2 = 2 – K(xk, vi)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
58
59
60
61
62
63
64
65
Kernel K-Means
K-Means cannot separate clusters that are non-linearly
separable
To solve this problem kernel K-Means algorithm was designed:
Before clustering, all points are mapped into a higher-dimensional
space using some nonlinear function; then the algorithm partitions
(clusters) points in the new space.
Major difference is calculation of distance in the kernel K-means
algorithm by the kernel method and not, for instance, by simple
Euclidean.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
66
Kernel K-means
Q
c
N
c
N
m
  u ik
i 1 k 1
Q
i 1
|| φ(x k )  φ(v i ) ||
 w(x) || φ(x
k 1
k
)  m j ||
2
2
To calculate the distances between the points in the new space and
the mj we use a kernel function that is specified in the kernel matrix K.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
67
Kernel K-means
Input: K -kernel function, k - number of clusters
1. Initialize the k clusters: C1(0), ...,Ck(0)
2. Set t = 0
3. For each point x, find its new cluster by:
J*(x) = argmin j ||f(x)−mj||2
4. Compute the updated clusters as
Cj (t+1) = {x : J*(x) = J}
5. If not converged, set t = t + 1 and go to step 3; otherwise, stop
Result: partition into clusters C1, ....,Ck
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
68
K-Medoids Clustering
To enhance robustness of clustering we use medoids instead of
prototype mean values. In one-dimensional case the medoid is the
median.
Consider an ordered collection of data points
x1 <= x2 <= … <=xN
Median is the central point in the sequence (if N is even) or an average
of the two points in the middle (if N is odd).
median
median
mean
mean
outlier
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
69
Median as a Robust Estimator
Median is a solution to the minimization problem:
N
N
k 1
k 1
min ii  | x k  x ii |  | x k  med |
We increase robustness by using this objective function
c
N n
   u ik | x k j  v ij |
i 1 k 1j1
Advantage: one of the original points becomes cluster center
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
70
K-Means vs. K-Medoids
K-Means is sensitive to outliers because a point with an extremely
large value substantially distorts distribution of the data
K-Medoids: Instead of using the means we use the medoids - the
most centrally located points in clusters.
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
K-Medoids
K-Medoids Clustering: Find representative objects (medoids) in clusters
– PAM (Partitioning Around Medoids)
• Starts from an initial set of medoids and iteratively replaces one of
the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
• PAM works effectively for small data sets but does not scale up well
due to its computational complexity: O(k(n-k)2 ) for each iteration,
where n # of data points, k # of clusters
Efficiency improvement on PAM
– CLARA (Clustering LARge Applications) : PAM on sub-samples of data
– CLARANS : Randomized re-sampling
72
PAM
medoids – a family of the most centrally positioned data points.
PAM clustering:


represent the structure in the data by a collection of medoids,
each data point is grouped around the medoid to which its distance is the
shortest.
PAM starts with an arbitrary collection of elements treated as medoids.
At each step of the optimization, we make a swap between a certain data and one
of the medoids assuming that the swap results in improvement of the quality of the
clustering.
Limitations -- size of the dataset.
PAM works well for small datasets with a small number of clusters,
(100 data points and 5 clusters).
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
73
PAM
1. Select k representative objects arbitrarily
2. For each pair of a non-selected object h and the
selected object i, calculate the total swapping cost TCih
3. For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most similar
representative object
Repeat steps 2-3 until there is no change
74
PAM: K-Medoids Algorithm
K=2
Total Cost = 20
10
10
10
9
9
9
8
8
8
Arbitrary
choose k
objects
as initial
medoids
7
6
5
4
3
2
Assign
each
remaining
object to
the
nearest
medoids
7
6
5
4
3
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
7
6
5
4
3
2
1
0
0
Until no
change
10
8
if quality is improved.
6
3
4
5
6
7
8
9
10
10
Compute
total cost of
swapping
9
Swap O and Oramdom
2
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
Do loop
1
7
9
8
7
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
PAM: Finding the Best Cluster Center
Four Cases
76
PAM: Finding the Best Cluster Center
Case 1
Suppose p currently belongs to cluster represented by Oj and that
D(p,Oi) < D(p,Orandom)
If Oj is replaced by Orandom, p will belong to the cluster represented by
Oi
swap cost:
C=d(p,Oi) - d(p,Oj)
Case 2
P currently belongs to the cluster represented by Oj and this time
assume that D(p,Oi) > D(p,Orandom)
If Oj is replaced by Orandom, p will belong to the cluster represented by
Orandom
swap cost:
C=d(p,Orandom)-d(p,Oj)
PAM: Finding the Best Cluster Center
Case 3
P currently belongs to a cluster represented by Oi instead of Oj and
D(p,Oi) < D(p,Orandom)
If Oj is replaced by Orandom, p will still belong to the cluster represented
by Oi
swap cost:
C=0
Case 4
P currently belongs to cluster represented by Oi but
D(p,Oi) > D(p,Orandom)
If we replace Oj with Orandom, p will belong to Orandom
swap cost:
C=d(p,Orandom)-d(p,Oi)
CLARA
CLARA (Kaufmann and Rousseeuw, 1990)
– It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
– Efficiency depends on the sample size
– A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the
sample is biased
79
CLARA
Five examples of size 40 (small number) +2K
1. For i =1 to 5, repeat the following steps:
2. Draw a sample of 40 + 2K objects randomly from the entire data set, and call
PAM to find K-medoids of the sample.
3. For each object Oj in the entire data set, determine which of the K-medoids is
the most similar to Oj.
4. Calculate the average dissimilarity of the clustering obtained in the previous
step. If this value is less than the current minimum, use this value as the
current minimum, and retain the K-medoids found in Step 2 as the best set of
medoids obtained so far.
5. Return to Step 1 to start the next iteration
Model-Based Clustering
Mixture of data as an underlying model
Each component is described by some conditional probability
density function described by parameters
c
p(x| θ 1, θ 2,…, θ c) =
 p(x | θ )p
i 1
i
i
Parameters estimation of the mixture of data
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
81
Mixture of Data Model
Maximum likelihood estimation
Given data x1, x2, …, xN choose parameters such that the value of
the expression
N
P( X | θ)   p(x k | θ)
k 1
becomes maximized
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
82
Density-Based Clustering
Clustering based on density (local cluster criterion) such as densityconnected points.
Characteristics:
– Discovers clusters of arbitrary shape
– Handles noise
– One scan
– Needs density parameters specified.
Example algorithms:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98)
83
Density-Based Clustering
Two parameters are used:
– Eps: maximum radius of the neighborhood
– MinPts: minimum number of points in the Eps-neighborhood of
that point
• NEps(p):
{q belongs to D | dist(p,q) <= Eps}
• Directly density-reachable: A point p is directly density-reachable
from a point q (with Eps, MinPts) if
– p belongs to NEps(q)
– core point condition:
|NEps (q)| >= MinPts
p
q
MinPts = 5
Eps = 1 cm
Data Mining: Concepts and Techniques
84
Density-Reachable and Density-Connected
• Density-reachable:
– A point p is density-reachable from a
point q (w.r.t. Eps, MinPts) if there is a
chain of points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly densityreachable from pi
p
p1
q
• Density-connected
– A point p is density-connected to a
point q (w.r.t. Eps, MinPts) if there is a
point o such that both, p and q are
density-reachable from o
p
q
o
Data Mining: Concepts and Techniques
85
DBSCAN: Pseudocode
• Arbitrarily select a point p
• Retrieve all points density-reachable from p (with Eps and
MinPts)
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the data
• Continue the process until all points have been processed
Complexity: O(nlogn)
86
DBSCAN: Density-Based Spatial Clustering with Noise
• Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with
noise
Outlier
Border
Eps = 1cm
Core
MinPts = 5
87
DBSCAN: Example
http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
88
OPTICS
OPTICS: Ordering Points To Identify the Clustering Structure (Ankerst
et al., SIGMOD’99)
– Produces an order with respect to its density-based clustering
structure
– This cluster-ordering contains info equivalent to the densitybased clusterings
Complexity: O(nlogn)
89
OPTICS: Extension of DBSCAN
Core Distance (CD) of object p is
the smallest ‘ value that makes
p a core object; if p not core the
CD is undefined.

‘
p
Reachability Distance (RD) of q
with respect to p is the
Max of (CD of p and its
p1
Euclidean d(p,q))
p
RD = Max (CD (p), d (o, p))
p2
RD(p1, p) = 2.8cm RD(p2,p) = 4cm
MinPts = 5
 = 3 cm
90
OPTICS Reachability Chart
Points are ordered according to their density reachibility, namely points with
lower (like ‘ ) are processed first. For each point its RD is calculated and
a graph is formed, which visualizes the data’s clustering structure.
RD
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
91
Grid-Based Clustering
Describes structure in data in the language of generic
geometric constructs – hyperboxes and their combinations
Collection of clusters of
different geometry
Formation of clusters by merging
adjacent hyperboxes of the grid
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
92
Grid-Based Clustering
Hyperboxes
{ B1, B2, …, Bp.} with two requirements:
a) Bi is nonempty in the sense it includes some data points,
b) the hyperboxes are disjoint that is Bi  Bj =  if p i  j,
c) a union of all hyperboxes covers all data that is  B i  X
i 1
where X = {x1, x2, …, xN}.
It is also required that such hyperboxes “cover” some maximal number
(say bmax) of data points.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
93
Grid-Based Clustering: Steps






Formation of the grid structure
Insertion of data into the grid structure
Computation of the density index of each hyperbox of the grid
structure
Sorting the hyperboxes with respect to the values of their density
index
Identification of cluster centres (viz. the hyperboxes of the highest
density)
Traversal of neighboring hyperboxes and merging process
Choice of the grid:
too rough grid may not help capture the details of the structure in the
data.
too detailed grid produces a significant computational overhead.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
94
Grid-Based Clustering
• Clustering highly-dimensional data
• Clusters may exist only in some subspaces
• Methods
– Subspace-clustering: find clusters in all possible subspaces
• Example: CLIQUE
95
CLIQUE
CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal
length intervals / rectangular units
– A unit is dense if the fraction of total data points contained in the
unit exceeds the user-specified input parameter
– A cluster is formed as a maximal set of connected dense units
within a subspace
• Uses the monotonicity property: If a collection of points S forms a
cluster in a K-dimensional space, then S is also part of a cluster in
any (K-1) dimensional projections of this space.
• Pruning is used to eliminate outliers that are not dense enough.
This threshold is called the “optimal cut point”
Data Mining: Concepts and Techniques
96
40
50
20
30
40
50
age
60
Vacation
30
Vacation
(week)
0 1 2 3 4 5 6 7
Salary
(10,000)
0 1 2 3 4 5 6 7
20
age
60
30
50
age
Data Mining: Concepts and Techniques
97
References
Data Mining Concepts and Techniques, edited by Jiawei Han
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletFCM.html
Yaling Pei,
http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
http://www.cs.uiuc.edu/~yyz/teaching/InfoVis-s10/gps-clustering.pdf
Yizhou Yu, www.cs.ndsu.nodak.edu/~adenton/datamining/DENCLUE.PPT
Li Cheng, webdocs.cs.ualberta.ca/~zaiane/courses/cmput695-00/.../LiClique.pdf