Download Clustering - Hong Kong University of Science and Technology

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Mixture model wikipedia , lookup

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Clustering
Instructor: Qiang Yang
Hong Kong University of Science and Technology
[email protected]
Thanks: J.W. Han, I. Witten, E. Frank
1
Essentials


Terminology:

Objects = rows = records

Variables = attributes = features
A good clustering method


high on intra-class similarity and low on inter-class similarity
What is similarity?

Based on computation of distance

Between two numerical attributes

Between two nominal attributes

Mixed attributes
2
The database
Object i
 x11 ... ... x1 p 




 xi1 ... ... xip 


...
...
...
... 

x

x
np 
 n1
3
Major clustering methods

Partition based (K-means)

Produces sphere-like clusters

Good when






know number of clusters,
Small and med sized databases
Hierarchical methods (Agglomerative or divisive)

Produces trees of clusters

Fast
Density based (DBScan)

Produces arbitrary shaped clusters

Good when dealing with spatial clusters (maps)
Grid-based

Produces clusters based on grids

Fast for large, multidimensional databases
Model-based

Based on statistical models

Allow objects to belong to several clusters
4
The K-Means Clustering Method: for
numerical attributes

Given k, the k-means algorithm is implemented in four
steps:




Partition objects into k non-empty subsets
Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
Assign each object to the cluster with the nearest
seed point
Go back to Step 2, stop when no more new
assignment
5
The mean point
X
4.5
Y
1
4
2
3.5
3
2
4
2.5
2
3
3
1.5
1
4
2
0.5
0
0
2.5
1
2
2.75
3
4
5
X
The mean point can be a virtual point
6
The K-Means Clustering Method

Example
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
9
9
8
8
7
7
6
6
5
5
4
3
2
1
0
1
2
3
4
5
6
7
8
8
9
10
reassign
10
0
7
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
7
Comments on the K-Means Method

Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.

Comment: Often terminates at a local optimum.

Weakness

Applicable only when mean is defined, then what about
categorical data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers too well

Not suitable to discover clusters with non-convex shapes
8
Robustness
X
4.5
Y
4
3.5
1
2
2
4
3
3
400
2
101.5
2.75
3
2.5
2
1.5
1
0.5
0
1
10
100
1000
X
9
Variations of the K-Means Method


A few variants of the k-means which differ in

Selection of the initial k means

Dissimilarity calculations

Strategies to calculate cluster means
Handling categorical data: k-modes (Huang’98)

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects

Using a frequency-based method to update modes of clusters

A mixture of categorical and numerical data: k-prototype method
10
K-Modes: See J. X. Huang’s paper online
(Data Mining and Knowledge Discovery Journal, Springer)
11
Formalization of K-Means
12
K-Means: Cont.
13
K-Modes: See J. X. Huang’s paper online
(Data Mining and Knowledge Discovery Journal, Springer)
14
K-Modes (Cont.)
15
K-Modes
16
K-Modes: Cost Function
17
Finding K-Modes
18
Mixed Types: K-Prototypes
19
K-Modes: Evaluation Data
20
K-Modes: Evaluation
21
Some Experiments
22
What is the problem of k-Means Method?

The k-means algorithm is sensitive to outliers !

Since an object with an extremely large value may substantially
distort the distribution of the data.

K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
23
The K-Medoids Clustering Method

Find representative objects, called medoids, in
clusters

Medoids are located in the center of the clusters.

Given data points, how to find the medoid?
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
24
K-Medoids: most centrally located objects
25
CLARA
26
CLASA: Simulated Annealing
27
Sampling based method: MCMRS
28
KMedoids: Evaluation
29
Density-Based Clustering Methods



Clustering based on density (local cluster criterion),
such as density-connected points
Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition
Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98)
30
Density-Based Clustering


Clustering based on density (local cluster criterion),
such as density-connected points
Each cluster has a considerable higher density of
points than outside of the cluster
31
Density-Based Clustering: Background



Two parameters:

e: Maximum radius of the neighbourhood

MinPts: Minimum number of points in an Epsneighbourhood of that point
Ne(p):
{q belongs to D | dist(p,q) <= e}
Directly density-reachable: A point p is directly densityreachable from a point q wrt. e, MinPts if

1) p belongs to Ne(q)

2) core point condition:
|Ne (q)| >= MinPts
p
q
MinPts = 5
e = 1 cm
32
Density-Based Clustering: Background (II)

Density-reachable:


p
A point p is density-reachable from
a point q wrt. e, MinPts if there is
a chain of points p1, …, pn, p1 = q,
pn = p such that pi+1 is directly
density-reachable from pi
p1
q
Density-connected

A point p is density-connected to a
point q wrt. e, MinPts if there is a
point o such that both, p and q
are density-reachable from o wrt.
e and MinPts.
p
q
o
33
DBSCAN: Density Based Spatial
Clustering of Applications with Noise


Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core
MinPts = 5
34
DBSCAN: The Algorithm





Arbitrary select a point p
Retrieve all points density-reachable from p wrt e and
MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the
database.
Continue the process until all of the points have been
processed.
35
DBSCAN Properties



Generally takes O(nlogn)
time
Still requires user to
supply Minpts and e
Advantage
 Can find points of
arbitrary shape
 Requires only a minimal
(2) of the parameters
36
Model-Based Clustering Methods


Attempt to optimize the fit between the data and some
mathematical model
Statistical and AI approach
 Conceptual clustering




A form of clustering in machine learning
Produces a classification scheme for a set of unlabeled objects
Finds characteristic description for each concept (class)
COBWEB (Fisher’87)



A popular a simple method of incremental conceptual learning
Creates a hierarchical clustering in the form of a classification
tree
Each node refers to a concept and contains a probabilistic
description of that concept
37
The COBWEB Conceptual Clustering
Algorithm 8.8.1

The COBWEB algorithm was developed by D. Fisher in the 1990 for
clustering objects in a object-attribute data set.




Fisher, Douglas H. (1987) Knowledge Acquisition Via
Incremental Conceptual Clustering
The COBWEB algorithm yields a classification tree that characterizes each
cluster with a probabilistic description

Probabilistic description of a node: (fish, prob=0.92)
Properties:
 incremental clustering algorithm, based on probabilistic
categorization trees
 The search for a good clustering is guided by a quality measure
for partitions of data
COBWEB only supports nominal attributes CLASSIT is the version
which works with nominal and numerical attributes
38
The Classification Tree Generated
by the COBWEB Algorithm
39
Input: A set of data like before

Can automatically guess the
class attribute
 That is, after clustering,
each cluster more or less
corresponds to one of
Play=Yes/No category
 Example: applied to vote
data set, can guess
correctly the party of a
senator based on the past
14 votes!
40
Clustering: COBWEB
• In the beginning tree consists of empty node
• Instances are added one by one, and the tree is
updated appropriately at each stage
• Updating involves finding the right leaf an instance
(possibly restructuring the tree)
• Updating decisions are based on partition utility and
category utility measures
41
Clustering: COBWEB
Ai  Vij
Ck
attribute - value pair
class
Intra - class similarity :
P ( Ai  Vij | C k )
 The larger this probability, the greater the proportion of class
members sharing the value (Vij) and the more predictable the value is
of class members.
42
Clustering: COBWEB
Inter - class similarity :
P ( C k | Ai  Vij )
 The larger this probability, the fewer the objects that share this
value (Vij) and the more predictive the value is of class Ck.
43
Clustering: COBWEB
Partition Utility :
n
PU (C1 , C2 ,..., Cn )   P( Ai  Vij ) * P(Ck | Ai  Vij ) * P( Ai  Vij | Ck )
k 1
i
j
 The formula is a trade-off between intra-class similarity and inter-
class dissimilarity, summed across all classes (k), attributes (i), and
values (j).
44
Clustering: COBWEB
Rewrite the equation using Bayes' rule :
P ( Ai  Vij ) * P ( C k | Ai  Vij )  P ( Ai  Vij ) *
P ( Ck , Ai Vij )
P ( Ai Vij )
 P ( C k , Ai  Vij )
 P ( C k ) * P ( Ai  Vij | C k )
Partition Utility can be rewritten as :
n
PU ( C1 , C 2 ,..., C n ) 
 P (Ck )  P ( Ai Vij |Ck ) 2
k 1
i
j
45
Clustering: COBWEB
Category Utility :
n
CU ( C1 , C 2 ,..., C n ) 
 P ( Ck )  P ( Ai Vij |Ck ) 2  P ( Ai Vij ) 2
k 1
i
j
Increase in the expected number of
attribute values that can be correctly
guessed (Posterior Probability)
The expected number of correct
guesses give no such knowledge
(Prior Probability)
46
The Category Utility Function


The COBWEB algorithm operates based on the socalled category utility function (CU) that measures
clustering quality.
If we partition a set of objects into m clusters, then
the CU of this particular partition is

2
2
PCk  PAi  Vij | Ck    PAi  Vij  

k 1
i
j
i j

m
m
Question: Why divide by
m?
- hint: if m=#objects, CU
is max!
47
Insights of the CU Function

For a given object in cluster Ck, if we guess its
attribute values according to the probabilities of
occurring, then the expected number of
attribute values that we can correctly guess is
 PA  V
i
i
ij
| Ck 
2
j
48
Finite mixtures





Probabilistic clustering algorithms model the data using
a mixture of distributions
Each cluster is represented by one distribution
 The distribution governs the probabilities of
attributes values in the corresponding cluster
They are called finite mixtures because there is only a
finite number of clusters being represented
Usually individual distributions are normal distribution
Distributions are combined using cluster weights
49
A two-class mixture model
data
A
A
B
B
A
A
A
A
A
51
43
62
64
45
42
46
45
45
B
A
A
B
A
B
A
A
A
62
47
52
64
51
65
48
49
46
B
A
A
B
A
A
B
A
A
64
51
52
62
49
48
62
43
40
A
B
A
B
A
B
B
B
A
48
64
51
63
43
65
66
65
46
A
B
B
A
B
B
A
B
A
39
62
64
52
63
64
48
64
48
A
A
B
A
A
A
51
48
64
42
48
41
model
A=50, A =5, pA=0.6
B=65, B =2, pB=0.4
50
Using the mixture model

The probability of an instance x belonging to
cluster A is:
Pr[ A | x] 
with

f ( x;  , ) 
1
2 
Pr[ x | A] Pr[ A] f ( x;  A , A ) p A

Pr[ x]
Pr[ x]
( x   )2
2
e 2
The likelihood of an instance given the clusters
is: Pr[ x | the distributi ons ]  Pr[ x | cluster ] Pr[ cluster ]

i
i
i
51
Learning the clusters




Assume we know that there are k clusters
To learn the clusters we need to determine
their parameters
 I.e. their means and standard deviations
We actually have a performance criterion: the
likelihood of the training data given the clusters
Fortunately, there exists an algorithm that finds
a local maximum of the likelihood
52
The EM algorithm



EM algorithm: expectation-maximization algorithm

Generalization of k-means to probabilistic setting
Similar iterative procedure:
1. Calculate cluster probability for each instance
(expectation step)
2. Estimate distribution parameters based on the
cluster probabilities (maximization step)
Cluster probabilities are stored as instance weights
53
More on EM

Estimating parameters from weighted instances:
A 
A


2
w1 x1  w2 x2  ...  wn xn
w1  w2  ...  wn
w1 ( x1   ) 2  w2 ( x2   ) 2  ...  wn ( xn   ) 2

w1  w2  ...  wn
Procedure stops when log-likelihood saturates
Log-likelihood (increases with each iteration;
we wish it to be largest):
 log( p
A
Pr[ xi | A]  pB Pr[ xi | B])
i
54