Download Chapter 5. Cluster Analysis

Document related concepts

Mixture model wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
MIS 542
Data Mining
Concepts and Techniques
— Chapter 5 —
Clustering
2014/2015 Fall
1
Chapter 5. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
2
What is Cluster Analysis?




Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Clustering is unsupervised learning: no predefined classes
Typical applications
 As a stand-alone tool to get insight into data
distribution
 As a preprocessing step for other algorithms
3
General Applications of Clustering





Pattern Recognition
Spatial Data Analysis
 create thematic maps in GIS by clustering feature
spaces
 detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
WWW
 Document classification
 Cluster Weblog data to discover groups of similar access
patterns
4
Examples of Clustering Applications





Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to
their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
5
What Is Good Clustering?

A good clustering method will produce high quality clusters
with


high intra-class similarity

low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.

The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
6
Requirements of Clustering in Data Mining

Scalability

Ability to deal with different types of attributes

Ability to handle dynamic data

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to
determine input parameters

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Incorporation of user-specified constraints

Interpretability and usability
7
Chapter 5. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
8
Data Structures


Data matrix
 (two modes)
Dissimilarity matrix
 (one mode)
 x11

 ...
x
 i1
 ...
x
 n1
...
x1f
...
...
...
...
xif
...
...
...
...
... xnf
...
...
 0
 d(2,1)
0

 d(3,1) d ( 3,2) 0

:
:
 :
d ( n,1) d ( n,2) ...
x1p 

... 
xip 

... 
xnp 







... 0
9
Properties of Dissimilarity Measures

Properties
d(i,j)  0 for i  j
 d(i,i) = 0
 d(i,j) = d(j,i) symmetry
 d(i,j)  d(i,k) + d(k,j) triangular inequality


Exercise: Can you find examples where
distance between objects are not obeying
symmetry property
10
Measure the Quality of Clustering





Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
Weights should be associated with different variables
based on applications and data semantics.
It is hard to define “similar enough” or “good enough”

the answer is typically highly subjective.
11
Type of data in clustering analysis

Interval-scaled variables:

Binary variables:

Nominal, ordinal, and ratio variables:

Variables of mixed types:
12
Classification by Scale






Nominal scale:merely distinguish classes: with respect
to A and B XA=XB or XAXB
e.g.: color {red, blue, green, …}
gender { male, female}
occupation {engineering, management. .. }
Ordinal scale: indicates ordering of objects in addition
to distinguishing
XA=XB or XAXB XA>XB or XA<XB
 e.g.: education {no school< primary sch. < high sch.
< undergrad < grad}

age {young < middle < old}

income {low < medium < high }
13








Interval scale: assign a meaningful measure of
difference between two objects
 Not only XA>XB but XAis XA – XB units different from XB
e.g.: specific gravity
temperature in oC or oF
Boiling point of water is 100 oC different then its melting
point or 180 oF different
Ratio scale: an interval scale with a meaningful zero
point
XA > XB but XA is XA/XB times greater then XB
e.g.: height, weight, age (as an integer)
temperature in oK or oR


Water boils at 373 oK and melts at 273 oK
Boiling point of water is 1.37 times hotter then melting poing
14
Comparison of Scales







Strongest scale is ratio weakest scale is ordinal
Ahmet`s height is 2.00 meters HA
Mehmet`s height is 1.50 meter HM
HA  HM nominal: their heights are different
HA > HM ordinal Ahmet is taller then Mehmet
HA - HM =0.50 meters interval Ahmet is 50 cm
taller then Mehmet
HA / HM =1.333 ratio scale, no mater height is
measured in meter or inch …
15
Interval-valued variables

Standardize data

Calculate the mean absolute deviation:
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where

m f  1n (x1 f  x2 f
 ... 
xnf )
.
Calculate the standardized measurement (z-score)
xif  m f
zif 
sf

Using mean absolute deviation is more robust than using
standard deviation
16
Other Standardizations







Min Max scale between 0 and 1 or -1 and 1
Decimal scale
For Ratio Scaled variales
Mean transformation
zi,f = xi,f/mean_f
 Measure in terms of means of variable f
Log transformation
zi,f = logxi,f
17
X2
*
* *
*
*
X2
*
*
*
*
*
*
*
X1
*
***
*
*
*
*
**
**
Both has zero mean and standardized by
Z scores X1 and X2 are unity in both cases I and II
X1.X2=0 in case I whereas X1.X2 1 in case II
Shell we use the same distance measure in both cases
After obtaining the z scores
X1
18
Exercise
X2
*
* A*
*
*
*
*
*
*
*
X2
*
*
X1
***
*
*
*
*
A*
**
**
X1
Suppose d(A,O) = 0.5 in case I and II
Does it reflect the distance between A and origin?
Suggest a transformation so as to handle correlation
Between variables
19
Similarity and Dissimilarity Between
Objects


Distances are normally used to measure the similarity or
dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q >=1

If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2
i p jp
20
Similarity and Dissimilarity Between
Objects (Cont.)

If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1
j1
i2
j2
ip
jp

Properties





d(i,j)  0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j)  d(i,k) + d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
21
Similarity and Dissimilarity Between
Objects (Cont.)

Weights can be assigned to variables
d (i, j)  (w | x  x |2 w | x  x |2 ... w | x  x |2 )
i1 j1
i2 j 2
ip
jp
1

2
p
Where wi i = 1…P weights showing the
importance of each variable
22
XA
XA
XB
Manhatan distance between
XA and XB
XB
Euclidean distance between
XA and XB
23



When q =  Mincovsly distance becomes
Chebychev distance or L metric
limq=(|Xi1-Xj1|q+|Xi2-Xj2|q+…+|Xip-Xjp|q)1/q
=maxp |Xip-Xjp|
24
Exercise

Take one of these points as origin and draw the
locus of points that are 1,2 ,3 units away from
the oirgin with two dimensions according to
 Menhattan distance
 Euchlidean distance
 Chebychev distance
25
Binary Variables


Symmetric asymmetric
Symmetric: both of its states are equally valuable and
carry the same weight
 gender: male female





0 male 1 female arbitarly coded as 0 or 1
Asymmetric variables
Outcomes are not equally important
Encoded by 0 and 1
E.g. patient smoker or not


1 for smoker 0 for nonsmoker asymmetric
Positive and negative outcomes of a disease test

HIV positive by 1 HIV negative 0
26
Binary Variables

A contingency table for binary data
Object j
Object i


1
0
1
a
b
0
c
d
sum a  c b  d
sum
a b
cd
p
Simple matching coefficient (invariant, if the binary
bc
variable is symmetric):
d (i, j) 
a bc  d
Jaccard coefficient (noninvariant if the binary variable is
asymmetric):
d (i, j) 
bc
a bc
27
Dissimilarity between Binary
Variables

Example
Name
Jack
Mary
Jim



Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
01
 0.33
2 01
11
d ( jack , jim ) 
 0.67
111
1 2
d ( jim , mary ) 
 0.75
11 2
d ( jack , mary ) 
28
Nominal Variables


A generalization of the binary variable in that it can take more than 2
states, e.g., red, yellow, blue, green
Method 1: Simple matching

m: # of matches, p: total # of variables
m
d (i, j)  p 
p


Higher weights can be assigned to variables with large number of
states
Method 2: use a large number of binary variables

creating a new binary variable for each of the M nominal states
29
Example









2 nominal variables
Faculty and country for students
Faculty {eng, applied Sc., Pure Sc., Admin., } 5
distinct values
Country {Turkey, USA} 10 distinct values
P = 2 just two varibales
Weight of country may be increased
Student A (eng, Turkey) B(Applied Sc, Turkey)
m =1 in one variable A and B are similar
D(A,B) = (2-1)/2 =1/2
30
Example cont.


Different binary variables for each faculty
 Eng 1 if student is in engineering 0 otherwise
 AppSc 1 if student in MIS, 0 otherwise
Different binary variables for each country
 Turkey 1 if sturent Turkish, 0 otherwise
 USA 1 if student USA ,0 otherwise
31
Ordinal Variables

An ordinal variable can be discrete or continuous

Order is important, e.g., rank

Can be treated like interval-scaled


replace xif by their rank
map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
zif

rif {1,...,M f }
rif 1

M f 1
compute the dissimilarity using methods for intervalscaled variables
32
Example




Credit Card type: gold > silver > bronze > normal, 4
states
Education: grad > undergrad > highschool > primary
school > no school, 5 states
Two customers
 A(gold,highschool)
 B(normal,no school)
 rA,card = 1 , rB,card = 4
 rA,edu = 3 , rA,card = 5
 zA,card = (1-1)/(4-1)=0
 zB,card = (4-1)/(4-1)=1
 zA,edu = (3-1)/(5-1)=0.5
 zB,edu = (5-1)/(5-1)=1
Use any interval scale distance measure on z values
33
Exercise

Find an attribute having both ordinal and nominal
charecterisitics
define a similarity or dissimilarity measure for to
objects A and B
34
Ratio-Scaled Variables


Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Methods:


treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted)
apply logarithmic transformation
yif = log(xif)

treat them as continuous ordinal data treat their rank
as interval-scaled
35
Example






Cluster individuals based on age weights and
heights
 All are ratio scale variables
Mean transformation
Zp,i = xp,i/meanp
As absolute zero makes sense measure distance
by units of mean for each variable
Then you may apply z`= logz
Use any distance measure for interval scales then
36
Example cont.





A weight difference of 0.5 kg is much more
important for babies then for adults
d(3kg,3.5kg) = 0.5 (3.5-3)/3 percentage
difference
d(71.5kg,70.0kg) =0.5
d`(3kg,3.5kg) = (3.5-3)/3 percentage difference
very significant approximately log(3.5)-log3
d(71.5kg,71.0kg) = (71.5-70.0)/70.0
 Not important log71.5 – log71 almost zero
37
Examples from Sports











Boxing
wrestling
48
48
51
52
54
56
57
62
60
68
63.5
74
67
82
71
90
75
100
81
130
38
Variables of Mixed Types


A database may contain all the six types of variables
 symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
One may use a weighted formula to combine their
effects
 pf  1 ij( f ) d ij( f )
d (i, j) 
 pf  1 ij( f )
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
 f is interval-based: use the normalized distance
 f is ordinal or ratio-scaled
 compute ranks rif and
r 1
z

if
 and treat zif as interval-scaled
M 1
if
f
39



fij is count variable
fij = 0 if
 f is binary and asymmetric variable and
 Xif = Xjf = 0
fij = 1 o.w.
40
Exercise



Construct an example data containiing all types of
variables
Define variables in that data set
and compute distance between two sample
objects
41
Chapter 6. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
42
Basic Measures for Clustering

Clustering: Given a database D = {t1, t2, .., tn}, a distance
measure dis(ti, tj) defined between any two objects ti and
tj, and an integer value k, the clustering problem is to
define a mapping f: D  {1, …, k} where each ti is
assigned to one cluster Kf, 1 ≤ f ≤ k, such that tfp,tfq ∈ Kf
and ts ∉ Kf, dis(tfp,tfq) ≤ dis(tfp,ts)

Centroid, radius, diameter

Typical alternatives to calculate the distance between
clusters

Single link, complete link, average, centroid, medoid
43
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)


Centroid: the “middle” of a cluster
ip
)
N
Radius: square root of average distance from any point of the
cluster to its centroid

Cm 
iN 1(t
 N (t  cm ) 2
Rm  i 1 ip
N
Diameter: square root of average mean squared distance between
all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)
44
Typical Alternatives to Calculate the
Distance between Clusters

Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)

Medoid: distance between the medoids of two clusters, i.e., dis(Ki,
Kj) = dis(Mi, Mj)

Medoid: one chosen, centrally located object in the cluster
45
Major Clustering Approaches

Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
46
Chapter 6. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
47
Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance
 
k
m1 tmiKm

(Cm  tmi )
2
Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

k-means (MacQueen’67): Each cluster is represented by the center
of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
48
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in four
steps:




Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
Assign each object to the cluster with the nearest
seed point
Go back to Step 2, stop when no more new
assignment
49
The K-Means Clustering Method

Example
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
9
9
8
8
7
7
6
6
5
5
4
3
2
1
0
1
2
3
4
5
6
7
8
8
9
10
reassign
10
0
7
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
50
Comments on the K-Means Method

Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.


Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms

Weakness

Applicable only when mean is defined, then what about categorical
data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex shapes
51
Variations of the K-Means Method


A few variants of the k-means which differ in

Selection of the initial k means

Dissimilarity calculations

Strategies to calculate cluster means
Handling categorical data: k-modes (Huang’98)

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects

Using a frequency-based method to update modes of clusters

A mixture of categorical and numerical data: k-prototype method
52
What is the problem of k-Means Method?

The k-means algorithm is sensitive to outliers !

Since an object with an extremely large value may substantially
distort the distribution of the data.

K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
53
Exercise



Show by designing simple examples:
The k-means algorithm may converge to different
local optima starting from different initial
assignments of objects into different clusters
Suitable only clusters are spherical and equal
sized
54
The K-Medoids Clustering Method

Find representative objects, called medoids, in clusters

PAM (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering

PAM works effectively for small data sets, but does not scale well
for large data sets

CLARA (Kaufmann & Rousseeuw, 1990)

CLARANS (Ng & Han, 1994): Randomized sampling

Focusing + spatial data structure (Ester et al., 1995)
55
Typical k-medoids algorithm (PAM)
Total Cost = 20
10
10
10
9
9
9
8
8
8
Arbitrary
choose k
object as
initial
medoids
7
6
5
4
3
2
7
6
5
4
3
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
0
10
1
2
3
4
5
6
7
8
9
10
Assign
each
remainin
g object
to
nearest
medoids
7
6
5
4
3
2
1
0
0
K=2
Until no
change
2
3
4
5
6
7
8
9
10
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
Do loop
1
10
10
Compute
total cost of
swapping
9
9
Swapping O
and Oramdom
8
If quality is
improved.
5
5
4
4
3
3
2
2
1
1
7
6
0
8
7
6
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
56
PAM (Partitioning Around Medoids)
(1987)

PAM (Kaufman and Rousseeuw, 1987), built in Splus

Use real object to represent the cluster



Select k representative objects arbitrarily
For each pair of non-selected object random and selected object
j, calculate the total swapping cost TCj,radom
For each pair of j and random,

If TCj,rand < 0, j is replaced by random

J such that min TCj,random


Then assign each non-selected object to the most similar
representative object
repeat steps 2-3 until there is no change
57
four possibilities (1 and 2)



p any object, j current medoid random candidate
modoid
(1): p belongs to j but
 p close to iij i any other medoid
 so p is reassigned to i
 d(p,i)<d(p, random) TCp,j.random d(p,i)-d(p,j)= <
0
(2): p belongs to j but
 p close to random
 so p is reassigned to random
 d(p,i)>d(p, random) TCp,j.random=d(p,random)d(p,j)<0
58
Case 1
P
random
i
j
random is candidate to replace
medoid j but any object p
close to i another medoid
i such that d(p,i)<d(p, random)
P
i
random
j
Seperate p from j
assign to i
as distance between p and i
is smaller then between
P and random
59
Case 2
P
i
random
j
random is candidate to replace
medoid j but any object p is
closer to random then any other
Medoid i
d(p, random)<d(p,i) i
P
i
j
random
Seperate p from j
assign to random
as distance between p and i
İs greater than between
P and medoid i
60
four possibilities (3 and 4)


(3): p belongs to i ij but
 p close to i
 so p is still belongs to I
 d(p,i)<d(p, random) TCp,j.random =0
(4): p belongs to i ij but
 p close to random
 so p is reassigned to random
 d(p,i)>d(p, random) TCp,j.random =d(p,random)d(p,i)<0
61
Case 3
P
random
random is candidate to replace
medoid j but any object p is
closer to another medoid i
i
j
d(p, random)>d(p,i)
P
random
i
reassing p to its previous
mediod i
j
62
Case 4
P
random
random is candidate to replace
medoid j but any object p is
closer to another medoid i
i
j
d(p, random)<d(p,i)
P
i
random
assing p to random
as it is closer to random
than mediod i
j
63
Example









A
B
C
D
E
A 0
B 1
0
C 2
2
0
D 2
4
1
0
E 3
3
5
3
0
A and B are initial medoids
Initial clusters C1{A C D} C2{B E}
 Note randomly assign C to C1 and E to C2
Three non medoids C D E are examined whether
they replace A or B
64
Example cont.





Six cost to determine
 TCAC, TCAD, TCAE,TCBC, TCBD, TCBE,
TCAC= TCAAC + TCBAC + TCCAC + TCDAC + TCEAC
TCAAC: A was a medoid it is assigned to either B
or C d(A,B)=1 d(A,C)=2 assign to B
 So TCAAC is 1-0 =1
TCBAC: B is still a medoid
 So TCBAC =0
TCCAC: C is the new medoid it is assinged to
nowhere but it is seperated from A releasing 2
 So TCCAC is 0-2=-2
65
Example cont.

TCDAC: D was a nonmedoid it was assigned to A
 First it is released from A -2
 It will be assigned to either B or C


d(D,B)=1 d(D,C)=4
So assign to B with a cost of 1
So TCDAC is 1-2 =-1
TCEAC: E was a nonmedoid it was assigned to B
 It will be assigned to either B or C





d(E,B)=3 d(E,C)=5
So reassign to B or no change
So TCEAC 2 -2 = 0
66
Example cont.



So
TCAC= TCAAC + TCBAC + TCCAC + TCDAC + TCEAC
= 1 + 0
+ -2 + -1 + 0 =-2

67
Replacing A with C
A
2
C
D
2
C
A
1
D
1
B
3
E
A and B are medoids
TC 7
C is a candidate medoid to
Replace A
B
3
E
After replacing A with C
B and C are medoids
TC = 5
TCAC = 5-7=-2
68
Replacing A with D
A
2
C
D
2
C
A
1
D
1
B
3
E
A and B are medoids
TC 7
D is a candidate medoid to
Replace A
B
3
E
After replacing A with D
B and D are medoids
TC = 5
TCAD = 5-7=-2
69
Replacing A with E
A
2
C
D
2
B
3
E
A and B are medoids
TC 7
E is a candidate medoid to
Replace A
A
1
B
C
2
D
3
E
After replacing A with C
B and C are medoids
TC = 6
TCAE = 5-6=-1
70
Replacing B with C
A
2
C
D
2
C
A
1
D
1
B
3
E
A and B are medoids
TC 7
C is a candidate medoid to
Replace B
B
3
E
After replacing B with C
A and C are medoids
TC = 5
TCBC = 5-7=-2
71
Replacing B with D
A
2
C
D
2
C
A
1
D
1
B
3
E
A and B are medoids
TC 7
D is a candidate medoid to
Replace B
B
3
E
After replacing B with D
A and D are medoids
TC = 5
TCBD = 5-7=-2
72
Replacing B with E
A
2
C
D
2
2
A
1
B
3
E
A and B are medoids
TC 7
E is a candidate medoid to
Replace B
B
C
D
2
E
After replacing B with E
A and E are medoids
TC = 5
TCBE = 5-7=-2
73









TCAC=-2
TCAD=-2
TCAE=-1
TCBC,=-2
TCBD =-2
TCBE=-2
Fife of six possibilities has a reduction of 2
Arbitrarily choose one say AC
 Replace A with C so
 B and C are new medoids and
 C1 {B A E} C2 {C D}
Continue the PAM until no reduction in costs
74
What is the problem with PAM?


PAM is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
PAM works efficiently for small data sets but does not
scale well for large data sets.

O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
Sampling based method,
CLARA(Clustering LARge Applications)
75
CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990)


Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output

Strength: deals with larger data sets than PAM

Weakness:


Efficiency depends on the sample size
A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
76
CLARANS (“Randomized” CLARA) (1994)

CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)





CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
77
Chapter 6. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
78
Hierarchical Clustering

Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
79
AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Use the Single-Link method and the dissimilarity matrix.

Merge nodes that have the least dissimilarity

Go on in a non-descending fashion

Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
80
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
81
Example






A
A 0
B 1
C 2
D 2
E 3
B
C
D
E
0
2
4
3
0
1
5
0
3
0
82
Single Link Distance Measure
E
A
2
3
3
1
B
2
A
C
4 5
2
1
3
D
E
{A}{B}{C}{D}{E}
A
2
E
1
2
{ABCD}{E}
B
2
C
D
1
1
B
C
1
D
{AB}{CD}{E}
1
B
A
2
2
3
3
C
E
2
1
3
D
3
2
1
{ABCDE}
A
B
C
D E
83
Complete Link Distance Measure
E
A
2
3
3
1
B
2
A
C
4 5
2
1
3
D
E
{A}{B}{C}{D}{E}
3
A
1
3
C
E
D
{ABE}{CD}
B
1
1
B
C
1
D
{AB}{CD}{E}
1
B
A
2
2
3
3
5
C
E
2 4 1
3
D
5
3
1
{ABCDE}
C
D
A
B E
84
Average Link distance measure
E
A
2
3
3
1
B
2
A
C
4 5
2
1
3
D
E
{A}{B}{C}{D}{E}
A
2
E
1
2
B
4
D
2
C
1
{ABCD}{E}
Average link
Between {AB} and{CD}
Is 2.5
1
B
C
1
D
{AB}{CD}{E} average link {AB} and {CD} is 1
1
B
A
2
2
3
3.5
3
5
C
E
2 4 1
2.5
3
D
{ABCDE}
Average link between
{ABCD} and {E}
A
Is 3.5
1
B
C
D E
85
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
86
More on Hierarchical Clustering Methods


Major weakness of agglomerative clustering methods
2
 do not scale well: time complexity of at least O(n ),
where n is the number of total objects
 can never undo what was done previously
Integration of hierarchical with distance-based clustering
 BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
 CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of the
cluster by a specified fraction
 CHAMELEON (1999): hierarchical clustering using
dynamic modeling
87
BIRCH (1996)


Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering


Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree

Scales linearly: finds a good clustering with a single scan

Weakness: handles only numeric data, and sensitive to the
and improves the quality with a few additional scans
order of the data record.
88
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2
CF = (5, (16,30),(54,190))
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
89
CF-Tree in BIRCH

Clustering feature:


summary of the statistics for a given subcluster: the 0-th, 1st and
2nd moments of the subcluster from the statistical point of view.
registers crucial measurements for computing cluster and utilizes
storage efficiently
A CF tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering


A nonleaf node in a tree has descendants or “children”

The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters

Branching factor: specify the maximum number of children.

threshold: max diameter of sub-clusters stored at the leaf nodes
90
CF Tree
Root
B=7
CF1
CF2 CF3
CF6
L=6
child1
child2 child3
child6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
CF6 next
Leaf node
prev CF1 CF2
CF4 next
91
Contributions of BIRCH



BIRCH is local in that each clustering decision
is made without scanning all data points.
BIRCH exploits the observation that the data
space is usually not uniformly occupied, and
hence not every data point is equally important
for clustering purposes.
BIRCH makes full use of available memory to
derive the finest possible subclusters ( to
ensure accuracy) while minimizing I/O costs (
to ensure efficiency).
92
Background knowledge


{
x
Given N d-dimensional data points in a cluster: i }, where
i = 1,2,…,N, the centroid, radius and diameter is defined
as:


x0


N
x
i 1 i
N
  2
i 1 ( xi  x0 )
N
R(
N
  2
(
x
i 1  j 1 i  x j )
N
D(

We treat

x0
)
1
2
N
N ( N  1)
)
1
2
, R and D as properties of a single cluster.
93
Background knowledge (Cont.)

Next between two clusters, we define 5 alternative
distances for measuring their closeness.


 Given the centroids of two clusters:
, the
x0 and x0
two alternative distances D0 and D1 are defined
as:
1
2

 2 12
D0  (( x01  x02 ) )
d


 (i )  (i )
D1 | x01  x02 |  | x01  x02 |
i 1
94
Background knowledge
(Cont.)

– Given N1 d-dimensional data points in a cluster:{xi }
where i = 1,2,…N1, and N2 data points in another cluster: { x j }
where j = N1+1,N1+2,…,N1+N2, the average inter-cluster
distance D2, average intra-cluster distance D3 and variance
increase distance D4 of the two clusters are defined as:
  2
(
x
i 1  j  N 1 i  x j )
N1
D2  (
N1  N 2
1
N1 N 2

D3  (
N1  N 2
i 1

N1  N 2
j 1
)
1
2
 
( xi  x j ) 2
1
2
)
( N1  N 2 )( N1  N 2  1)
N1  N 2 
N1  N 2 
N1 
N1  N 2
N
N

N
x
1
1
2

 l 1 xl 2
 l 1 xl 2
l  N1 1 l 2
D 4   ( xk 
)   ( xi 
)   (x j 
)
N

N
N
N
k 1
i 1
j  N1 1
1
2
1
2


Actually, D3 is D of the merged cluster.
D0,D1,D2,D3,D4 is some measurement of two clusters. We
can use them to determine if two clusters are close.
95
Clustering Feature


A Clustering Feature is a tirple summarizing the
information that we maintain about a cluster.
Definition. Given N d-dimensional data points in a

{
x
cluster: i } where i = 1,2,…N, the Clustering Feature
(CF) vector of the cluster is defined as a triple:
, where N is the number of data points in the cluster,
CF  ( sum
N , LS , SS
is the linear
of) the N data points, i.e.,
and

x
SS isthe square sum of
LSthe N data points, i.e.,
N
i 1 i
2
i1 xi
N
96
Clustering Feature

(Cont.)
CF Additive Theorem:


Assume that CF1=(N1L
, S ,SS1), and CF2 =(N2, LS ,SS2)
are the CF vectors of two disjoint clusters. Then the CF
vector of the cluster that is formed by merging the two
disjoint clusters, is:


CF1  CF2  ( N1  N 2 , LS1  LS 2 , SS1  SS 2 )

From the CF definition and additive theorem, we know
that the corresponding X0, R,D,D0,D1,D2,D3 and D4,can
all be calculated easily.
97
CF Tree

A CF tree is a height-balanced tree with two parameters:
 branching factor B. Each nonleaf node contains at
most B entries of the form[cfi,childi], where i =
1,2,…,B, childi is a pointer to its i-th child node, and CFi
is the CF of the sub-cluster represented by this child. A
leaf node contains at most L entries, each of the form
[CFi], where i = 1,2,…L.
 Initial threshold T. Which is an initial threshold for the
radius ( or diameter ) of any cluster. The value of T is
an upper bound on the radius of any cluster and
controls the number of clusters that the algorithm
discovers.
98
CF Tree


(Cont.)
CF Tree will be built dynamically as new data
objects are inserted. It is used to guide a new
insertion into the correct subcluster for clustering
purpose. Actually, all clusters are formed as each
data point is inserted into the CF Tree.
CF Tree is a very compact representation of the
dataset because each entry in a leaf node is not a
single data point but a subcluster.
99
Insertion into a CF Tree
1.
2.
we now present the algorithm for inserting an entry into a CF tree.Given
entry “Ent”, it proceeds as below:
Identifying the appropriate leaf: Starting from the root, according to a
chosen distance metric D0 to D4 as defined before, it recursively
descends the CF tree by choosing the closest child node .
Modifying the leaf: When it reaches a leaf node, it finds the closest leaf
entry, and tests whether the node can absorb it without violating the
threshold condition.


If so, the CF vector for the node is updated to reflect
this.
If not, a new entry for it is added to the leaf.


If there is space on the leaf for this new entry, we are done.
Otherwise we must split the leaf node. Node splitting is done
by choosing the farthest pair of entries based on the closest
criteria.
100
Insertion into a CF Tree
3.
(Cont.)
Modifying the path to the leaf: After inserting “Ent” into a leaf, we
must update the CF information for each nonleaf entry on the path
to the leaf.


In the absence of a split, this simply involves
adding CF vectors to reflect the additions of “Ent”.
A leaf split requires us to insert a new nonleaf
entry into the parent node, to describe the newly
created leaf.


If the parent has space for this entry, at all higher levels,
we only need to update the CF vectors to reflect the
addition of “Ent”.
Otherwise, we may have to split the parent as well, and
so on up to the root.
101
Insertion into a CF Tree
4.
(Cont.)
Merging Refinement: In the presence of skewed data input
order, split can affect the clustering quality, and also reduce space
utilization. A simple additional merging often helps ameliorate
these problems: suppose the propagation of one split stops at
some nonleaf node Nj, i.e., Nj can accommodate the additional
entry resulting from the split.
Scan Nj to find the two closest entries.
2.
If they are not the pair corresponding to the split, merge them.
3.
If there are more entries than one page can hold, split it again.
4.
During the resplitting, one of the seeds attracts enough merged
entries, the other receives the rest entries.
In summary, it improve the distribution of entries in the closest two
children, even
free a node space for later use.
1.
102
Insertion into a CF Tree

(Cont.)
Notes: Some problems which will be resolved in
the later phases of the entire algorithm.

Depending upon the order of data input and
the degree of skew, two subclusters that
should not be in one cluster are kept in the
same node. It will be remedied with a global
algorithm that arranges leaf entries across
nodes.( Phase 3)

If the same data point is inserted twice, but
at two different times, the two copies might
be entered into distinct leaf entries. It will be
addressed with further refinement passes
over the data.(Phase 4)
103
Figure: Effect of Split, Merge and Resplit
CF1 CF3
CF4
CF2
Split
CF1 CF3 CF5
CF1 CF3 CF5
CF6
CF7 CF8
CF4
CF2
CF4
CF2
Split
CF1 CF3 CF5
CF6
CF7 CF8
CF4
CF2
CF9
Merge
CF1 CF3 CF5
CF6
CF7 CF8
CF4
CF2
CF9
Resplit
CF1 CF3 CF5
CF6
CF7 CF8
CF4
CF2
CF9
104
BIRCH Clustering Algorithm - Overview
Data
Phase1: Load into memory by
building a CF tree
Initial CF tree
Phase2(Optional): Condense into
desirable range by building a
smaller CF tree
Smaller CF tree
Phase3: Global clustering
Good cluster
Phase4(optional and off line):
Clustering refinement
Better Cluster
Click Here
105
BIRCH Clustering Algorithm – Phase 2,3 & 4

Phase 2: Condense into desirable range by building a smaller CF tree


Phase 3: Global clustering



To meet the need of the input size range of Phase 3
To solve the problem 1
Approach: re-cluster all subclusters by using the
existing global or semi-global clustering algorithms
Phase 4: Global clustering


To solve the problem 2
Approach: use the centroids of the cluster produced
by phase 3 as seeds, and redistributes the data
points to its closest seed to obtain a set of new
clusters. This can use existing algorithm.
106
BIRCH Clustering Algorithm – Control flow of Phase1
Start CF tree t1 of Initial T
Continue scanning data and insert to t1
Out of memory
(1)
(2)
(3)
Result?
Finish scanning data
Increase T.
Rebuild CF tree t2 of new T from CF tree t1: if a leaf
entry of t1 is potential outlier and disk space
available, write to disk; otherwise use it to rebuild t2
T1< - t2
Otherwise
Result?
Out of disk space
Re-absorb potential outliers in to t1
Re-absorb potential outliers in to t1
107
CURE (Clustering Using REpresentatives )

CURE: proposed by Guha, Rastogi & Shim, 1998


Stops the creation of a cluster hierarchy if a level
consists of k clusters
Uses multiple representative points to evaluate the
distance between clusters, adjusts well to arbitrary
shaped clusters and avoids single-link effect
108
Drawbacks of Distance-Based Method

Drawbacks of square-error based clustering method


Consider only one point as representative of a cluster
Good only for convex shaped, similar size and density,
and if k can be reasonably estimated
109
Cure: The Algorithm

Draw random sample s.

Partition sample to p partitions with size s/p

Partially cluster partitions into s/pq clusters

Eliminate outliers

By random sampling

If a cluster grows too slow, eliminate it.

Cluster partial clusters.

Label data in disk
110
Data Partitioning and Clustering



s = 50
p=2
s/p = 25

s/pq = 5
y
y
y
x
y
y
x
x
x
x
111
Cure: Shrinking Representative Points
y
y
x


x
Shrink the multiple representative points towards the
gravity center by a fraction of .
Multiple representatives capture the shape of the cluster
112
Clustering Categorical Data: The ROCK Algorithm




ROCK: RObust Clustering using linKs
 S. Guha, R. Rastogi & K. Shim, ICDE’99
Major ideas
 Use links to measure similarity/proximity
 Not distance-based
2
O
(
n
 nmmma
 Computational complexity:
Algorithm: sampling-based clustering
 Draw random sample
 Cluster with links
 Label data in disk
Experiments
 Congressional voting, mushroom data
 n2 log n)
113
Similarity Measure in ROCK




Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
Example: Two groups (clusters) of transactions

C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c,
e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}

C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Jaccard co-efficient may lead to wrong clustering result

C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})

C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
New Similarity function: Sim(T , T )  T1  T2
1
2
T1  T2

Ex. Let T1 = {a, b, c}, T2 = {c, d, e}
Sim (T 1, T 2) 
{c}
{a, b, c, d , e}

1
 0.2
5
114
Link Measure in ROCK

Links: # of common neighbors given a treshold value 



C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c,
e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

link(T1, T2) = 4, since they have 4 common neighbors


link(T1, T3) = 3, since they have 3 common neighbors


{a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
{a, b, d}, {a, b, e}, {a, b, g}
Thus link is a better measure than Jaccard coefficient
115
Goodness Measure to Join Clusters







Goodness measure between two clusters i and j
g(Ci,Cj) = _______links(Ci,Cj)_______
(ni+nj)1+2f()-ni1+2f()-nj1+2f()
Links(Ci,Cj): number of links betweeen two
clusters
ni nj number of object in each cluster
ni1+2f() is number of expected links in cluster i
f () function of shape of the cluster
116
CHAMELEON (Hierarchical clustering using
dynamic modeling)

CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar’99

Measures the similarity based on a dynamic model



Two clusters are merged only if the interconnectivity and closeness
(proximity) between two clusters are high relative to the internal
interconnectivity of the clusters and closeness of items within the
clusters
Cure ignores information about interconnectivity of the objects,
Rock ignores information about the closeness of two clusters
A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large
number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the
genuine clusters by repeatedly combining these sub-clusters
117
Overall Framework of CHAMELEON
Construct
Partition the Graph
Sparse Graph
Data Set
Merge Partition
Final Clusters
118
Chapter 6. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
119
Density-Based Clustering Methods



Clustering based on density (local cluster criterion), such
as density-connected points
Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition
Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98)
120
Density Concepts

Core object (CO)–object with at least ‘M’ objects within a
radius ‘E-neighborhood’

Directly density reachable (DDR)–x is CO, y is in x’s ‘Eneighborhood’

Density reachable–there exists a chain of DDR objects
from x to y

Density based cluster–density connected objects
maximum w.r.t. reachability
121
Density-Based Clustering: Background



Two parameters:

Eps: Maximum radius of the neighbourhood

MinPts: Minimum number of points in an Epsneighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly densityreachable from a point q wrt. Eps, MinPts if

1) p belongs to NEps(q)

2) core point condition:
|NEps (q)| >= MinPts
p
q
MinPts = 5
Eps = 1 cm
122
Density-Based Clustering: Background (II)

Density-reachable:


p
A point p is density-reachable from
a point q wrt. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
p1
q
Density-connected

A point p is density-connected to a
point q wrt. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o wrt.
Eps and MinPts.
p
q
o
123
Let MinPts =3
*
* Q
*M *
* P **
**
*
*
*M,P are core objects since each is in an
 neighborhood containing at least 3 points
*Q is directly density-reachable from M
M is directly density-reachable form P and vice versa
*Q is indirectly density reachable form P since
Q is directly density-reachable from M and M is directly density reacable
From P but p is not density reacable from Q sicne Q is not a core object
124
DBSCAN: Density Based Spatial Clustering of
Applications with Noise


Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core
MinPts = 5
125
DBSCAN: The Algorithm





Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps
and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the
database.
Continue the process until all of the points have been
processed.
126
OPTICS: A Cluster-Ordering Method (1999)

OPTICS: Ordering Points To Identify the Clustering
Structure
 Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
 Produces a special order of the database wrt its
density-based clustering structure
 This cluster-ordering contains info equiv to the
density-based clusterings corresponding to a broad
range of parameter settings
 Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure
 Can be represented graphically or using visualization
techniques
127
OPTICS: Some Extension from
DBSCAN

Index-based:
 k = number of dimensions
 N = 20
 p = 75%
 M = N(1-p) = 5



D
Complexity: O(kN2)
Core Distance
p1
Reachability Distance
o
p2
Max (core-distance (o), d (o, p))
r(p1, o) = 2.8cm. r(p2,o) = 4cm
o
MinPts = 5
 = 3 cm
128
Reachability
-distance
undefined

‘

Cluster-order
of the objects
129
Density-Based Cluster analysis:
OPTICS & Its Applications
130
DENCLUE: Using density functions

DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)

Major features

Solid mathematical foundation

Good for data sets with large amounts of noise



Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
Significant faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
But needs a large number of parameters
131
Denclue: Technical Essence





Uses grid cells but only keeps information about grid
cells that do actually contain data points and manages
these cells in a tree-based access structure.
Influence function: describes the impact of a data point
within its neighborhood.
Overall density of the data space can be calculated as
the sum of the influence function of all data points.
Clusters can be determined mathematically by
identifying density attractors.
Density attractors are local maximal of the overall
density function.
132
Gradient: The steepness of a slope

Example
f Gaussian ( x , y )  e
f
D
Gaussian
f
d ( x , y )2

2 2
( x )   i 1 e
N

d ( x , xi ) 2
2 2
( x, xi )  i 1 ( xi  x)  e
D
Gaussian
N

d ( x , xi ) 2
2 2
133
Density Attractor
134
Center-Defined and Arbitrary
135
Chapter 6. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
136
Grid-Based Clustering Method

Using multi-resolution grid data structure

Several interesting methods


STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98)


A multi-resolution clustering approach using
wavelet method
CLIQUE: Agrawal, et al. (SIGMOD’98)
137
STING: A Statistical Information
Grid Approach



Wang, Yang and Muntz (VLDB’97)
The spatial area area is divided into rectangular cells
There are several levels of cells corresponding to different
levels of resolution
138
STING: A Statistical Information
Grid Approach (2)






Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated from
parameters of lower level cell
 count, mean, s, min, max
 type of distribution—normal, uniform, etc.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layer—typically with a small
number of cells
For each cell in the current level compute the confidence
interval
139
STING: A Statistical Information Grid Approach
(3)




Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to the
next lower level
Repeat this process until the bottom layer is reached
Advantages:



Query-independent, easy to parallelize, incremental
update
O(K), where K is the number of grid cells at the lowest
level
Disadvantages:
 All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
140
WaveCluster (1998)


Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
A multi-resolution clustering approach which applies
wavelet transform to the feature space

A wavelet transform is a signal processing technique
that decomposes a signal into different frequency subband.

Both grid-based and density-based

Input parameters:


# of grid cells for each dimension
the wavelet, and the # of applications of wavelet
transform.
141
WaveCluster (1998)

How to apply wavelet transform to find clusters

Summaries the data by imposing a multidimensional
grid structure onto data space

These multidimensional spatial data objects are
represented in a n-dimensional feature space

Apply wavelet transform on feature space to find the
dense regions in the feature space

Apply wavelet transform multiple times which result
in clusters at different scales from fine to coarse
143
Wavelet Transform



Decomposes a signal into different frequency
subbands (can be applied to n-dimensional
signals)
Data are transformed to preserve relative
distance between objects at different levels of
resolution
Allows natural clusters to become more
distinguishable
144
What Is Wavelet (2)?
145
Quantization
146
Transformation
147
WaveCluster (1998)


Why is wavelet transformation useful for clustering
 Unsupervised clustering
It uses hat-shape filters to emphasize region where
points cluster, but simultaneously to suppress weaker
information in their boundary
 Effective removal of outliers
 Multi-resolution
 Cost efficiency
Major features:
 Complexity O(N)
 Detect arbitrary shaped clusters at different scales
 Not sensitive to noise, not sensitive to input order
 Only applicable to low dimensional data
148
CLIQUE (Clustering In QUEst)

Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).

Automatically identifying subspaces of a high dimensional
data space that allow better clustering than original space

CLIQUE can be considered as both density-based and gridbased

It partitions each dimension into the same number of
equal length interval

It partitions an m-dimensional data space into nonoverlapping rectangular units

A unit is dense if the fraction of total data points
contained in the unit exceeds the input model parameter

A cluster is a maximal set of connected dense units
within a subspace
149
CLIQUE: The Major Steps



Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify the subspaces that contain clusters using the
Apriori principle
Identify clusters:



Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of
interests.
Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of
connected dense units for each cluster
 Determination of minimal cover for each cluster
150
=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
30
Vacation
(week)
0 1 2 3 4 5 6 7
age
60
20
30
40
50
age
60
50
age
151
Strength and Weakness of CLIQUE


Strength
 It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
 It is insensitive to the order of records in input and
does not presume some canonical data distribution
 It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
 The accuracy of the clustering result may be degraded
at the expense of simplicity of the method
152
Chapter 6. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
153
Model-Based Clustering Methods


Attempt to optimize the fit between the data and some
mathematical model
Statistical and AI approach
 Conceptual clustering




A form of clustering in machine learning
Produces a classification scheme for a set of unlabeled objects
Finds characteristic description for each concept (class)
COBWEB (Fisher’87)



A popular a simple method of incremental conceptual learning
Creates a hierarchical clustering in the form of a classification
tree
Each node refers to a concept and contains a probabilistic
description of that concept
154
COBWEB

Incremental learning to form hierarchy of concepts
 Tree structure
 Root node represents highest level of concept
generalization





Summary information for all instances
Each object is presented to the existing concept hierarchy
in a sequential manner
In the beginning the tree consists of the root along
Objects are added one by one
The tree is updated at each stage
155



Create a cluster with the first instance as its only
member
For each new instance
 Place the new instance into an existing cluster
 Create a new concept cluster having the new
instance as its only member
At each level of the hierarchy an evaluation
function is used (category utility)
 To include the object in an existing cluster
 To create a new cluster the object being its
only member
156
Category Utility

Attribute-value format categorical values
 Ai i =1..n attributes
Vij j th value of attribute i
Three probabilities:
 Conditional probability of attribute Ai having
value Vij given in class Ck




P(Ai=Vij|Ck)
Note that if P(Ai = Vij|Ck) = 1

Each instance of class Ck will always have Vij as the
value for attribute Ai
157


Second probability:
P(Ck|Ai = Vij) conditional probability that
 An object is in class Ck given that attribute Ai
has value Vij
 If P(Ck|Ai = Vij) = 1 if Ai has vaue Vij, The class
must be Ck

Attribute Ai having vaue vij is sufficient condition for
defining class Ck
158








Partition quality
KijP(Ai=Vij) P(Ck|Ai = Vij) P(Ai = Vij|Ck)
Category utility is defined as
Kk=1 P(Ck) ij P(Ai = Vij|Ck)2- ij P(Ai = Vij)2
K
First numerator term is partition-quality
expression by applying Bayes rule
Second term is probability of correctly guessing
attribute values without any category knowledge
Division by K total number of classes: per cluster
utility
159





Category utility is to be expressed as
Kk=1 P(Ck)[ij P(Ai = Vij|Ck)2- ij P(Ai = Vij)2]
K
ij P(Ai = Vij|Ck)2- ij P(Ai = Vij)2
The cluster give some advantage in predicting the
values of attributes of objects in that cluster
 In double summation:

Improvement in information in predicting the values
of attributes by knowing their clusters
160
Example








Object
1
2
3
4
5
6
7
Tails
one
two
two
one
one
one
one
color
light
light
dark
dark
light
light
light
nuclei
one
two
two
three
two
two
three
161
P(N)=7/7
N1
P(N)=3/7
P(V/C)
P(C/V)
Tails
Öne
Two
.71
.29
1.0
1.0
Color
Light
Dark
.71
.29
1.0
1.0
Nuclei
Öne
Two
Three
.14
.57
.29
1.0
1.0
1.0
N4
N2
P(V/C)
P(C/V)
P(N)=2/7
P(V/C)
P(C/V)
P(N)=2/7
P(V/C)
P(C/V)
Tails
Öne
Two
1.0
0.0
0.6
0.0
Tails
Öne
Two
0.0
1.0
0.0
1.0
Tails
Öne
Two
1.0
0.0
0.4
0.0
Color
Light
Dark
1.0
0.0
0.6
0.0
Color
Light
Dark
0.5
0.5
0.2
0.5
Color
Light
Dark
0.5
0.5
0.2
0.5
Nuclei
Öne
Two
Three
0.33
0.67
0.0
1.0
0.5
0.0
Nuclei
Öne
Two
Three
0.0
1.0
0.0
0.0
0.5
0.0
Nuclei
Öne
Two
Three
0.0
0.0
1.0
0.0
0.0
1.0
P(V/C)
P(C/V)
N5
N3
P(N)=1/3
P(V/C)
P(C/V)
P(N)=2/3
Tails
Öne
Two
1.0
0.0
0.33
0.0
Tails
Öne
Two
1.0
0.0
0.67
0.0
Color
Light
Dark
1.0
0.0
0.33
0.0
Color
Light
Dark
1.0
0.0
0.67
0.0
Nuclei
Öne
Two
Three
1.0
0.0
0.0
1.0
0.0
0.0
Nuclei
Öne
Two
Three
0.0
1.0
0.0
0.0
1.0
0.0
162



A new object to be placed in the hierarchy
Enters at the root node N
 N` statistics are updated
As the object enters to the second level of the hierarchy
 one of four actions are chosen
 (1)If the new object is similar enough to one of N1 N2
N4



Incorporated with the preferred node
Proceeds to the second level of the hierarchy
(2) the new object is unique enough to merit the
creation of a new first level concept node


The object becomes a first level concept node
Classification process terminates
163




(3) merging the two-best scoring nodes into a
single nodes
(4) removes the best-scoring node from the
hierarchy
If (3) or (4) is processed
 The new object is once again presented for
classification to the modified hierarchy
This continues at each level of the concept tree
until the new object becomes a terminal node
164
Example 2


Weather data without the play attribute
At the beginning when new objects are presented
in the the structure they each form their own
subcluster under the root
165
humıdıty
windy
play
hot
high
false
no
sunny
hot
high
true
no
c
overcast
hot
high
false
yes
d
rainy
mild
high
false
yes
e
rainy
cool
normal
false
yes
f
rainy
cool
normal
true
no
g
overcast
cool
normal
true
yes
h
sunny
mild
high
false
no
i
sunny
cool
normal
false
yes
j
rainy
mild
normal
false
yes
k
sunny
mild
normal
true
yes
l
overcast
mild
high
true
yes
m
overcast
hot
normal
false
yes
n
rainy
mild
high
true
no
ID
outlook
a
sunny
b
temperature
166
abcde
a:no
a:no
b:no
c:yes
d:yes
e:yes
For each of the first five objects
It is better in terms of category utility to form
A new leaf for each object
167
abcdef
a:no
b:no
c:yes
d:yes
ef
e:yes
f:no
It finally becomes beneficial to form a cluster, joining the new
Object f with the old host e
e and f are very similar differing only in windy attribute
168
abcdefg
a:no
b:no
c:yes
d:yes
e:yes
efg
f:no
g:yes
g is placed in the same cluster
It differs from e only in outlook
When g is introduced
First which of the five children of the root makes the best host
it turns out to be the rightmost
169
abcdefg
a:no
b:no
c:yes
d:yes
e:yes
efg
f:no
g:yes
Then in the second level root is {ed}
Its two children are evaluated to see which will be the better host
It it best in terms of category utility that
to add g as a subcluster of its own
170
abcdefgh
adh
a:no
d:yes
b:no
h:no
c:yes
e:yes
efg
f:no
g:yes
When h is introduced
171




Nodes a and d are merged into the single cluster
before the new object h is added
One Method
 Consider all pairs of nodes for merging and evaluate
the category utility of each pair
 Computationally expensive
COBWEB:
 Best host for h that is a
 Second best host for h that is d
 Merge a and d to form a node
 Test the category utility for placing h into ad as well
172
Splitting Operation


Converse of merging
 Whenever the best host is identified and
merging has not proven beneficial,
 Consider splitting the host node
Take a node and replace it with its children
173
abcdefg
a:no
adh
b:no
d:yes
h:no
c:yes
e:yes
f:no
g:yes
Splitting example
174
COBWEB Clustering Method
A classification tree
175
More on Statistical-Based Clustering



Limitations of COBWEB
 The assumption that the attributes are independent of
each other is often too strong because correlation may
exist
 Not suitable for clustering large database data – skewed
tree and expensive probability distributions
CLASSIT
 an extension of COBWEB for incremental clustering of
continuous data
 suffers similar problems as COBWEB
AutoClass (Cheeseman and Stutz, 1996)
 Uses Bayesian statistical analysis to estimate the
number of clusters
 Popular in industry
176
Shortcomings of Conceptual Clustering
Systems

Object ordering can have a marked impacted on
the results of the clustering
 merge and split operations to overcome this
problem

Not useful in all cases
177
Probability Based Clustering


A mixture is a set of n probability distributions
where each distribution represents a cluster
The mixture model assigns each object a
probability of being a member of cluster k
 Rather than assigning the object or not
assigning the object into cluster k

as in K-mean, k-medoids…
178
Simplest Case




K = 2 two clusters
There is one real valued attribute X
Each having Normally distributed
Five parameters to determine:
 mean and standard deviation for cluster A A
A
 mean and standard deviation for cluster B B
B
 Sampling probability for cluster A pA

As P(A) + P(B) = 1
179


There are two populations each normally
distributed with mean and standard deviation
An object comes from one of them but which one
is unknown
 Assign a probability that an object comes from
A or B

That is equivalent to assign a cluster for each object
180
In parametrıc statıstıcs
Assume that objects come from a distribution usually
Normal distribution with unknown parameters  and 
•Estimate the parameters from data
•Test hypothesis based on the normality assumption
•Test the normality assumption
181
Two normal dirtributions
Bi modal
Arbitary shapes can be captured by mixture of normals
182
Unbiased Estimators of parmeters








If you know which distribution samples come
from estimating the parameters mean and
standard deviation is easy
X_bar = x1+x2+...+xn
n
s2_bar = (x1- X-b)2+(x2- X-b)2+…+(xn- X-b)2
n-1
Unbiased estimator of variance
X_bar and s2_bar are unbiased estimators of the
unknown parameters  and 2
E(x_bar) =  and E s2_bar)= 2
183
Maximum Likelihood estimators






Overall likelihood that the data came from
dataset, given the values for the two parameters
And assuming a probability density function
Multiplying the probabilities of the individual
objects i (assuming they are independent)
L =iP(Xi|,)
P(Xi| ,)=f(Xi;A,A)
Likelihood is a measure of the likelihood of
sample points x1,x2…xn
184
Maximum Likelihood estimators




For continuous variables X
f(X;,) is not a probability but probability
density so likelihood function does not lay
between 0 and 1
In practice its logarithm is calculated
logL = iLog(P(Xi| ,))
185
ML for discrete variables






Tossing a coin
 PH probability of head
 1- PH probability of tail
if PH is unknown how do you estimate PH from sample
observations of heads and tails
Three tosses HHT
Likelihood of that is PH PH PT= PH PH(1-PH)
Estimate the probability of head by maximizing the
likelihood of that particular sample
In this case likelihood is interpreted as probability of
observing that sample
186
Maximum Likelihood estimators






Observing the sample
Estimate the values of the unknown parameters 

 Assuming that the observations are coming
from a normal distribution
Take derivatives of the likelihood function with
respect to   equate to zero
dL/d = 0
dL/d = 0 or
dlogL/d =0, dlogL/d = 0
187
Maximum Likelihood estimators








X_ML = x1+x2+...+xn
n
s2_ML = (x1- X-b)2+(x2- X-b)2+…+(xn- X-b)2
n
Notice that ML estimate of  is the same as its
point estimate
ML estimate of  is biased
Then data comes from a normal distribution with
Mean X_ML and variance s2_ML
188
Returning to Mixtures







Given the five parameters
 Finding the probabilities that a given object comes
from each distribution or cluster is easy
Given an object X,
The probability that it belongs to cluster A
P(A|X) = P(A,X) = P(X|A)*P(A) = f(X;A,A)P(A)
P(X)
P(X)
P(X)
P(B|X) = P(B,X) = P(X|B)*P(B) = f(X;B,B)P(B)
P(X)
P(X)
P(X)
189

f(X;A,A) is the normal distribution function for cluster A:

1
f ( X ; , ) 
e
2 
( X  )2
2 
The denominator P(x) will disappear;
Calculate the numerators for both P(A|X) and P(B|X)
And normalize them by dividing their sum
190

P(A|X) or P(B|X) is the conditional probability that
the object X whose data values are known is in
cluster A or B
 Assume a density for each cluster
 Say a normal distribution so
 P(A/x) is the normal distrubution for cluster A
 Whose parameters its mean and standard
deviation is unknown to be estimated
 Similarly for P(B/X)
191




P(A) or P(B) is the unconditional probability that
ta randomly selected object is in cluster A or B
without knowing its data point its value
Here
P(A) + P(B) = 1
So only one parameter is unknown
192






P(A|X) = _
P(X|A)*P(A)
_
P(X|A)*P(A)+ P(X|B)*P(B)
P(B|X) = _
P(X|B)*P(B)
_
P(X|A)*P(A)+ P(X|B)*P(B)
Note that the final outcome P(A|X) and P(B|X)
gives the probabilities that given object X belongs
to cluster A and B
Note that these can be calculated only by
knowing the priori probabilities P(A) and P(B)
193
EM Algorithm for the Simple Case

We do not know neither of these things
 Distribution that each object came from


Nor the five parameters of the mixture model


P(X/A), P(X/B)
A, B,A,B,P(A)
Adapt the procedure used for k-means algorithm
194
EM Algorithm for the Simple Case



Initially guess initial values for the five parameters
A, B,A,B,P(A)
Until a specified termination criterion
 Compute P(A/Xi) and P(B/Xi) for each Xi i =1..n





Use these probabilities to reestimate the parameters
A, B,A,B,P(A)


Class probabilities for each object
As these are normal distributions with
The expectation step
Maximization of the likelihood of the distributions
Until likelihood converges or does not improve so much
195
Estimation of new means







If wi is the probability that object i belongs to
cluster A
WiA = P(A|Xi) calculated in E step
A = w1Ax1+ w2Ax2+ …+ wnAxn
w1A+ w2A+…+ wnA
WiB = P(B|Xi) calculated in E step
B = w1Bx1+ w2Bx2+ …+ wnBxn
w1B+ w2B+…+ wnB
196
Estimation of new standard
deviations






2A=w1A(X1- A)2+w2A(X2- A)2+…+wnA(Xn- A)2
w1A+ w2A+…+ wnA
2B=w1B(X1- B)2+w2B(X2- B)2+…+wnB(Xn- B)2
w1B+ w2B+…+ wnB
Xi are all the objects not just belonging to A or B
These are maximum likelihood estimators for
variance
 If the weights are equal denominator sum up
to n rather then n-1

197
Estimation of new priori probabilities









P(A) = P(A|X1)+ P(A|X2)+… +P(A|Xn)
n
pA = w1A+ w2A+…+ wnA
n
P(B) = P(B|X1)+ P(B|X2)+… +P(B|Xn)
n
pB = w1B+w2B+…+ wnB
n
Or P(B) = 1-P(A)
198
How to terminate the iteratin







K-means stops when the classes of the objects do not
change from one iteration to the next
Overall likelihood that the data came from this dataset,
given the values for the five parameters
Multiplying the probabilities of the individual objects I
L =i(pAP(Xi|A)+pBP(Xi|B))
The probabilities given the cluster A and B are determined
from the normal distribution function
P(Xi|A)=f(Xi;A,A)
Likelihood is a measure of the goodness of clustering and
increases at each iteration of the EM algorithm
199







For continuous variables X
f(X;A,A) is not a probability but probability density so
likelihood function does not lay between 0 and 1
In practice its logarithm is calculated
Since Y =logX is a monotonically increasing function
logL = ilog(pAP(Xi|A)+pBP(Xi|B))
Iterate until the increase in likelihood or log-likelihood
becomes negligible
Maximization of L is equivalent to maximization of log of L
200
General Considerations



EM converges to a local optimum maximum
likelihood score
 May not be global
EM may be run with different initial values of the
parameters and the clustering structure with the
highest likelihood is chosen
Or an initial k-means is conducted to get the
mean and standard deviation parameters
201
Other Model-Based Clustering Methods


Neural network approaches
 Represent each cluster as an exemplar, acting as a
“prototype” of the cluster
 New objects are distributed to the cluster whose
exemplar is the most similar according to some
distance measure
Competitive learning
 Involves a hierarchical architecture of several units
(neurons)
 Neurons compete in a “winner-takes-all” fashion for
the object currently being presented
202
Model-Based Clustering Methods
203
Self-Organizing feature Maps (SOMs)





Clustering is also performed by having several
units competing for the current object
The unit whose weight vector is closest to the
current object wins
The winner and its neighbors learn by having
their weights adjusted
SOMs are believed to resemble processing that
can occur in the brain
Useful for visualizing high-dimensional data in 2or 3-D space
204
Chapter 6. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
205
What Is Outlier Discovery?



What are outliers?
 The set of objects are considerably dissimilar from the
remainder of the data
 Example: Sports: Michael Jordon, Wayne Gretzky, ...
Problem
 Find top n outlier points
Applications:
 Credit card fraud detection
 Telecom fraud detection
 Customer segmentation
 Medical analysis
206
Outlier Discovery:
Statistical Approaches



Assume a model underlying distribution that generates
data set (e.g. normal distribution)
Use discordancy tests depending on
 data distribution
 distribution parameter (e.g., mean, variance)
 number of expected outliers
Drawbacks
 most tests are for single attribute
 In many cases, data distribution may not be known
207
Outlier Discovery: Distance-Based Approach



Introduced to counter the main limitations imposed by
statistical methods
 We need multi-dimensional analysis without knowing
data distribution.
Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the objects
in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers
 Index-based algorithm
 Nested-loop algorithm
 Cell-based algorithm
208
Outlier Discovery: Deviation-Based Approach



Identifies outliers by examining the main characteristics
of objects in a group
Objects that “deviate” from this description are
considered outliers
sequential exception technique


simulates the way in which humans can distinguish
unusual objects from among a series of supposedly
like objects
OLAP data cube technique

uses data cubes to identify regions of anomalies in
large multidimensional data
209
Problems and Challenges



Considerable progress has been made in scalable
clustering methods

Partitioning: k-means, k-medoids, CLARANS

Hierarchical: BIRCH, CURE

Density-based: DBSCAN, CLIQUE, OPTICS

Grid-based: STING, WaveCluster

Model-based: Autoclass, Denclue, Cobweb
Current clustering techniques do not address all the
requirements adequately
Constraint-based clustering analysis: Constraints exist in
data space (bridges and highways) or in user queries
210
Constraint-Based Clustering Analysis

Clustering analysis: less parameters but more user-desired
constraints, e.g., an ATM allocation problem
211
Clustering With Obstacle Objects
Not Taking obstacles into account
Taking obstacles into account
212
Chapter 6. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
213
Summary





Cluster analysis groups objects based on their similarity
and has wide applications
Measure of similarity can be computed for various types
of data
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
There are still lots of research issues on cluster analysis,
such as constraint-based clustering
214
References (1)










R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
high dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD’99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based
on dynamic systems. In Proc. VLDB’98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
215
References (2)









L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method
for very large databases. SIGMOD'96.
216
www.cs.uiuc.edu/~hanj
Thank you !!!
217