Download Cluster Analysis

Document related concepts

Principal component analysis wikipedia , lookup

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Cluster Analysis:
— Chapter 4 —
BIS 541
2015/2016 Summer
1
What is Cluster Analysis?




Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
Clustering is unsupervised classification: no predefined classes
Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
 Measuring the performance of supervised learning
algorithms
2
Basic Measures for Clustering

Clustering: Given a database D = {t1, t2, .., tn}, a distance
measure dis(ti, tj) defined between any two objects ti and
tj, and an integer value k, the clustering problem is to
define a mapping f: D  {1, …,k } where each ti is
assigned to one cluster Kf, 1 ≤ f ≤ k,
3
What is Cluster Analysis?




Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Clustering is unsupervised learning: no predefined classes
Typical applications
 As a stand-alone tool to get insight into data
distribution
 As a preprocessing step for other algorithms
4
General Applications of Clustering





Pattern Recognition
Spatial Data Analysis
 create thematic maps in GIS by clustering feature
spaces
 detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
WWW
 Document classification
 Cluster Weblog data to discover groups of similar access
patterns
5
Examples of Clustering Applications





Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to
their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
6
Constraint-Based Clustering Analysis

Clustering analysis: less parameters but more user-desired
constraints, e.g., an ATM allocation problem
7
Clustering Cities






Clustering Turkish cities
Based on
 Political
 Demographic
 Economical
Characteristics
Political :general elections 1999,2002
Demographic:population,urbanization rates
Economical:gnp per head,growth rate of gnp
8
What Is Good Clustering?

A good clustering method will produce high quality clusters
with


high intra-class similarity

low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.

The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
9
Requirements of Clustering in Data Mining

Scalability

Ability to deal with different types of attributes

Ability to handle dynamic data

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to
determine input parameters

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Incorporation of user-specified constraints

Interpretability and usability
10
Chapter 5. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
11
Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance
 
k
m1 tmiKm

(Cm  tmi )
2
Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

k-means (MacQueen’67): Each cluster is represented by the center
of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
12
The K-Means Clustering Algorithm

choose k, number of clusters to be determined

Choose k objects randomly as the initial cluster centers

repeat

Assign each object to their closest cluster center


Compute new cluster centers


Using Euclidean distance
Calculate mean points
until

No change in cluster centers or

No object change its cluster
13
The K-Means Clustering Method

Example
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
9
9
8
8
7
7
6
6
5
5
4
3
2
1
0
1
2
3
4
5
6
7
8
8
9
10
reassign
10
0
7
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
14
Example TBP Sec 3.3 page 84 Table 3.6



Instance X
Y
 1
1.0
1.5
 2
1.0
4.5
 3
2.0
1.5
 4
2.0
3.5
 5
3.0
2.5
 6
5.0
6.1
k is chosen as 2 k=2
Chose two points at random representing initial
cluster centers
 Object 1 and 3 are chosen as cluster centers
15
6
2
4
5
1* 3*
16
Example cont.



Euclidean distance between point i and j
D(i - j)=( (Xi – Xj)2 + (Yi – Yj)2)1/2
Initial cluster centers
 C1:(1.0,1.5) C2:(2.0,1.5)
D(C1 – 1) = 0.00
 D(C1 – 2) = 3.00
 D(C1 – 3) = 1.00
 D(C1 – 4) = 2.24
 D(C1 – 5) = 2.24
 D(C1 – 6) = 6.02
C1:{1,2} C2:{3.4.5.6}


D(C2
D(C2
D(C2
D(C2
D(C2
D(C2
–1)
–2)
–3)
–4)
–5)
–6)
=
=
=
=
=
=
1.00
3.16
0.00
2.00
1.41
5.41
C1
C1
C2
C2
C2
C2
17
6
2
4
5
1* 3*
18
Example cont.






Recomputing cluster centers
For C1:
 XC1 = (1.0+1.0)/2 = 1.0
 YC1 = (1.5+4.5)/2 = 3.0
For C2:
 XC2 = (2.0+2.0+3.0+5.0)/4 = 3.0
 YC2 = (1.5+3.5+2.5+6.0)/4 = 3.375
Thus the new cluster centers are
C1(1.0,3.0) and C2(3.0,3.375)
As the cluster centers have changed
 The algorithm performs another iteration
19
6
2
*
* 4
5
1 3
20
Example cont.

New cluster centers
C1(1.0,3.0) and C2(3.0,3.375)

D(C1 – 1) = 1.50
 D(C1 – 2) = 1.50
 D(C1 – 3) = 1.80
 D(C1 – 4) = 1.12
 D(C1 – 5) = 2.06
 D(C1 – 6) = 5.00
C1 {1,2.3} C2:{4.5.6}


D(C2
D(C2
D(C2
D(C2
D(C2
D(C2
–1)
–2)
–3)
–4)
–5)
–6)
=
=
=
=
=
=
2.74
2.29
2.13
1.01
0.88
3.30
C1
C1
C1
C2
C2
C2
21
Example cont.






computing new cluster centers
For C1:
 XC1 = (1.0+1.0+2.0)/3 = 1.33
 YC1 = (1.5+4.5+1.5)/3 = 2.50
For C2:
 XC2 = (2.0+3.0+5.0)/3 = 3.33
 YC2 = (3.5+2.5+6.0)/3 = 4.00
Thus the new cluster centers are
C1(1.33,2.50) and C2(3.33,4.3.00)
As the cluster centers have changed
 The algorithm performs another iteration
22
Exercise

Perform the third iteration
23
Commands



each initial cluster centers may end up with
different final cluster configuration
Finds local optimum but not necessarily the
global optimum
 Based on sum of squared error SSE
differences between objects and their cluster
centers
Choose a terminating criterion such as
 Maximum acceptable SSE
 Execute K-Means algorithm until satisfying the
condition
24
Comments on the K-Means Method

Strength: Relatively efficient: O(tkn), where n is #
objects, k is # clusters, and t is # iterations.
Normally, k, t << n.


Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2
+ k(n-k))
Comment: Often terminates at a local optimum.
The global optimum may be found using
techniques such as: deterministic annealing and
genetic algorithms
25
Weaknesses of K-Means Algorithm


Applicable only when mean is defined, then what
about categorical data?
Need to specify k, the number of clusters, in advance

run the algorithm with different k values

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex
shapes

Works best when clusters are of approximately of equal size
26
Presence of Outliers
x
x
x x
xx
x xx
xx
x
When k=2
x x
xx
x xx
xx
x
When k=3
When k=2 the two natural clusters are not captured
27
Quality of clustering depends on
unit of measure
income
x
x
income
x
x
x
x
x
x
x
x
x
age
Income measured by
YTL,age by year
x
age Income measured by
So what to do?
TL, age by year
28
Variations of the K-Means Method


A few variants of the k-means which differ in

Selection of the initial k means

Dissimilarity calculations

Strategies to calculate cluster means
Handling categorical data: k-modes (Huang’98)

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects

Using a frequency-based method to update modes of clusters

A mixture of categorical and numerical data: k-prototype method
29
Exercise

Show by designing simple examples:
 a) K-means algorithm may converge to
different local optima starting from different
initial assignments of objects into different
clusters
 b) In the case of clusters of unequal size, Kmeans algorithm may fail to catch the
obvious (natural) solution
30
* *
* **
* * ***
***
*
* *
*****
**
*
** ****** *
*
*
* ** ** *
* *** **
***
** * *
31
How to chose K





for reasonable values of k
 e.g. from 2 to 15
plot k versus SSE (sum of square error)
visually inspect the plot and
as k increases SSE falls
choose the breaking point
32
SSE
2 4 6 8 10 12
k
33
Velidation of clustering



partition the data into two equal groups
apply clustering the
 one of these partitions
 compare cluster centers with the overall data
or
 apply clustering to each of these groups
 compare cluster centers
34
Basic Measures for Clustering

Clustering: Given a database D = {t1, t2, .., tn}, a distance
measure dis(ti, tj) defined between any two objects ti and
tj, and an integer value k, the clustering problem is to
define a mapping f: D  {1, …, k} where each ti is
assigned to one cluster Kf, 1 ≤ f ≤ k, such that tfp,tfq ∈ Kf
and ts ∉ Kf, dis(tfp,tfq) ≤ dis(tfp,ts)

Centroid, radius, diameter

Typical alternatives to calculate the distance between
clusters

Single link, complete link, average, centroid, medoid
35
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)


Centroid: the “middle” of a cluster
ip
)
N
Radius: square root of average distance from any point of the
cluster to its centroid

Cm 
iN 1(t
 N (t  cm ) 2
Rm  i 1 ip
N
Diameter: square root of average mean squared distance between
all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)
36
Typical Alternatives to Calculate the
Distance between Clusters

Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)

Medoid: distance between the medoids of two clusters, i.e., dis(Ki,
Kj) = dis(Mi, Mj)

Medoid: one chosen, centrally located object in the cluster
37
Major Clustering Approaches

Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
38
The K-Medoids Clustering Method

Find representative objects, called medoids, in clusters

PAM (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering

PAM works effectively for small data sets, but does not scale
well for large data sets

CLARA (Kaufmann & Rousseeuw, 1990)

CLARANS (Ng & Han, 1994): Randomized sampling

Focusing + spatial data structure (Ester et al., 1995)
39
CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990)


Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output

Strength: deals with larger data sets than PAM

Weakness:


Efficiency depends on the sample size
A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
40
CLARANS (“Randomized” CLARA) (1994)

CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)





CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
41
Hierarchical Clustering

Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
42
AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Use the Single-Link method and the dissimilarity matrix.

Merge nodes that have the least dissimilarity

Go on in a non-descending fashion

Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
43
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
44
Example






A
A 0
B 1
C 2
D 2
E 3
B
C
D
E
0
2
4
3
0
1
5
0
3
0
45
Simple Link Distance Measure
E
A
2
3
3
1
B
2
A
C
4 5
2
1
3
D
E
{A}{B}{C}{D}{E}
A
2
E
1
2
{ABCD}{E}
B
2
C
D
1
1
B
C
1
D
{AB}{CD}{E}
1
B
A
2
2
3
3
C
E
2
1
3
D
3
2
1
{ABCDE}
A
B
C
D E
46
Complete Link Distance Measure
E
A
2
3
3
1
B
2
A
C
4 5
2
1
3
D
E
{A}{B}{C}{D}{E}
3
A
1
3
C
E
D
{ABE}{CD}
B
1
1
B
C
1
D
{AB}{CD}{E}
1
B
A
2
2
3
3
5
C
E
2 4 1
3
D
5
3
1
{ABCDE}
C
D
A
B E
47
Average Link distance measure
E
A
2
3
3
1
B
2
A
C
4 5
2
1
3
D
E
{A}{B}{C}{D}{E}
A
2
E
1
2
B
4
D
2
C
1
{ABCD}{E}
Average link
Between {AB} and{CD}
Is 2.5
1
B
C
1
D
{AB}{CD}{E} average link {AB} and {CD} is 1
1
B
A
2
2
3
3.5
3
5
C
E
2 4 1
2.5
3
D
{ABCDE}
Average link between
{ABCD} and {E}
A
Is 3.5
1
B
C
D E
48
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
49
More on Hierarchical Clustering Methods


Major weakness of agglomerative clustering methods
2
 do not scale well: time complexity of at least O(n ),
where n is the number of total objects
 can never undo what was done previously
Integration of hierarchical with distance-based clustering
 BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
 CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of the
cluster by a specified fraction
 CHAMELEON (1999): hierarchical clustering using
dynamic modeling
50
Data Structures


Data matrix
 (two modes)
Dissimilarity matrix
 (one mode)
 x11

 ...
x
 i1
 ...
x
 n1
...
x1f
...
...
...
...
xif
...
...
...
...
... xnf
...
...
 0
 d(2,1)
0

 d(3,1) d ( 3,2) 0

:
:
 :
d ( n,1) d ( n,2) ...
x1p 

... 
xip 

... 
xnp 







... 0
51
Properties of Dissimilarity Measures

Properties
d(i,j)  0 for i  j
 d(i,i) = 0
 d(i,j) = d(j,i) symmetry
 d(i,j)  d(i,k) + d(k,j) triangular inequality


Exercise: Can you find examples where
distance between objects are not obeying
symmetry property
52
Measure the Quality of Clustering





Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
Weights should be associated with different variables
based on applications and data semantics.
It is hard to define “similar enough” or “good enough”

the answer is typically highly subjective.
53
Type of data in clustering analysis

Interval-scaled variables:

Binary variables:

Nominal, ordinal, and ratio variables:

Variables of mixed types:
54
Classification by Scale






Nominal scale:merely distinguish classes: with respect
to A and B XA=XB or XAXB
e.g.: color {red, blue, green, …}
gender { male, female}
occupation {engineering, management. .. }
Ordinal scale: indicates ordering of objects in addition
to distinguishing
XA=XB or XAXB XA>XB or XA<XB
 e.g.: education {no school< primary sch. < high sch.
< undergrad < grad}

age {young < middle < old}

income {low < medium < high }
55








Interval scale: assign a meaningful measure of
difference between two objects
 Not only XA>XB but XAis XA – XB units different from XB
e.g.: specific gravity
temperature in oC or oF
Boiling point of water is 100 oC different then its melting
point or 180 oF different
Ratio scale: an interval scale with a meaningful zero
point
XA > XB but XA is XA/XB times greater then XB
e.g.: height, weight, age (as an integer)
temperature in oK or oR


Water boils at 373 oK and melts at 273 oK
Boiling point of water is 1.37 times hotter then melting poing
56
Comparison of Scales







Strongest scale is ratio weakest scale is ordinal
Ahmet`s height is 2.00 meters HA
Mehmet`s height is 1.50 meter HM
HA  HM nominal: their heights are different
HA > HM ordinal Ahmet is taller then Mehmet
HA - HM =0.50 meters interval Ahmet is 50 cm
taller then Mehmet
HA / HM =1.333 ratio scale, no mater height is
measured in meter or inch …
57
Interval-valued variables

Standardize data

Calculate the mean absolute deviation:
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where

m f  1n (x1 f  x2 f
 ... 
xnf )
.
Calculate the standardized measurement (z-score)
xif  m f
zif 
sf

Using mean absolute deviation is more robust than using
standard deviation
58
X2
*
* *
*
*
X2
*
*
*
*
*
*
*
X1
*
***
*
*
*
*
**
**
Both has zero mean and standardized by
Z scores X1 and X2 are unity in both cases I and II
X1.X2=0 in case I whereas X1.X2 1 in case II
Shell we use the same distance measure in both cases
After obtaining the z scores
X1
59
Exercise
X2
*
* A*
*
*
*
*
*
*
*
X2
*
*
X1
***
*
*
*
*
A*
**
**
X1
Suppose d(A,O) = 0.5 in case I and II
Does it reflect the distance between A and origin?
Suggest a transformation so as to handle correlation
Between variables
60
Other Standardizations







Min Max scale between 0 and 1 or -1 and 1
Decimal scale
For Ratio Scaled variales
Mean transformation
zi,f = xi,f/mean_f
 Measure in terms of means of variable f
Log transformation
zi,f = logxi,f
61
Similarity and Dissimilarity Between
Objects


Distances are normally used to measure the similarity or
dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q >=1

If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2
i p jp
62
Similarity and Dissimilarity Between
Objects (Cont.)

If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1
j1
i2
j2
ip
jp

Properties





d(i,j)  0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j)  d(i,k) + d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
63
Similarity and Dissimilarity Between
Objects (Cont.)

Weights can be assigned to variables
d (i, j)  (w | x  x |2 w | x  x |2 ... w | x  x |2 )
i1 j1
i2 j 2
ip
jp
1

2
p
Where wi i = 1…P weights showing the
importance of each variable
64
XA
XA
XB
Manhatan distance between
XA and XB
XB
Euclidean distance between
XA and XB
65
Binary Variables


Symmetric asymmetric
Symmetric: both of its states are equally valuable and
carry the same weight
 gender: male female





0 male 1 female arbitarly coded as 0 or 1
Asymmetric variables
Outcomes are not equally important
Encoded by 0 and 1
E.g. patient smoker or not


1 for smoker 0 for nonsmoker asymmetric
Positive and negative outcomes of a disease test

HIV positive by 1 HIV negative 0
66
Binary Variables

A contingency table for binary data
Object j
Object i


1
0
1
a
b
0
c
d
sum a  c b  d
sum
a b
cd
p
Simple matching coefficient (invariant, if the binary
bc
variable is symmetric):
d (i, j) 
a bc  d
Jaccard coefficient (noninvariant if the binary variable is
asymmetric):
d (i, j) 
bc
a bc
67
Dissimilarity between Binary
Variables

Example
Name
Jack
Mary
Jim



Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
01
 0.33
2 01
11
d ( jack , jim ) 
 0.67
111
1 2
d ( jim , mary ) 
 0.75
11 2
d ( jack , mary ) 
68
Nominal Variables


A generalization of the binary variable in that it can take more than 2
states, e.g., red, yellow, blue, green
Method 1: Simple matching

m: # of matches, p: total # of variables
m
d (i, j)  p 
p


Higher weights can be assigned to variables with large number of
states
Method 2: use a large number of binary variables

creating a new binary variable for each of the M nominal states
69
Example









2 nominal variables
Faculty and country for students
Faculty {eng, applied Sc., Pure Sc., Admin., } 5
distinct values
Country {Turkey, USA} 10 distinct values
P = 2 just two varibales
Weight of country may be increased
Student A (eng, Turkey) B(Applied Sc, Turkey)
m =1 in one variable A and B are similar
D(A,B) = (2-1)/2 =1/2
70
Example cont.


Different binary variables for each faculty
 Eng 1 if student is in engineering 0 otherwise
 AppSc 1 if student in MIS, 0 otherwise
Different binary variables for each country
 Turkey 1 if sturent Turkish, 0 otherwise
 USA 1 if student USA ,0 otherwise
71
72
Ordinal Variables

An ordinal variable can be discrete or continuous

Order is important, e.g., rank

Can be treated like interval-scaled


replace xif by their rank
map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
zif

rif {1,...,M f }
rif 1

M f 1
compute the dissimilarity using methods for intervalscaled variables
73
Example




Credit Card type: gold > silver > bronze > normal, 4
states
Education: grad > undergrad > highschool > primary
school > no school, 5 states
Two customers
 A(gold,highschool)
 B(normal,no school)
 rA,card = 1 , rB,card = 4
 rA,edu = 3 , rA,card = 5
 zA,card = (1-1)/(4-1)=0
 zB,card = (4-1)/(4-1)=1
 zA,edu = (3-1)/(5-1)=0.5
 zB,edu = (5-1)/(5-1)=1
Use any interval scale distance measure on z values
74
Exercise

Find an attribute having both ordinal and nominal
charecterisitics
define a similarity or dissimilarity measure for to
objects A and B
75
Ratio-Scaled Variables


Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Methods:


treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted)
apply logarithmic transformation
yif = log(xif)

treat them as continuous ordinal data treat their rank
as interval-scaled
76
Example






Cluster individuals based on age weights and
heights
 All are ratio scale variables
Mean transformation
Zp,i = xp,i/meanp
As absolute zero makes sense measure distance
by units of mean for each variable
Then you may apply z`= logz
Use any distance measure for interval scales then
77
Example cont.





A weight difference of 0.5 kg is much more
important for babies then for adults
d(3kg,3.5kg) = 0.5 (3.5-3)/3 percentage
difference
d(71.5kg,70.0kg) =0.5
d`(3kg,3.5kg) = (3.5-3)/3 percentage difference
very significant approximately log(3.5)-log3
d(71.5kg,71.0kg) = (71.5-70.0)/70.0
 Not important log71.5 – log71 almost zero
78
Examples from Sports










48
51
54
57
60
63.5
67
71
75
81
48
52
56
62
68
74
82
90
100
130
79
Probability Based Clustering


A mixture is a set of n probability distributions
where each distribution represents a cluster
The mixture model assigns each object a
probability of being a member of cluster k
 Rather than assigning the object or not
assigning the object into cluster k


as in K-mean, k-medoids…
P(C1/X)
80
Simplest Case




K = 2 two clusters
There is one real valued attribute X
Each having Normally distributed
Five parameters to determine:
 mean and standard deviation for cluster A A
A
 mean and standard deviation for cluster B B
B
 Sampling probability for cluster A pA

As P(A) + P(B) = 1
81


There are two populations each normally
distributed with mean and standard deviation
An object comes from one of them but which one
is unknown
 Assign a probability that an object comes from
A or B

That is equivalent to assign a cluster for each object
82
In parametrıc statıstıcs
Assume that objects come from a distribution usually
Normal distribution with unknown parameters  and 
•Estimate the parameters from data
•Test hypothesis based on the normality assumption
•Test the normality assumption
83
Two normal dirtributions
Bi modal
Arbitary shapes can be captured by mixture of normals
84





Variable age
Clasters C1 C2 C3
P(C1/age = 25)
P(C2/age = 35)
P(C3/age = 65)
85
Returning to Mixtures







Given the five parameters
 Finding the probabilities that a given object comes
from each distribution or cluster is easy
Given an object X,
The probability that it belongs to cluster A
P(A|X) = P(A,X) = P(X|A)*P(A) = f(X;A,A)P(A)
P(X)
P(X)
P(X)
P(B|X) = P(B,X) = P(X|B)*P(B) = f(X;B,B)P(B)
P(X)
P(X)
P(X)
86

f(X;A,A) is the normal distribution function for cluster A:

1
f ( X ; , ) 
e
2 
( X  )2
2 
The denominator P(x) will disappear;
Calculate the numerators for both P(A|X) and P(B|X)
And normalize them by dividing their sum
87






P(A|X) = _
P(X|A)*P(A)
_
P(X|A)*P(A)+ P(X|B)*P(B)
P(B|X) = _
P(X|B)*P(B)
_
P(X|A)*P(A)+ P(X|B)*P(B)
Note that the final outcome P(A|X) and P(B|X)
gives the probabilities that given object X belongs
to cluster A and B
Note that these can be calculated only by
knowing the priori probabilities P(A) and P(B)
88
EM Algorithm for the Simple Case

We do not know neither of these things
 Distribution that each object came from


Nor the five parameters of the mixture model


P(X/A), P(X/B)
A, B,A,B,P(A)
Adapt the procedure used for k-means algorithm
89
EM Algorithm for the Simple Case



Initially guess initial values for the five parameters
A, B,A,B,P(A)
Until a specified termination criterion
 Compute P(A/Xi) and P(B/Xi) for each Xi i =1..n





Use these probabilities to reestimate the parameters
A, B,A,B,P(A)


Class probabilities for each object
As these are normal distributions with
The expectation step
Maximization of the likelihood of the distributions
Until likelihood converges or does not improve so much
90
Estimation of new means







If wi is the probability that object i belongs to
cluster A
WiA = P(A|Xi) calculated in E step
A = w1Ax1+ w2Ax2+ …+ wnAxn
w1A+ w2A+…+ wnA
WiB = P(B|Xi) calculated in E step
B = w1Bx1+ w2Bx2+ …+ wnBxn
w1B+ w2B+…+ wnB
91
Estimation of new standard
deviations






2A=w1A(X1- A)2+w2A(X2- A)2+…+wnA(Xn- A)2
w1A+ w2A+…+ wnA
2B=w1B(X1- B)2+w2B(X2- B)2+…+wnB(Xn- B)2
w1B+ w2B+…+ wnB
Xi are all the objects not just belonging to A or B
These are maximum likelihood estimators for
variance
 If the weights are equal denominator sum up
to n rather then n-1

92
Estimation of new priori probabilities









P(A) = P(A|X1)+ P(A|X2)+… +P(A|Xn)
n
pA = w1A+ w2A+…+ wnA
n
P(B) = P(B|X1)+ P(B|X2)+… +P(B|Xn)
n
pB = w1B+w2B+…+ wnB
n
Or P(B) = 1-P(A)
93
General Considerations



EM converges to a local optimum maximum
likelihood score
 May not be global
EM may be run with different initial values of the
parameters and the clustering structure with the
highest likelihood is chosen
Or an initial k-means is conducted to get the
mean and standard deviation parameters
94
Density-Based Clustering Methods



Clustering based on density (local cluster criterion), such
as density-connected points
Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition
Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98)
95
Density Concepts

Core object (CO)–object with at least ‘M’ objects within a
radius ‘E-neighborhood’

Directly density reachable (DDR)–x is CO, y is in x’s ‘Eneighborhood’

Density reachable–there exists a chain of DDR objects
from x to y

Density based cluster–density connected objects
maximum w.r.t. reachability
96
Density-Based Clustering: Background



Two parameters:

Eps: Maximum radius of the neighbourhood

MinPts: Minimum number of points in an Epsneighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly densityreachable from a point q wrt. Eps, MinPts if

1) p belongs to NEps(q)

2) core point condition:
|NEps (q)| >= MinPts
p
q
MinPts = 5
Eps = 1 cm
97
Density-Based Clustering: Background (II)

Density-reachable:


p
A point p is density-reachable from
a point q wrt. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
p1
q
Density-connected

A point p is density-connected to a
point q wrt. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o wrt.
Eps and MinPts.
p
q
o
98
Let MinPts =3
*
* Q
*M *
* P **
**
*
*
*M,P are core objects since each is in an
 neighborhood containing at least 3 points
*Q is directly density-reachable from M
M is directly density-reachable form P and vice versa
*Q is indirectly density reachable form P since
Q is directly density-reachable from M and M is directly density reacable
From P but p is not density reacable from Q sicne Q is not a core object
99
DBSCAN: Density Based Spatial Clustering of
Applications with Noise


Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core
MinPts = 5
100
DBSCAN: The Algorithm





Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps
and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the
database.
Continue the process until all of the points have been
processed.
101