* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Cluster Analysis
Survey
Document related concepts
Principal component analysis wikipedia , lookup
Mixture model wikipedia , lookup
Nonlinear dimensionality reduction wikipedia , lookup
Human genetic clustering wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Nearest-neighbor chain algorithm wikipedia , lookup
Transcript
Cluster Analysis:
— Chapter 4 —
BIS 541
2013/2014 Summer
1
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
Clustering is unsupervised classification: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
Measuring the performance of supervised learning
algorithms
2
Basic Measures for Clustering
Clustering: Given a database D = {t1, t2, .., tn}, a distance
measure dis(ti, tj) defined between any two objects ti and
tj, and an integer value k, the clustering problem is to
define a mapping f: D {1, …,k } where each ti is
assigned to one cluster Kf, 1 ≤ f ≤ k,
3
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Clustering is unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms
4
General Applications of Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature
spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns
5
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to
their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
6
Constraint-Based Clustering Analysis
Clustering analysis: less parameters but more user-desired
constraints, e.g., an ATM allocation problem
7
Clustering Cities
Clustering Turkish cities
Based on
Political
Demographic
Economical
Characteristics
Political :general elections 1999,2002
Demographic:population,urbanization rates
Economical:gnp per head,growth rate of gnp
8
What Is Good Clustering?
A good clustering method will produce high quality clusters
with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
9
Requirements of Clustering in Data Mining
Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
10
Chapter 5. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
11
Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance
k
m1 tmiKm
(Cm tmi )
2
Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
12
The K-Means Clustering Algorithm
choose k, number of clusters to be determined
Choose k objects randomly as the initial cluster centers
repeat
Assign each object to their closest cluster center
Compute new cluster centers
Using Euclidean distance
Calculate mean points
until
No change in cluster centers or
No object change its cluster
13
The K-Means Clustering Method
Example
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
9
9
8
8
7
7
6
6
5
5
4
3
2
1
0
1
2
3
4
5
6
7
8
8
9
10
reassign
10
0
7
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
14
Example TBP Sec 3.3 page 84 Table 3.6
Instance X
Y
1
1.0
1.5
2
1.0
4.5
3
2.0
1.5
4
2.0
3.5
5
3.0
2.5
6
5.0
6.1
k is chosen as 2 k=2
Chose two points at random representing initial
cluster centers
Object 1 and 3 are chosen as cluster centers
15
6
2
4
5
1* 3*
16
Example cont.
Euclidean distance between point i and j
D(i - j)=( (Xi – Xj)2 + (Yi – Yj)2)1/2
Initial cluster centers
C1:(1.0,1.5) C2:(2.0,1.5)
D(C1 – 1) = 0.00
D(C1 – 2) = 3.00
D(C1 – 3) = 1.00
D(C1 – 4) = 2.24
D(C1 – 5) = 2.24
D(C1 – 6) = 6.02
C1:{1,2} C2:{3.4.5.6}
D(C2
D(C2
D(C2
D(C2
D(C2
D(C2
–1)
–2)
–3)
–4)
–5)
–6)
=
=
=
=
=
=
1.00
3.16
0.00
2.00
1.41
5.41
C1
C1
C2
C2
C2
C2
17
6
2
4
5
1* 3*
18
Example cont.
Recomputing cluster centers
For C1:
XC1 = (1.0+1.0)/2 = 1.0
YC1 = (1.5+4.5)/2 = 3.0
For C2:
XC2 = (2.0+2.0+3.0+5.0)/4 = 3.0
YC2 = (1.5+3.5+2.5+6.0)/4 = 3.375
Thus the new cluster centers are
C1(1.0,3.0) and C2(3.0,3.375)
As the cluster centers have changed
The algorithm performs another iteration
19
6
2
*
* 4
5
1 3
20
Example cont.
New cluster centers
C1(1.0,3.0) and C2(3.0,3.375)
D(C1 – 1) = 1.50
D(C1 – 2) = 1.50
D(C1 – 3) = 1.80
D(C1 – 4) = 1.12
D(C1 – 5) = 2.06
D(C1 – 6) = 5.00
C1 {1,2.3} C2:{4.5.6}
D(C2
D(C2
D(C2
D(C2
D(C2
D(C2
–1)
–2)
–3)
–4)
–5)
–6)
=
=
=
=
=
=
2.74
2.29
2.13
1.01
0.88
3.30
C1
C1
C1
C2
C2
C2
21
Example cont.
computing new cluster centers
For C1:
XC1 = (1.0+1.0+2.0)/3 = 1.33
YC1 = (1.5+4.5+1.5)/3 = 2.50
For C2:
XC2 = (2.0+3.0+5.0)/3 = 3.33
YC2 = (3.5+2.5+6.0)/3 = 4.00
Thus the new cluster centers are
C1(1.33,2.50) and C2(3.33,4.3.00)
As the cluster centers have changed
The algorithm performs another iteration
22
Exercise
Perform the third iteration
23
Commands
each initial cluster centers may end up with
different final cluster configuration
Finds local optimum but not necessarily the
global optimum
Based on sum of squared error SSE
differences between objects and their cluster
centers
Choose a terminating criterion such as
Maximum acceptable SSE
Execute K-Means algorithm until satisfying the
condition
24
Comments on the K-Means Method
Strength: Relatively efficient: O(tkn), where n is #
objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2
+ k(n-k))
Comment: Often terminates at a local optimum.
The global optimum may be found using
techniques such as: deterministic annealing and
genetic algorithms
25
Weaknesses of K-Means Algorithm
Applicable only when mean is defined, then what
about categorical data?
Need to specify k, the number of clusters, in advance
run the algorithm with different k values
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex
shapes
Works best when clusters are of approximately of equal size
26
Presence of Outliers
x
x
x x
xx
x xx
xx
x
When k=2
x x
xx
x xx
xx
x
When k=3
When k=2 the two natural clusters are not captured
27
Quality of clustering depends on
unit of measure
income
x
x
income
x
x
x
x
x
x
x
x
x
age
Income measured by
YTL,age by year
x
age Income measured by
So what to do?
TL, age by year
28
Variations of the K-Means Method
A few variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes (Huang’98)
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototype method
29
Exercise
Show by designing simple examples:
a) K-means algorithm may converge to
different local optima starting from different
initial assignments of objects into different
clusters
b) In the case of clusters of unequal size, Kmeans algorithm may fail to catch the
obvious (natural) solution
30
* *
* **
* * ***
***
*
* *
*****
**
*
** ****** *
*
*
* ** ** *
* *** **
***
** * *
31
How to chose K
for reasonable values of k
e.g. from 2 to 15
plot k versus SSE (sum of square error)
visually inspect the plot and
as k increases SSE falls
choose the breaking point
32
SSE
2 4 6 8 10 12
k
33
Velidation of clustering
partition the data into two equal groups
apply clustering the
one of these partitions
compare cluster centers with the overall data
or
apply clustering to each of these groups
compare cluster centers
34
Basic Measures for Clustering
Clustering: Given a database D = {t1, t2, .., tn}, a distance
measure dis(ti, tj) defined between any two objects ti and
tj, and an integer value k, the clustering problem is to
define a mapping f: D {1, …, k} where each ti is
assigned to one cluster Kf, 1 ≤ f ≤ k, such that tfp,tfq ∈ Kf
and ts ∉ Kf, dis(tfp,tfq) ≤ dis(tfp,ts)
Centroid, radius, diameter
Typical alternatives to calculate the distance between
clusters
Single link, complete link, average, centroid, medoid
35
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster
ip
)
N
Radius: square root of average distance from any point of the
cluster to its centroid
Cm
iN 1(t
N (t cm ) 2
Rm i 1 ip
N
Diameter: square root of average mean squared distance between
all pairs of points in the cluster
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
36
Typical Alternatives to Calculate the
Distance between Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e., dis(Ki,
Kj) = dis(Mi, Mj)
Medoid: one chosen, centrally located object in the cluster
37
Major Clustering Approaches
Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
38
The K-Medoids Clustering Method
Find representative objects, called medoids, in clusters
PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
PAM works effectively for small data sets, but does not scale well
for large data sets
CLARA (Kaufmann & Rousseeuw, 1990)
CLARANS (Ng & Han, 1994): Randomized sampling
Focusing + spatial data structure (Ester et al., 1995)
39
CLARA (Clustering Large Applications) (1990)
CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
40
CLARANS (“Randomized” CLARA) (1994)
CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
41
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
42
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Use the Single-Link method and the dissimilarity matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
43
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
44
Example
A
A 0
B 1
C 2
D 2
E 3
B
C
D
E
0
2
4
3
0
1
5
0
3
0
45
Simple Link Distance Measure
E
A
2
3
3
1
B
2
A
C
4 5
2
1
3
D
E
{A}{B}{C}{D}{E}
A
2
E
1
2
{ABCD}{E}
B
2
C
D
1
1
B
C
1
D
{AB}{CD}{E}
1
B
A
2
2
3
3
C
E
2
1
3
D
3
2
1
{ABCDE}
A
B
C
D E
46
Complete Link Distance Measure
E
A
2
3
3
1
B
2
A
C
4 5
2
1
3
D
E
{A}{B}{C}{D}{E}
3
A
1
3
C
E
D
{ABE}{CD}
B
1
1
B
C
1
D
{AB}{CD}{E}
1
B
A
2
2
3
3
5
C
E
2 4 1
3
D
5
3
1
{ABCDE}
C
D
A
B E
47
Average Link distance measure
E
A
2
3
3
1
B
2
A
C
4 5
2
1
3
D
E
{A}{B}{C}{D}{E}
A
2
E
1
2
B
4
D
2
C
1
{ABCD}{E}
Average link
Between {AB} and{CD}
Is 2.5
1
B
C
1
D
{AB}{CD}{E} average link {AB} and {CD} is 1
1
B
A
2
2
3
3.5
3
5
C
E
2 4 1
2.5
3
D
{ABCDE}
Average link between
{ABCD} and {E}
A
Is 3.5
1
B
C
D E
48
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
49
More on Hierarchical Clustering Methods
Major weakness of agglomerative clustering methods
2
do not scale well: time complexity of at least O(n ),
where n is the number of total objects
can never undo what was done previously
Integration of hierarchical with distance-based clustering
BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of the
cluster by a specified fraction
CHAMELEON (1999): hierarchical clustering using
dynamic modeling
50
Data Structures
Data matrix
(two modes)
Dissimilarity matrix
(one mode)
x11
...
x
i1
...
x
n1
...
x1f
...
...
...
...
xif
...
...
...
...
... xnf
...
...
0
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
d ( n,1) d ( n,2) ...
x1p
...
xip
...
xnp
... 0
51
Properties of Dissimilarity Measures
Properties
d(i,j) 0 for i j
d(i,i) = 0
d(i,j) = d(j,i) symmetry
d(i,j) d(i,k) + d(k,j) triangular inequality
Exercise: Can you find examples where
distance between objects are not obeying
symmetry property
52
Measure the Quality of Clustering
Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
Weights should be associated with different variables
based on applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
53
Type of data in clustering analysis
Interval-scaled variables:
Binary variables:
Nominal, ordinal, and ratio variables:
Variables of mixed types:
54
Classification by Scale
Nominal scale:merely distinguish classes: with respect
to A and B XA=XB or XAXB
e.g.: color {red, blue, green, …}
gender { male, female}
occupation {engineering, management. .. }
Ordinal scale: indicates ordering of objects in addition
to distinguishing
XA=XB or XAXB XA>XB or XA<XB
e.g.: education {no school< primary sch. < high sch.
< undergrad < grad}
age {young < middle < old}
income {low < medium < high }
55
Interval scale: assign a meaningful measure of
difference between two objects
Not only XA>XB but XAis XA – XB units different from XB
e.g.: specific gravity
temperature in oC or oF
Boiling point of water is 100 oC different then its melting
point or 180 oF different
Ratio scale: an interval scale with a meaningful zero
point
XA > XB but XA is XA/XB times greater then XB
e.g.: height, weight, age (as an integer)
temperature in oK or oR
Water boils at 373 oK and melts at 273 oK
Boiling point of water is 1.37 times hotter then melting poing
56
Comparison of Scales
Strongest scale is ratio weakest scale is ordinal
Ahmet`s height is 2.00 meters HA
Mehmet`s height is 1.50 meter HM
HA HM nominal: their heights are different
HA > HM ordinal Ahmet is taller then Mehmet
HA - HM =0.50 meters interval Ahmet is 50 cm
taller then Mehmet
HA / HM =1.333 ratio scale, no mater height is
measured in meter or inch …
57
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
sf 1
n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
m f 1n (x1 f x2 f
...
xnf )
.
Calculate the standardized measurement (z-score)
xif m f
zif
sf
Using mean absolute deviation is more robust than using
standard deviation
58
X2
*
* *
*
*
X2
*
*
*
*
*
*
*
X1
*
***
*
*
*
*
**
**
Both has zero mean and standardized by
Z scores X1 and X2 are unity in both cases I and II
X1.X2=0 in case I whereas X1.X2 1 in case II
Shell we use the same distance measure in both cases
After obtaining the z scores
X1
59
Exercise
X2
*
* A*
*
*
*
*
*
*
*
X2
*
*
X1
***
*
*
*
*
A*
**
**
X1
Suppose d(A,O) = 0.5 in case I and II
Does it reflect the distance between A and origin?
Suggest a transformation so as to handle correlation
Between variables
60
Other Standardizations
Min Max scale between 0 and 1 or -1 and 1
Decimal scale
For Ratio Scaled variales
Mean transformation
zi,f = xi,f/mean_f
Measure in terms of means of variable f
Log transformation
zi,f = logxi,f
61
Similarity and Dissimilarity Between
Objects
Distances are normally used to measure the similarity or
dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j) q (| x x |q | x x |q ... | x x |q )
i1 j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q >=1
If q = 1, d is Manhattan distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2
i p jp
62
Similarity and Dissimilarity Between
Objects (Cont.)
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1
j1
i2
j2
ip
jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
63
Similarity and Dissimilarity Between
Objects (Cont.)
Weights can be assigned to variables
d (i, j) (w | x x |2 w | x x |2 ... w | x x |2 )
i1 j1
i2 j 2
ip
jp
1
2
p
Where wi i = 1…P weights showing the
importance of each variable
64
XA
XA
XB
Manhatan distance between
XA and XB
XB
Euclidean distance between
XA and XB
65
Binary Variables
Symmetric asymmetric
Symmetric: both of its states are equally valuable and
carry the same weight
gender: male female
0 male 1 female arbitarly coded as 0 or 1
Asymmetric variables
Outcomes are not equally important
Encoded by 0 and 1
E.g. patient smoker or not
1 for smoker 0 for nonsmoker asymmetric
Positive and negative outcomes of a disease test
HIV positive by 1 HIV negative 0
66
Binary Variables
A contingency table for binary data
Object j
Object i
1
0
1
a
b
0
c
d
sum a c b d
sum
a b
cd
p
Simple matching coefficient (invariant, if the binary
bc
variable is symmetric):
d (i, j)
a bc d
Jaccard coefficient (noninvariant if the binary variable is
asymmetric):
d (i, j)
bc
a bc
67
Dissimilarity between Binary
Variables
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
01
0.33
2 01
11
d ( jack , jim )
0.67
111
1 2
d ( jim , mary )
0.75
11 2
d ( jack , mary )
68
Nominal Variables
A generalization of the binary variable in that it can take more than 2
states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
m
d (i, j) p
p
Higher weights can be assigned to variables with large number of
states
Method 2: use a large number of binary variables
creating a new binary variable for each of the M nominal states
69
Example
2 nominal variables
Faculty and country for students
Faculty {eng, applied Sc., Pure Sc., Admin., } 5
distinct values
Country {Turkey, USA} 10 distinct values
P = 2 just two varibales
Weight of country may be increased
Student A (eng, Turkey) B(Applied Sc, Turkey)
m =1 in one variable A and B are similar
D(A,B) = (2-1)/2 =1/2
70
Example cont.
Different binary variables for each faculty
Eng 1 if student is in engineering 0 otherwise
AppSc 1 if student in MIS, 0 otherwise
Different binary variables for each country
Turkey 1 if sturent Turkish, 0 otherwise
USA 1 if student USA ,0 otherwise
71
72
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
replace xif by their rank
map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
zif
rif {1,...,M f }
rif 1
M f 1
compute the dissimilarity using methods for intervalscaled variables
73
Example
Credit Card type: gold > silver > bronze > normal, 4
states
Education: grad > undergrad > highschool > primary
school > no school, 5 states
Two customers
A(gold,highschool)
B(normal,no school)
rA,card = 1 , rB,card = 4
rA,edu = 3 , rA,card = 5
zA,card = (1-1)/(4-1)=0
zB,card = (4-1)/(4-1)=1
zA,edu = (3-1)/(5-1)=0.5
zB,edu = (5-1)/(5-1)=1
Use any interval scale distance measure on z values
74
Exercise
Find an attribute having both ordinal and nominal
charecterisitics
define a similarity or dissimilarity measure for to
objects A and B
75
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted)
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank
as interval-scaled
76
Example
Cluster individuals based on age weights and
heights
All are ratio scale variables
Mean transformation
Zp,i = xp,i/meanp
As absolute zero makes sense measure distance
by units of mean for each variable
Then you may apply z`= logz
Use any distance measure for interval scales then
77
Example cont.
A weight difference of 0.5 kg is much more
important for babies then for adults
d(3kg,3.5kg) = 0.5 (3.5-3)/3 percentage
difference
d(71.5kg,70.0kg) =0.5
d`(3kg,3.5kg) = (3.5-3)/3 percentage difference
very significant approximately log(3.5)-log3
d(71.5kg,71.0kg) = (71.5-70.0)/70.0
Not important log71.5 – log71 almost zero
78
Examples from Sports
48
51
54
57
60
63.5
67
71
75
81
48
52
56
62
68
74
82
90
100
130
79
Probability Based Clustering
A mixture is a set of n probability distributions
where each distribution represents a cluster
The mixture model assigns each object a
probability of being a member of cluster k
Rather than assigning the object or not
assigning the object into cluster k
as in K-mean, k-medoids…
P(C1/X)
80
Simplest Case
K = 2 two clusters
There is one real valued attribute X
Each having Normally distributed
Five parameters to determine:
mean and standard deviation for cluster A A
A
mean and standard deviation for cluster B B
B
Sampling probability for cluster A pA
As P(A) + P(B) = 1
81
There are two populations each normally
distributed with mean and standard deviation
An object comes from one of them but which one
is unknown
Assign a probability that an object comes from
A or B
That is equivalent to assign a cluster for each object
82
In parametrıc statıstıcs
Assume that objects come from a distribution usually
Normal distribution with unknown parameters and
•Estimate the parameters from data
•Test hypothesis based on the normality assumption
•Test the normality assumption
83
Two normal dirtributions
Bi modal
Arbitary shapes can be captured by mixture of normals
84
Variable age
Clasters C1 C2 C3
P(C1/age = 25)
P(C2/age = 35)
P(C3/age = 65)
85
Returning to Mixtures
Given the five parameters
Finding the probabilities that a given object comes
from each distribution or cluster is easy
Given an object X,
The probability that it belongs to cluster A
P(A|X) = P(A,X) = P(X|A)*P(A) = f(X;A,A)P(A)
P(X)
P(X)
P(X)
P(B|X) = P(B,X) = P(X|B)*P(B) = f(X;B,B)P(B)
P(X)
P(X)
P(X)
86
f(X;A,A) is the normal distribution function for cluster A:
1
f ( X ; , )
e
2
( X )2
2
The denominator P(x) will disappear;
Calculate the numerators for both P(A|X) and P(B|X)
And normalize them by dividing their sum
87
P(A|X) = _
P(X|A)*P(A)
_
P(X|A)*P(A)+ P(X|B)*P(B)
P(B|X) = _
P(X|B)*P(B)
_
P(X|A)*P(A)+ P(X|B)*P(B)
Note that the final outcome P(A|X) and P(B|X)
gives the probabilities that given object X belongs
to cluster A and B
Note that these can be calculated only by
knowing the priori probabilities P(A) and P(B)
88
EM Algorithm for the Simple Case
We do not know neither of these things
Distribution that each object came from
Nor the five parameters of the mixture model
P(X/A), P(X/B)
A, B,A,B,P(A)
Adapt the procedure used for k-means algorithm
89
EM Algorithm for the Simple Case
Initially guess initial values for the five parameters
A, B,A,B,P(A)
Until a specified termination criterion
Compute P(A/Xi) and P(B/Xi) for each Xi i =1..n
Use these probabilities to reestimate the parameters
A, B,A,B,P(A)
Class probabilities for each object
As these are normal distributions with
The expectation step
Maximization of the likelihood of the distributions
Until likelihood converges or does not improve so much
90
Estimation of new means
If wi is the probability that object i belongs to
cluster A
WiA = P(A|Xi) calculated in E step
A = w1Ax1+ w2Ax2+ …+ wnAxn
w1A+ w2A+…+ wnA
WiB = P(B|Xi) calculated in E step
B = w1Bx1+ w2Bx2+ …+ wnBxn
w1B+ w2B+…+ wnB
91
Estimation of new standard
deviations
2A=w1A(X1- A)2+w2A(X2- A)2+…+wnA(Xn- A)2
w1A+ w2A+…+ wnA
2B=w1B(X1- B)2+w2B(X2- B)2+…+wnB(Xn- B)2
w1B+ w2B+…+ wnB
Xi are all the objects not just belonging to A or B
These are maximum likelihood estimators for
variance
If the weights are equal denominator sum up
to n rather then n-1
92
Estimation of new priori probabilities
P(A) = P(A|X1)+ P(A|X2)+… +P(A|Xn)
n
pA = w1A+ w2A+…+ wnA
n
P(B) = P(B|X1)+ P(B|X2)+… +P(B|Xn)
n
pB = w1B+w2B+…+ wnB
n
Or P(B) = 1-P(A)
93
General Considerations
EM converges to a local optimum maximum
likelihood score
May not be global
EM may be run with different initial values of the
parameters and the clustering structure with the
highest likelihood is chosen
Or an initial k-means is conducted to get the
mean and standard deviation parameters
94
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such
as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98)
95
Density Concepts
Core object (CO)–object with at least ‘M’ objects within a
radius ‘E-neighborhood’
Directly density reachable (DDR)–x is CO, y is in x’s ‘Eneighborhood’
Density reachable–there exists a chain of DDR objects
from x to y
Density based cluster–density connected objects
maximum w.r.t. reachability
96
Density-Based Clustering: Background
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Epsneighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly densityreachable from a point q wrt. Eps, MinPts if
1) p belongs to NEps(q)
2) core point condition:
|NEps (q)| >= MinPts
p
q
MinPts = 5
Eps = 1 cm
97
Density-Based Clustering: Background (II)
Density-reachable:
p
A point p is density-reachable from
a point q wrt. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
p1
q
Density-connected
A point p is density-connected to a
point q wrt. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o wrt.
Eps and MinPts.
p
q
o
98
Let MinPts =3
*
* Q
*M *
* P **
**
*
*
*M,P are core objects since each is in an
neighborhood containing at least 3 points
*Q is directly density-reachable from M
M is directly density-reachable form P and vice versa
*Q is indirectly density reachable form P since
Q is directly density-reachable from M and M is directly density reacable
From P but p is not density reacable from Q sicne Q is not a core object
99
DBSCAN: Density Based Spatial Clustering of
Applications with Noise
Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core
MinPts = 5
100
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps
and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the
database.
Continue the process until all of the points have been
processed.
101