Download Categorical Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inverse problem wikipedia , lookup

Geographic information system wikipedia , lookup

Neuroinformatics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Theoretical computer science wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Data analysis wikipedia , lookup

Corecursion wikipedia , lookup

Pattern recognition wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Clustering Categorical Data
Pasi Fränti
18.2.2016
K-means clustering
Definitions and data
Set of N data points:
X={x1, x2, …, xN}
Partition of the data:
P={p1, p2, …, pM},
Set of M cluster prototypes (centroids):
C={c1, c2, …, cM},
Distance and cost function
Euclidean distance of data vectors:
d ( xi , x j ) 
 x
K
k 1
k
i
x

k 2
j
Mean square error:
1
MSE (C , P) 
N
N
x
i 1
i
 c pi
2
Clustering result as partition
Partition of data
Illustrated by
Voronoi diagram
Cluster prototypes
Illustrated by
Convex hulls
Duality of partition and centroids
Partition of data
Partition by nearest
prototype mapping
Cluster prototypes
Centroids
as prototypes
Categorical data
Categorical clustering
Three attributes
t1 (Godfather
II)
t2 (Good
Fellas)
t3 (Vertigo)
t4 (N by NW)
director actor
genre
Coppol
De
Crime
a
Niro
Scorses
De
Crime
e
Niro
Hitchco Stewar Thriller
ck
t
Hitchco Grant Thriller
Categorical clustering
Sample 2-d data: color and shape
Model A
Model B
Model C
Hamming Distance
(Binary and categorical data)
•
•
•
•
Number of different attribute values.
Distance of (1011101) and (1001001) is 2.
Distance (2143896) and (2233796)
Distance between (toned) and (roses) is 3.
3-bit binary cube
100->011 has distance 3 (red path)
010->111 has distance 2 (blue path)
K-means variants
Methods:
•
•
•
•
•
•
k-modes
k-medoids
k-distributions
k-histograms
k-populations
k-representatives
Histogram-based methods:
Entropy-based cost functions
Category utility:
Entropy of data set:
Entropies of the clusters relative to the data:
Iterative algorithms
K-modes clustering
Distance function
K-modes clustering
Prototype of cluster
K-medoids clustering
Prototype of cluster
Vector with minimal total distance to every other
2
A
C
E
3
B
C
F
2
Medoid:
B
D
G
B
C
F
2+3=5 2+2=4 2+3=5
K-medoids
Example
K-medoids
Calculation
K-histograms
D 2/3
F 1/3
K-distributions
Cost function with ε addition
Example of cluster allocation
Change of entropy
Problem of non-convergence
Non-convergence
Results with Census dataset
Literature
Modified k-modes + k-histograms:
M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes
Clustering Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), 503-507,
March, 2007.
ACE:
K. Chen and L. Liu, The “Best k'' for entropy-based categorical dataclustering, Int. Conf. on Scientific
and Statistical Database Management (SSDBM'2005), pp. 253-262, Berkeley, USA, 2005.
ROCK:
S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”,
Information Systems, Vol. 25, No. 5, pp. 345-366, 200x.
K-medoids:
L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John
Wiley Sons, New York, 1990.
K-modes:
Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data
mining knowledge discovery, Vol. 2, No. 3, pp. 283-304, 1998.
K-distributions:
Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int.
Conf. on Intelligent Computing (ICIC 2007), pp. 436-443, Qingdao, China, 2007.
K-histograms:
Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering
Algorithm for Categorical Dataset, CoRR, abs/cs/0509033, http://arxiv.org/abs/cs/0509033, 2005.