Download Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Clustering II
Data Mining Overview

Data Mining
 Data warehouses and OLAP (On Line Analytical
Processing.)
 Association Rules Mining
 Clustering: Hierarchical and Partitional approaches
 Classification: Decision Trees and Bayesian classifiers
 Sequential Patterns Mining
 Advanced topics: outlier detection, web mining
Major Clustering Approaches

Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion

Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion

Density-based algorithms: based on connectivity and density functions

Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters

Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms


k-means (MacQueen’67): Each cluster is represented by the center of the
cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in the
cluster
Optimization problem


The goal is to optimize a score function
The most commonly used is the square error criterion:
k
E 

i  1 p  Ci
p  mi
2
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in 4
steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the clusters of
the current partition. The centroid is the center (mean
point) of the cluster.
 Assign each object to the cluster with the nearest seed
point.
 Go back to Step 2, stop when no more new assignment.
The K-Means Clustering Method
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Comments on the K-Means Method


Strength
 Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
 Often terminates at a local optimum. The global optimum may
be found using techniques such as: deterministic annealing and
genetic algorithms
Weakness
 Applicable only when mean is defined, then what about
categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
The K-Medoids Clustering Method

Find representative objects, called medoids, in clusters

PAM (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it improves
the total distance of the resulting clustering

PAM works effectively for small data sets, but does not scale
well for large data sets

CLARA (Kaufmann & Rousseeuw, 1990)

CLARANS (Ng & Han, 1994): Randomized sampling

Focusing + spatial data structure (Ester et al., 1995)
PAM (Partitioning Around Medoids)

PAM (Kaufman and Rousseeuw, 1987), built in Splus

Use real object to represent the cluster

Select k representative objects arbitrarily

For each pair of non-selected object h and selected object i,
calculate the total swapping cost TCih

For each pair of i and h,



If TCih < 0, i is replaced by h
Then assign each non-selected object to the most
similar representative object
repeat steps 2-3 until there is no change
PAM Clustering: Total swapping cost TCih=jCjih
10
10
9
9
t
8
7
j
t
8
7
6
5
i
4
3
j
6
h
4
5
h
i
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
Cjih = d(j, h) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
Cjih = 0
10
10
9
9
h
8
8
7
j
7
6
6
i
5
5
i
4
h
4
t
j
3
3
t
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
Cjih = d(j, t) - d(j, i)
10
0
1
2
3
4
5
6
7
8
9
Cjih = d(j, h) - d(j, t)
10
CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990)

Built in statistical analysis packages, such as S+

It draws multiple samples of the data set, applies PAM on each
sample, and gives the best clustering as the output

Strength: deals with larger data sets than PAM

Weakness:

Efficiency depends on the sample size

A good clustering based on samples will not necessarily represent a
good clustering of the whole data set if the sample is biased
CLARANS (“Randomized” CLARA)

CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)

CLARANS draws sample of neighbors dynamically

The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids

If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum

It is more efficient and scalable than both PAM and
CLARA
Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not
require the number of clusters k as an input, but needs a termination
condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
AGNES (Agglomerative Nesting)

Agglomerative, Bottom-up approach

Merge nodes that have the least dissimilarity

Go on in a non-descending fashion

Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects
into a several levels of
nested partitioning (tree
of clusters), called a
dendrogram.
A clustering of the data
objects is obtained by
cutting the dendrogram
at the desired level, then
each connected
component forms a
cluster.
DIANA (Divisive Analysis)

Top-down approach

Inverse order of AGNES

Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
More on Hierarchical Clustering Methods


Major weakness of vanilla agglomerative clustering
methods
2
 do not scale well: time complexity of at least O(n ), where n
is the number of total objects
 can never undo what was done previously
Integration of hierarchical with distance-based clustering
 BIRCH (1996): uses CF-tree and incrementally adjusts the
quality of sub-clusters
 CURE (1998): selects well-scattered points from the cluster
and then shrinks them towards the center of the cluster by a
specified fraction
BIRCH

Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)

Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering

Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve the
inherent clustering structure of the data)

Phase 2: use an arbitrary clustering algorithm to cluster the
leaf nodes of the CF-tree
BIRCH

Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans

Weakness: handles only numeric data, and sensitive to
the order of the data record.
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
CF = (5, (16,30),(54,190))
SS: Ni=1=Xi2
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
CF Tree
B=7
Root
CF1
CF2 CF3
CF6
child1
child2 child3
child6
L=6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
CF6 next
Leaf node
prev CF1 CF2
CF4 next
CURE (Clustering Using
REpresentatives )

CURE: proposed by Guha, Rastogi & Shim, 1998

Stops the creation of a cluster hierarchy if a level consists of
k clusters

Uses multiple representative points to evaluate the distance
between clusters, adjusts well to arbitrary shaped clusters
and avoids single-link effect
Drawbacks of Distance-Based Method

Drawbacks of square-error based clustering method

Consider only one point as representative of a cluster

Good only for convex shaped, similar size and density, and if k can be
reasonably estimated
Cure: The Algorithm

Draw random sample s.

Partition sample to p partitions with size s/p

Partially cluster partitions into s/pq clusters

Eliminate outliers

By random sampling

If a cluster grows too slow, eliminate it.

Cluster partial clusters.

Label data in disk
Data Partitioning and Clustering



s = 50
p=2
s/p = 25

s/pq = 5
y
y
y
x
y
y
x
x
x
x
Cure: Shrinking Representative Points
y
y
x

Shrink the multiple representative points towards the
gravity center by a fraction of .

Multiple representatives capture the shape of the cluster
x
Clustering Categorical Data: ROCK


ROCK: Robust Clustering using linKs,
by S. Guha, R. Rastogi, K. Shim (ICDE’99).
 Use links to measure similarity/proximity
 Not distance based
 Computational complexity:
2
2
O
(
n

nm
m

n
log n)
m a
Basic ideas:
 Similarity function and neighbors:
T1  T2
Let T1 = {1,2,3}, T2={3,4,5}
Sim(T , T ) 
1
Sim( T1, T 2) 
2
{3}
1

 0.2
{1,2,3,4,5}
5
T1  T2
Rock: Algorithm


Links: The number of common neighbours for the two
points.
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
3
{1,2,3}
{1,2,4}
Algorithm
 Draw random sample
 Cluster with links
 Label data in disk
Density-Based Clustering Methods



Clustering based on density (local cluster criterion), such as
density-connected points
Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition
Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98)
CLIQUE (Clustering In QUEst)

Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).

Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space

CLIQUE can be considered as both density-based and grid-based

It partitions each dimension into the same number of equal length interval

It partitions an m-dimensional data space into non-overlapping rectangular
units

A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter

A cluster is a maximal set of connected dense units within a subspace
CLIQUE: The Major Steps

Partition the data space and find the number of points that
lie inside each cell of the partition.

Identify the subspaces that contain clusters using the
Apriori principle

Identify clusters:



Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of interests.
Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of connected
dense units for each cluster
 Determination of minimal cover for each cluster
=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
30
Vacation
(week)
0 1 2 3 4 5 6 7
age
60
20
30
40
50
age
50
age
60
Strength and Weakness of CLIQUE


Strength
 It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those
subspaces
 It is insensitive to the order of records in input and does not
presume some canonical data distribution
 It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
 The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
Model based clustering




Assume data generated from K probability distributions
Typically Gaussian distribution Soft or probabilistic
version of K-means clustering
Need to find distribution parameters.
EM Algorithm
EM Algorithm


Initialize K cluster centers
Iterate between two steps

Expectation step: assign points to clusters
 w Pr(d
P(di  ck )  wk Pr( di | ck )
wk 

j
 Pr( d  c )
i
i
|cj )
j
k
i
N
Maximation step: estimate model parameters
k
1

m
d i P ( d i  ck )

i 1  P ( d i  c j )
m
k