Download Data Mining: Concepts and Techniques

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Literature Survey of Clustering
Algorithms
Bill Andreopoulos
Biotec, TU Dresden, Germany, and
Department of Computer Science and Engineering
York University, Toronto, Ontario, Canada
June 27, 2006
Outline
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Supervised Classification
Objective of clustering
algorithms for categorical data
• Partition the objects into groups.
– Objects with similar categorical attribute
values are placed in the same group.
– Objects in different groups contain
dissimilar categorical attribute values.
• An issue with clustering in general is defining the goals.
Papadimitriou et al. (2000) propose:
– Seek k groups G1,…,Gk and a policy Pi for each group i.
– Pi : a vector of categorical attribute values.
k
– Maximize:
 
i 1
cGi
Pi  c
– ‘○’ is the “overlap” operator between 2 vectors.
• Clustering problem is NP-complete:
– Ideally, search all possible clusters and all assignments of objects.
– The best clustering is the one maximizing a quality measure.
General Applications of
Clustering
• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature
spaces
– detect spatial clusters and explain them in spatial data
mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar
access patterns
Examples of Clustering Applications
• Software clustering: cluster files in software systems based on their
functionality
• Intrusion detection: Discover instances of anomalous (intrusive) user
behavior in large system log files
• Gene expression data: Discover genes with similar functions in DNA
microarray data.
• Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering depends on:
– Appropriateness of method for dataset.
– The (dis)similarity measure used
– Its implementation.
• The quality of a clustering method is also measured by its ability to
discover some or all of the hidden patterns.
Requirements of Clustering in
Data Mining
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Scalability to High dimensions
• Interpretability and usability
• Incorporation of user-specified constraints
Outline
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Supervised Classification
Data Structures
• Data matrix
• Dissimilarity matrix
 x11

 ...
x
 i1
 ...
x
 n1
...
x1f
...
...
...
...
xif
...
...
...
...
... xnf
...
...
 0
 d(2,1)
0

 d(3,1) d ( 3,2) 0

:
:
 :
d ( n,1) d ( n,2) ...
x1p 

... 
xip 

... 
xnp 







... 0
Measure the Quality of
Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Type of data in clustering
analysis
• Nominal (Categorical)
• Interval-scaled variables
• Ordinal (Numerical)
• Binary variables
• Mixed types
Nominal (categorical)
• A generalization of the binary variable in that it
can take more than 2 states, e.g., red, yellow,
blue, green
• Method 1: Simple matching
– m: # of matches, p: total # of variables
m
d (i, j)  p 
p
• Method 2: use a large number of binary variables
– creating a new binary variable for each of the M
nominal states
Interval-scaled variables
• Standardize data
– Calculate the mean absolute deviation:
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where
m f  1n (x1 f  x2 f
 ... 
xnf )
.
– Calculate the standardized measurement (z-score)
xif  m f
zif 
sf
Ordinal (numerical)
• An ordinal variable can be discrete or continuous
• order is important, e.g., rank
• Can be treated like interval-scaled
– replacing xif by their rank
rif {1,...,M f }
– map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
zif
rif 1

M f 1
– compute the dissimilarity using methods for intervalscaled variables
Similarity and Dissimilarity
Between Objects
• Distances are normally used to measure the similarity or
dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2
i p jp
Similarity and Dissimilarity
Between Objects (Cont.)
• If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1
j1
i2
j2
ip
jp
– Properties
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)
• Also one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures.
Binary Variables
• A contingency table for binary data
Object j
Object i
1
0
1
a
b
0
c
d
sum a  c b  d
sum
a b
cd
p
• Simple matching coefficient (invariant, if the binary
bc
variable is symmetric): d (i, j) 
a bc  d
• Jaccard coefficient (noninvariant if the binary variable is
d (i, j)  b  c
asymmetric):
a bc
Dissimilarity between Binary
Variables
• Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
– gender is a symmetric attribute
– the remaining attributes are asymmetric binary
– let the values Y and P be set to 1, and the value N be set to 0
01
 0.33
2 01
11
d ( jack , jim ) 
 0.67
111
1 2
d ( jim , mary ) 
 0.75
11 2
d ( jack , mary ) 
Variables of Mixed Types
• A database may contain all the six types of
variables
– symmetric binary, asymmetric binary, nominal, ordinal,
interval and ratio.
• One may use a weighted formula to combine
their effects.
 pf  1 ij( f ) d ij( f )
d (i, j) 
 pf  1 ij( f )
– f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
– f is interval-based: use the normalized distance
– f is ordinal
r
if  1
• compute ranks rif and zif 
M f 1
• and treat zif as interval-scaled
Clustering of genomic data sets
Clustering of gene expression
data sets
Clustering of synthetic mutant
lethality data sets
Clustering applied to yeast data sets
• Clustering the yeast genes in response to environmental
changes
• Clustering the cell cycle-regulated yeast genes
• Functional analysis of the yeast genome : Finding gene
functions - Functional prediction
Software clustering
• Group system files such that files with similar functionality are in the
same cluster, while files in different clusters perform dissimilar
functions.
– Each object is a file x of the software system.
• Both categorical and numerical data sets:
• Categorical data set on a software system: for each file, which other
files it may invoke during runtime.
– After the filename x there is a list of the other filenames that x may invoke.
• Numerical data set on a software system: the results of a profiling of
the execution of the system, how many times each file invoked other
files during the run time.
– After the file name x there is a list of the other filenames that x invoked and: how
many times x invoked them during the run time.
Outline
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Supervised Classification
Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
• Unsupervised vs. Supervised: clustering may or may not be based on
prior knowledge of the correct classification.
Outline
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Supervised Classification
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
• Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67): Each cluster is represented by the
center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
The k-Means Clustering Method
• Given k, the k-Means algorithm is
implemented in 4 steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the
clusters of the current partition. The centroid is the
center (mean point) of the cluster.
– Assign each object to the cluster with the nearest
seed point.
– Go back to Step 2, stop when no more new
assignment.
The K-Means Clustering Method
• Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Comments on the K-Means
Method
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic
annealing and genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
Variations of the K-Means Method
• A few variants of the k-means which differ in:
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces one of
the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
– PAM works effectively for small data sets, but does not scale well
for large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)
• CLARANS (Ng & Han, 1994): Randomized sampling
PAM (Partitioning Around Medoids)
(1987)
• PAM (Kaufman and Rousseeuw, 1987), built in Splus
• Use real object to represent the cluster
– Select k representative objects arbitrarily
– For each pair of non-selected object h and selected object i,
calculate the total swapping cost TCih
– For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most similar
representative object
– repeat steps 2-3 until there is no change
PAM Clustering: Total swapping cost
TCih=jCjih
10
10
9
9
8
t
7
j
t
8
7
6
j
5
6
5
4
i
3
h
4
h
i
3
2
2
1
1
0
0
1
2
3
4
5
6
7
8
9
10
0
0
Cjih = d(j, h) - d(j, i)
1
2
3
4
5
6
7
8
9
10
Cjih = 0
10
10
9
9
8
8
h
7
7
j
6
6
i
5
5
i
4
4
t
3
h
3
2
2
t
1
1
j
0
0
0
1
2
3
4
5
6
7
8
9
Cjih = d(j, t) - d(j, i)
10
0
1
2
3
4
5
6
7
8
9
Cjih = d(j, h) - d(j, t)
10
CLARA (Clustering Large
Applications) (1990)
• CLARA (Kaufmann and Rousseeuw in 1990)
• It draws a sample of the data set, applies PAM on the sample, and
gives the best clustering as the output.
• Strength: deals with larger data sets than PAM
• Weakness:
– Efficiency depends on the sample size
– A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the sample is
biased
CLARANS (“Randomized”
CLARA) (1994)
• CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
• CLARANS draws multiple samples dynamically.
• A different sample can be chosen at each loop.
• The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
• It is more efficient and scalable than both PAM and
CLARA
Squeezer Single linkage clustering
Not the most effective and accurate clustering algorithm that
exists, but it is efficient as it has a complexity of O(n)
where n is the number of data objects [Portnoy01].
1) Initialize the set of clusters, S, to the empty set.
2) Obtain an object d from the data set. If S is empty, then
create a cluster with d and add it to S. Otherwise, find
the cluster in S that is closest to this object. In other
words, find the closest cluster C to d in S.
3) If the distance between d and C is less than or equal to
a user specified threshold W then associate d with the
cluster C. Else, create a new cluster for d in S.
4) Repeat steps 2 and 3 until no objects are left in the
data set.
Fuzzy k-Means
•
Clusters produced by k-Means: "hard" or "crisp" clusters
– since any feature vector x either is or is not a member of a
particular cluster.
• In contrast to "soft" or "fuzzy" clusters
– a feature vector x can have a degree of membership in each
cluster.
• The fuzzy-k-means procedure of Bezdek [Bezdek81,
Dembele03, Gasch02] allows each feature vector x to
have a degree of membership in Cluster i
Fuzzy k-Means
Fuzzy k-Means algorithm
•
•
•
•
•
•
•
Choose the number of classes k, with 1<k<n.
Choose a value for the fuzziness exponent f, with f>1.
Choose a definition of distance in the variable-space.
Choose a value for the stopping criterion e (e= 0.001 gives reasonable
convergence).
Make initial guesses for the means m1, m2,..., mk
Until there are no changes in any mean:
– Use the estimated means to find the degree of membership u(j,i) of xj
in Cluster i. For example, if a(j,i) = exp(- || xj - mi ||2 ), one
might use u(j,i) = a(j,i) / sum_j a(j,i).
– For i from 1 to k
• Replace mi with the fuzzy mean of all of the examples for Cluster i
-– end_for
end_until
K-Modes for categorical data
(Huang’98)
• Variation of the K-Means Method
– Replacing means of clusters with modes
– Using new dissimilarity measures to deal with
categorical objects
– Using a frequency-based method to update modes of
clusters
– A mixture of categorical and numerical data: kprototypes method
K-Modes algorithm
• K-Modes deals with categorical attributes.
• Insert the first K objects into K new clusters.
• Calculate the initial K modes for K clusters.
• Repeat {
•
For (each object O) {
•
Calculate the similarity between object O and
the modes of all clusters.
•
Insert object O into the cluster C whose mode
is the least dissimilar to object O.
•
}
•
Recalculate the cluster modes so that the cluster
similarity between mode and objects is maximized.
• } until (no or few objects change clusters).
Fuzzy K-Modes
• The fuzzy k-modes algorithm contains extensions to the
fuzzy k-means algorithm for clustering categorical data.
Bunch
•
Bunch is a clustering tool intended to aid the software
developer and maintainer in understanding and
maintaining source code [Mancoridis99].
• Input : Module Dependency Graph (MDG).
• Bunch “good partition" : highly interdependent modules
are grouped in the same subsystems (clusters) .
– Independent modules are assigned to separate subsystems.
• Figure b shows a “good” partitioning of Figure a.
• Finding a good graph partition involves:
– systematically navigating through a very large search space of all
possible partitions for that graph.
Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
• Unsupervised vs. Supervised: clustering may or may not be based on
prior knowledge of the correct classification.
Hierarchical Clustering
• Use distance matrix as clustering criteria. This
method does not require the number of clusters
k as an input, but needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
AGNES (Agglomerative
Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
DIANA (Divisive Analysis)
• Introduced in Kaufmann and Rousseeuw (1990)
• Inverse order of AGNES
• Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
More on Hierarchical Clustering
Methods
• Weaknesses of agglomerative clustering methods
– do not scale well: time complexity of at least O(n2),
where n is the number of total objects.
– can never undo what was done previously.
More on Hierarchical Clustering
Methods
• Next…..
• Integration of hierarchical with distancebased clustering
– BIRCH (1996): uses CF-tree and incrementally
adjusts the quality of sub-clusters
– CURE (1998): selects well-scattered points from
the cluster and then shrinks them towards the
center of the cluster by a specified fraction
– CHAMELEON (1999): hierarchical clustering
using dynamic modeling
BIRCH (1996)
• Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
• Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
– Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
– Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
Clustering Feature
Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2
CF = (5, (16,30),(54,190))
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
CF Tree - A nonleaf node in this tree contains summaries of the CFs of its
children. A CF tree is a multilevel summary of the data that preserves the
inherent structure of the data.
L=6
Root
CF3
CF1
CF2
child1
child2 child3
CF6
child6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
CF6 next
Leaf node
prev CF1 CF2
CF4 next
BIRCH (1996)
Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the
order of the data record.
CURE (Clustering Using
REpresentatives)
• CURE: proposed by Guha, Rastogi & Shim, 1998
– CURE goes a step beyond BIRCH by not favoring
clusters with spherical shape – thus being able to
discover clusters with arbitrary shape. CURE is also
more robust with respect to outliers [Guha98].
– Uses multiple representative points to evaluate the
distance between clusters, adjusts well to arbitrary
shaped clusters and avoids single-link effect.
Drawbacks of distance-based Methods
– Consider only one point as representative of a cluster
– Good only for convex shaped, similar size and
density, and if k can be reasonably estimated
CURE: The Algorithm
–
Draw random sample s.
–
Partition sample to p partitions with size s/p. Each partition
is a partial cluster.
–
Eliminate outliers by random sampling. If a cluster grows
too slow, eliminate it.
–
Cluster partial clusters.
–
The representative points falling in each new cluster are
“shrinked” or moved toward the cluster center by a userspecified shrinking factor.
These objects then represent the shape of the newly
formed cluster.
–
Data Partitioning and Clustering
x
Figure 11 – Clustering a
set of objects using
CURE. (a) A random
sample of objects. (b)
Partial clusters.
Representative points
for each cluster are
marked with a “+”. (c)
The partial clusters are
further clustered. The
representative points
are moved toward the
cluster center. (d) The
final clusters are
nonspherical.
x
y
Cure: Shrinking Representative
Points
y
x
x
• Shrink the multiple representative points towards
the gravity center by a fraction of .
• Multiple representatives capture the shape of the
cluster
Clustering Categorical Data:
ROCK
• ROCK: Robust Clustering using linKs,
by S. Guha, R. Rastogi, K. Shim (ICDE’99).
•
– Use links to measure similarity/proximity
– Cubic computational complexity
O(n2  nmmma  n2 log n)
Rock: Algorithm
• Links: The number of common neighbours
for the two points.
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
{1,2,3}
3
{1,2,4}
• Initially, each tuple is assigned to a separate cluster and
then clusters are merged repeatedly according to the
closeness between clusters.
• The closeness between clusters is defined as the sum of
the number of “links” between all pairs of tuples, where the
number of “links” represents the number of common
neighbors between two clusters.
CHAMELEON
• CHAMELEON: hierarchical clustering using
dynamic modeling, by G. Karypis, E.H. Han
and V. Kumar’99
• Measures the similarity based on a dynamic
model
– Two clusters are merged only if the
interconnectivity and closeness (proximity)
between two clusters are high relative to the
internal interconnectivity of the clusters and
closeness of items within the clusters
CHAMELEON
• A two phase algorithm
– Use a graph partitioning algorithm: cluster objects into a
large number of relatively small sub-clusters
– Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining these
sub-clusters
Overall Framework of
CHAMELEON
Construct
Partition the Graph
Sparse Graph
Data Set
Merge Partition
Final Clusters
Eisen’s hierarchical clustering of
gene expression data
• The hierarchical clustering algorithm by Eisen et al. [Eisen98, Eisen99]
– commonly used for cancer clustering
– clustering of genomic (gene expression) data sets in general.
• For a set of n genes, compute an upper-diagonal similarity matrix
– containing similarity scores for all pairs of genes
– use a similarity metric
• Scan to find the highest value, representing the pair of genes with the
most similar interaction patterns
• The two most similar genes are grouped in a cluster
– the similarity matrix is recomputed, using the average properties of both or all
genes in the cluster
• More genes are progressively added to the initial pairs to form clusters
of genes [Eisen98]
– process is repeated until all genes have been grouped into clusters
LIMBO
• LIMBO is introduced in [Andritsos04] is a scalable
hierarchical categorical clustering algorithm that builds
on the Information Bottleneck (IB) framework for
quantifying the relevant information preserved when
clustering.
– LIMBO uses the IB framework to define a distance measure for
categorical tuples.
– LIMBO handles large data sets by producing a memory
bounded summary model for the data.
Outline
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Supervised Classification
Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
• Unsupervised vs. Supervised: clustering may or may not be based on
prior knowledge of the correct classification.
Density-Based Clustering Methods
• Clustering based on density (local cluster
criterion), such as density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– Need user-specified parameters
– One scan
Density-Based Clustering:
Background
• Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Epsneighbourhood of that point
p
q
MinPts = 5
Eps = 1 cm
Density-Based Clustering:
Background (II)
• Density-reachable:
p
– A point p is density-reachable from
a point q wrt. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
p1
q
• Density-connected:
– A point p is density-connected to a
point q wrt. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o wrt.
Eps and MinPts.
p
q
o
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
• Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of densityconnected points
• Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Eps = 1cm
Core
MinPts = 5
DBSCAN: The Algorithm
– Check the e-neighborhood of each object in the
database.
– If the e-neighborhood of an object o contains more
than MinPts, a new cluster with o as a core object is
created.
– Iteratively collect directly density-reachable objects
from these core objects, which may involve the merge
of a few density-reachable clusters.
– Terminate the process when no new object can be
added to any cluster.
OPTICS: A Cluster-Ordering
Method (1999)
• OPTICS: Ordering Points To Identify the
Clustering Structure
– Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
– Produces a special order of the database wrt its
density-based clustering structure
– Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure
OPTICS: An Extension from DBSCAN
– OPTICS was developed to overcome
the difficulty of selecting appropriate
parameter values for DBSCAN
[Ankerst99].
– The OPTICS algorithm finds clusters
using the following steps:
1) Create an ordering of the objects in a
database, storing the core-distance and
a suitable reachability-distance for each
object. Clusters with highest density will
be finished first.
2) Based on the ordering information
produced by OPTICS, use another
algorithm to extract clusters.
3) Extract density-based clusters with
respect to any distance e’ that is smaller
than the distance e used in generating
the order.
Figure 15a – OPTICS. The core distance of
p is the distance e’, between p and the
fourth closest object. The reachability
distance of q1 with respect to p is the coredistance of p (e’=3mm) since this is greater
than the distance between p and q1. The
reachability distance of q2 with respect to
p is the distance between p and q2 since
this is greater than the core-distance of p
(e’=3mm). Adopted from [Ankerst99].
Reachability
-distance

‘

Cluster-order of the objects
DENCLUE: using density functions
• DENsity-based CLUstEring by Hinneburg & Keim
(KDD’98)
• Major features:
– Solid mathematical foundation
– Good for data sets with large amounts of noise
– Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
– Significantly faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
– But needs a large number of parameters
DENCLUE: Technical Essence
• Influence function: describes the impact of a data point
within its neighborhood.
• Overall density of the data space can be calculated as the
sum of the influence function of all data points.
• Clusters can be determined mathematically by identifying
density attractors.
• Density attractors are local maximal of the overall density
function.
Density Attractor
Center-Defined and Arbitrary
CACTUS Categorical Clustering
• CACTUS is presented in [Ganti99].
• Distinguishing sets: clusters are uniquely identified by a
core set of attribute values that occur in no other cluster.
• A distinguishing number represents the minimum size
of the distinguishing sets – i.e. attribute value sets that
uniquely occur within only one cluster.
• While this assumption may hold true for many real
world datasets, it is unnatural and unnecessary for the
clustering process.
COOLCAT Categorical Clustering
• COOLCAT is introduced in [Barbara02] as an entropy-based
algorithm for categorical clustering.
• COOLCAT starts with a sample of data objects
– identifies a set of k initial tuples such that the minimum pairwise
distance among them is maximized.
• All remaining tuples of the data set are placed in one of the
clusters such that
– at each step, the increase in the entropy of the resulting clustering is
minimized.
CLOPE Categorical Clustering
•
Let's take a small market basket database with 5 transactions {(apple, banana),
(apple, banana, cake), (apple, cake, dish), (dish, egg), (dish, egg, fish)}.
– For simplicity, transaction (apple, banana) is abbreviated to ab, etc.
•
For this small database, we want to compare the following two clusterings:
– (1) { {ab, abc, acd}, {de, def} } and (2) { {ab, abc}, {acd, de, def} }.
•
•
•
H=2.0, W=4
H=1.67, W=3
H=1.67, W=3
H=1.6, W=5
{ab, abc, acd}
{de, def}
{ab, abc}
{acd, de, def}
clustering (1)
clustering (2)
– Histograms of the two clusterings. Adopted from [Yang2002].
•
We judge the qualities of these two clusterings, by analyzing the heights and widths of the
clusters. Leaving out the two identical histograms for cluster {de, def} and cluster {ab, abc},
the other two histograms are of different quality.
•
The histogram for cluster {ab, abc, acd} has H/W=0.5, but the one for cluster {acd, de, def}
has H/W=0.32.
–
–
Clustering (1) is better since we prefer more overlapping among transactions in the same cluster.
Thus, a larger height-to-width ratio of the histogram means better intra-cluster similarity.
Outline
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Supervised Classification
Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
• Unsupervised vs. Supervised: clustering may or may not be based on
prior knowledge of the correct classification.
Grid-Based Clustering Method
• Using multi-resolution grid data structure
• Several interesting methods
– STING (a STatistical INformation Grid approach)
by Wang, Yang and Muntz (1997)
– WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98)
• A multi-resolution clustering approach using
wavelet method
– CLIQUE: Agrawal, et al. (SIGMOD’98)
STING: A Statistical
Information Grid Approach
• Wang, Yang and Muntz (VLDB’97)
• The spatial area area is divided into rectangular
cells
• There are several levels of cells corresponding to
different levels of resolution
STING: A Statistical Information
Grid Approach
– Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
– Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
– Parameters of higher level cells can be easily calculated
from parameters of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
– Use a top-down approach to answer spatial data queries
– Start from a pre-selected layer—typically with a small
number of cells
STING: A Statistical Information
Grid Approach
– When finish examining the current layer, proceed to
the next lower level
– Repeat this process until the bottom layer is reached
– Advantages:
• O(K), where K is the number of grid cells at the
lowest level
– Disadvantages:
• All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
WaveCluster (1998)
• Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
• A multi-resolution clustering approach which
applies wavelet transform to the feature space
– A wavelet transform is a signal processing technique that
decomposes a signal into different frequency sub-band.
• Input parameters:
– the wavelet, and the # of applications of wavelet transform.
WaveCluster (1998)
• How to apply wavelet transform to find clusters
– Summaries the data by imposing a multidimensional
grid structure onto data space
– These multidimensional spatial data objects are
represented in a n-dimensional feature space
– Apply wavelet transform on feature space to find the
dense regions in the feature space
– Apply wavelet transform multiple times which result in
clusters at different scales from fine to coarse
What Is Wavelet (2)?
WaveCluster (1998)
• Why is wavelet transformation useful for
clustering
– Unsupervised clustering
It uses hat-shape filters to emphasize region where points
cluster, but simultaneously to suppress weaker information in
their boundary
– Effective removal and detection of outliers
– Multi-resolution
– Cost efficiency
• Major features:
–
–
–
–
Complexity O(N)
Detect arbitrary shaped clusters at different scales
Not sensitive to noise, not sensitive to input order
Only applicable to low dimensional data
Quantization
Transformation
CLIQUE (CLustering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan
(SIGMOD’98).
• CLIQUE can be considered as both densitybased and grid-based.
– It partitions each dimension into the same number of
equal length interval
– It partitions an m-dimensional data space into nonoverlapping rectangular units
– A unit is dense if the fraction of total data points
contained in the unit exceeds the input model parameter
– A cluster is a maximal set of connected dense units
within a subspace
=3
30
40
Vacation
20
50
Salary
30
(week)
0 1 2 3 4 5 6 7
Vacation
(10,000)
0 1 2 3 4 5 6 7
age
60
20
30
40
50
age
50
age
60
Strength and Weakness of CLIQUE
• Strengths
– It is insensitive to the order of records in input
and does not presume some canonical data
distribution
– It scales linearly with the size of input and has
good scalability as the number of dimensions
in the data increases
• Weakness
– The accuracy of the clustering result may be
degraded
Outline
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Supervised Classification
Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
• Unsupervised vs. Supervised: clustering may or may not be based on
prior knowledge of the correct classification.
Model-Based Clustering Methods
• Attempt to optimize the fit between the data and some
mathematical model
• Statistical and AI approach
– Conceptual clustering
• A form of clustering in machine learning
• Produces a classification scheme for a set of unlabeled objects
• Finds characteristic description for each concept (class)
– COBWEB (Fisher’87)
• A popular a simple method of incremental conceptual learning
• Creates a hierarchical clustering in the form of a classification tree
• Each node refers to a concept and contains a probabilistic description
of that concept
COBWEB Clustering
Method
A classification tree
More on COBWEB Clustering
• Limitations of COBWEB
– The assumption that the attributes are
independent of each other is often too strong
because correlation may exist
– Not suitable for clustering large database
data – skewed tree and expensive
probability distributions
AutoClass (Cheeseman and Stutz, 1996)
•
E is the attributes of a data item, that are given to us by the data set. For example, if each data
item is a coin, the evidence E might be represented as follows for a coin i:
–
•
If there were many attributes, then the evidence E might be represented as follows for a coin i:
–
•
Ei = {"land tail","land tail","land head"} meaning that in 3 separate trials the coin i landed as tail, tail
and head.
H is a hypothesis about the classification of a data item.
–
•
Ei = {"land tail“} meaning that in one trial the coin i landed to be tail.
For example, H might state that coin i belongs in the class “two headed coin”.
We usually do not know the H for a data set. Thus, AutoClass tests many hypotheses.
 (H | E) 
L( E | H ) ( H )
L( E | H ) ( H )

 (E)
 L( E | H ) ( H )
H
•
AutoClass uses a Bayesian method for determining the optimal class H for each object.
–
–
•
Prior distribution for each attribute, symbolizing the prior beliefs of the user about the attribute.
Change the classifications of items in clusters and change the means and variances of the distributions
in each cluster, until the means and variances stabilize.
Normal distribution for an attribute in a cluster:
Neural Networks and SelfOrganizing Maps
• Neural network approaches
– Represent each cluster as an exemplar, acting as a
“prototype” of the cluster
– New objects are distributed to the cluster whose
exemplar is the most similar according to some
distance measure
• Competitive learning
– Involves a hierarchical architecture of several units
(neurons)
– Neurons compete in a “winner-takes-all” fashion for
the object currently being presented
Self-organizing feature maps (SOMs)
type of neural network
•
•
•
•
Several units (clusters) compete for the current object
The unit whose weight vector is closest to the current object wins
The winner and its neighbors learn by having their weights adjusted
SOMs are believed to resemble processing that can occur in the brain
Outline
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Supervised Classification
Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
• Unsupervised vs. Supervised: clustering may or may not be based on
prior knowledge of the correct classification.
Supervised classification
• Bases the classification on prior
knowledge about the correct classification
of the objects.
Support Vector Machines
• Support Vector Machines (SVMs) were invented by Vladimir
Vapnik for classification based on prior knowledge [Vapnik98,
Burges98].
• Create classification functions from a set of labeled training
data.
– The output is binary: is the input in a category?
• SVMs find a hypersurface in the space of possible inputs.
– Split the positive examples from the negative examples.
– The split is chosen to have the largest distance from the hypersurface to
the nearest of the positive and negative examples.
•
Training an SVM on a large data set can be slow.
– Testing data should be near the training data.
CCA-S
• Clustering and Classification Algorithm – Supervised
(CCA-S)
– for detecting intrusions into computer network systems [Ye01].
• CCA-S learns signature patterns of both normal and
intrusive activities in the training data.
• Then, classify the activities in the testing data as normal or
intrusive based on the learned signature patterns of
normal and intrusive activities.
Classification (supervised) applied
to cancer tumor data sets
• Class Discovery: dividing tumor samples into groups with similar
behavioral properties and molecular characteristics.
– Previously unknown tumor subtypes may be identified this way
• Class Prediction: determining the correct class for a new tumor
sample, given a set of known classes.
– Correct class prediction may suggest whether a patient will benefit from
treatment, how will respond after treatment with a certain drug
•
•
•
•
Examples of Cancer Tumor Classification:
Classification of acute leukemias
Classification of diffuse large B-cell lymphoma tumors
Classification of 60 cancer cell lines derived from a variety of tumors
Chapter 8. Cluster Analysis
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Supervised Classification
• Outlier Analysis
What Is Outlier Discovery?
• What are outliers?
– The set of objects are considerably dissimilar from
the remainder of the data
– Example: Sports: Michael Jordan, Wayne Gretzky,
...
• Problem
– Find top n outlier points
• Applications:
–
–
–
–
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis
Outlier Discovery:
Statistical
Approaches
Assume a model underlying distribution that
generates data set (e.g. normal distribution)
• Use discordancy tests depending on
– data distribution
– distribution parameter (e.g., mean, variance)
– number of expected outliers
• Drawbacks
– Most distribution tests are for single attribute
– In many cases, data distribution may not be known
Outlier Discovery:
Distance-Based Approach
• Introduced to counter the main limitations imposed by
statistical methods
– We need multi-dimensional analysis without knowing data
distribution.
• Distance-based outlier: A (p,D)-outlier is an object O in
a dataset T such that at least a fraction p of the
objects in T lies at a distance greater than D from O.
Outlier Discovery:
Deviation-Based Approach
• Identifies outliers by examining the main
characteristics of objects in a group
• Objects that “deviate” from this description are
considered outliers
• Sequential exception technique
– simulates the way in which humans can distinguish
unusual objects from among a series of supposedly
like objects
• OLAP data cube technique
– uses data cubes to identify regions of anomalies in
large multidimensional data
Chapter 8. Cluster Analysis
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Outlier Analysis
• Summary
Problems and Challenges
• Considerable progress has been made in
scalable clustering methods
– Partitioning: k-means, k-medoids, CLARANS
– Hierarchical: BIRCH, CURE, LIMBO
– Density-based: DBSCAN, CLIQUE, OPTICS
– Grid-based: STING, WaveCluster
– Model-based: Autoclass, Denclue, Cobweb
• Current clustering techniques do not address all
the requirements adequately
• Constraint-based clustering analysis: Constraints
exist in data space (bridges and highways) or in
user queries
Constraint-Based Clustering
Analysis
•
Clustering analysis: less parameters but more userdesired constraints, e.g., an ATM allocation
problem
Summary
• Cluster analysis groups objects based on their similarity
and has wide applications
• Measure of similarity can be computed for various types
of data
• Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
• Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
• There are still lots of research issues on cluster analysis,
such as constraint-based clustering
References (1)
•
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
•
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
•
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to
identify the clustering structure, SIGMOD’99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World
Scietific, 1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
•
•
•
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
•
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine
Learning, 2:139-172, 1987.
•
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
•
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
•
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
References (2)
•
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
•
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.
•
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and
Applications to Clustering. John Wiley and Sons, 1988.
•
P. Michaud. Clustering techniques. Future Generation Computer systems, 13,
1997.
•
R. Ng and J. Han. Efficient and effective clustering method for spatial data
mining. VLDB'94.
•
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very
large data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
•
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multiresolution clustering approach for very large spatial databases. VLDB’98.
•
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to
Spatial Data Mining, VLDB’97.
•
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering
method for very large databases. SIGMOD'96.
Thank you !!!
Ratio-Scaled Variables
• Ratio-scaled variable: a positive measurement on
a nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt
• Methods:
– treat them like interval-scaled variables — not a good
choice! (why?)
– apply logarithmic transformation
yif = log(xif)
– treat them as continuous ordinal data treat their rank as
interval-scaled.
Intrusion detection systems
•
Clustering is applied to files with logged behavior of system users over time, to separate instances
of normal activity from instances of abusive or attack activity. Clustering is primarily used in
Intrusion Detection Systems, especially when the log files of system user behavior are too large for
a human expert to analyze.
•
Unsupervised clustering: When unsupervised clustering is applied to Intrusion Detection
Systems data on which no prior knowledge exists the data is first clustered and then different
clusters are labeled as either ‘normal’ or ‘intrusions’ based on the size of the cluster.
Given a new object d, classification proceeds by finding a cluster C which is closest to d and
classifying d according to the label of C as either normal or anomalous [Li04, Portnoy01].
IDSs based on unsupervised clustering focus on activity data that may contain features of an
individual TCP connection, such as its duration, protocol type, number of bytes transferred and a
flag indicating the connection’s normal or error status. Other features may include the number of
file creation operations, number of failed login attempts, whether root shell was obtained, the
number of connections to the same host as the current connection within the past two seconds, the
number of connections to the same service as the current connection within the past two seconds
and others.
•
•
Intrusion detection systems
•
Clustering is applied to files with logged behavior of system users over time, to separate instances
of normal activity from instances of abusive or attack activity. Clustering is primarily used in
Intrusion Detection Systems, especially when the log files of system user behavior are too large for
a human expert to analyze.
•
Supervised clustering: Supervised clustering is applied to Intrusion Detection Systems for
Signature Recognition purposes. This means that signatures representing patterns of previous
intrusion activities are learned by the system. Then, new patterns of activity are compared to the
previously learned signatures to determine if the activities are normal or if they may be intrusions
[Ye01].
•
IDS based on signature recognition focus on two main types of activity data: network traffic data
and computer audit data. Some categorical activity attributes that can be obtained from this data
include the user id, event type, process id, command, remote IP address. Some numerical activity
attributes that can be obtained from this data include the time stamp, CPU time, etc.
Squeezer
•
•
Squeezer is introduced in [He2002] as a one-pass algorithm. Squeezer
repeatedly reads tuples from the data set one by one. When the first tuple
arrives, it forms a cluster alone. The consequent tuples are either put into an
existing cluster or rejected by all existing clusters to form a new cluster by the
given similarity function.
Squeezer clustering consists of the following steps:
1) Initialize the set of clusters, S, to the empty set.
2) Obtain an object d from the data set. If S is empty, then create a cluster
with d and add it to S. Otherwise, find the cluster in S that is closest
to this object. In other words, find the closest cluster C to d in S.
3) If the distance between d and C is less than or equal to a user specified
threshold W then associate d with the cluster C. Else, create a new
cluster for d in S.
4) Repeat steps 2 and 3 until no objects are left in the data set.
Farthest First Traversal k-center Algorithm
•
The farthest-first traversal k-center algorithm (FFT) is a fast, greedy algorithm that
minimizes the maximum cluster radius [Hochbaum85]. In FFT, k points are first selected
as cluster centers. The remaining points are added to the cluster whose center is the
closest. The first center is chosen randomly. The second center is greedily chosen as the
point furthest from the first. Each remaining center is determined by greedily choosing
the point farthest from the set of already chosen centers, where the furthest point, x,
from a set, D, is defined as, maxx{min{d(x,j), j(-D}}.
•
•
•
•
•
•
•
•
•
•
•
Farthest_first_traversal(D: data set, k: integer) {
randomly select first center;
//select centers
for (I= 2,…,k) {
for (each remaining point) { calculate distance to the current center set; }
select the point with maximum distance as new center;
}
//assign remaining points
for (each remaining point) {
calculate the distance to each cluster center;
put it to the cluster with minimum distance;
}
}
Clustering Feature
Vector
BIRCH’s operation is based on clustering features (CF)
and clustering feature trees. These are used to
summarize the information in a cluster.
A CF is a triplet summarizing information about
subclusters of objects.
A CF typically holds the following information about a
subcluster: the number of objects N in the subcluster, a
vector holding the linear sum of the N objects in the
subcluster, and a vector holding the square sum of the
N objects in the subcluster.
Thus, a CF is a summary of statistics for the given
subcluster
OPTICS: An Extension from DBSCAN
– OPTICS was developed to overcome
the difficulty of selecting appropriate
parameter values for DBSCAN
[Ankerst99].
– The OPTICS algorithm finds clusters
using the following steps:
1) Create an ordering of the objects in a
database, storing the core-distance and
a suitable reachability-distance for each
object. Clusters with highest density will
be finished first.
2) Based on the ordering information
produced by OPTICS, use another
algorithm to extract clusters.
3) Extract density-based clusters with
respect to any distance e’ that is smaller
than the distance e used in generating
the order.
Figure 15a – OPTICS. The core distance of
p is the distance e’, between p and the
fourth closest object. The reachability
distance of q1 with respect to p is the coredistance of p (e’=3mm) since this is greater
than the distance between p and q1. The
reachability distance of q2 with respect to
p is the distance between p and q2 since
this is greater than the core-distance of p
(e’=3mm). Adopted from [Ankerst99].
Gradient: The steepness of a
slope
• Example
f Gaussian ( x , y )  e
f
D
Gaussian
f
d ( x , y )2

2 2
( x )   i 1 e
N

d ( x , xi ) 2
2 2
( x, xi )  i 1 ( xi  x)  e
D
Gaussian
N

d ( x , xi ) 2
2 2
Mutual Information Clustering
•
•
•
•
•
Mutual information refers to a measure of correlation between the information content of
two information elements [Michaels98]. For example, when clustering gene expression
data , the formula M(A,B) = H(A) + H(B) - H(A,B) represents the mutual information M,
that is shared by two temporal gene expression patterns A and B.
H refers to the Shannon entropy. H(A) is a measure of the number of expression levels
exhibited by the gene A expression pattern, over a given time period. The higher the
value of H, the more expression levels gene A exhibits over a temporal expression
pattern, and thus the more information the expression pattern contains.
An H of zero means that a gene expression pattern over a time course was completely
flat, or constant, and thus the expression pattern carries no information.
H(A,B) is a measure of the number of expression levels observed in either gene
expression pattern A or B.
Thus, the mutual information M(A,B) = H(A) + H(B) - H(A,B) is defined as the number of
expression levels that are observed in both gene expression patterns A and B, over a
given time period [26].
CACTUS Categorical Clustering
•
•
•
•
•
CACTUS is presented in [Ganti99], introducing a novel formalization of a
cluster for categorical attributes by generalizing a definition of a cluster for
numerical data.
CACTUS is a summary-based algorithm that utilizes summary information
of the dataset. It characterizes the detected categorical clusters by
building summaries.
The authors assume the existence of a distinguishing number that
represents the minimum size of the distinguishing sets – i.e. attribute value
sets that uniquely occur within only one cluster.
The distinguishing sets in CACTUS rely on the assumption that clusters
are uniquely identified by a core set of attribute values that occur in no
other cluster.
While this assumption may hold true for many real world datasets, it is
unnatural and unnecessary for the clustering process. The distinguishing
sets are then extended to cluster projections.
COOLCAT Categorical Clustering
• COOLCAT is introduced in [Barbara02] as an entropy-based
algorithm for categorical clustering. The COOLCAT algorithm is
based on the idea of entropy reduction within the generated
clusters.
• It first bootstraps itself using a sample of maximally dissimilar
points from the dataset to create initial clusters.
• COOLCAT starts with a sample of data objects and identifies a set
of k initial tuples such that the minimum pairwise distance among
them is maximized.
• The remaining points are then added incrementally. Clusters are
created by "cooling" them down, i.e. reducing their entropy.
• All remaining tuples of the data set are placed in one of the
clusters such that, at each step, the increase in the entropy of the
resulting clustering is minimized.
STIRR Categorical Clustering
•
•
•
•
•
•
STIRR produces a dynamical system from a table of categorical data. STIRR begins with a table of relational
data, consisting of k fields (or columns), each of which can assume one of many possible values. Figure 15
shows an example of the data representation. Each possible value in each field is represented by a node and
the data is represented as a set of tuples – each tuple consists of nodes, with one node for each field.
A configuration is an assignment of a weight w to each node. A normalization function N() is needed to rescale
the weights of the nodes associated with each field so that their squares add up to 1.
Figure 15 - Representation of a collection of tuples. Adopted from [Gibson98].
A dynamical system is the repeated application of a function f on some set of values. A fixed point of a
dynamical system is a point u for which f(u) = u. So it is a point which remains the same under the repeated
application of f.
The dynamical system for a set of tuples is based on a function f, which maps a current configuration of
weights to a new configuration. We define f as follows. We choose a combiner function defined below and
update each weight wv as follows:
CLICK Categorical Clustering
•
CLICK finds clusters in categorical datasets based on a search method for k-partite maximal
cliques [Peters04].
•
Figure 16 – The CLICK algorithm. Adopted from [Peters04].
•
•
The basic Click approach consists of the three principal stages, shown in Figure 17, as follows:
Pre-Processing: In this step, the k-partite graph is created from the input database D, and the
attributes are ranked for efficiency reasons.
Clique Detection: Given Γ(D), all the maximal k-partite cliques in the graph are enumerated.
Post-Processing: the support of the candidate cliques within the original dataset is verified to
form the final clusters. Moreover, the final clusters are optionally merged to partially relax the
strict cluster conditions.
•
•
CLOPE Categorical Clustering
•
•
CLOPE proposes a novel global criterion function that tries to increase the intra-cluster overlapping of
transaction items by increasing the height-to-width ratio of the cluster histogram.
Let's take a small market basket database with 5 transactions {(apple, banana), (apple, banana, cake), (apple,
cake, dish), (dish, egg), (dish, egg, fish)}. For simplicity, transaction (apple, banana) is abbreviated to ab, etc.
For this small database, we want to compare the following two clustering (1) { {ab, abc, acd}, {de, def} } and (2)
{ {ab, abc}, {acd, de, def} }. For each cluster, we count the occurrence of every distinct item, and then obtain the
height (H) and width (W) of the cluster. For example, cluster {ab, abc, acd} has the occurrences of a:3, b:2, c:2,
and d:l, with H=2.0 and W=4. H is computed by summing the numbers of occurrences and dividing by the
number of distinct items. Figure 17 shows the values of these results as histograms, with items sorted in reverse
order of their occurrences, only for the sake of easier visual interpretation.
•
•
•
•
H=2.0, W=4
H=1.67, W=3
H=1.67, W=3
H=1.6, W=5
{ab, abc, acd}
{de, def}
{ab, abc}
{acd, de, def}
clustering (1)
clustering (2)
Figure 17 - Histograms of the two clusterings. Adopted from [Yang2002].
•
We judge the qualities of these two clusterings, by analyzing the heights and widths of the clusters. Leaving
out the two identical histograms for cluster {de, def} and cluster {ab, abc}, the other two histograms are of
different quality. The histogram for cluster {ab, abc, acd} has H/W=0.5, but the one for cluster {acd, de, def} has
H/W=0.32. Clearly, clustering (1) is better since we prefer more overlapping among transactions in the same
cluster. From the above example, we can see that a larger height-to-width ratio of the histogram means better
intra-cluster similarity. This intuition is the basis of CLOPE and defines the global criterion function using the
geometric properties of the cluster histograms.
WaveCluster (1998)
•
•
WaveCluster uses wavelet transformations to cluster data.
It uses a wavelet transformation to transform the original feature space, finding
dense regions in the transformed space.
•
•
•
•
•
•
•
•
A wavelet transform is a signal processing technique that decomposes a signal into
different frequency subbands.
The wavelet model can be applied to a data set with n dimensions by applying a
one-dimensional wavelet n times.
Each time that a wavelet transform is applied, data are transformed so as to
preserve the relative distance between objects at different levels of resolution,
although the absolute distance increases.
This allows the natural clusters in the data to become more distinguishable.
Clusters are then identified by searching for dense regions in the new domain.
Wavelets emphasize regions where the objects cluster, but suppress less dense
regions outside of clusters.
The clusters in the data stand out because they are more dense and “clear” the
regions around them [Sheikholeslami98].
CLIQUE: The Major Steps
– finds subspaces of the highest dimensionality such that high
density clusters exist in those subspaces
• Partition the data space and find the number of points that
lie inside each cell of the partition.
• Identify clusters:
– Determine dense units in all subspaces of interest.
– Determine connected dense units in all subspaces of interest.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of connected
dense units for each cluster
– Determination of minimal cover for each cluster
ACDC
•
ACDC works in a different way than algorithms we mentioned above [Tzerpos00]. ACDC
performs the task of clustering in two stages. In the first one it creates a skeleton of the final
decomposition by identifying subsystems using a pattern-driven approach. There are many
patterns that have been used in ACDC. Figure 24 shows some of these patterns.
•
Depending on the pattern used the subsystems are given appropriate names. In the second
stage ACDC completes the decomposition by using an extended version of a technique known
as Orphan Adoption. Orphan Adoption is an incremental clustering technique based on the
assumption that the existing structure is well established. It attempts to place each newly
introduced resource (called an orphan) in the subsystem that seems “more appropriate”. This is
usually a subsystem that has a larger amount of connectivity to the orphan than any other
subsystem [Tzerpos00].
ACDC