Download clustering - The University of Kansas

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
EECS 800 Research Seminar
Mining Biological Data
Instructor: Luke Huan
Fall, 2006
The UNIVERSITY of Kansas
Administrative
Project assignments have been distributed
Schedule a meeting this week with the instructor to start to
work on your projects.
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide2
Overview
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. Types of Clusters
4. A Categorization of Major Clustering Methods
5. Partitioning Methods
6. Hierarchical Methods
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide3
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated
to) the objects in other groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide4
Applications of Cluster Analysis
Understanding
Group related documents for browsing, group genes and
proteins that have similar functionality, or group stocks with
similar price fluctuations
Summarization
Reduce the size of large data sets
Clustering precipitation
in Australia
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide5
Multidisciplinary Efforts of Clustering
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access patterns
Bioinfo:
Phylogenetic tree
Microarray analysis
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide6
What is not Cluster Analysis?
Supervised classification
Have class label information
Simple segmentation
Dividing students into different registration groups alphabetically, by
last name
Results of a query
Groupings are a result of an external specification
Graph partitioning
Some mutual relevance and synergy, but areas are not identical
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide7
Terms in Cluster Analysis
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the characteristics found in the
data and grouping similar data objects into clusters
Unsupervised learning: no predefined classes
So what?
Clustering can be used as a stand-alone tool to get insight into data distribution
Clustering can be used as a preprocessing step for other algorithms such as
discretization
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide8
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the similarity
measure used by the method and its implementation
The quality of a clustering method is also measured by its ability to
discover some or all of the hidden patterns
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide9
Measure the Quality of Clustering
Dissimilarity/Similarity metric: Similarity is expressed in terms of a
distance function, typically metric: d(i, j)
There is a separate “quality” function that measures the “goodness”
of a cluster.
The definitions of distance functions are usually very different for
boolean, categorical, ordinal, interval, ratio, and vector variables.
Weights should be associated with different variables based on
applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide10
Requirements of Clustering in Data Mining
Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input
parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide11
Data Matrix
Data matrix
Also called object-by-variable structure
 x11

 ...
x
 i1
 ...
x
 n1
...
x1f
...
...
...
...
xif
...
...
...
...
... xnf
...
...
x1p 

... 
xip 

... 
xnp 

Xf = (x1f, x2f, …, xnf)’ is the fth variable
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide12
Data Structure
Dissimilarity matrix
Also called object-by-object structure
d(i,j): dissimilarity between objects i and j
Nonnegative
Close to 0: similar
 0
 d(2,1)
0

 d(3,1) d ( 3,2) 0

:
:
 :
d ( n,1) d ( n,2) ...
9/27/2006
Clustering I






... 0
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide13
Data Structures
Variance, co-Variance matrix
 v( X 1 )

v(X ,X ) v( X )

2
 2 1

 v(X 3 ,X 1 ) v(X 3 ,X 2 ) v( X 3 )



:
:
 :

v(X n ,X 1 ) v(X n ,X 2 )
...
... v( X n )
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide14
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are (we are in
the business of comparing apple and orange!).
Is higher when objects are more alike.
Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different are two data objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide15
Similarity (Dissimilarity) is Measured at
Different Levels
Similarity between a single measurement for two objects
Similarity between two objects (across a group of
measurements)
Similarity between an object and a cluster
Similarity between two clusters
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide16
Type of data in clustering analysis
Binary variables
Nominal
Ordinal
Interval
Ratio variables
Variables of mixed types
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide17
Similarity/Dissimilarity for Simple
Attributes
p and q are the attribute values for two data objects.
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide18
Binary Variables
A binary variable is symmetric if both of its states are
equally valuable and carry the same weight (e.g. gender).
The dissimilarity between objects I and J (using a group of
symmetric variables) is defined with the simple matching
coefficient (SMC):
Object j
1
0
1
a
b
Object i
0
c
d
sum a  c b  d
d (i, j) 
9/27/2006
Clustering I
sum
a b
cd
p
bc
a bc  d
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide19
Binary Variables
A binary variable is asymmetric if the outcomes of the
states are not equally important.
The distance between two objects on asymmetric binary
variables is defined using Jaccard distance
d (i, j) 
9/27/2006
Clustering I
bc
a bc
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide20
Simple Matching versus Jaccard
p= 1000000000
q= 0000001001
M01 = 2
M10 = 1
M00 = 7
M11 = 0
(the number of attributes where p was 0 and q was 1)
(the number of attributes where p was 1 and q was 0)
(the number of attributes where p was 0 and q was 0)
(the number of attributes where p was 1 and q was 1)
Simple match distance =
(M10 + M01)/(M01 + M10 + M11 + M00) = (1+2) / (2+1+0+7) = 0.3
J=
(M10 + M01) / (M01 + M10 + M11) = (1 + 2) / (2 + 1 + 0) = 1
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide21
Nominal Variables
A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
m
d (i, j)  p 
p
Method 2: use a large number of binary variables
creating a new binary variable for each of the M nominal states
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide22
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
replace xif by their rank rif {1,...,M f }
map the range of each variable onto [0, 1] by replacing i-th object
in the f-th variable by
zif
rif 1

M f 1
compute the dissimilarity using methods for interval variables
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide23
Interval variables
Standardize data
Calculate the mean absolute deviation:
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where
m f  1n (x1 f  x2 f
 ... 
xnf )
.
Calculate the standardized measurement (z-score)
xif  m f
zif 
sf
Using mean absolute deviation is more robust than using
standard deviation
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide24
Similarity and Dissimilarity
Between Objects
Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide25
Minkowski Distance: Examples
r = 1. City block (Manhattan, taxicab, L1 norm) distance.
A common example of this is the Hamming distance, which is
just the number of bits that are different between two binary
vectors
r = 2. Euclidean distance
r  . “supremum” (Lmax norm, L norm) distance.
This is the maximum difference between any component of the
vectors
Do not confuse r with n, i.e., all these distances are defined
for all numbers of dimensions.
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide26
Distance Defines our Perception of the
World
What is a circle if we use Manhattan distance?
The definition of a circle is that all the points on a circle
have equal distances to a fix point (the center)
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide27
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a nonlinear scale,
approximately at exponential scale, such as AeBt or Ae-Bt
A original measurement
B a constant
t a parameter for non-linear scale
Methods:
treat them like interval-scaled variables—not a good choice! (why?—the scale
can be distorted)
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank as interval-scaled
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide28
Variables of Mixed Types
A database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio
One may use a weighted formula to combine their effects
 pf  1 ij( f ) d ij( f )
d (i, j) 
 pf  1 ij( f )
d(i, j)(f) is the distance of the fth variable for objects I and J.
δ(i, j)(f) is the weighting of the fth variable
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide29
Vector Objects
Vector objects: keywords in documents, gene features in
micro-arrays, etc.
Broad applications: information retrieval, biologic
taxonomy, etc.
Cosine measure
A variant: Tanimoto coefficient
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide30
Types of Clusters
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
Model based
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide31
Types of Clusters: Well-Separated
Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is closer (or more
similar) to every other point in the cluster than to any point not in the
cluster.
3 well-separated clusters
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide32
Types of Clusters: Center-Based
Center-based
A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most
“representative” point of a cluster
4 center-based clusters
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide33
Types of Clusters: Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
8 contiguous clusters
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide34
Types of Clusters: Density-Based
Density-based
A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide35
Types of Clusters: Model Based
Shared Property or Conceptual Clusters
Finds clusters that share some common property or represent a
particular model.
.
Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 Movie 7
rating
8
Viewer
71
6
Viewer
52
4
Viewer
33
2
Viewer
14
0
Viewer 5
1
2
4
4
3
5
6
7
viewer11
viewer 3
2
3
4
6
3
4
5
7
movie 1
movie 2
5 movie 45
36
movie
3
viewer 4
4
2 Overlapping Circles
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide36
Major Clustering Approaches (I)
Partitioning approach:
Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some
criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide37
Partitional Clustering
Original Points
9/27/2006
Clustering I
A Partitional Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide38
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
9/27/2006
Clustering I
p3 p4
Non-traditional Dendrogram
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide39
Major Clustering Approaches (II)
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
Model-based:
A model is hypothesized for each of the clusters and tries to find the best fit of
that model to each other
Typical methods: EM, SOM, COBWEB
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide40
Typical Alternatives to Calculate the
Distance between Clusters
Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an element in the
other, i.e., dis(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) =
dis(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi,
Mj)
Medoid: one chosen, centrally located object in the cluster
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide41
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
Centroid: the “mean” point of a cluster
Cm 
iN 1(t
ip
)
N
Radius: square root of average distance from any point of the cluster to its
centroid
 N (t  cm ) 2
ip
i

1
Rm 
N
Diameter: square root of average mean squared distance between all pairs of
points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide42
Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters, s.t., min sum of squared distance
km1tmiKm (Cm  tmi )2
Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67): Each cluster is represented by the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide43
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in four steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the clusters of the current
partition (the centroid is the mean point, of the cluster)
Assign each object to the cluster with the nearest seed point
Go back to Step 2, stop when no more new assignment
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide44
The K-Means Clustering Method
10
9
8
7
6
5
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
4
3
2
1
0
0
10
10
9
9
8
8
7
7
6
6
5
4
3
2
1
0
0
9/27/2006
Clustering I
Update
the
cluster
means
1
2
3
4
5
6
7
8
9
10
Update
the
cluster
means
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
1
2
3
4
5
6
7
8
9
10
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
slide45
Comments on the K-Means Method
Strength
Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Comment: Often terminates at a local optimum.
The global optimum may be found using techniques such as
genetic algorithms
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide46
Comments on the K-Means Method
Weakness
Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide47
Variations of the K-Means Method
A few variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes (Huang’98)
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototype method
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide48
A Problem of K-means
Sensitive to outliers
Outlier: objects with extremely large values
May substantially distort the distribution of the data
+
+
K-medoids: the most centrally located object in a cluster
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
9/27/2006
Clustering I
1
2
3
4
5
6
7
8
9
10
0
1
2
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
3
4
5
6
7
8
9
10
slide49
A Problem K-means: Differing Density
Original Points
9/27/2006
Clustering I
K-means (3 Clusters)
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide50
A Problem of K-means:
Non-globular Shapes
Original Points
9/27/2006
Clustering I
K-means (2 Clusters)
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide51
The K-Medoids Clustering Method
Find representative objects, called medoids, in clusters
PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of the
resulting clustering
PAM works effectively for small data sets, but does not scale well for large
data sets
CLARA (Kaufmann & Rousseeuw, 1990)
CLARANS (Ng & Han, 1994): Randomized sampling
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide52
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10
10
10
9
9
9
8
8
8
Arbitrary
choose k
object as
initial
medoids
7
6
5
4
3
2
7
6
5
4
3
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Assign
each
remainin
g object
to
nearest
medoids
7
6
5
4
3
2
1
0
0
K=2
10
Until no
change
If quality is
improved.
Compute
total cost of
swapping
8
7
6
4
5
6
7
8
9
10
9
8
7
6
5
5
4
4
3
3
2
2
1
1
0
0
0
9/27/2006
Clustering I
3
10
9
Swapping O
and Oramdom
2
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
Do loop
1
1
2
3
4
5
6
7
8
9
10
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
0
1
2
3
4
5
6
7
8
9
10
slide53
PAM (Partitioning Around Medoids) (1987)
PAM (Kaufman and Rousseeuw, 1987), built in Splus
Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and selected object i, calculate the total
swapping cost TCih
For each pair of i and h,
If TCih < 0, i is replaced by h
Then assign each non-selected object to the most similar
representative object
repeat steps 2-3 until there is no change
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide54
What Is the Problem with PAM?
Pam is more robust than k-means in the presence of noise and
outliers because a medoid is less influenced by outliers or other
extreme values than a mean
Pam works efficiently for small data sets but does not scale well for
large data sets.
O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
Sampling based method,
CLARA(Clustering LARge Applications)
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide55
CLARA (Clustering Large Applications) (1990)
CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on each
sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not necessarily represent a good
clustering of the whole data set if the sample is biased
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide56
CLARANS
Clustering Large Applications based upon RANdomized
Search
The problem space: graph of clustering
 n
 vertices in
k
A vertex is k from n numbers,
total
PAM search the whole graph by random walking
CLARA search some random sub-graphs
CLARANS climbs mountains
Randomly sample a set and select k medoids
Consider neighbors of medoids as candidate for new medoids
Use the sample set to verify
Repeat multiple times to avoid bad samples
9/27/2006
Clustering I
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide57
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not
require the number of clusters k as an input, but needs a termination
condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
agglomerative
(AGNES)
ab
b
abcde
c
cde
d
de
e
Step 4
9/27/2006
Clustering I
Step 3
Step 2 Step 1 Step 0
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
divisive
(DIANA)
slide58
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Use the Single-Link method and the dissimilarity matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
9/27/2006
Clustering I
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
0
1
2
3
4
5
6
7
8
9
10
slide59