Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
4. Clustering Methods
Concepts
Partitional (k-Means, k-Medoids)
Hierarchical (Agglomerative & Divisive, COBWEB)
Density-based (DBSCAN, CLIQUE)
Large size data (STING, BIRCH, CURE)
9/03
Data Mining – Clustering
G Dong (WSU)
1
The Clustering Problem
• The clustering problem is about grouping a set of data
tuples into a number of clusters. Data in the same cluster
are highly similar to each other and data in different
clusters are highly different from each other.
• About clusters
– Inter-clusters distance  maximization
– Intra-clusters distance  minimization
• Clustering vs. classification
– Which one is more difficult? Why?
– Various possible ways of clustering, which way is the best?
9/03
Data Mining – Clustering
G Dong (WSU)
2
Different ways of representing clusters
•
•
•
•
•
•
Division with boundaries
Venn diagram or spheres
Probabilistic
Dendrograms
Trees
Rules
1 2 3
I1
0.5 0.2 0.3
I2
…
In
9/03
Data Mining – Clustering
G Dong (WSU)
3
Major Categories of Algorithms
• Partitioning: Divide into k partitions (k fixed);
regroup to get better clustering.
• Hierarchical: Divide into different number of
partitions in layers - merge (bottom-up) or divide
(top-down).
• Density-based: Continue to grow a cluster as long as
the density of the cluster exceeds a threshold
• Grid-based: First divide space into grids, then
perform clustering on the grids.
9/03
Data Mining – Clustering
G Dong (WSU)
4
k-Means
•
Algorithm
1.
2.
3.
4.
5.
•
Given k
Randomly pick k instances as the initial centers
Assign the rest instances to closest one of k clusters
Recalculate the mean of each cluster
Repeat 3 & 4 until means don’t change
How good the clusters are
– Initial and final clusters
– Within-cluster variation diff(x,mean)^2
– Why don’t we consider inter-cluster distance?
9/03
Data Mining – Clustering
G Dong (WSU)
5
Example
• For simplicity, 1 dimensional objects and k=2.
• Objects: 1, 2, 5, 6,7
• K-means:
– Randomly select 5 and 6 as initial centroids;
– => Two clusters {1,2,5} and {6,7}; meanC1=8/3,
meanC2=6.5
– => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6
– => no change.
– Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 =
2.5
9/03
Data Mining – Clustering
G Dong (WSU)
6
Discussions
• Limitations:
–
–
–
–
Means cannot be defined for categorical attributes;
Choice of k;
Sensitive to outliers;
Crisp clustering
• Variants of k-means exist:
– Using modes to deal with categorical attributes
• How about distance measures
• Is it similar to or different from k-NN?
– With and without learning
9/03
Data Mining – Clustering
G Dong (WSU)
7
k-Medoids
• k-Means algorithm is sensitive to outliers
– Is this true? How to prove it?
• Medoid – the most centrally located point in a
cluster, as a representative point of the cluster.
• In contrast, a centroid is not necessarily in a
cluster.
• An example
Initial Medoids
9/03
Data Mining – Clustering
G Dong (WSU)
8
Partition Around Medoids
•
PAM:
1. Given k
2. Randomly pick k instances as initial medoids
3. Assign each instance to the nearest medoid
4. Calculate the objective function
• the sum of dissimilarities of all instances to their
nearest medoids
5. Randomly select an instance y
6. Swap some medoid x by y if the swap reduces
the objective function
7. Repeat (3-6) until no change
9/03
Data Mining – Clustering
G Dong (WSU)
9
k-Means and k-Medoids
• The key difference lies in how they update means
or medoids
• Both require distance calculation and reassignment
of instances
• Time complexity
Outlier (100 unit away)
– Which one is more costly?
• Dealing with outliers
9/03
Data Mining – Clustering
G Dong (WSU)
10
EM (Expectation Maximization)
• Moving away from crisp clusters as in k-Means by
allowing an instance to belong to several clusters
• Finite mixtures – a statistical clustering model
– A mixture is a set of k probability distributions,
representing k clusters
– The simplest finite mixture: one feature with a Gaussian
– When k=2, we need to estimate 5 parameters: 2 pairs of μ,
2 pairs of σ, and pA, where pB = 1- pA
• EM
– Estimate using instances
– Maximize the overall likelihood that data came from this
data set
9/03
Data Mining – Clustering
G Dong (WSU)
11
Agglomerative
• Each object is viewed as a cluster (bottom up).
• Repeat until the number of clusters is small
enough
– Choose a closest pair of clusters
– Merge the two into one
• Defining “closest”: Centroid (mean of cluster)
distance, (average) sum of pairwise distance, …
– Refer to the Evaluation part
• A dendrogram is a tree that shows clustering
process.
9/03
Data Mining – Clustering
G Dong (WSU)
12
Dendrogram
• Cluster 1, 2, 4, 5, 6, 7 into two clusters (centriod
distance)
1
2
4
5
6
7
9/03
Data Mining – Clustering
G Dong (WSU)
13
An example to show different Links
• Single link
A B C D E
– Merge the nearest clusters measured by
the shortest edge between the two
– (((A B) (C D)) E)
• Complete link
– Merge the nearest clusters measured by
the longest edge between the two
– (((A B) E) (C D))
A 0
1
2
2
3
B 1
0
2
4
3
C 2
2
0
1
5
D 2
4
1
0
3
E 3
3
5
3
0
B
A
• Average link
– Merge the nearest clusters measured byE
the average edge length between the two
– (((A B) (C D)) E)
9/03
Data Mining – Clustering
G Dong (WSU)
C
D
14
Divisive
• All instances belong to one cluster (top-down)
• To find an optimal division at each layer (especially
the top one) is computationally prohibitive.
• One heuristic method is based on the Minimum
Spanning Tree (MST) algorithm
– Connecting all instances with MST (O(N2))
– Repeatedly cut out the longest edges at each iteration
until some stopping criterion is met or until one instance
remains in each cluster.
9/03
Data Mining – Clustering
G Dong (WSU)
15
COBWEB
• Building a conceptual hierarchy incrementally
• Each cluster has a probabilistic description
• Category Utility:
kijP(fi=vij)P(fi=vij|ck)P(ck|fi=vij)
– All categories ck, all features fi, all feature values vij
• It attempts to maximize both the probability that
two objects in the same category have values in
common and the probability that objects in
different categories will have different property
values
9/03
Data Mining – Clustering
G Dong (WSU)
16
A tree of clusters produced by COBWEB:
9/03
Data Mining – Clustering
G Dong (WSU)
17
• Processing one instance at a time by choosing best among
– Placing the instance in the best existing category
– Adding a new category containing only the instance
– Merging of two existing categories into a new one and adding the
instance to that category
– Splitting of an existing category into two and placing the instance
in the best new resulting category
Grandparent
Grandparent
Split
Parent
Child 1
9/03
Child 2
Merge
Child 1
Data Mining – Clustering
G Dong (WSU)
Child 2
18
Cobweb Demo
http://kiew.cs.uni-dortmund.de:8001/mlnet/instances/81d91eaae317b2bebb
9/03
Data Mining – Clustering
G Dong (WSU)
19
Density-based
• DBSCAN –Density-Based Clustering of
Applications with Noise
• It grows regions with sufficiently high density into
clusters and can discover clusters of arbitrary
shape in spatial databases with noise.
– Many existing clustering algorithms find spherical
shapes of clusters
• DBSCAN defines a cluster as a maximal set of
density-connected points.
9/03
Data Mining – Clustering
G Dong (WSU)
20
• Defining density and connection
–
–
–
–
–
-neighborhood of an object x (core object) (M, P, Q)
MinPts of objects within -neighborhood (say, 3)
directly density-reachable (Q from M, M from P)
density-reachable (Q from P, P not from Q) [asymmetric]
density-connected (O, R, S) [symmetric] <for border points>
• What is the relationship between DR and DC?
Q
R
M
S
P
9/03
O
Data Mining – Clustering
G Dong (WSU)
21
• Clustering with DBSCAN
– Search for clusters by checking the -neighborhood of
each instance x
– If the -neighborhood of x contains more than MinPts,
create a new cluster with x as a core object
– Iteratively collect directly density-reachable objects from
these core object and merge density-reachable clusters
– Terminate when no new point can be add to any cluster
• DBSCAN is sensitive to the thresholds of density,
but it is many folds faster than CLARANS
• Time complexity O(N log N) if a spatial index is
used, O(N2) otherwise
9/03
Data Mining – Clustering
G Dong (WSU)
22
Dealing with Large Data
• Key ideas
– Reducing the number of instances to be maintained, and yet
to maintain the distribution
– Identifying relevant subspaces where clusters possibly exist
– Using summarized information to avoid repeated data
access
• Sampling
– CLARA (Clustering LARge Applications) working on
samples instead of the whole data
– CLARANS (Clustering Large Applications based on
RANdomized Search)
9/03
Data Mining – Clustering
G Dong (WSU)
23
• Grid: STING (STatistical INformation Grid)
– Statistical parameters of higher-level cells can easily be
computed from those of lower-level cells
• Attribute-independent: count
• Attribute-dependent: mean, standard deviation, min, max
• Type of distribution: normal, uniform, exponential, or
unknown
– Irrelevant cells can be removed
9/03
Data Mining – Clustering
G Dong (WSU)
24
Representatives
•
BIRCH using Clustering Feature (CF) and CF tree
– A cluster feature is a triplet about sub-clusters of instances
(N, LS, SS)
•
N - the number of instances, LS – linear sum, SS – square sum
– Two thresholds: branching factor (the max number of
children per non-leaf node) and diameter threshold
– Two phases
1. Build an initial in-memory CF tree
2. Apply a clustering algorithm to cluster the leaf nodes in CF tree
•
9/03
CURE (Clustering Using REpresentitives) is another
example
Data Mining – Clustering
G Dong (WSU)
25
CF Tree
Root
B: Branching factor
L: Threshold: max diameter of subclusters
at leaf nodes
B=7
CF1
CF2 CF3
CF6
L=6
child1
child2 child3
child6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
9/03
CF6 next
Leaf node
prev CF1 CF2
Data Mining – Clustering
G Dong (WSU)
CF4 next
26
• Taking advantage of the property of density
– If it’s dense in higher dimensional subspaces, it should
be dense in some lower dimensional subspaces
– CLIQUE (CLustering In QUEst)
• With high dimensional data, there are many void subspaces
• Using the property identified, we can start with dense lower
dimensional data
• CLIQUE is a density-based method that can automatically find
subspaces of the highest dimensionality such that high-density
clusters exist in those subspaces
9/03
Data Mining – Clustering
G Dong (WSU)
27
Drawbacks of Distance-Based Method
• Drawbacks of square-error based clustering method
– Consider only one point as representative of a cluster
– Good only for convex shaped, similar size and density, and if k can be
reasonably estimated
9/03
Data Mining – Clustering
G Dong (WSU)
28
Chameleon
• A hierarchical Clustering Algorithm Using
Dynamic Modeling
– Observations on the weakness of pure distance based
methods
• Basic steps:
– Build K nearest neighbor graph
– Partition the graph
– Merge the “strongly connected partitions,” in terms of
strength of connections between partitions
9/03
Data Mining – Clustering
G Dong (WSU)
29
Summary
• There are many clustering algorithms
• Good clustering algorithms maximize inter-cluster
dissimilarity and intra-cluster similarity
• Without prior knowledge, it is difficult to choose
the best clustering algorithm.
• Clustering is an important tool for outlier analysis.
9/03
Data Mining – Clustering
G Dong (WSU)
30
Bibliography
• I.H. Witten and E. Frank. Data Mining – Practical
Machine Learning Tools and Techniques with Java
Implementations. 2000. Morgan Kaufmann.
• M. Kantardzic. Data Mining – Concepts, Models,
Methods, and Algorithms. 2003. IEEE.
• J. Han and M. Kamber. Data Mining – Concepts
and Techniques. 2001. Morgan Kaufmann.
• M. H. Dunham. Data Mining – Introductory and
Advanced Topics.
9/03
Data Mining – Clustering
G Dong (WSU)
31