Download K-Means Cluster Analysis Chapter 3 3 PPDM Cl ass

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
K-Means Cluster Analysis
Chapter 3
PPDM Cl
Class
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
What is Cluster Analysis?
z
Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
Inter
cluster
distances are
maximized
Intra-cluster
distances are
minimized
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
2
Applications of Cluster Analysis
z
Discovered Clusters
Understanding
– Group
p related documents
for browsing, group genes
and proteins that have
similar functionality, or
group stocks with similar
price fluctuations
z
Industry Group
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
1
2
Fannie Mae DOWN Fed Home Loan DOWN
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN
3
4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Technology1-DOWN
Technology2-DOWN
Financial-DOWN
Oil-UP
Summarization
– Reduce the size of large
data sets
C uste g p
Clustering
precipitation
ec p tat o
in Australia
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
3
What is not Cluster Analysis?
z
Supervised classification
– Have class label information
z
Simple segmentation
– Di
Dividing
idi students
t d t into
i t different
diff
t registration
i t ti groups
alphabetically, by last name
z
Results of a query
– Groupings are a result of an external specification
z
Graph partitioning
– Some mutual relevance and synergy, but areas are not
identical
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
4
Notion of a Cluster can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
5
Types of Clusterings
z
A clustering is a set of clusters
z
Important distinction between hierarchical and
partitional sets of clusters
z
Partitional Clustering
– A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
z
Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
6
Partitional Clustering
Original
g
Points
© Tan,Steinbach, Kumar
A Partitional Clustering
g
Introduction to Data Mining
4/18/2004
7
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
© Tan,Steinbach, Kumar
p3 p4
Non-traditional Dendrogram
Introduction to Data Mining
4/18/2004
8
Other Distinctions Between Sets of Clusters
z
Exclusive versus non-exclusive
– In non-exclusive clusterings, points may belong to multiple
clusters.
clusters
– Can represent multiple classes or ‘border’ points
z
Fuzzyy versus non-fuzzyy
– In fuzzy clustering, a point belongs to every cluster with some
weight between 0 and 1
– Weights must sum to 1
– Probabilistic clustering has similar characteristics
z
Partial versus complete
– In
I some cases, we only
l wantt to
t cluster
l t some off the
th data
d t
z
Heterogeneous versus homogeneous
– Cluster of widely different sizes, shapes, and densities
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
9
Types of Clusters
z
Well-separated clusters
z
Center-based clusters
z
Contiguous clusters
z
Density-based clusters
z
P
Property
t or Conceptual
C
t l
z
Described by an Objective Function
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
10
Types of Clusters: Well-Separated
z
Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.
3 well-separated clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
11
Types of Clusters: Center-Based
z
Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all
th points
the
i t iin th
the cluster,
l t or a medoid,
d id the
th mostt “representative”
“
t ti ”
point of a cluster
4 center-based clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
12
Types of Clusters: Contiguity-Based
z
Contiguous Cluster (Nearest neighbor or
Transitive)
– A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
8 contiguous clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
13
Types of Clusters: Density-Based
z
Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
14
Types of Clusters: Conceptual Clusters
z
Shared Property or Conceptual Clusters
– Finds clusters that share some common property or represent
a particular concept.
.
2 Overlapping Circles
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
15
Types of Clusters: Objective Function
z
Clusters Defined by an Objective Function
– Finds clusters that minimize or maximize an objective function.
– Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using
the given objective function
function. (NP Hard)
–
Can have global or local objectives.
‹
Hierarchical clustering algorithms typically have local objectives
‹
P titi
Partitional
l algorithms
l ith
typically
t i ll h
have global
l b l objectives
bj ti
– A variation of the global objective function approach is to fit the
data to a parameterized model.
‹
Parameters for the model are determined from the data.
Mixture models assume that the data is a ‘mixture' of a number of
statistical distributions.
‹
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
16
Types of Clusters: Objective Function …
z
Map the clustering problem to a different domain
and solve a related problem in that domain
– Proximity matrix defines a weighted graph, where the
nodes are the points being clustered, and the
weighted edges represent the proximities between
points
– Clustering is equivalent to breaking the graph into
connected components, one for each cluster.
– Want to minimize the edge weight between clusters
and maximize the edge weight within clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
17
Characteristics of the Input Data Are Important
z
Type of proximity or density measure
– This is a derived measure, but central to clustering
z
Sparseness
– Dictates type of similarity
– Adds to efficiency
z
Attribute type
– Dictates type of similarity
z
Type of Data
– Dictates type of similarity
– Other characteristics, e.g., autocorrelation
z
z
z
Dimensionality
Di
i
lit
Noise and Outliers
Type of Distribution
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
18
Clustering Algorithms
z
K-means and its variants
z
Hierarchical clustering
z
Density-based clustering
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
19
K-means Clustering
z
z
Partitional clustering approach
Each cluster is associated with a centroid (center point)
z
Each point is assigned to the cluster with the closest
centroid
z
Number of clusters,
clusters K
K, must be specified
The basic algorithm is very simple
z
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
20
K-means Clustering – Details
z
Initial centroids are often chosen randomly.
–
Clusters produced vary from one run to another.
z
The centroid is (typically) the mean of the points in the
cluster.
z
‘Closeness’ is measured by Euclidean distance, cosine
similarity correlation
similarity,
correlation, etc
etc.
K-means will converge for common similarity measures
mentioned above.
z
z
Most of the convergence happens in the first few
iterations.
–
z
Often the stopping
pp g condition is changed
g to ‘Until relatively
y few
points change clusters’
Complexity is O( n * K * I * d )
–
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
21
Two different K-means Clusterings
3
2.5
Original Points
2
y
15
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
2.5
2.5
2
2
1.5
1.5
y
3
y
3
1
1
0.5
0.5
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
Optimal Clustering
© Tan,Steinbach, Kumar
-2
Introduction to Data Mining
Sub-optimal Clustering
4/18/2004
22
Importance of Choosing Initial Centroids
Iteration 6
1
2
3
4
5
3
2.5
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
23
Importance of Choosing Initial Centroids
Iteration 1
Iteration 2
Iteration 3
2.5
2.5
2.5
2
2
2
1.5
1.5
1.5
y
3
y
3
y
3
1
1
1
0.5
0.5
0.5
0
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2
-1.5
-1
-0.5
x
0
0.5
1
1.5
2
-2
It ti 4
Iteration
It ti 5
Iteration
2.5
2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
-0.5
0
0.5
x
© Tan,Steinbach, Kumar
1
1.5
2
0
0.5
1
1.5
2
1
1.5
2
y
2.5
y
2.5
y
3
-1
-0.5
It ti 6
Iteration
3
-1.5
-1
x
3
-2
-1.5
x
-2
-1.5
-1
-0.5
0
0.5
1
x
Introduction to Data Mining
1.5
2
-2
-1.5
-1
-0.5
0
0.5
x
4/18/2004
24
Evaluating K-means Clusters
z
Most common measure is Sum of Squared Error (SSE)
– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.
K
SSE = ∑ ∑ dist 2 ( mi , x )
i =1 x∈Ci
– x is a data point in cluster Ci and mi is the representative point for
cluster Ci
‹
can show that mi corresponds to the center (mean) of the cluster
– Given two clusters, we can choose the one with the smallest
error
– One easy way to reduce SSE is to increase K, the number of
clusters
A good clustering with smaller K can have a lower SSE than a poor
clustering
l t i with
ith hi
higher
h K
‹
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
25
Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3
2.5
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
26
Importance of Choosing Initial Centroids …
Iteration 1
Iteration 2
2.5
2.5
2
2
1.5
1.5
y
3
y
3
1
1
0.5
0.5
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2
-1.5
-1
-0.5
x
0
0.5
Iteration 3
2.5
2
2
2
1.5
1.5
1.5
y
2.5
y
2.5
y
3
1
1
1
0.5
0.5
0.5
0
0
0
-1
1
-0.5
05
0
05
0.5
x
© Tan,Steinbach, Kumar
2
Iteration 5
3
-1.5
15
1.5
Iteration 4
3
-2
2
1
x
1
15
1.5
2
-2
2
-1.5
15
-1
1
-0.5
05
0
05
0.5
1
x
Introduction to Data Mining
15
1.5
2
-2
2
-1.5
15
-1
1
-0.5
05
0
05
0.5
1
15
1.5
2
x
4/18/2004
27
Problems with Selecting Initial Points
z
If there are K ‘real’ clusters then the chance of selecting
one centroid from each cluster is small.
–
Ch
Chance
is
i relatively
l ti l smallll when
h K iis llarge
–
If clusters are the same size, n, then
–
For example, if K = 10, then probability = 10!/1010 = 0.00036
–
Sometimes the initial centroids will readjust themselves in
‘ i ht’ way, and
‘right’
d sometimes
ti
they
th d
don’t
’t
–
Consider an example of five pairs of clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
28
10 Clusters Example
Iteration 4
1
2
3
8
6
4
y
2
0
-2
-4
-6
0
5
10
15
20
x
Starting with two initial centroids in one cluster of each pair of clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
29
10 Clusters Example
Iteration 2
8
6
6
4
4
2
2
y
y
Iteration 1
8
0
0
-2
-2
-4
-4
-6
-6
0
5
10
15
20
0
5
x
6
6
4
4
2
2
0
15
20
0
-2
-2
-4
-4
-6
-6
10
20
Iteration 4
8
y
y
Iteration 3
5
15
x
8
0
10
15
20
x
0
5
10
x
Starting with two initial centroids in one cluster of each pair of clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
30
10 Clusters Example
Iteration 4
1
2
3
8
6
4
y
2
0
-2
-4
-6
0
5
10
15
20
x
Starting with some pairs of clusters having three initial centroids, while other have only one.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
31
10 Clusters Example
Iteration 2
8
6
6
4
4
2
2
y
y
Iteration 1
8
0
0
-2
-2
-4
-4
-6
-6
0
5
10
15
20
0
5
8
8
6
6
4
4
2
2
0
-2
-4
-4
-6
-6
5
10
15
20
15
20
0
-2
0
10
x
Iteration
4
y
y
x
Iteration
3
15
20
x
0
5
10
x
Starting with some pairs of clusters having three initial centroids, while other have only one.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
32
Solutions to Initial Centroids Problem
z
Multiple runs
– Helps,
p , but probability
p
y is not on yyour side
Sample and use hierarchical clustering to
determine initial centroids
z Select more than k initial centroids and then
select among these initial centroids
z
– Select most widely separated
Postprocessing
z Bisecting K-means
z
– Not as susceptible to initialization issues
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
33
Handling Empty Clusters
z
Basic K-means algorithm can yield empty
clusters
z
Several strategies
– Choose the point that contributes most to SSE
– Choose a point from the cluster with the highest SSE
– If there are several empty clusters, the above can be
repeated several times.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
34
Updating Centers Incrementally
z
In the basic K-means algorithm, centroids are
updated after all points are assigned to a centroid
z
An alternative is to update the centroids after
each assignment (incremental approach)
–
–
–
–
–
Each assignment updates zero or two centroids
More expensive
Introduces an order dependency
N
Never
gett an empty
t cluster
l t
Can use “weights” to change the impact
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
35
Pre-processing and Post-processing
z
Pre-processing
– Normalize the data
– Eliminate outliers
z
Post processing
Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’
loose clusters,
clusters ii.e.,
e clusters with relatively high
SSE
– Merge
g clusters that are ‘close’ and that have relatively
y
low SSE
– Can use these steps during the clustering process
‹
ISODATA
SO
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
36
Bisecting K-means
z
Bisecting K-means algorithm
–
Variant of K-means that can produce a partitional or a
hierarchical clustering
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
37
Bisecting K-means Example
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
38
Limitations of K-means
z
K-means has problems when clusters are of
differing
– Sizes
– Densities
– Non-globular shapes
z
K-means has problems when the data contains
outliers.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
39
Limitations of K-means: Differing Sizes
K-means (3 Clusters)
Original Points
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
40
Limitations of K-means: Differing Density
K-means (3 Clusters)
Original Points
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
41
Limitations of K-means: Non-globular Shapes
Original Points
© Tan,Steinbach, Kumar
K-means (2 Clusters)
Introduction to Data Mining
4/18/2004
42
Overcoming K-means Limitations
Original Points
K-means Clusters
One solution is to use many
y clusters.
Find parts of clusters, but need to put together.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
43
Overcoming K-means Limitations
Original Points
© Tan,Steinbach, Kumar
K-means Clusters
Introduction to Data Mining
4/18/2004
44
Overcoming K-means Limitations
Original Points
© Tan,Steinbach, Kumar
K-means Clusters
Introduction to Data Mining
4/18/2004
45