Download Hierarchical Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Mining
Cluster Analysis: Basic Concepts
and Algorithms
Lecture Notes for Chapter 8
Introduction to Data Mining
by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
What is Cluster Analysis?

Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs

Land use: Identification of areas of similar land use in an earth
observation database

Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost

City-planning: Identifying groups of houses according to their house
type, value, and geographical location

Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Applications of Cluster Analysis

Discovered Clusters
Understanding
– Group related documents
for browsing, group genes
and proteins that have
similar functionality, or
group stocks with similar
price fluctuations

Industry Group
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
1
2
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN
3
4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Technology1-DOWN
Technology2-DOWN
Financial-DOWN
Oil-UP
Summarization
– Reduce the size of large
data sets
Clustering precipitation
in Australia
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Requirements of Clustering in Data Mining

Scalability

Ability to deal with different types of attributes

Ability to handle dynamic data

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to
determine input parameters

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Incorporation of user-specified constraints

Interpretability and usability
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Measure the Quality of Clustering

Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)

There is a separate “quality” function that measures the
“goodness” of a cluster.

The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.

Weights should be associated with different variables
based on applications and data semantics.

It is hard to define “similar enough” or “good enough”
–
the answer is typically highly subjective.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Data Structures

Data matrix
– (two modes)

 x11

 ...
x
 i1
 ...
x
 n1
...
x1f
...
...
...
...
xif
...
...
...
...
... xnf
...
...
x1p 

... 
xip 

... 
xnp 

Dissimilarity matrix
– (one mode)
© Tan,Steinbach, Kumar
 0
 d(2,1)
0

 d(3,1) d ( 3,2) 0

:
:
 :
d ( n,1) d ( n,2) ...
Introduction to Data Mining






... 0
4/18/2004
‹#›
Type of data in clustering analysis

Interval-scaled variables

Binary variables

Nominal, ordinal, and ratio variables

Variables of mixed types
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Interval-valued variables

Standardize data
– Calculate the mean absolute deviation:
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where
m f  1n (x1 f  x2 f  ...  xnf )
– Calculate the standardized measurement (z-score)
.
xif  m f
zif 
sf

Using mean absolute deviation is more robust than using
standard deviation
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Similarity and Dissimilarity Between Objects

Distances are normally used to measure the similarity or
dissimilarity between two data objects

Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer

If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2
i p jp
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Similarity and Dissimilarity Between Objects
(Cont.)

If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1
j1
i2
j2
ip
jp
– Properties

d(i,j)
0
d(i,i)
=0
d(i,j)
= d(j,i)
d(i,j)
 d(i,k) + d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Binary Variables
Object j

1
0
1
a
b
Object i
0
c
d
sum a  c b  d
data

Distance measure for
symmetric binary variables:

Distance measure for
asymmetric binary variables:

sum
a b
cd
p
A contingency table for binary
d (i, j) 
d (i, j) 
bc
a bc  d
bc
a bc
Jaccard coefficient (similarity
measure for asymmetric
binary variables):
© Tan,Steinbach, Kumar
Introduction to Data Mining
simJaccard (i, j) 
a
a b c
4/18/2004
‹#›
Dissimilarity between Binary Variables

Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
– gender is a symmetric attribute
– the remaining attributes are asymmetric binary
– let the values Y and P be set to 1, and the value N be set to 0
01
 0.33
2 01
11
d ( jack , jim ) 
 0.67
111
1 2
d ( jim , mary ) 
 0.75
11 2
d ( jack , mary ) 
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Nominal Variables

A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matching
– m: # of matches, p: total # of variables
m
d (i, j)  p 
p

Method 2: use a large number of binary variables
– creating a new binary variable for each of the M nominal states
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Ordinal Variables

An ordinal variable can be discrete or continuous

Order is important, e.g., rank

Can be treated like interval-scaled
– replace xif by their rank
rif {1,...,M f }
– map the range of each variable onto [0, 1] by replacing i-th object
in the f-th variable by
zif
rif 1

M f 1
– compute the dissimilarity using methods for interval-scaled
variables
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Ratio-Scaled Variables

Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt

Methods:
– treat them like interval-scaled variables—not a good choice!
(why?—the scale can be distorted)
– apply logarithmic transformation
yif = log(xif)
– treat them as continuous ordinal data treat their rank as intervalscaled
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Variables of Mixed Types

A database may contain all the six types of variables
– symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio

One may use a weighted formula to combine their
effects
 pf  1 ij( f ) d ij( f )
d (i, j) 
 pf  1 ij( f )
– f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
– f is interval-based: use the normalized distance
– f is ordinal or ratio-scaled
compute ranks rif and
and treat zif as interval-scaled
zif
© Tan,Steinbach, Kumar
Introduction to Data Mining

r
M
if
1
f
1
4/18/2004
‹#›
Vector Objects

Vector objects: keywords in documents, gene
features in micro-arrays, etc.

Broad applications: information retrieval, biologic
taxonomy, etc.

Cosine measure

A variant: Tanimoto coefficient
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Types of Clusterings

A clustering is a set of clusters

Important distinction between hierarchical and
partitional sets of clusters

Partitional Clustering
– A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset

Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Partitional Clustering
Original Points
© Tan,Steinbach, Kumar
A Partitional Clustering
Introduction to Data Mining
4/18/2004
‹#›
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
© Tan,Steinbach, Kumar
p3 p4
Non-traditional Dendrogram
Introduction to Data Mining
4/18/2004
‹#›
Types of Clusters

Well-separated clusters

Center-based clusters

Contiguous clusters

Density-based clusters

Property or Conceptual

Described by an Objective Function
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Types of Clusters: Well-Separated

Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.
3 well-separated clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Types of Clusters: Center-Based

Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most “representative”
point of a cluster
4 center-based clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Types of Clusters: Contiguity-Based

Contiguous Cluster (Nearest neighbor or
Transitive)
– A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
8 contiguous clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Types of Clusters: Density-Based

Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Types of Clusters: Conceptual Clusters

Shared Property or Conceptual Clusters
– Finds clusters that share some common property or represent
a particular concept.
.
2 Overlapping Circles
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Clustering Algorithms

K-means and its variants

Hierarchical clustering

Density-based clustering
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
K-means Clustering

Partitional clustering approach

Each cluster is associated with a centroid (center point)

Each point is assigned to the cluster with the closest
centroid

Number of clusters, K, must be specified

The basic algorithm is very simple
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Comments on the K-Means Method

Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
Comparing:
PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms

Weakness
– Applicable only when mean is defined, then what about categorical
data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
© Tan,Steinbach, Kumar
May 22, 2017
Introduction to Data Mining
Data Mining: Concepts and
4/18/2004
30
‹#›
Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)
– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.
K
SSE    dist 2 (mi , x )
i 1 xCi
– x is a data point in cluster Ci and mi is the representative point for
cluster Ci

can show that mi corresponds to the center (mean) of the cluster
– Given two clusters, we can choose the one with the smallest
error
– One easy way to reduce SSE is to increase K, the number of
clusters
A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K

© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Limitations of K-means

K-means has problems when clusters are of
differing
– Sizes
– Densities
– Non-globular shapes

K-means has problems when the data contains
outliers.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Limitations of K-means: Differing Sizes
K-means (3 Clusters)
Original Points
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Limitations of K-means: Differing Density
K-means (3 Clusters)
Original Points
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Limitations of K-means: Non-globular Shapes
Original Points
© Tan,Steinbach, Kumar
K-means (2 Clusters)
Introduction to Data Mining
4/18/2004
‹#›
Hierarchical Clustering
Produces a set of nested clusters organized as a
hierarchical tree
 Can be visualized as a dendrogram

– A tree like diagram that records the sequences of
merges or splits
5
6
0.2
4
3
4
2
0.15
5
2
0.1
1
0.05
3
0
1
© Tan,Steinbach, Kumar
3
2
5
4
1
6
Introduction to Data Mining
4/18/2004
‹#›
Strengths of Hierarchical Clustering

Do not have to assume any particular number of
clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level

They may correspond to meaningful taxonomies
– Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Hierarchical Clustering

Two main types of hierarchical clustering
– Agglomerative:

Start with the points as individual clusters
At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left

– Divisive:

Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point (or
there are k clusters)


Traditional hierarchical algorithms use a similarity or
distance matrix
– Merge or split one cluster at a time
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Hierarchical Clustering

Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step
0
a
Step
1
Step
2
Step
3
Step
4
agglomerative
(AGNES)
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step
May 22, 20174
© Tan,Steinbach, Kumar
Step
3
Step Step Step
Concepts
and
2 Data Mining:
1
0
Introduction to Data Mining
4/18/2004
39
‹#›
Agglomerative Clustering Algorithm

More popular hierarchical clustering technique

Basic algorithm is straightforward
1.
Compute the proximity matrix
2.
Let each data point be a cluster
3.
Repeat
4.
Merge the two closest clusters
5.
Update the proximity matrix
6.

Until only a single cluster remains
Key operation is the computation of the proximity of
two clusters
–
Different approaches to defining the distance between
clusters distinguish the different algorithms
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to Define Inter-Cluster Similarity
p1
Similarity?
p2
p3
p4 p5
...
p1
p2
p3
p4





p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
...
p1
p2
p3
p4





p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
...
p1
p2
p3
p4





p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
...
p1
p2
p3
p4





p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
...
p1


p2
p3
p4





p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Hierarchical Clustering: Group Average

Compromise between Single and Complete
Link

Strengths
– Less susceptible to noise and outliers

Limitations
– Biased towards globular clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Hierarchical Clustering: Time and Space requirements

O(N2) space since it uses the proximity matrix.
– N is the number of points.

O(N3) time in many cases
– There are N steps and at each step the size, N2,
proximity matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time for
some approaches
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Hierarchical Clustering: Problems and Limitations

Once a decision is made to combine two clusters,
it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or
more of the following:
– Sensitivity to noise and outliers
– Difficulty handling different sized clusters and convex
shapes
– Breaking large clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Cluster Validity

For supervised classification we have a variety of
measures to evaluate how good our model is
– Accuracy, precision, recall

For cluster analysis, the analogous question is how to
evaluate the “goodness” of the resulting clusters?

But “clusters are in the eye of the beholder”!

Then why do we want to evaluate them?
–
–
–
–
To avoid finding patterns in noise
To compare clustering algorithms
To compare two sets of clusters
To compare two clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Quality: What Is Good Clustering?

A good clustering method will produce high quality
clusters with
– high intra-class similarity
– low inter-class similarity

The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation

The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Internal Measures: Cohesion and Separation

Cluster Cohesion: Measures how closely related
are objects in a cluster
– Example: SSE

Cluster Separation: Measure how distinct or wellseparated a cluster is from other clusters

Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
WSS    ( x  mi )2
i xC i
– Separation is measured by the between cluster sum of squares
BSS   Ci (m  mi )
2
i
– Where |Ci| is the size of cluster i
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Internal Measures: Cohesion and Separation

A proximity graph based approach can also be used for
cohesion and separation.
– Cluster cohesion is the sum of the weight of all links within a cluster.
– Cluster separation is the sum of the weights between nodes in the cluster
and nodes outside the cluster.
cohesion
© Tan,Steinbach, Kumar
separation
Introduction to Data Mining
4/18/2004
‹#›