Download 4. Cluster Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
4. Cluster Analysis
I Hierarchical clustering
I k-means clustering
I Classification maximum likelihood (model-based clustering)
I How many clusters?
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
1
The basic objective in cluster analysis is to discover natural
homogeneous groupings (clusters) in the observed vectors
x1 , . . . , xn . There is no dependent random variable: none of the
components of the observed vector X is treated differently from
the others. No assumptions are made on the number of groups.
The notion of a cluster cannot be precisely defined, which is why
there are many clustering models and algorithms. Grouping is done
on the basis of similarities or distances (dissimilarities) between the
observations.
Data clustering has been used for three main purposes:
• Underlying structure: to gain insight into data, generate
hypotheses, detect anomalies, and identify salient features.
• Natural classification: to identify the degree of similarity among
individuals.
• Compression: as a method for summarizing the data through
cluster prototypes.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
2
Cluster analysis is also called unsupervised classification or
unsupervised pattern recognition to distinguish it from discriminant
analysis (supervised classification).
A good revision of the historical evolution of clustering procedures
and current perspectives is Jain (2010).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
3
Clustering
Hierarchical
Partitional
Divisive Agglomerative
Centroid
(k-means)
Bayesian
Decision Nonparametric
based
Model
based
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
Graph
theoretic
4. Cluster Analysis
Spectral
4
4.1. Hierarchical clustering
Hierarchical clustering seeks to build a hierarchy of clusters. The
superior levels or hierarchies will contain the inferior ones: it
generates a nested sequence of clusters. The nesting can be
decreasing or increasing.
Agglomerative hierarchical methods start with the individual data
points as singleton clusters (initially there are as many clusters as
objects). Then the most similar elements are grouped, and
afterwards the nearest or most similar of these groups are merged.
This is repeated until, eventually, all the subgroups are merged into
a single cluster. It is a “bottom up” approach: each observation
starts in its own cluster, and pairs of clusters are merged as one
moves up the hierarchy.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
5
Divisive hierarchical methods start from the single, global group
containing all the data points, which is divided into two subgroups
such that the objects in one subgroup are far from the objects in
the other. These subgroups are then further split into dissimilar
subgroups. The process is repeated until each individual in the
sample forms a group. It is a “top down” approach: all
observations start in one cluster, and splits are performed as one
moves down the hierarchy.
Divisive methods are often more computationally demanding than
agglomerative ones, because decisions must be made about
dividing a cluster in two in all possible ways. For a description of
divisive hierarchical methods, see Everitt et al. (2011), Kaufman
and Rousseeuw (2005).
It is conventional to limit all merges to two input clusters and all
splits to two output clusters.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
6
Hierarchical clustering methods work for many types of variables
(nominal, ordinal, quantitative,...). The original observations are
not needed: only the matrix of pairwise dissimilarities.
In order to decide which clusters should be combined (for
agglomerative), or where a cluster should be split (for divisive), a
measure of dissimilarity between sets of observations is required.
The choice of a similarity measure is subjective and depends, for
instance, on the nature of the variables. For numerical variables,
we can consider the Minkowski metric:

1/m
p
X
d(x, y) = 
|xj − yj |m 
.
j=1
For m = 1, d(x, y) is the city-block or Manhattan metric. For
m = 2, d is the Euclidean metric, the most usual in clustering.
Once chosen the dissimilarity between individual observations, we
should define the dissimilarity between groups of observations.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
7
The linkage criterion determines the distance between sets of
observations as a function of the pairwise distances between individual
objects. We list three well-known linkage criteria:
• Single linkage: distance between two groups is the distance between their
nearest members −→ minimum distance or nearest neighbour clustering
Cluster distance = d24
• Complete linkage: distance between two groups is the distance between
their farthest members −→ maximum distance or farthest neighbour
clustering
Cluster distance = d15
• Average linkage: distance between two groups is the average distance
between their members −→ mean or average distance clustering
Cluster distance =
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
d13 + d14 + d15 + d23 + d24 + d25
6
4. Cluster Analysis
8
Agglomerative hierarchical clustering algorithm
1. We have n observations x1 , . . . , xn . Start with n clusters, each
containing one observation, and an n × n matrix D of distances
or similarities.
2. Search the matrix D for the nearest pair of clusters (say, U and
V ).
3. Merge clusters U and V into a new cluster UV . Update the
entries in the distance matrix by deleting the rows and columns
corresponding to clusters U and V and adding a row and
column giving the distance between cluster UV and the rest of
clusters.
4. Repeat Steps 2 and 3 until all the observations are in a single
cluster. Record the identity of the merged clusters and the
distances or similarities at which the mergers take place.
The results of any hierarchical method are displayed in a
dendrogram, a tree-like diagram whose branches represent clusters.
The position of the nodes along the distance axis measure the
distance at which mergers take place.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
9
Example 4.1. Data on protein consumption in twenty-five
European countries for nine food groups (Weber 1973).
Table = read.table(’ProteinConsumptionEurope.txt’,header=TRUE)
c = ncol(Table)
Data = Table[,2:c]
D = dist(Data)

1
2
3
4
5
6
7
8
9
10
11
12
13
D = 14
15
16
17
18
19
20
21
22
23
24
25

0
23.2
0
21.7
7.9
0
15.7 32.3 32.8
0
15.2 10.3 10.6 24.0
0
30.2 12.00 11.1 40.3 19.4
0
22.9 10.7 8.9 33.6 10.6 15.2
0
31.0 17.4 17.6 40.3 24.0 12.3 23.8
0
23.2 11.0 6.0 33.3 13.4 12.7 13.9 18.2
0
12.1 19.5 18.3 19.3 15.0 24.5 22.1 24.1 18.3
0
13.2 17.0 18.8 18.4 9.2 26.7 17.5 30.0 21.3 14.9
0
27.9 10.0 9.2 38.4 17.6 8.9 16.2 11.6 10.2 23.0 25.0
0
10.6 14.7 13.6 21.0 8.7 21.6 15.6 23.8 15.2 8.0 10.7 20.0
0
28.3
6.8 9.7 38.5 16.4 8.4 13.3 14.7 12.4 24.1 23.2 6.8 19.9
0
26.8 13.7 10.8 38.2 18.7 6.7 15.0 11.7 13.2 21.4 25.7 10.9 18.8 11.6
0
17.6
9.9 12.2 24.5 8.3 17.8 14.9 19.4 14.0 12.4 12.0 16.0 9.1 15.3 16.9
0
23.1 22.9 19.2 33.3 19.1 23.9 15.2 31.2 21.9 22.2 22.0 27.1 17.9 25.4 20.7 21.7
0
10.3 25.3 25.9 8.3 17.3 33.3 26.7 33.4 27.1 13.3 11.6 31.3 14.3 31.2 31.0 17.4 27.7
0
17.1 17.4 13.9 28.9 13.1 21.2 11.8 26.6 17.2 16.3 16.2 21.8 10.9 20.7 17.6 15.5 8.8 22.3
0
30.0 13.0 11.6 41.5 20.4 4.8 15.6 11.9 14.0 25.2 27.7 8.9 21.9 8.5 5.6 19.2 24.1 34.2 20.8
0
24.9
7.6 7.5 35.5 15.0 9.7 14.7 13.1 8.2 20.0 22.1 5.1 16.7 6.3 10.7 13.5 24.8 28.6 19.3 9.5
0
24.3 12.9 6.8 36.4 16.5 11.7 14.8 16.0 7.0 20.3 24.2 8.2 17.3 12.0 10.5 17.1 22.8 29.7 17.7 10.9 8.0
0
11.0 19.0 18.4 16.7 12.6 25.3 21.4 24.7 19.4 8.2 12.5 23.0 9.5 24.2 22.8 10.8 24.0 9.9 17.9 26.3 21.0 21.6
0
29.1 10.1 9.1 40.6 17.2 9.9 10.6 18.9 12.5 26.3 24.7 9.7 21.1 6.5 12.2 18.5 22.8 33.4 19.1 9.1 9.6 11.1 26.6
0
15.5 32.0 32.7 4.9 24.0 39.9 33.3 39.3 33.7 18.8 17.6 38.0 20.7 38.0 37.5 23.7 32.8 6.9 28.1 40.8 35.3 36.4 15.8 40.3





























































































1
2
3
4
5
6
7
8
9
10
11
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
12
13
14
15
16
4. Cluster Analysis
17
18
19
20
21
22
23





























































































0
24 25
10
hc = hclust(D,method=’complete’)
plot(hc)
20
1
23
10
13
5
16
17
19
12
21
14
24
15
6
20
22
3
9
18
4
25
0
2
11
8
7
10
Height
30
40
Cluster Dendrogram
D
hclust (*, "complete")
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
11
10
Romania
Bulgaria
Yugoslavia
Finland
Norway
Denmark
Sweden
UK
Belgium
France
Austria
Ireland
Switzerland
Netherlands
WGermany
EGermany
Portugal
Spain
Hungary
Czechoslovakia
Poland
Albania
USSR
Greece
Italy
0
Height
20
30
40
hc = hclust(D,method=’complete’)
plot(hc,labels=Table[,1])
Cluster Dendrogram
D
hclust (*, "complete")
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
12
Some results are surprising. We standardize the data and carry out
the analysis again.
Hungary
USSR
Poland
Czechoslovakia
EGermany
Finland
Norway
Denmark
Sweden
France
UK
Ireland
Belgium
WGermany
Switzerland
Austria
Netherlands
Portugal
Spain
Greece
Italy
Albania
Bulgaria
Romania
Yugoslavia
3
2
1
0
Height
S = var(Data)
m = colMeans(Data)
mMat = matrix(rep(m,times=n),
nrow=n,byrow=T)
MinusSqrtD = diag(diag(S)^(-0.5)
)
DataStandar = as.matrix(DatamMat)
%* % MinusSqrtD
DStandar = dist(DataStandar)
hcStandar = hclust(DStandar,
method=’complete’)
plot(hcStandar,labels=Table[,1])
4
5
6
7
Cluster Dendrogram
DStandar
hclust (*, "complete")
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
13
It is difficult to provide a clear-cut solution about choosing the
“best” number of clusters, k, whatever clustering method is used.
Hence, the analyst has to interpret the output of the clustering
procedure and several solutions may be equally interesting.
We can choose to place the threshold where the clusters are too
far apart (i.e. there’s a big jump between the two adjacent levels).
If n denotes the p
sample size, a simple rule-of-thumb (see Mardia et
al. 1989) is k ≈ n/2.
In the “elbow method” we compute the within-cluster sum of
squares for k clusters
WCSS(k) =
k
X
X
j=1 xi ∈cluster
kxi − x̄j k2 ,
j
where x̄j is the sample mean in cluster j. We plot WCSS(k)
against the number k of clusters and we choose k in the elbow of
the graph.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
14
0
WCSS(k)
2000 4000
## determine a "good" k using elbow
library("GMD")
css.obj <- css.hclust(D,hc)
plot(css.obj$k,css.obj$tss-css.obj$totbss,pch=19,
ylab="WCSS(k)",xlab="number of clusters k",cex.lab=1.5)
par (mar = c(5, 10, 3, 1) + 0.1)
5
10
15
number of clusters k
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
20
15
4.2. k-means Clustering
In contrast to hierarchical techniques, which give a nested
sequence of clusterings, usually based on a dissimilarity, partitional
clustering procedures start with an initial clustering (a partition of
the data) and refine it iteratively, typically until the final clustering
satisfies an optimality criterion. In general, the clusterings in the
sequence are not nested, so no dendrogram is possible. Hierarchical
methods are based on a dissimilarity, while partitional procedures
are usually based on an objective function.
Objective functions are of two types. They can serve to maximize
the homogeneity of the points in each cluster (internal criteria). In
this case the objective function measures the within-groups spread.
Or they can focus on how clusters are different from each other, in
which case the objective function measures the between-groups
variability.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
16
4.2.1 Introduction
k-means clustering is a nonhierarchical centroid-based clustering
technique designed to group the n individuals in the sample into k
clusters. Compared to hierarchical techniques, nonhierarchical ones
can be applied to larger datasets, because a matrix of distances
does not have to be determined.
Let us assume that the number of clusters, k(≤ n), is specified in
advance. Given the sample x1 , . . . , xn of vectors taking values in
Rd , k-means clustering aims to partition this collection of
observations into k groups, G = {G1 , . . . , Gk }, so as to minimize
(over all possible G’s) the within-cluster sum of squares
WCSS =
k X
X
kxi − x̄j k2 ,
j=1 xi ∈Gj
P
where x̄j := xi ∈Gj xi /card(Gj ) is the sample mean in Gj and k k
is the Euclidean norm.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
17
P
Since x̄ = arg mı́na∈Rp ni=1 kxi − ak2 , minimizing WCSS is
equivalent to minimizing
n
1X
mı́n kxi − aj k2
1≤j≤k
n
(1)
i=1
over all possible collections {a1 , . . . , ak } in Rp . Then each xi ,
i = 1, . . . , n, is assigned to its nearest cluster center. The sample
means x̄1 , . . . , x̄k in the resulting groups, G1 , . . . , Gk , are precisely
the arguments a1 , . . . , ak minimizing (1).
Remark: For each j = 1, . . . , k, the set of Rp points closer to x̄j
than to any other cluster center is the Voronoi cell of x̄j .
Remark: Let Fn denote the empirical probability distribution on
Rd corresponding to the sample. It is possible to restate the
problem of minimizing (1) as that of minimizing
Z
Φ(A, Fn ) := mı́n kx − ak2 dFn (x),
(2)
a∈A
over all choices of the set A ⊂ Rd containing k (or fewer) points.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
18
4.2.2 Asymptotic behaviour of k-means clustering
For any probability measure Q on Rp and any finite subset A of Rp
define
Z
Φ(A, Q) := mı́n kx − ak2 dQ(x)
a∈A
and mk (Q) := ı́nf{Φ(A, Q) : A contains k or fewer points}.
For a given k, the set An = An (k) of optimal sample cluster
centers satisfies Φ(An , Fn ) = mk (Fn ) and the optimal population
cluster centers Ā = Ā(k) satisfy Φ(Ā, P) = mk (F ), where F is the
probability distribution of X.
Z
2
If EkXk = kxk2 dF (x) < ∞, then Φ(A, F ) < ∞ for each A: for
each a ∈ Rp
Z
kx − ak2 dF (x) < ∞.
Then these minimization procedures make sense.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
19
We define the Hausdorff distance between two sets A and B in Rd
as
dH (A, B) = ı́nf{ > 0 : A ⊂ B , B ⊂ A },
where A := {x ∈ Rd : kx − ak < for some a ∈ A}.
Theorem 4.1 (Consistency of k-means, Pollard 1981):
Suppose that EkXk2 < ∞ and that for each j = 1, . . . , k, there is
a unique set Ā(j) for which Φ(Ā(j), F ) = mj (F ). Then
dH (An , Ā(k)) → 0 a.s. as n → ∞
and
Φ(An , Fn ) → mk (F )
a.s. as n → ∞.
The uniqueness condition on Ā(j) implies that
m1 (F ) > m2 (F ) > . . . > mk (F ) and Ā(j) contains exactly j
distinct points.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
20
Consistency of k-means in dimension p = 1
Let F be a class of measurable functions defined on R equipped
with the Borel σ-algebra on which a probability measure F is
defined. Assume that, for each g ∈ F,
Z
F |g | := EF (g (X )) = |g (x)| dF (x) < ∞.
Let X1 , . . . , Xn be a sample of i.i.d. r.v. from F . Then we define
the empirical (probability) measure Fn on R, that assigns weight
1/n to each Xi in the sample, for i = 1, . . . , n:
Z
n
1X
g (Xi ).
Fn g := EFn (g (X )) = g (x) dFn (x) =
n
i=1
Then the strong law of large numbers (SLLN) is stated as: for each
fixed g ∈ F,
Fn g −−−→ Fg
n→∞
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
almost surely (a.s.)
4. Cluster Analysis
21
We have the following one-sided uniformity result:
Theorem II.3 in Pollard (1984): Suppose that, for each > 0,
there exists a finite class F of functions such that, for each g ∈ F,
there exists a g ∈ F satisfying g ≤ g and Fg − Fg ≤ . Then
lı́m inf ı́nf (Fn g − Fg ) ≥ 0
n→∞ F
almost surely.
For simplicity, assume that we want to partition the sample
X1 , . . . , Xn into k = 2 groups. The k-means method minimizes the
within-cluster sum of squares. In this case, this is equivalent to
minimizing
n
1X
mı́n (Xi − a)2 , (Xi − b)2
(3)
n
i=1
over all (a, b) ∈
R2 .
If we denote
ga,b (x) := mı́n (x − a)2 , (x − b)2 ,
then minimizing (3) is equivalent to minimizing, over all
(a, b) ∈ R2 ,
W (a, b, Fn ) := Fn ga,b .
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
(4)
22
By the SLLN, for each fixed (a, b),
W (a, b, Fn ) = Fn ga,b −−−→ W (a, b, F ) = Fga,b
n→∞
a.s.
This suggests that the (an , bn ) minimizing W (·, ·, Fn ) might
converge to the (a∗ , b ∗ ) minimizing W (·, ·, F ). Let us adopt the
convention that a ≤ b in any pair (a, b) (so that (b ∗ , a∗ ) is not a
distinct minimizing pair of W (·, ·, F )).
Theorem (Example II.4 of Pollard 1984): Assume that
Fx 2 < ∞. Assume that W (·, ·, F ) achieves its unique minimum at
(a∗ , b ∗ ). Then
(an , bn ) −−−→ (a∗ , b ∗ ) a.s.
n→∞
Sketch of the proof:
• There exists a sufficiently large constant M > 0 such that
(a∗ , b ∗ ) belongs to the set
C := [−M, M] × R ∪ R × [−M, M]
and, for n large enough, (an , bn ) ∈ C a.s. too. From now on we
assume that this is the case.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
23
• Construction of the finite approximating class F :
First note that, for (a, b) ∈ C ,
ga,b (x) ≤ gM (x) := (x − M)2 + (x + M)2
and that FgM < ∞. Then we can find a constant D > M > 0 such
that F (gM 1[−D,D]c ) < . Thus we only have to worry about the
approximation to ga,b (x) for x ∈ [−D, D]. We may also assume
that both a and b lie in the interval [−3D, 3D].
Let C be a finite subset of [−3D, 3D]2 such that each (a, b) in
that square has an (a0 , b 0 ) with |a − a0 | < /D and |b − b 0 | < /D.
Then, for each x ∈ [−D, D],
|ga,b (x) − ga0 ,b0 (x)| ≤ 16.
We take the class
F := (ga0 ,b0 (x) − 16)1{|x|≤D} : (a0 , b 0 ) ∈ C
Then Theorem II.3 in Pollard (1984) implies that
lı́m inf
ı́nf (Fn ga,b − Fga,b ) ≥ 0
n→∞ (a,b)∈C
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
a.s.
24
Thus,
lı́m inf W (an , bn , Fn ) − W (an , bn , F ) ≥ 0
n→∞
a.s.
• Now observe that the uniformity result allows to transfer the
optimality of (an , bn ) for Fn to asymptotic optimality for F :
W (an , bn , Fn )
≤ W (a∗ , b ∗ , Fn ) because (an , bn ) is optimal for Fn
→ W (a∗ , b ∗ , F ) a.s.
≤ W (an , bn , F ) because (a∗ , b ∗ ) is optimal for F .
This implies that
W (an , bn , F ) −−−→ W (a∗ , b ∗ , F )
n→∞
a.s.
Since (a∗ , b ∗ ) is the unique minimum for W (·, ·, F ), then (an , bn )
converges a.s. to (a∗ , b ∗ ).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
25
4.2.3 k-means algorithm
The k-means clustering procedure is also used in quantization.
Quantization is the procedure of constraining a continuous large
set of values (such as the real numbers) to a relatively small
discrete set (such as the integers).
In our context, a k-level quantizer is a map q : Rp → {a1 , . . . , ak },
where aj ∈ Rp , j = 1, . . . , k, are called the prototype vectors. Such
a map can be used to convert a p-dimensional input signal X into
a code q(X) that can take on at most k different values
(fixed-length code). This technique was originally used in signal
processing for data compression.
An optimal k-level quantizer for a probability distribution P on Rp
minimizes the distortion as measured by the mean-squared error
Z
2
EkX − q(X)k = kx − q(x)k2 dF (x).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
26
Thus, searching for an optimal k-level quantizer is equivalent to
the k-means problem. For more information on quantization see,
e.g., Gersho and Gray (1992), Sayood (2005).
Finding the minimum of (1) is computationally difficult, but several
fast algorithms have been proposed that approximate the optimum.
The most popular algorithm was proposed by Lloyd (1982) in the
quantization context and is known as k-means algorithm or Lloyd
algorithm. The name “k-means” was coined by MacQueen (1967).
A history of the k-means algorithm can be found in Bock (2007).
The k-means algorithm employs an iterative refinement similar to
that of the EM algorithm for mixtures of distributions.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
27
k-means algorithm
(1)
(1)
Take k initial centers a1 , . . . , ak ∈ Rp . For instance, the Forgy
method randomly chooses k observations from the sample and
uses them as the initial centers.
Assignment or expectation step: Cluster the data points x1 , . . . , xn
(`)
(`)
around the centers a1 , . . . , ak into k sets such that the j-th set
(`)
(`)
Gj contains the xi ’s that are closer to aj than to any other
center. This means assigning the observations to the Voronoi cells
(`)
(`)
determined by a1 , . . . , ak .
Update or maximization step: Determine the new centers
(`+1)
a1
(`+1)
, . . . , ak
as the averages of the data within the clusters:
X
1
(`+1)
xi .
aj
=
(`)
card(Gj )
(`)
xi ∈Gj
The steps are iterated until there are no changes in the cluster
centers. Each step of the algorithm decreases the empirical squared
Euclidean distance error.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
28
k − means with R:
kmeans(x, centers, iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong", "Lloyd",
"Forgy", "MacQueen"))
The algorithm of Hartigan and Wong (1979) is used by default: it
generally does a better job than those of Lloyd (1957), Forgy
(1965) or MacQueen (1967).
Observe that there is no guarantee that k-means algorithm will
converge to the global optimum, and the result (a local optimum)
may depend on the initial clusters. As the algorithm is usually very
fast, trying several random starts (nstart> 1) is often
recommended.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
29
150
100
50
n = nrow(Data) # Sample size
wss = (n-1)*sum(apply(DataStandar,2,var))
for (i in 2:15) {
wss[i] = sum(kmeans(DataStandar,centers=i)$withinss)
}
plot(1:15, wss, type="b",
xlab="Number of Clusters",
ylab="Within groups sum of squares")
Within groups sum of squares
200
Example 4.1 (protein consumption in Europe): A plot of the
within groups sum of squares by number of clusters extracted can
help determine the appropriate number of clusters. We look for a
bend in the plot similar to a scree plot in PCA or factor analysis.
2
4
6
8
10
12
14
Number of Clusters
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
30
fit = kmeans(DataStandar, centers = 5, nstart = 10)
aggregate(DataStandar,by=list(fit$cluster),FUN=mean)
1
2
3
4
5
Group.1
V1
V2
V3
V4
V5
V6
V7
V8
V9
1 0.006572 -0.229015 0.191478 1.345874 1.158254 -0.872272 0.167678 -0.955339 -1.114804
2 -0.570049 0.580387 -0.085897 -0.460493 -0.453779 0.318183 0.785760 -0.267918 0.068739
3 -0.807569 -0.871935 -1.553305 -1.078332 -1.038637 1.720033 -1.423426 0.996131 -0.643604
4 -0.508801 -1.108800 -0.412484 -0.832041 0.981915 0.130025 -0.184201 1.310884 1.629244
5 1.011180 0.742133 0.940841 0.570058 -0.267153 -0.687758 0.228874 -0.508389 0.021619
NewTable = data.frame(Table, fit$cluster)
head(NewTable)
Country RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts FrVeg fit.cluster
1
Albania
10.1
1.4 0.5 8.9 0.2
42.3
0.6 5.5
1.7
5
2
Austria
8.9
14.0 4.3 19.9 2.1
28.0
3.6 1.3
4.3
2
3
Belgium
13.5
9.3 4.1 17.5 4.5
26.6
5.7 2.1
4.0
2
4
Bulgaria
7.8
6.0 1.6 8.3 1.2
56.7
1.1 3.7
4.2
5
5 Czechoslovakia
9.7
11.4 2.8 12.5 2.0
34.3
5.0 1.1
4.0
3
6
Denmark
10.6
10.8 3.7 25.0 9.9
21.9
4.8 0.7
2.4
1
o = order(fit$cluster)
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
31
data.frame(NewTable$Country[o],NewTable$fit.cluster[o])
NewTable.Country.o. NewTable.fit.cluster.o.
1
Denmark
1
2
Finland
1
3
Norway
1
4
Sweden
1
5
Czechoslovakia
3
6
EGermany
3
7
Hungary
3
8
Poland
3
9
USSR
3
10
Albania
5
11
Bulgaria
5
12
Romania
5
13
Yugoslavia
5
14
Greece
4
15
Italy
4
16
Portugal
4
17
Spain
4
18
Austria
2
19
Belgium
2
20
France
2
21
Ireland
2
22
Netherlands
2
23
Switzerland
2
24
UK
2
25
WGermany
2
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
32
4.2.4 Generalizations and modifications of k-means algorithm
• Clustering with non-Euclidean divergences (see, e.g., Banerjee
et al. 2005). Instead of using the squared Euclidean norm in the
objective function
Z
Φ(A, Q) = mı́n kx − ak2 dQ(x)
a∈A
we can use a more general distortion function φ = φ(x),
continuous, non-decreasing and satisfying certain properties.
For instance, k-medians clustering is obtained with
φ(x) = |x1 | + . . . + |xp |, the norm obtained from Manhattan
metric.
• Modifications of k-means can be found in Jain (2010).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
33
4.3. Classification Maximum Likelihood (ML)
Here we regard the clusters as representing local modes of the
underlying distribution, which is then modeled as a mixture of
components. One way to represent the uncertainty in whether a
point is in a cluster is to back off from determining absolute
membership of each xi in a cluster and ask only for a degree of
membership (soft assignment to the clusters).
Pk Then we say that xi
is in cluster Gk with probability πik , with j=1 πik = 1. Hard
clustering can be recovered by assigning xi to cluster Gk such that
k = arg maxj πij .
The idea of soft membership can be formalized through a
parametric mixture model
f (x; ψ) =
k
X
πj fj (x; θ j ),
(5)
j=1
where ψ = (π1 , . . . , πk−1 , ξ 0 )0 and ξ is the vector containing all the
parameters in θ 1 , . . . , θ k known a priori to be different.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
34
Denote by zi = (zi1 , . . . , zik )0 a k-dimensional vector with zij = 1
or 0 according to whether xi did or did not arise from the j-th
component of the mixture, i = 1, . . . , n, j = 1, . . . , k.
The complete data sample is xc1 , . . . xcn , where xci = (xi , zi ). The
complete-data log-likelihood for ψ is
log Lc (ψ) =
n X
k
X
zij (log πj + log fj (xi ; θ j )).
i=1 j=1
In the classification likelihood approach the parameter vector ψ
and the unknown component indicator vectors z1 , . . . , zn are
chosen to maximize Lc (ψ). That is, the vectors z1 , . . . , zn are
treated as parameters to be estimated along with ψ.
The classification ML is also called Gaussian Mixture Model
(GMM) classification. A more detailed account can be found in
McLachlan and Basford (1988).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
35
Unless n is small, however, searching over all possible combinations
of values of the zi ’s is computationally prohibitive. McLachlan
(1982) proved that a solution corresponding to a local maximum
can be computed iteratively via the so-called classification EM
algorithm, where a classification step is inserted between the
E-step and the M-step of the original EM algorithm:
The algorithm starts with an initial guess ψ (0) for ψ. Let ψ (g ) be
the approximate value of ψ after the g -th iteration.
E-Step: Compute the posterior probability that the i-th member of
the sample, Xi , belongs to the j-th component of the mixture:
(g )
(g )
τij
(g )
πj fj (xi ; θ j )
:= P
k
(g )
(g )
j=1 πj fj (xi ; θ j )
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
.
36
(g )
(g )
(g )
Classification Step: zi = (zi1 , . . . , zik )0 with
(
(g )
1 if j ∗ = arg maxj=1,...,k τij
(g )
zij ∗ =
0 otherwise.
M-Step for Gaussian mixtures:
n
(g +1)
πj
1 X (g )
=
zij
n
i=1
(g +1)
µj
and
(g +1)
Σj
= P
n
(g )
i=1 zij (xi
Pn
=
(g )
i=1 zij xi
(g )
i=1 zij
Pn
(g +1)
− µj
Pn
(g +1) 0
)
)(xi − µj
(g )
.
i=1 zij
The three steps above are repeated until the difference
Lc (ψ (g +1) ) − Lc (ψ (g ) ) is small enough.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
37
Celeux and Govaert (1993) compared the classification EM
algorithm and the original EM algorithm and found that the first
approach tends to be better for small samples, whereas the mixture
likelihood approach is preferable for large samples (in this case, the
classification likelihood approach usually yields inconsistent
parameter estimates).
The Classification EM procedure can be shown to be equivalent to
some commonly used clustering criteria under the assumption of
Gaussian populations with various constraints on their covariance
matrices. For instance, if we assume equal priors
π1 = . . . = πk = 1/k and Σ1 = . . . = Σk = Σ with Σ = σ 2 Ip (a
scalar matrix), the GMM classification is exactly k-means
clustering. This model is referred to as isotropic or spherical
variance since the variance is equal in all directions.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
38
With the availability of powerful MCMC methods, there is renewed
interest in the Bayesian approach to model-based clustering. In the
Bayesian Maximum A Posteriori (MAP) classification approach, a
prior distribution is placed on the unknown parameters ξ of the
mixture.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
39
4.4. How many clusters?
In k-means and in GMM classification the number k of clusters or
components in the mixture model is assumed to be known in
advance. However, choosing k is a problem that still receives
considerable attention in the literature (see McLachlan and Peel
2000; Sugar and James 2003; Oliveira-Brochado and Martins
2005).
Here we focus on choosing the order of a Gaussian mixture model.
Specifically, we consider the estimation of the number of
components in the mixture based on a penalized form of the
likelihood. As the likelihood increases with the addition of a
component to the mixture model, the log-likelihood is penalized by
the subtraction of a term that “penalizes” the model for the
number of parameters in it. This leads to a penalized log-likelihood,
yielding the so-called information criteria for the choice of k.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
40
4.4.1 Information criteria in model selection
Let f and g denote two probability densities in Rd (in general, two
probability distribution models). The Kullback-Leibler information
between models f and g is defined as
Z
f (X)
f (x)
dx = Ef log
(6)
I (f , g ) := f (x) log
g (x)
g (X)
and represents the information lost when g is used to approximate
f (Kullback and Leibler 1951).
The KullbackLeibler information can be conceptualized as a
directed “distance” between the two models f and g (Kullback
1959). Strictly speaking, I (f , g ) is a measure of “discrepancy”, not
a distance, because it is not symmetric. The K-L distance has also
been called the K-L discrepancy, divergence and number. It
happens to be the negative of Boltzmann’s (1877) concept of
generalized entropy in physics and thermodynamics.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
41
The KL discrepancy (6) can be expressed as
I (f , g ) = Ef (log f (X)) − Ef (log g (X)).
Typically, f = f (x) is the true density of the random vector of
interest, X, and g is some estimate, fˆ, of f . In our framework, fˆ is
a density from a parametric
P family, namely, a mixture of continuous
distributions f (x; ψ) = kj=1 πj fj (x; θ j ). Intuitively, we would like
to find a number k of components and a value of ψ such that the
K-L divergence of f relative to the parametric model f (·; ψ)
I (f , f (·; ψ)) = Ef (log f (X)) − Ef (log f (X; ψ))
(7)
is small.
We would need to know f , k and ψ to fully compute (7), but f is
usually unknown. However, these requirements can be diminished
by noting that we only want to minimize (7) with respect to k and
ψ and that Cf := Ef (log f (X)) is a constant that does not depend
on these parameters, only on the unknown, true f .
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
42
Hence, we would like to find the values of k and ψ minimizing the
relative information
η(ψ, F ) := I (f , f (·; ψ)) − Cf = −Ef (log f (X; ψ))
Z
= − log f (x; ψ)dF (x),
(8)
where F denotes the distribution of X.
Let x1 , . . . , xn be a sample from X. A natural estimator of the
probability distribution F of X is the empirical distribution, Fn , a
discrete probability measure giving probability mass 1/n to each
sample point. We can plug Fn in the place of the unknown F , thus
obtaining an estimator of (8)
Z
η(ψ, Fn ) = −
n
1X
log f (xi ; ψ)
log f (x; ψ) dFn (x) = −
n
i=1
= − log L(ψ; x1 , . . . , xn ).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
43
It is tempting to just estimate ψ by its m.l.e. However, Akaike
(1973) showed that minus the maximized log-likelihood is a
downwards biased estimator of (8).
Given the m.l.e. ψ̂ = ψ̂(x1 , . . . , xn ) of ψ, the bias of η(ψ̂, Fn ) is
the functional
b(F ) := Ef (η(ψ̂, Fn )) − η(ψ, F )
! Z
#
"
n
1X
log f (Xi ; ψ̂) − log f (x; ψ)dF (x)
= − Ef
n
i=1
The information criterion for model selection seeks the model (in
our case, the mixture order) minimizing the bias-corrected relative
information
2 ( η(ψ̂, Fn ) − b(F ) ) = −2(log L(ψ̂) + b(F )),
factor for
“historical reasons”
measures the
model’s lack of fit
penalty term measuring
the model’s complexity
bias-corrected
log-likelihood
using an appropriate estimate of the bias b(F ). Different bias
estimators give rise to different information criteria.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
44
Akaike’s Information Criterion (AIC)
Akaike (1973, 1974) approximated the bias b(F ) by −d, where d
is equal to the total number of unknown parameters in the model.
Thus AIC selects the model minimizing
AIC = −2 log L(ψ̂) + 2d.
Many authors (see McLachlan and Peel 2000 and references
therein) have observed that AIC tends to overestimate the correct
number of components in a mixture and thus leads to overfitting.
Bayesian Information Criterion (BIC)
The BIC has been derived within a Bayesian framework for model
selection, but can also be applied in a non-Bayesian context.
Schwarz (1978) proposed the following negative penalized
log-likehood to be minimized
BIC = −2 log L(ψ̂) + d log n.
For n > 8, the penalty term of BIC penalizes complex models more
heavily than AIC. BIC works well and is popular in practice.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
45
Example: Times between Old Faithful eruptions (Y ) and duration
of eruptions (X ).
Datos = read.table(’Datos-geyser.txt’,header=TRUE)
XY = cbind(Datos$X,Datos$Y)
library(mclust)
faithfulBIC = mclustBIC(XY)
faithfulSummary = summary(faithfulBIC, data = XY)
faithfulSummary
classification table:
1 2
80 27
best BIC values:
VVV,2
EEE,2
EEI,2
-962.3372 -962.5509 -966.3832
In the package mclust
BIC = 2 log L(ψ̂) − d log n.
The faithful dataset included
in the R language distribution
has a larger sample size than
Datos-geyser.txt. Read the
help pdf for package mclust.
faithfulBIC
BIC:
1
2
3
4
5
6
7
8
9
EII
-1569.04
-1379.31
-1319.98
-1308.77
-1277.10
-1266.59
-1229.99
-1213.58
-1213.43
VII
-1569.04
-1377.37
-1322.97
-1292.39
-1260.01
-1267.84
-1235.25
-1238.59
NA
EEI
VEI
EVI
VVI
EEE
EEV
-1179.96 -1179.96 -1179.96 -1179.96 -1041.77 -1041.77
-966.38 -968.38 -971.03 -973.01 -962.55 -966.84
-970.24 -978.44 -978.04 -981.76 -969.12 -973.32
-983.88 -987.56 -993.71 -1001.38 -979.89 -984.59
-976.17 -974.21 -987.11 -989.63 -976.21 -981.17
-990.17 -980.26
NA
NA -990.27 -999.86
-1015.08 -1007.07
NA
NA -1003.13 -1033.46
-1001.58 -1011.37
NA
NA -1015.10 -1052.19
-1006.84
NA
NA
NA -1029.14 -1070.88
VEV
VVV
-1041.77 -1041.77
-967.29 -962.33
-972.45 -966.92
-991.81 -1002.93
-986.05 -1005.03
-1004.22
NA
-1022.75
NA
-1035.70
NA
NA
NA
Top 3 models based on the BIC criterion:
VVV,2
EEE,2
EEI,2
-962.3372 -962.5509 -966.3832
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
46
4.4.2 Number of clusters = mixture order?
Any continuous distribution can be approximated arbitrarily well by
a finite mixture of normal densities with common covariance
matrix (see McLachlan and Peel 2000). But, for instance, more
than one component of the mixture might be needed to model a
skewed shape of the density in a neighbourhood of a local mode:
0.0 0.1 0.2 0.3 0.4 0.5
Mixture of 3 Gaussian distributions
−2
−1
0
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
1
2
3
4. Cluster Analysis
4
47
Thus, when a GMM is used for clustering, the number of mixture
components is not necessarily the same as the number of clusters.
There are some recent proposals to overcome this problem. Most
of them are algorithms for hierarchically merging mixture
components into the same cluster, once the “optimal” number of
components in the mixture has been chosen (for example, using
BIC). See Hennig (2010).
One of the difficulties encountered is that it has to be decided by
the statistician under which conditions different Gaussian mixture
components should be regarded as a common cluster: there is no
objective unique definition of what a “true” cluster is. Should
clusters be identified with regions surrounding local modes?
Ray and Lindsay (2005) state “the modes are the dominant
feature, and are themselves potentially symptomatic of underlying
population structures” and prove that there is a
(k − 1)-dimensional surface, the ridgeline manifold, which includes
all the critical points (modes, antimodes and saddlepoints).
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
48
4.5. Clustering high-dimensional data
Some references:
Bouveyron, C. and Brunet-Saumard, C. (2014). Model-based clustering of
high-dimensional data: A review. Computational Statistics and Data Analysis, 56,
4462–4469.
Bouveyron, C., Girard, S. and Schmid, C. (2007). High-dimensional data clustering.
Computational Statistics and Data Analysis, 52, 502–519.
Kriegel, H.-P., Kröger, P. and Zimek, A. (2009). Clustering high-dimensional data: A
survey on subspace clustering, pattern-based clustering, and correlation clustering.
ACM Transactions on Knowledge Discovery from Data, 3, Article 1.
Sun, W., Wang, J. and Fang, Y. (2012). Regularized k-means clustering of
high-dimensional data and its asymptotic consistency. Electronic Journal of
Statistics, 6, 148–167.
Tomasev, N., Radovanović, M., Mladenić, D. and Ivanović, M. (2014). The role of
hubness in clustering high-dimensional data. IEEE Transactions on Knowledge and
Data Engineering, 26, 739–751.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
49
References
Akaike, H. (1973). Information theory as an extension of the maximum likelihood
principle. In B.N. Petrov, and F. Csaki, (eds.) Second International Symposium on
Information Theory, Akademiai Kiado, Budapest, pgs 267-281.
Akaike, H. (1974). A new look at the statistical model identification. IEEE
Transactions on Automatic Control, 19, 716-723.
Banerjee, A., Merugu, S., Dhillon, I.S. and Ghosh, J. (2005). Clustering with Bregman
divergences. Journal of Machine Learning Research, 6, 1705–1749.
Bock, H.-H. (2007). Clustering methods: a history of k-means algorithms. In Selected
Contributions in Data Analysis and Classification, Springer, pp. 161–172.
Boltzmann, L. (1877). Über die Beziehung zwischen dem Hauptsatze derzwe: Ten
mechanischen Wärmetheorie und derWahrscheinlichkeitsrechnung respective den
Sätzen über das Wärmegleichgewicht. Wiener Berichte, 76, 373-435.
Celeux, G. and G. Govaert (1993). Comparison of the mixture and the classification
maximum likelihood in cluster analysis. Journal of Statistical Computation and
Simulation, 47, 127–146.
Everitt, B.S., Landau, S., Leese, M. and Stahl, D. (2011). Cluster Analysis. Wiley.
Forgy, E.W. (1965). Cluster analysis of multivariate data: efficiency versus
interpretability of classifications. Biometrics, 21, 768-769.
Gersho, A. and Gray, R.M. (1992). Vector Quantization and Signal Compression.
Springer.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
50
Hartigan, J.A. and Wong, M.A. (1979). Algorithm AS 136: a k-means clustering
algorithm. Journal of the Royal Statistical Society, Series C, 28, 100-108.
Hennig, C. (2010). Methods for merging gaussian mixture components. Advances in
Data Analysis and Classification, 4, 3-34.
Jain, A.K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition
Letters, 31, 651–666.
Kaufman, L. and Rousseeuw, P.J. (2005). Finding Groups in Data: An Introduction to
Cluster Analysis. Wiley.
Kullback, S. (1959). Information theory and statistics. Wiley.
Kullback, S. and Leibler, R.A. (1951). On information and sufficiency. Annals of
Mathematical Statistics, 22, 79–86.
Lloyd., S.P. (1982). Least squares quantization in PCM. IEEE Transactions on
Information Theory, 28, 129-137.
Mardia, K.V., Kent, J.T. and Bibby, J.M. (1989). Multivariate analysis. Academic
Press.
McLachlan, G. (1982). The classification and mixture maximum likelihood approaches
to cluster analysis. In Handbook of Statistics, vol. 2, eds. P.R. Krishnaiah and L.
Kanal, North-Holland., pp. 199–208.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
51
McLachlan, G. and Basford, K.E. (1988). Mixture Models: Inference and Applications
to Clustering. Marcel Dekker.
McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley.
MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate
observations. In Proceedings of 5th Berkeley Symposium on Mathematical
Statistics and Probability. University of California Press, pp. 281-297.
Oliveira-Brochado, A. and Martins, F.V. (2005). Assessing the number of components
in mixture models: a review. FEP Working Papers 194.
Pollard, D. (1981). Strong consistency of k-means clustering. The Annals of Statistics,
9,135–140.
Pollard, D. (1984). Convergence of Stochastic Processes. Springer.
Downloadable at www.stat.yale.edu/~pollard/.
Ray, S. and Lindsay, B.G. (2005). The topography of multivariate normal mixtures.
The Annals of Statistics, 33, 2042-2065.
Sayood, K. (2005). Introduction to Data Compression. Elsevier.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6,
461-464.
Sugar, C.A. and James, G.M. (2003). Finding the number of clusters in a dataset: an
information-theoretic approach. Journal of the American Statistical Association,
98, 750–763.
Weber, A. (1973). Agrarpolitik im Spannungsfeld der internationalen
Ernaehrungspolitik. Institut fuer Agrarpolitik und marktlehre, Kiel.
Advanced Course in Statistics. Lecturer: Amparo Baı́llo
4. Cluster Analysis
52