Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Clustering
CSE 6363 β Machine Learning
Vassilis Athitsos
Computer Science and Engineering Department
University of Texas at Arlington
1
Supervised vs. Unsupervised Learning
β’ In supervised learning, the training data is a set of
pairs π₯π , π¦π , where π₯π is an example input, and π¦π
is the target output for that input.
β The goal is to learn a general function H(π₯) that maps
inputs to outputs.
β The majority of the methods we have covered this
semester fall under the category of supervised learning.
β’ In unsupervised learning, the training data is a set of
values π₯π .
β There is no associated target value for any π₯π .
β Because of that, data values π₯π are called unlabeled data.
2
Supervised vs. Unsupervised Learning
β’ The goal in supervised learning is regression or classification.
β’ The goal in unsupervised learning is to discover hidden structure
in the data.
β’ We have already seen some examples of unsupervised learning in
this class.
β Can you think of any such examples?
3
Supervised vs. Unsupervised Learning
β’ The goal in supervised learning is regression or classification.
β’ The goal in unsupervised learning is to discover hidden structure
in the data.
β’ We have already seen some examples of unsupervised learning in
this class.
β Learning of mixtures of Gaussians using EM.
β’ The hidden structure is the decoupling of the overall distribution as a
combination of multiple Gaussians.
β PCA: The hidden structure here is the representation of the data as a
linear combination of a few eigenvectors plus noise.
β SVD: The hidden structure here is the representation of the data as dot
products of low-dimensional vectors.
β HMMs: Baum-Welch can be seen as unsupervised learning, as it estimates
probabilities of hidden states given the training observation sequences.
4
Clustering
β’ The goal of clustering is
to separate the data into
coherent subgroups.
β Data within a cluster
should be more similar to
each other than to data in
other clusters.
β’ For example:
β Can you identify clusters
in this dataset?
β How many? Which ones?
5
Clustering
β’ The goal of clustering is
to separate the data into
coherent subgroups.
β Data within a cluster
should be more similar to
each other than to data in
other clusters.
β’ For example:
β Can you identify clusters
in this dataset?
β Many people would
identify three clusters.
6
Clustering
β’ Some times the clusters
may not be as obvious.
β’ Identifying clusters can
be even harder with
high-dimensional data.
7
Applications of Clustering
β’ Finding subgroups of similar items is useful in many
fields:
β In biology, clustering is used to identify relationships
between organisms, and to uncover possible evolutionary
links.
β In marketing, clustering is used to identify segments of the
population that would be specific targets for specific
products.
β For anomaly detection, anomalous data can be identified
as data that cannot be assigned to any of the "normal"
clusters.
β In search engines and recommender systems, clustering
can be used to group similar items together.
8
K-Means Clustering
β’ K-means clustering is a
simple and widely used
clustering method.
β’ First, clusters are initialized
by assigning each object
randomly to a cluster.
β’ Then, the algorithm alternates between:
β Re-assigning objects to clusters based on distances from
each object to the mean of each current cluster.
β Re-computing the means of the clusters.
β’ The number of clusters must be chosen manually.
9
K-Means Clustering:
Initialization
β’ To start the iterative
process, we first need
to provide some initial
values.
β’ We manually pick the number of clusters.
β In the example shown, we pick 3 as the number of clusters.
β’ We randomly assign objects to clusters.
10
K-Means Clustering:
Initialization
β’ To start the iterative
process, we first need
to provide some initial
values.
β’ We manually pick the number of clusters.
β In the example shown, we pick 3 as the number of clusters.
β’ We randomly assign objects to clusters.
β’ In the example figure, cluster membership is
indicated with color.
β There is a red cluster, a green cluster, and a brown cluster.
11
K-Means Clustering:
Initialization
β’ To start the iterative
process, we first need
to provide some initial
values.
β’ We manually pick the number of clusters.
β In the example shown, we pick 3 as the number of clusters.
β’ We randomly assign objects to clusters.
β’ The means of these random clusters are shown with
× marks.
12
K-Means Clustering:
Main Loop
β’ The main loop alternates
between:
β Computing new assignments
of objects to clusters, based
on distances from each object
to each cluster mean.
β Computing new means for the clusters, using the current
cluster assignments.
β’ At this point, we have provided some random initial
assignments of points to clusters.
β’ What is the next step?
13
K-Means Clustering:
Main Loop
β’ The main loop alternates
between:
β Computing new assignments
of objects to clusters, based
on distances from each object
to each cluster mean.
β Computing new means for the clusters, using the current
cluster assignments.
β’ The next step is to compute new assignments of
objects to clusters.
14
K-Means Clustering:
Main Loop
β’ The main loop alternates
between:
β Computing new assignments
of objects to clusters, based
on distances from each object
to each cluster mean.
β Computing new means for the clusters, using the current
cluster assignments.
β’ The next step is to compute new assignments of
objects to clusters.
β The figure shows the new assignments.
15
K-Means Clustering:
Main Loop
β’ The main loop alternates
between:
β Computing new assignments
of objects to clusters, based
on distances from each object
to each cluster mean.
β Computing new means for the clusters, using the current
cluster assignments.
β’ Next, we move on to compute the new means.
β Again, the new means are shown with × marks.
16
K-Means Clustering:
Main Loop
β’ The main loop alternates
between:
β Computing new assignments
of objects to clusters, based
on distances from each object
to each cluster mean.
β Computing new means for the clusters, using the current
cluster assignments.
β’ Next, we compute again new assignments.
β We end up with the same assignments as before.
β When this happens, we can terminate the algorithm.
17
K-Means Clustering:
Another Example
β’ Here is a more difficult
example, where there
are no obvious clusters.
β’ Again, we specify
manually that we want to
find 3 clusters.
β’ First step: ???
18
K-Means Clustering:
Another Example
β’ Here is a more difficult
example, where there
are no obvious clusters.
β’ Again, we specify
manually that we want to
find 3 clusters.
β’ First step: randomly assign points to the three
clusters.
β We see the random assignments and the resulting means
for the three clusters.
19
K-Means Clustering:
Another Example
β’ Here is a more difficult
example, where there
are no obvious clusters.
β’ Again, we specify
manually that we want to
find 3 clusters.
β’ Main loop, iteration 1: recompute cluster
assignments.
20
K-Means Clustering:
Another Example
β’ Here is a more difficult
example, where there
are no obvious clusters.
β’ Again, we specify
manually that we want to
find 3 clusters.
β’ Main loop, iteration 2: recompute cluster
assignments.
21
K-Means Clustering:
Another Example
β’ Here is a more difficult
example, where there
are no obvious clusters.
β’ Again, we specify
manually that we want to
find 3 clusters.
β’ Main loop, iteration 3: recompute cluster
assignments.
22
K-Means Clustering:
Another Example
β’ Here is a more difficult
example, where there
are no obvious clusters.
β’ Again, we specify
manually that we want to
find 3 clusters.
β’ Main loop, iteration 4: recompute cluster
assignments.
23
K-Means Clustering:
Another Example
β’ Here is a more difficult
example, where there
are no obvious clusters.
β’ Again, we specify
manually that we want to
find 3 clusters.
β’ Main loop, iteration 5: recompute cluster
assignments.
24
K-Means Clustering:
Another Example
β’ Here is a more difficult
example, where there
are no obvious clusters.
β’ Again, we specify
manually that we want to
find 3 clusters.
β’ Main loop, iteration 6: recompute cluster
assignments.
β The cluster assignments do not change, compared to
iteration 5, so we can stop.
25
K-Means Clustering - Theory
β’ Let πΏ = π₯1 , π₯2 , β¦ , π₯π be our set of points.
β We assume that πΏ is a subset of π·-dimensional vector
space βπ· .
β Thus, each π₯π is a π·-dimensional vector.
β’ Let πΊ1 , πΊ2 , β¦ , πΊπΎ be some set of clusters for πΏ.
β The union of all clusters should be πΏ.
πΎ
πΊπ = πΏ
π=1
β The intersection of any two clusters should be empty.
βπ, π: πΊπ β© πΊπ = β
26
K-Means Clustering - Theory
β’
β’
β’
β’
Let πΏ = π₯1 , π₯2 , β¦ , π₯π be our set of points.
Let πΊ1 , πΊ2 , β¦ , πΊπΎ be some set of clusters for πΏ.
Let ππ be the mean of all elements of πΊπ .
The error measure for such a set of clusters is
defined as:
Euclidean distance
between π₯π and ππ
πΎ
πΈ πΊ1 , πΊ2 , β¦ , πΊπΎ =
π₯π β ππ
π=1 π₯π βπΊπ
β’ The optimal set of clusters is the one that minimizes
27
this error measure.
K-Means Clustering - Theory
β’ The error measure for a set of clusters πΊ1 ,
πΊ2 , β¦ , πΊπΎ is defined as:
πΎ
πΈ πΊ1 , πΊ2 , β¦ , πΊπΎ =
π₯π β ππ
π=1 π₯π βπΊπ
β’ The optimal set of clusters is the one that minimizes
this error measure.
β’ Finding the optimal set of clusters is NP-hard.
β Solving this problem takes time π π πΎπ·+1 , where:
β’ π is the number of objects.
β’ π· is the number of dimensions.
β’ πΎ is the number of clusters.
28
K-Means Clustering - Theory
β’ The error measure for a set of clusters πΊ1 ,
πΊ2 , β¦ , πΊπΎ is defined as:
πΎ
πΈ πΊ1 , πΊ2 , β¦ , πΊπΎ =
π₯π β ππ
π=1 π₯π βπΊπ
β’ The iterative algorithm we described earlier
decreases the error πΈ at each iteration.
β’ Thus, this iterative algorithm converges.
β However, it only converges to a local optimum.
29
K-Medoid Clustering
β’ The main loop in k-means clustering is:
β Computing new assignments of objects to clusters, based
on distances from each object to each cluster mean.
β Computing new means for the clusters, using the current
cluster assignments.
β’ K-medoid clustering is a variation of k-means, where
we use medoids instead of means.
β’ The main loop in k-medoid clustering is:
β Computing new assignments of objects to clusters, based
on distances from each object to each cluster medoid.
β Computing new medoids for the clusters, using the current
cluster assignments.
30
K-Medoid Clustering
β’ To complete the definition of k-medoid clustering,
we need to define what a medoid of a set is.
β’ Let πΊ be a set, and let πΉ be a distance measure, that
evaluates distances between any two elements of πΊ.
β’ Then, the medoid ππΊ of πΊ is defined as:
ππΊ = argmin
π βπΊ
πΉ(π , π₯)
π₯βπΊ
β’ In words, ππΊ is the object in πΊ with the smallest sum
of distances to all other objects in πΊ.
31
K-Medoid vs. K-means
β’ K-medoid clustering can be applied on non-vector
data with non-Euclidean distance measures.
β’ For example:
β K-medoid can be used to cluster a set of time series
objects, using DTW as the distance measure.
β K-medoid can be used to cluster a set of strings, using the
edit distance as the distance measure.
β’ K-means cannot be used in such cases.
β Means may not make sense for non-vector data.
β For example, it does not make sense to talk about the
mean of a set of strings. However, we can define (and find)
the medoid of a set of strings, under the edit distance. 32
K-Medoid vs. K-means
β’ K-medoid clustering can be applied on non-vector
data with non-Euclidean distance measures.
β’ K-medoid clustering is more robust to outliers.
β A single outlier can dominate the mean of a cluster, but it
typically has only small influence on the medoid.
β’ The K-means algorithm can be proven to converge to
a local optimum.
β The k-medoid algorithm may not converge.
33
EM for Clustering
β’ Earlier in the course,
we saw how to use the
Expectation-Maximization
(EM) algorithm to estimate
the parameters of a
mixture of Gaussians.
β’ This EM algorithm can be viewed as a clustering
algorithm.
β Each Gaussian defines a cluster.
34
EM Clustering:
Initialization
β’ Let's assume that each
each π₯π is a
π·-dimensional column
vector.
β’ We want to learn the
parameters of πΎ Gaussians π1 , π2 ,β¦, ππΎ .
β’ Gaussian ππ is defined by these parameters:
β Mean ππ, which is a π·-dimensional column vector.
β Covariance matrix ππ , which is a π· × π· matrix.
β’ We initialize each ππ and each ππ to random values.
35
EM Clustering:
Initialization
β’ We initialize each ππ
and each ππ to random
values.
β’ The figure shows those
initial assignments.
β For every object π₯π, we give it the color of the cluster π for
which πππ is the highest.
36
EM Clustering:
Main Loop
β’ The main loop alternates
between:
β Computing new assignment
probabilities πππ that object
π₯π belongs to Gaussian ππ .
β Computing, for each Gaussian ππ , a new mean ππ, a new
covarriance matrix ππ , and a new weight π€π , using the
current assignment probabilities πππ .
β’ Here is the result after one iteration of the main
loop:
37
EM Clustering:
Main Loop
β’ The main loop alternates
between:
β Computing new assignment
probabilities πππ that object
π₯π belongs to Gaussian ππ .
β Computing, for each Gaussian ππ , a new mean ππ, a new
covarriance matrix ππ , and a new weight π€π , using the
current assignment probabilities πππ .
β’ Here is the result after two iterations of the main
loop:
38
EM Clustering:
Main Loop
β’ The main loop alternates
between:
β Computing new assignment
probabilities πππ that object
π₯π belongs to Gaussian ππ .
β Computing, for each Gaussian ππ , a new mean ππ, a new
covarriance matrix ππ , and a new weight π€π , using the
current assignment probabilities πππ .
β’ Here is the result after three iterations of the main
loop:
39
EM Clustering:
Main Loop
β’ The main loop alternates
between:
β Computing new assignment
probabilities πππ that object
π₯π belongs to Gaussian ππ .
β Computing, for each Gaussian ππ , a new mean ππ, a new
covarriance matrix ππ , and a new weight π€π , using the
current assignment probabilities πππ .
β’ Here is the result after four iterations of the main
loop:
40
EM Clustering:
Main Loop
β’ The main loop alternates
between:
β Computing new assignment
probabilities πππ that object
π₯π belongs to Gaussian ππ .
β Computing, for each Gaussian ππ , a new mean ππ, a new
covarriance matrix ππ , and a new weight π€π , using the
current assignment probabilities πππ .
β’ Here is the result after five iterations of the main
loop.
41
EM Clustering:
Main Loop
β’ Here is the result after
six iterations of the main
loop.
β’ The results have not
changed, so we stop.
β’ Note that, for this example, EM has not found the
same clusters that k-means produced.
β In general, k-means and EM may perform better or worse,
depending on the nature of the data we want to cluster,
and our criteria for what defines a good clustering result.
42
EM Clustering:
Another Example
β’ Here is another example.
β’ What clusters would you
identify here, if we were
looking for two clusters?
43
EM Clustering:
Another Example
β’ Here is another example.
β’ What clusters would you
identify here, if we were
looking for two clusters?
β’ In general, there are no absolute criteria for what
defines a "good" clustering result, and different people
may give different answers.
β’ However, for many people, the two clusters are:
β A central cluster of points densely crowded together.
β A peripheral cluster of points spread out far from the center.44
EM Clustering:
Another Example
β’ Let's see how EM does on
this dataset.
β’ Here we see the initialization.
45
EM Clustering:
Another Example
β’ Let's see how EM does on
this dataset.
β’ Here we see the result after
one iteration of the main
loop.
46
EM Clustering:
Another Example
β’ Let's see how EM does on
this dataset.
β’ Here we see the result after
two iterations of the main
loop.
47
EM Clustering:
Another Example
β’ Let's see how EM does on
this dataset.
β’ Here we see the result after
three iterations of the main
loop.
48
EM Clustering:
Another Example
β’ Let's see how EM does on
this dataset.
β’ Here we see the result after
six iterations of the main
loop.
49
EM Clustering:
Another Example
β’ Let's see how EM does on
this dataset.
β’ Here we see the result after
nine iterations of the main
loop.
β’ This is the final result,
subsequent iterations do not
lead to any changes.
50
EM vs. K-Means
β’ The k-means result (on the left) is different than the
EM result (on the right).
β’ EM can assign an object to cluster A even if the object
is closer to the mean of cluster B.
51
EM vs. K-Means
β’ Another important difference between K-means
clustering and EM clustering is:
β K-means makes hard assignments: an object belongs to
one and only one cluster.
β EM makes soft assignments: weights πππ specify how much
each object π₯π belongs to cluster π.
52
Agglomerative
Clustering
B
A
D
β’ In agglomerative clustering,
there are different levels of
clustering.
β At the top level, every object
is its own cluster.
β Under the top level, each
level is obtained by merging
two clusters from the
previous level.
β The bottom level has just
one cluster, covering the
entire dataset.
C
F
E
A
D
B
G
C
E
F
G
53
Agglomerative
Clustering
β’ Under the top level, each
level is obtained by
merging two clusters from
the previous level.
β’ There are different variants
of agglomerative clustering.
β’ Each variant is specified by
the method for selecting
which two clusters to
merge at each level.
C
B
A
D
F
E
A
D
B
G
C
E
F
G
54
Agglomerative
Clustering
β’ Suppose that, at each level, we
merge the two clusters from the
previous level that have the
smallest minimum distance.
β’ Let π be a distance measure
between objects in our data.
β’ The minimum distance
πmin (π, π) between two sets π
and π is defined as:
C
B
A
D
F
E
A
D
B
G
C
E
F
G
πmin π, π = min π(π₯, π¦)
π₯βπ,π¦βπ
β’ What do we merge at the next level?
55
Agglomerative
Clustering
β’ Suppose that, at each level, we
merge the two clusters from the
previous level that have the
smallest minimum distance.
β’ Let π be a distance measure
between objects in our data.
β’ The minimum distance
πmin (π, π) between two sets π
and π is defined as:
C
B
A
D
F
E
G
A
D
B
C
E
A
D
B
C
E
F
G
F,G
πmin π, π = min π(π₯, π¦)
π₯βπ,π¦βπ
β’ Clusters {πΉ} and {πΊ} are the closest to each other. Next?
56
Agglomerative
Clustering
β’ Suppose that, at each level, we
merge the two clusters from the
previous level that have the
smallest minimum distance.
β’ Let π be a distance measure
between objects in our data.
β’ The minimum distance
πmin (π, π) between two sets π
and π is defined as:
C
B
A
D
F
E
G
A
D
B
C
E
A
D
B
C
E
F,G
B
C
E
F,G
A,D
F
G
πmin π, π = min π(π₯, π¦)
π₯βπ,π¦βπ
β’ Clusters {π΄} and {π·} are the closest to each other. Next?
57
Agglomerative
Clustering
β’ Suppose that, at each level, we
merge the two clusters from the
previous level that have the
smallest minimum distance.
β’ Let π be a distance measure
between objects in our data.
β’ The minimum distance
πmin (π, π) between two sets π
and π is defined as:
πmin π, π = min π(π₯, π¦)
C
B
A
D
F
E
G
A
D
B
C
E
A
D
B
C
E
F,G
B
C
E
F,G
E
F,G
A,D
A,D
B,C
F
G
π₯βπ,π¦βπ
β’ Clusters {π΅} and {πΆ} are the closest to each other. Next?
58
Agglomerative
Clustering
β’ Suppose that, at each level, we
merge the two clusters from the
previous level that have the
smallest minimum distance.
β’ Let π be a distance measure
between objects in our data.
β’ The minimum distance
πmin (π, π) between two sets π
and π is defined as:
πmin π, π = min π(π₯, π¦)
π₯βπ,π¦βπ
C
B
A
D
F
E
G
A
D
B
C
E
A
D
B
C
E
F,G
B
C
E
F,G
E
F,G
A,D
A,D
B,C
F
G
A,D,B,C
E
F,G
β’ Clusters {π΄, π·} and {π΅, πΆ} are the closest to each other. Next? 59
Agglomerative
Clustering
β’ Suppose that, at each level, we
merge the two clusters from the
previous level that have the
smallest minimum distance.
β’ Let π be a distance measure
between objects in our data.
β’ The minimum distance
πmin (π, π) between two sets π
and π is defined as:
πmin π, π = min π(π₯, π¦)
π₯βπ,π¦βπ
β’ Merging {π΄, π·, π΅, πΆ} and {πΈ}. Next?
C
B
A
D
F
E
G
A
D
B
C
E
A
D
B
C
E
F,G
B
C
E
F,G
E
F,G
E
F,G
60
F,G
A,D
A,D
B,C
A,D,B,C
A,D,B,C,E
F
G
Agglomerative
Clustering
C
B
A
D
β’ Suppose that, at each level, we
merge the two clusters from the
previous level that have the
smallest minimum distance.
β’ Let π be a distance measure
between objects in our data.
β’ The minimum distance
πmin (π, π) between two sets π
and π is defined as:
πmin π, π = min π(π₯, π¦)
π₯βπ,π¦βπ
β’ Merging {π΄, π·, π΅, πΆ, πΈ} and {πΉ, πΊ}.
F
E
G
A
D
B
C
E
A
D
B
C
E
F,G
B
C
E
F,G
A,D
A,D
B,C
E
A,D,B,C
E
A,D,B,C,E
A,D,B,C,E,F,G
F
G
F,G
F,G
F,G
61
Agglomerative
Clustering
C
B
A
D
β’ Instead of using πmin (π, π), we
could use other definitions of the
distance between two sets.
β’ For example:
πmax π, π = max π(π₯, π¦)
π₯βπ,π¦βπ
πmean π, π = mean π(π₯, π¦)
π₯βπ,π¦βπ
β’ Each of these choices can lead to
different choices at each level.
F
E
G
A
D
B
C
E
A
D
B
C
E
F,G
B
C
E
F,G
A,D
A,D
B,C
E
A,D,B,C
E
A,D,B,C,E
A,D,B,C,E,F,G
F
G
F,G
F,G
F,G
62
Hierarchical
Clustering
β’ Agglomerative clustering is an
example of what we call hierarchical
clustering.
β’ In hierarchical clustering, there are
different levels of clustering.
β Each level is obtained by merging, or
splitting, clusters from the previous
level.
β’ If we merge clusters from the
previous level, we get
agglomerative clustering.
β’ If we split clusters from the previous
level, it is called divisive clustering.
C
B
A
D
F
E
G
A
D
B
C
E
A
D
B
C
E
F,G
B
C
E
F,G
A,D
A,D
B,C
E
A,D,B,C
E
A,D,B,C,E
A,D,B,C,E,F,G
F
G
F,G
F,G
F,G
63
Clustering - Recap
β’ The goal in clustering is to split a set of objects into groups of
similar objects.
β’ There is no single criterion for measuring the quality of a
clustering result.
β’ The number of clusters typically needs to be specified in
advance.
β Methods (that we have not seen) do exist for trying to find the
number of clusters automatically.
β In hierarchical clustering (e.g., agglomerative clustering), we do not
need to pick a number of clusters.
β’ A large variety of clustering methods exist.
β’ We saw a few examples of such methods:
β K-means, K-medoid, EM, agglomerative clustering.
64