Download Unsupervised Learning and Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Pattern
Classification
All materials in these slides were taken from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and the
publisher
1
Unsupervised Learning & Clustering
Introduction
Mixture Densities and Identifiability
Maximum-Likelihood Estimates
Application to Normal Mixtures
Data Description and Clustering
Criterion Functions for Clustering
Hierarchical Clustering
Component Analysis
Pattern Classification, Chapter 10
2
Introduction
•
•
Previously, all our training samples were labeled: those
samples were said to be “supervised”
We now investigate “unsupervised” procedures using
unlabeled samples. At least five reasons for this:
1.
2.
Collecting and Labeling a large set of samples can be costly
3.
This is also appropriate in many applications when the
characteristics of the patterns can change slowly with time
4.
We can use unsupervised methods to identify features that will
then be useful for categorization
5.
We gain some insight into the nature (or structure) of the data
We can train with large amounts of (less expensive) unlabeled
data, and only then use supervision to label the groupings found,
this is appropriate for large “data mining” applications
Pattern Classification, Chapter 10
3
Mixture Densities and Identifiability
• We begin with the assumption that the functional forms for
the underlying probability densities are known and that the
only thing that must be learned is the value of an unknown
parameter vector
• We make the following assumptions:
1.
2.
3.
4.
5.
The samples come from a known number c of classes
The prior probabilities P(j) for each class are known j = 1, …,c
The class-conditional densities P(x | j, j) j = 1, …,c are known
The values of the c parameter vectors 1, 2, …, c are unknown
The category labels are unknown
Pattern Classification, Chapter 10
4
component densities




c
P ( x |  )   P ( x |  j , j ). P (  j )

j 1
mixing parameters
where   (  1 , 2 ,..., c )t
• This density function is called a mixture density
• Our goal will be to use samples drawn from this
mixture density to estimate the unknown
parameter vector 
• Once  is known, we can decompose the mixture
into its components and use a maximum a
posteriori (MAP) classifier on the derived densities
Pattern Classification, Chapter 10
5
• Definition
A density P(x | ) is said to be identifiable if
  ’ implies that there exists an x such that:
P(x | )  P(x | ’)
i.e., a density is identifiable only if we can recover unique parameters
• Most mixtures of commonly-encountered real-world density
functions are identifiable, especially for continuous densities
• The textbook gives some unidentifiable examples
• Although identifiability can be a problem, we always
assume that the densities we are dealing with are
identifiable!
Pattern Classification, Chapter 10
6
Maximum-Likelihood Estimates
• Suppose that we have a set D = {x1, …, xn} of n
unlabeled samples drawn independently from
the mixture density
c
p( x |  )   p( x |  j , j )P (  j )
j 1
where  is fixed but unknown!
• To estimate 
take the gradient of the log
likelihood with respect to i and set to zero
Pattern Classification, Chapter 10
7
Applications to Normal Mixtures
p(x | i, i) ~ N(i, i)
Case
i
i
P(i)
c
1
?
x
x
x
2
?
?
?
x
3
?
?
?
?
x = known
? = unknown
Case 1 = Simplest case
Pattern Classification, Chapter 10
8
Case 1: Unknown mean vectors
•
•
This “simplest” case is not easy and the textbook
obtains an iterative gradient ascent (hill-climbing)
procedure to maximize the log-likelihood function
Example: Consider the simple two-component onedimensional normal mixture
p(x |  1,  2 ) 
1
2
 1

 1

exp   (x   1 )2  
exp   (x   2 )2 
3 2
 2
 3 2
 2

Set 1 = -2, 2 = 2 and draw 25 samples from this
mixture. The log-likelihood function is:
l( 1,  2 ) 
n
 lnp(xk | 1,  2 )
k 1
Pattern Classification, Chapter 10
9
Pattern Classification, Chapter 10
10
k-Means Clustering
• Popular approximation method to estimate the c mean
vectors 1, 2, …, c
• Replace the squared Mahalanobis distance by the
squared Euclidean distance
• Find the mean
P̂(  i | xk ,ˆ )
̂ m nearest to xk and approximate
1 if i  m
as: P̂ (  i | xk , )  
0 otherwise
• Use the iterative scheme to find ˆ 1 , ˆ 2 ,..., ˆ c
• The # iterations is usually much less than # samples
Pattern Classification, Chapter 10
11
• If n is the known number of patterns and c the desired
number of clusters, the k-means algorithm is:
Begin
initialize n, c, 1, 2, …, c(randomly
selected)
do classify n samples according to
nearest i
recompute i
until no change in i
return 1, 2, …, c
End
Pattern Classification, Chapter 10
12
Two-class example - compare max likelihood in previous fig
Pattern Classification, Chapter 10
13
Three-class example – convergence in three iterations
Pattern Classification, Chapter 10
14
Data Description and Clustering
•
•
Structures of multidimensional patterns are important
for clustering
If we know that data come from a specific distribution,
such data can be represented by a compact set of
parameters (sufficient statistics)
•
•
E.g., mean and covariance matrix for a normal distribution
If samples are considered coming from a specific
distribution, but actually they are not, these statistics is
a misleading representation of the data (next figure)
Pattern Classification, Chapter 10
15
Pattern Classification, Chapter 10
16
• Mixture of normal distributions can approximate a
large variety of situations – i.e., any density functions
• In these cases, one can use parametric methods to
estimate the parameters of the mixture density.
• However, if little prior knowledge can be assumed,
the assumption of a parametric form is meaningless:
we are actually imposing structure on data, not
finding structure on it!
• In these cases, one can use non parametric methods
to estimate the unknown mixture density.
• If the goal is to find subclasses, one can use a
clustering procedure to identify groups of data points
having strong internal similarities
Pattern Classification, Chapter 10
17
Similarity measures
•
•
The question is how to evaluate that the samples in one
cluster are more similar among themselves than samples in
other clusters
Two isses:
•
•
•
•
How to measure the similarity between samples?
How to evaluate a partitioning of a set into clusters?
The most obvious measure of similarity between two samples
is the distance between them, i.e., define a metric
Once this measure is defined, one would expect the distance
between samples of the same cluster to be significantly less
than the distance between samples in different classes
Pattern Classification, Chapter 10
18
• Euclidean distance is a possible metric: a possible
criterion is to assume samples belonging to same
cluster if their distance is less than a threshold d0
• Clusters defined by Euclidean distance are
invariant to translations and rotation of the feature
space, but not invariant to general transformations
that distort the distance relationship
Pattern Classification, Chapter 10
19
Pattern Classification, Chapter 10
20
Pattern Classification, Chapter 10
21
Pattern Classification, Chapter 10
22
Scaling axes
Pattern Classification, Chapter 10
23
Scaling for unit variance may be undesirable
Pattern Classification, Chapter 10
24
Criterion Functions for Clustering
• The second issue: how to evaluate a partitioning of
•
a set into clusters?
Clustering can be posted as an optimization of a
criterion function
1. The sum-of-squared-error criterion and its variants
2. Scatter criteria
The Sum-of-Squared-Error Criterion
c
Je  

i 1 x D i
x  mi
where
1
mi 
ni
x
x Di
Pattern Classification, Chapter 10
25
• This criterion defines clusters as their mean vectors mi
in the sense that it minimizes the sum of the squared
lengths of the error x - mi.
• The optimal partition is defined as one that minimizes Je,
also called minimum variance partition.
• Works fine when clusters form well separated compact
clouds, less fine when there are great differences in the
number of samples in different clusters.
Pattern Classification, Chapter 10
26
Scatter Criteria
• Scatter matrices used in multiple discriminant
analysis, i.e., the within-scatter matrix SW and
the between-scatter matrix SB
ST = SB +SW (T = total scatter matrix)
(Note: T depends only on the set of samples
and not on the partitioning)
• Various criteria can be used to minimize the
within-cluster or maximize the between-cluster
scatter
Pattern Classification, Chapter 10
27
Pattern Classification, Chapter 10
•
28
The Trace Criterion (minimize trace of Sw) is the simplest
scalar measure of the scatter matrix, as it is proportional to the
sum of the variances in the coordinate directions, that is in
practice the sum-of-squared-error criterion
c
c
tr S W    tr S i   
i 1
•

i 1 x D i
x  mi
2
 Je
As tr[ST] = tr[SW] + tr[SB] and tr[ST] is independent from the
partitioning, no new results can be derived by maximizing
tr[SB]
c
tr S B    n i m i  m
2
i 1
•
However, seeking to minimize the within-cluster criterion
Je=tr[SW], is equivalent to maximizing the between-cluster
criterion
Pattern Classification, Chapter 10
29
Hierarchical Clustering
• Many times, clusters are not disjoint, but may have
subclusters, in turn having sub-subclusters, etc.
• Consider a sequence of partitions of the n samples
into c clusters
• The first is a partition into n clusters, each one
containing exactly one sample
• The second is a partition into n-1 clusters, the third into
n-2, and so on, until the n-th in which there is only one
cluster containing all of the samples
• At the level k in the sequence, c = n-k+1.
Pattern Classification, Chapter 10
•
•
30
Given any two samples x and x’, they will be grouped
together at some level, and if they are grouped a level k,
they remain grouped for all higher levels
Hierarchical clustering  tree called dendrogram
Pattern Classification, Chapter 10
• The similarity values may help to determine if the
•
31
grouping are natural or forced, but if they are
evenly distributed no information can be gained
Another representation is based on Venn diagrams
Pattern Classification, Chapter 10
32
• Hierarchical clustering can be divided in
agglomerative and divisive.
• Agglomerative (bottom up, clumping): start with
n singleton cluster and form the sequence by
merging clusters
• Divisive (top down, splitting): start with all of the
samples in one cluster and form the sequence
by successively splitting clusters
Pattern Classification, Chapter 10
33
The problem of the number of clusters
•
•
•
•
•
Typically, the number of clusters is known
When it’s not, there are several ways of proceed
When clustering is done by extremizing a criterion function,
a common approach is to repeat the clustering with c=1,
c=2, c=3, etc.
Another approach is to state a threshold for the creation of
a new cluster
These approaches are similar to model selection
procedures, typically used to determine the topology and
number of states (e.g., clusters, parameters) of a model,
given a specific application
Pattern Classification, Chapter 10
34
Component Analysis
•
•
•
•
Combine features to reduce the dimension of the feature space
Linear combinations are simple to compute and tractable
Project high dimensional data onto a lower dimensional space
Two classical approaches for finding “optimal” linear
transformation
•
PCA (Principal Component Analysis) “Projection that best represents the
data in a least- square sense”
•
MDA (Multiple Discriminant Analysis) “Projection that best separates the
data in a least-squares sense” (generalization of Fisher’s Linear
Discriminant for two classes)
Pattern Classification, Chapter 10
35
• PCA (Principal Component Analysis) “Projection that
best represents the data in a least- square sense”
• The scatter matrix of the cloud of samples is the same as
the maximum-likelihood estimate of the covariance matrix
• Unlike a covariance matrix, however, the scatter matrix includes
samples from all classes!
• And the least-square projection solution (maximum scatter)
is simply the subspace defined by the d’<d eigenvectors of
the covariance matrix that correspond to the largest d’
eigenvalues of the matrix
• Because the scatter matrix is real and symmetric the eigenvectors
are orthogonal
Pattern Classification, Chapter 10
36
Fisher Linear Discriminant
•
While PCA seeks directions efficient for representation, discriminant
analysis seeks directions efficient for discrimination
Pattern Classification, Chapter 10
37
Multiple Discriminant Analysis
•
Generalization of Fisher’s Linear Discriminant for more than two classes
Pattern Classification, Chapter 10
38
Videos
•
http://www.youtube.com/watch?v=aiJ8II94qck
• Seeds are data samples
• https://class.coursera.org/ml-003/lecture/78
• Seeds are random points and not data samples
Pattern Classification, Chapter 10