Download 4 - Read

Document related concepts

Principal component analysis wikipedia , lookup

Mixture model wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
《智能信息处理》课程
第四讲 模糊信息处理技术(4)
模糊聚类原理
2008年10月17日
(星期五3、4节,理教110)
1
Fuzzy Clustering
What’s clustering?
Some concepts
Clustering Algorithms
K-means method
Fuzzy C-means (FCM) clustering method
Hierarchical Clustering Algorithms
Mixture of Gaussians
Homework
2
What’s clustering ?
Clustering can be considered the most important
unsupervised learning problem, it deals with
finding a structure in a collection of unlabeled
data.
Definition of clustering
The process of organizing objects into groups whose
members
are
similar
in
some
way.
A cluster is a collection of objects which are
“similar” between them and are “dissimilar” to
the objects belonging to other clusters.
3
a graphical example of clustering
4
It is easily to identify the 4 clusters into which
the data can be divided; the similarity criterion is
distance: two or more objects belong to the same
cluster if they are “close” according to a given
distance (in this case geometrical distance). This
is called distance-based clustering.
Another kind of clustering is conceptual
clustering: two or more objects belong to the
same cluster if this one defines a concept common
to all that objects.
5
Vehicle Example
Vehicle
V1
V2
V3
V4
V5
V6
V7
V8
V9
Top speed
km/h
220
230
260
140
155
130
100
105
110
Colour
red
black
red
gray
blue
white
black
red
gray
Air
resistance
0.30
0.32
0.29
0.35
0.33
0.40
0.50
0.60
0.55
Weight
Kg
1300
1400
1500
800
950
600
3000
2500
3500
6
Vehicle Clusters
3500
3000
Lorries
Weight [kg]
2500
Sports cars
2000
1500
Medium market cars
1000
500
100
150
200
250
300
Top speed [km/h]
7
Terminology
feature
space
Object or data point
3500
label
3000
Lorries
2500
Weight [kg]
cluster
Sports cars
2000
1500
Medium market cars
feature
1000
500
100
150
200
250
300
Top speed [km/h]
feature
8
The Goals of Clustering
To determine the intrinsic grouping in a
set of unlabeled data.
How to decide what constitutes a good
clustering?
9
The Goals of Clustering(2)
It can be shown that there is no absolute
“best” criterion which would be
independent of the final aim of the
clustering.
Consequently, it is the user which must
supply this criterion, in such a way that the
result of the clustering will suit their needs.
10
The Goals of Clustering(3)
For instance, we could be interested in
finding representatives for homogeneous
groups (data reduction), in finding “natural
clusters” and describe their unknown
properties (“natural” data types), in finding
useful and suitable groupings (“useful” data
classes) or in finding unusual data objects
(outlier detection).
11
Rich Applications of Clustering
Pattern Recognition
Spatial Data Analysis
 Create thematic maps in GIS by clustering feature spaces
 Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
 Document classification
 Cluster Weblog data to discover groups of similar access
patterns
12
Examples of Clustering
Applications
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
13
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
14
Requirements of
a clustering algorithm
scalability;
dealing with different types of attributes;
discovering clusters with arbitrary shape;
minimal requirements for domain knowledge to
determine input parameters;
ability to deal with noise and outliers;
insensitivity to order of input records;
high dimensionality;
interpretability and usability.
15
Quality: What Is Good Clustering?
A good clustering method will produce high
quality clusters with
 high intra-class similarity
 low inter-class similarity
The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation
The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns
16
Problems
 current clustering techniques do not address all the
requirements adequately (and concurrently);
 dealing with large number of dimensions and large
number of data items can be problematic because of time
complexity;
 the effectiveness of the method depends on the definition
of “distance” (for distance-based clustering);
 if an obvious distance measure doesn’t exist we must
“define” it, which is not always easy, especially in multidimensional spaces;
 the result of the clustering algorithm (that in many cases
can be arbitrary itself) can be interpreted in different
ways.
17
Clustering Algorithms
Clustering algorithms may be classified as
listed below:
Exclusive Clustering
Overlapping Clustering
Hierarchical Clustering
Probabilistic Clustering
18
Exclusive Clustering
Data are grouped in an exclusive way, so
that if a certain datum belongs to a definite
cluster then it could not be included in
another cluster.
A simple example of that is shown in the
figure below, where the separation of
points is achieved by a straight line on a bidimensional plane.
19
20
Overlapping clustering
Overlapping clustering uses fuzzy sets to
cluster data, so that each point may belong
to two or more clusters with different
degrees of membership.
In this case, data will be associated to an
appropriate membership value.
21
Hierarchical Clustering
A hierarchical clustering algorithm is
based on the union between the two nearest
clusters. The beginning condition is
realized by setting every datum as a cluster.
After a few iterations it reaches the final
clusters wanted.
22
Probabilistic Clustering
Probabilistic clustering uses a completely
probabilistic approach for clustering the
data in hand.
23
Four most used clustering
algorithms
K-means
Fuzzy C-means
Hierarchical clustering
Mixture of Gaussians
24
Distance Measure
An important component of a clustering
algorithm is the distance measure between
data points.
If the components of the data instance
vectors are all in the same physical units
then it is possible that the simple Euclidean
distance metric is sufficient to successfully
group similar data instances.
However, even in this case the Euclidean
distance can sometimes be misleading.
25
26
K-Means Clustering
 K-means (MacQueen, 1967) is one of the simplest
unsupervised learning algorithms that solve the well
known clustering problem.
 The procedure follows a simple and easy way to classify a
given data set through a certain number of clusters
(assume k clusters) fixed a priori.

The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way
because of different location causes different result. So, the better choice is to place them as much as possible far
away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest
centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need
to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k
new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop
has been generated. As a result of this loop we may notice that the k centroids change their location step by step
until no more changes are done. In other words centroids do not move any more.
Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective
function
27
Partitioning Algorithms: Basic
Concept
 Partitioning method: Construct a partition of a database D of
n objects into a set of k clusters, s.t., min sum of squared
distance
k
2
m1tmiKm (Cm  tmi )
 Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the center of
the cluster
 k-medoids (Kaufman & Rousseeuw’87): Each cluster is represented by
one of the objects in the cluster
29
The K-Means Clustering
Method
Given k, the k-means
implemented in four steps:
algorithm
is
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when no more new
assignment
30
The K-Means Clustering
Method
Example
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
9
9
8
8
7
7
6
6
5
5
4
3
2
1
0
1
2
3
4
5
6
7
8
8
9
10
reassign
10
0
7
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
7
31
8
9
10
Comments on the K-Means
Method
 Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as: deterministic
annealing and genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
32
Fuzzy C-Means Clustering
 Fuzzy c-means (FCM) is a method of clustering which allows one
piece of data to belong to two or more clusters. This method
(developed by Dunn in 1973 and improved by Bezdek in 1981) is
frequently used in pattern recognition. It is based on minimization of
the following objective function:
 where m is any real number greater than 1, uij is the degree of
membership of xi in the cluster j, xi is the ith of d-dimensional
measured data, cj is the d-dimension center of the cluster, and ||*|| is
any norm expressing the similarity between any measured data and
the center.
33
Fuzzy partitioning is carried out through
an iterative optimization of the objective
function shown above, with the update of
membership uij and the cluster centers cj by
This iteration will stop when
34
FCM’s Steps
1.Initialize U=[uij] matrix, U(0)
2.At k-step: calculate the centers vectors C(k)=[cj] with U(k)
3.Update U(k) , U(k+1)
4.If || U(k+1) - U(k)||<
then STOP; otherwise return to step 2.
35
Remarks
As already told, data are bound to each
cluster by means of a Membership
Function, which represents the fuzzy
behavior of this algorithm. To do that, we
simply have to build an appropriate matrix
named U whose factors are numbers
between 0 and 1, and represent the degree
of membership between data and centers of
clusters.
36
A 1-D example
37
matrix U
Now, instead of using a graphical representation, we introduce a
matrix U whose factors are the ones taken from the membership
functions:
(a)
(b)
The number of rows and columns depends on how many data and
clusters we are considering. More exactly we have C = 2 columns
(C = 2 clusters) and N rows.
38
Other properties
•
•
39
A 1-D application of the FCM
Figures below show the membership value for each datum
and for each cluster.
40
In the simulation , we have used a fuzzyness coefficient m = 2 and
we have also imposed to terminate the algorithm
when
.
The picture shows the initial condition where the fuzzy distribution
depends on the particular position of the clusters.
No step is performed yet so that clusters are not identified very well.
Now we can run the algorithm until the stop condition is verified.
The figure below shows the final condition reached at the 8th step
with m=2 and
=0.3:
41
Is it possible to do better?
Certainly, we could use an higher accuracy
but we would have also to pay for a bigger computational effort.
In the figure below we can see a better result having used the same
initial conditions and =0.01, but we needed 37 steps!
42
Hierarchical Clustering
Algorithms
Given a set of N items to be clustered, and an N*N distance
(or similarity) matrix, the basic process of hierarchical
clustering (defined by S.C. Johnson in 1967) is this:
1. Start by assigning each item to a cluster, so that if you have N items,
you now have N clusters, each containing just one item. Let the
distances (similarities) between the clusters the same as the distances
(similarities) between the items they contain.
2. Find the closest (most similar) pair of clusters and merge them into a
single cluster, so that now you have one cluster less.
3. Compute distances (similarities) between the new cluster and each of
the old clusters.
4. Repeat steps 2 and 3 until all items are clustered into a single cluster
of size N. (*)
43
Algorithm Steps
1. Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.
2. Find the least dissimilar pair of clusters in the current clustering, say pair (r), (s),
according to
d[(r),(s)] = min d[(i),(j)]
where the minimum is over all pairs of clusters in the current clustering.
3. Increment the sequence number : m = m +1. Merge clusters (r) and (s) into a single
cluster to form the next clustering m. Set the level of this clustering to
L(m) = d[(r),(s)]
4. Update the proximity matrix, D, by deleting the rows and columns corresponding to
clusters (r) and (s) and adding a row and column corresponding to the newly formed
cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is
defined in this way:
d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]
5. If all objects are in one cluster, stop. Else, go to step 2.
44
agglomerative / divisive
This kind of hierarchical clustering is
called agglomerative because it merges
clusters iteratively.
There is also a divisive hierarchical
clustering which does the reverse by
starting with all objects in one cluster and
subdividing them into smaller pieces.
45
Example
a hierarchical clustering of distances in
kilometers between some Italian cities
46
Input distance matrix
BA
FI
MI
NA
RM
TO
BA
0
662
877
255
412
996
FI
662
0
295
468
268
400
MI
877
295
0
754
564
138
NA
255
468
754
0
219
869
RM
412
268
564
219
0
669
TO
996
400
138
869
669
0
47
MI,TO merged into MI/TO
BA
FI
MI/TO
NA
RM
BA
0
662
877
255
412
FI
662
0
295
468
268
MI/TO
877
295
0
754
564
NA
255
468
754
0
219
RM
412
268
564
219
0
48
merge NA and RM into a
new NA/RM cluster
BA
FI
MI/TO
NA/RM
BA
0
662
877
255
FI
662
0
295
268
MI/TO
877
295
0
564
NA/RM
255
268
564
0
49
BA/FI/NA/RM
MI/TO
BA/FI/NA/RM
0
295
MI/TO
295
0
50
Hierarchical tree
51
Clustering as a Mixture of
Gaussians
a model-based approach, which consists in using
certain models for clusters and attempting to
optimize the fit between the data and the model.
Each cluster can be mathematically represented
by a parametric distribution, like a Gaussian
(continuous) or a Poisson (discrete). The entire
data set is therefore modeled by a mixture of
these distributions. An individual distribution
used to model a specific cluster is often referred
to as a component distribution
52
A mixture model with high likelihood tends to have
the following traits:
 component distributions have high “peaks” (data in one
cluster are tight);
 the mixture model “covers” the data well (dominant
patterns in the data are captured by component
distributions).
Main advantages of model-based clustering:
 well-studied statistical inference techniques available;
 flexibility in choosing the component distribution;
 obtain a density estimation for each cluster;
 a “soft” classification is available.
53
Mixture of Gaussians
54
The algorithm works in the following way:
•it chooses the component (the Gaussian) at random
with probability
;
•it samples a point
.
Let’s suppose to have:
• x1, x2,..., xN
•
We
can
obtain
the
likelihood
of
the
sample:
.
What
we
really
want
to
maximise
is
(probability of a datum given the
centres of the Gaussians).
55
is the base to write the likelihood function:
Now we should maximise the likelihood function by
calculating
but it would be too difficult.
That’s why we use a simplified algorithm called EM
(Expectation-Maximization).
56
References
 Tariq
Rashid:
“Clustering”
http://www.cs.bris.ac.uk/home/tr1690/documentation/fuzzy_clustering_initial_report/node11.html
 Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering”
http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/index.html
 Pier Luca Lanzi: “Ingegneria della Conoscenza e Sistemi Esperti – Lezione 2: Apprendimento non
supervisionato”
http://www.elet.polimi.it/upload/lanzi/corsi/icse/2002/Lezione%202%20%20Apprendimento%20non%20supervisionato.pdf
 J. C. Dunn (1973): "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact
Well-Separated Clusters", Journal of Cybernetics 3: 32-57
 J. C. Bezdek (1981): "Pattern Recognition with Fuzzy Objective Function Algoritms", Plenum
Press, New York
 Tariq
Rashid:
“Clustering”
http://www.cs.bris.ac.uk/home/tr1690/documentation/fuzzy_clustering_initial_report/node11.html
 Hans-Joachim
Mucha
and
Hizir
Sofyan:
“Nonhierarchical
Clustering”
http://www.quantlet.com/mdstat/scripts/xag/html/xaghtmlframe149.html
 A.P. Dempster, N.M. Laird, and D.B. Rubin (1977): "Maximum Likelihood from Incomplete Data
via theEM algorithm", Journal of the Royal Statistical Society, Series B, vol. 39, 1:1-38
 Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering”
http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/index.html
 Jia
Li:
“Data
Mining
Clustering
by
Mixture
Models”
http://www.stat.psu.edu/~jiali/course/stat597e/notes/mix.pd
57
Homework
1. 为什么需要聚类分析?它有什么作用?
2. 请列举出一些聚类算法的应用领域,并
简要说明。
3. 实现FCM算法,并用它处理一个2-D数
据的聚类问题,给出实验结果。
58
谢谢!
59