Download lecture 4 - Maastricht University

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
DATA MINING
van data naar informatie
Ronald Westra
Dep. Mathematics
Maastricht University
CLUSTERING AND
CLUSTER ANALYSIS
Data Mining Lecture IV
[Chapter 8: sections 8.4 and Chapter 9 from Principles
of Data Mining by Hand,, Manilla, Smyth ]
1. Clustering versus Classification
•
•
•
•
classification: give a pre-determined label to a sample
clustering: provide the relevant labels for classification from structure in
a given dataset
clustering: maximal intra-cluster similarity and maximal inter-cluster
dissimilarity
Objectives: - 1. segmentation of space
- 2. find natural subclasses
Examples of Clustering
and Classification
1. Computer Vision
Examples of Clustering
and Classification: 1. Computer Vision
Example of Clustering
and Classification: 1. Computer Vision
Examples of Clustering and Classification:
2. Types of chemical reactions
Examples of Clustering and Classification:
2. Types of chemical reactions
Voronoi Clustering
Georgy Fedoseevich Voronoy
1868 - 1908
Voronoi Clustering
A Voronoi diagram (also called a Voronoi
tessellation, Voronoi decomposition, Dirichlet
tessellation), is a special kind of decomposition
of a metric space determined by distances to a
specified discrete set of objects in the space,
e.g., by a discrete set of points.
Voronoi Clustering
Voronoi Clustering
Voronoi Clustering
Voronoi
Clustering
Voronoi Clustering
Voronoi Clustering
Voronoi Clustering
Partitional Clustering [book section 9.4]
score-functions
centroid
intra-cluster distance
inter-cluster distance
C-means [book page 303]
k-means clustering (also: C-means)
The k-means algorithm assigns each point to
the cluster whose center (also called centroid)
is nearest. The center is the average of all the
points in the cluster, ie its coordinates is the
arithmetic mean for each dimension
separately for all the points in the cluster.
k-means clustering (also: C-means)
Example: The data set has three dimensions
and the cluster has two points: X = (x1, x2, x3)
and Y = (y1, y2, y3).
Then the centroid Z becomes Z = (z1, z2, z3),
where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and
z3 = (x3 + y3)/2
k-means clustering (also: C-means)
This is the basic structure of the algorithm (J.
MacQueen, 1967):
•Randomly generate k clusters and determine the
cluster centers or directly generate k seed points as
cluster centers
•Assign each point to the nearest cluster center.
•Recompute the new cluster centers.
•Repeat until some convergence criterion is met
(usually that the assignment hasn't changed).
C-means [book page 303]
while changes in cluster Ck
% form clusters
for k=1,…,K do
Ck = {x | ||x – rk|| < || x – rl|| }
end
% compute new cluster centroids
for k=1,…,K do
rk = mean({x | x  Ck })
end
end
k-means clustering (also: C-means)
The main advantages of this algorithm are its
simplicity and speed, which allows it to run on
large datasets. Yet it does not systematically
yield the same result with each run of the
algorithm. Rather, the resulting clusters
depend on the initial assignments. The kmeans algorithm maximizes inter-cluster (or
minimizes intra-cluster) variance, but does not
ensure that the solution given is not a local
minimum of variance.
k-means clustering
k-means clustering (also: C-means)
k-means clustering (also: C-means)
k-means clustering (also: C-means)
Fuzzy c-means
One of the problems of the k-means algorithm
is that it gives a hard partitioning of the data,
that is to say that each point is attributed to
one and only one cluster. But points on the
edge of the cluster, or near another cluster,
may not be as much in the cluster as points in
the center of cluster.
Fuzzy c-means
Therefore, in fuzzy clustering, each point does
not pertain to a given cluster, but has a degree
of belonging to a certain cluster, as in fuzzy
logic. For each point x we have a coefficient
giving the degree of being in the k-th cluster
uk(x). Usually, the sum of those coefficients
has to be one, so that uk(x) denotes a
probability of belonging to a certain cluster:
Fuzzy c-means
With fuzzy c-means, the centroid of a cluster is
computed as being the mean of all points, weighted by
their degree of belonging to the cluster, that is:
Fuzzy c-means
The degree of being in a certain cluster is
related to the inverse of the distance to the
cluster
then the coefficients are normalized and
fuzzyfied with a real parameter m > 1 so that
their sum is 1. So :
Fuzzy c-means
For m equal to 2, this is equivalent to
normalising the coefficient linearly to make
their sum 1. When m is close to 1, then cluster
center closest to the point is given much more
weight than the others, and the algorithm is
similar to k-means.
Fuzzy c-means
The fuzzy c-means algorithm is greatly similar
to the k-means algorithm :
Fuzzy c-means
•Choose a number of clusters
•Assign randomly to each point coefficients for
being in the clusters
•Repeat until the algorithm has converged (that
is, the coefficients' change between two iterations
is no more than ε, the given sensitivity threshold)
:
•Compute the centroid for each cluster, using the
formula above
•For each point, compute its coefficients of being
in the clusters, using the formula above
Fuzzy C-means
uij is membership of sample i to custer j
ck is centroid of custer i
while changes in cluster Ck
% compute new memberships
for k=1,…,K do
for i=1,…,N do
ujk = f(xj – ck)
end
end
% compute new cluster centroids
for k=1,…,K do
% weighted mean
ck = SUMj jkxk xj /SUMj ujk
end
end
Fuzzy c-means
The fuzzy c-means algorithm minimizes intracluster variance as well, but has the same
problems as k-means, the minimum is local
minimum, and the results depend on the initial
choice of weights.
Fuzzy c-means
Fuzzy c-means
Fuzzy c-means
Centroids
Trajectory of Fuzzy MultiVariate Centroids
Fuzzy
c-means
1
0.9
0.8
0.7
1
0.6
2
453
0.5
0.4
2
4
0.3
0.2
3
5
0.1
0
0.8
1
-0.1
0
0.2
0.4
0.6
0.8
1
The Correct Number of Clusters
Algorithms like C-means and fuzzy C-means
need the “correct” number K of clusters in
your data set.
In realistic cases it is mostly impossible to
define what this number K should be.
Therefore, the following approach is often
used.
The Correct Number of Clusters
The sum all distances between points and
their respective centroid
E    xi  ck
k iCk
The Correct Number of Clusters
Now plot this error E as function of the
number of clusters K
Shoulder
E
K
The Correct Number of Clusters
Remark that the number of errors is minimal
when K reflects the natural number of clusters
in your data set.
Now, hoe to define the error of your
clustering?
A solution is to sum all distances between
points and their respective centroid
Hierarchical Clustering [book section 9.5]
One major problem with partitional clustering is that the
number of clusters (= #classes) must be pre-specified !!!
This poses the question: what IS the real number of clusters in a
given set of data?
Answer: it depends!
• Agglomerative methods: bottom-up
• Divisive methods: top-down
Hierarchical Clustering
Agglomerative hierarchical clustering
Hierarchical Clustering
Hierarchical
Clustering
Hierarchical Clustering
Hierarchical Clustering
Introduction to Bioinformatics
7.3 INFERRING TREES
7.3 Inferring trees
* n taxa {t1,…,tn}
* D matrix of pairwise genetic distances + JC-correction
* Additive distances: distance over path from i → j is: d(i,j)
* (total) length of a tree: sum of all branch lengths.
Introduction to Bioinformatics
7.3 INFERRING TREES
Finding Branche lengths:
Three-point formula:
A
C
Lx
Lx + Ly = dAB
Lx + Lz = dAC
Ly + Lz = dBC
Lz
centre
Ly
B
Lx = (dAB+dAC-dBC)/2
Ly = (dAB+dBC-dAC)/2
Lz = (dAC+dBC-dAB)/2
Introduction to Bioinformatics
7.3 INFERRING TREES
Four-point formula:
1
when (1,2) and (i,j) are
Four-point
condition !
Lneighbor-couples
x
d(1,2) + d(i,j) < d(i,1) + d(2,j)
Ri = ∑j d(ti ,tj)
M(i,j) = (n-2)d(i,j) – Ri – Rj
i
Lz
centre
Minimize d(i,j) AND total distance in tree
Ly
If i and j are neighbours!
2
M(i,j) < M(i,k) for all k not equal to j
Lq
j
NJ algorithm:
Input: nxn distance matrix D and an outgroup
Output: rooted phylogenetic tree T
Step 1: Compute new table M using D – select smallest value of M to select
two taxa to join
Step 2: Join the two taxa ti and tj to a new vertex V - use 3-point formula to
calculate the updates distance matrix D’ where ti and tj are replaced by V .
Step 3: Compute branch lengths from tk to V using 3-point formula, T(V,1) =
ti and T(V,2) = tj and TD(ti) = L(ti,V) and TD(ti) = L(ti,V).
Step 4: The distance matrix D’ now contains n – 1 taxa. If there are more than
2 taxa left go to step 1. If two taxa are left join them by an branch of length
d(ti,tj).
Step 5: Define the root node as the branch connecting the outgroup to the
rest of the tree. (Alternatively, determine the so-called “mid-point”)
Introduction to Bioinformatics
7.3 INFERRING TREES
UPGMA and ultrametric trees:
If the distance from the root to all leafs is equal the tree is
ultrametric
In that case we can use D instead of M and the algorithm is
called UPGMA (Unweighted Pair Group Method)
Ultrametricity must be valid for the real tee, bur due to noise
this condition will in practice generate erroneous trees.
Example of Clustering
and Classification
1. Clustering versus Classification
•
•
•
•
classification: give a pre-determined label to a sample
clustering: provide the relevant labels for classification from structure in
a given dataset
clustering: maximal intra-cluster similarity and maximal inter-cluster
dissimilarity
Objectives: - 1. segmentation of space
- 2. find natural subclasses