Download Lec2 - Maastricht University

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Mixture model wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
DATA MINING
van data naar informatie
Ronald Westra
Dep. Mathematics
Maastricht University
CLUSTERING AND
CLUSTER ANALYSIS
Data Mining Lecture IV
[Chapter 8: sections 8.4 and Chapter 9 from Principles
of Data Mining by Hand,, Manilla, Smyth ]
1. Clustering versus Classification
•
•
•
•
classification: give a pre-determined label to a sample
clustering: provide the relevant labels for classification from structure in
a given dataset
clustering: maximal intra-cluster similarity and maximal inter-cluster
dissimilarity
Objectives: - 1. segmentation of space
- 2. find natural subclasses
Examples of Clustering
and Classification
1. Computer Vision
Examples of Clustering
and Classification: 1. Computer Vision
Example of Clustering
and Classification: 1. Computer Vision
Examples of Clustering and Classification:
2. Types of chemical reactions
Examples of Clustering and Classification:
2. Types of chemical reactions
Voronoi Clustering
Georgy Fedoseevich Voronoy
1868 - 1908
Voronoi Clustering
A Voronoi diagram (also called a Voronoi
tessellation, Voronoi decomposition, Dirichlet
tessellation), is a special kind of decomposition
of a metric space determined by distances to a
specified discrete set of objects in the space,
e.g., by a discrete set of points.
Voronoi Clustering
Voronoi Clustering
Voronoi Clustering
Voronoi
Clustering
Voronoi Clustering
Voronoi Clustering
Voronoi Clustering
Partitional Clustering [book section 9.4]
score-functions
centroid
intra-cluster distance
inter-cluster distance
C-means [book page 303]
k-means clustering (also: C-means)
The k-means algorithm assigns each point to
the cluster whose center (also called centroid)
is nearest. The center is the average of all the
points in the cluster, ie its coordinates is the
arithmetic mean for each dimension
separately for all the points in the cluster.
k-means clustering (also: C-means)
Example: The data set has three dimensions
and the cluster has two points: X = (x1, x2, x3)
and Y = (y1, y2, y3).
Then the centroid Z becomes Z = (z1, z2, z3),
where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and
z3 = (x3 + y3)/2
k-means clustering (also: C-means)
This is the basic structure of the algorithm (J.
MacQueen, 1967):
•Randomly generate k clusters and determine the
cluster centers or directly generate k seed points as
cluster centers
•Assign each point to the nearest cluster center.
•Recompute the new cluster centers.
•Repeat until some convergence criterion is met
(usually that the assignment hasn't changed).
C-means [book page 303]
while changes in cluster Ck
% form clusters
for k=1,…,K do
Ck = {x | ||x – rk|| < || x – rl|| }
end
% compute new cluster centroids
for k=1,…,K do
rk = mean({x | x  Ck })
end
end
k-means clustering (also: C-means)
The main advantages of this algorithm are its
simplicity and speed, which allows it to run on
large datasets. Yet it does not systematically
yield the same result with each run of the
algorithm. Rather, the resulting clusters
depend on the initial assignments. The kmeans algorithm maximizes inter-cluster (or
minimizes intra-cluster) variance, but does not
ensure that the solution given is not a local
minimum of variance.
k-means clustering
k-means clustering (also: C-means)
k-means clustering (also: C-means)
k-means clustering (also: C-means)
Fuzzy c-means
One of the problems of the k-means algorithm
is that it gives a hard partitioning of the data,
that is to say that each point is attributed to
one and only one cluster. But points on the
edge of the cluster, or near another cluster,
may not be as much in the cluster as points in
the center of cluster.
Fuzzy c-means
Therefore, in fuzzy clustering, each point does
not pertain to a given cluster, but has a degree
of belonging to a certain cluster, as in fuzzy
logic. For each point x we have a coefficient
giving the degree of being in the k-th cluster
uk(x). Usually, the sum of those coefficients
has to be one, so that uk(x) denotes a
probability of belonging to a certain cluster:
Fuzzy c-means
With fuzzy c-means, the centroid of a cluster is
computed as being the mean of all points, weighted by
their degree of belonging to the cluster, that is:
Fuzzy c-means
The degree of being in a certain cluster is
related to the inverse of the distance to the
cluster
then the coefficients are normalized and
fuzzyfied with a real parameter m > 1 so that
their sum is 1. So :
Fuzzy c-means
For m equal to 2, this is equivalent to
normalising the coefficient linearly to make
their sum 1. When m is close to 1, then cluster
center closest to the point is given much more
weight than the others, and the algorithm is
similar to k-means.
Fuzzy c-means
The fuzzy c-means algorithm is greatly similar
to the k-means algorithm :
Fuzzy c-means
•Choose a number of clusters
•Assign randomly to each point coefficients for
being in the clusters
•Repeat until the algorithm has converged (that
is, the coefficients' change between two iterations
is no more than ε, the given sensitivity threshold)
:
•Compute the centroid for each cluster, using the
formula above
•For each point, compute its coefficients of being
in the clusters, using the formula above
Fuzzy C-means
uij is membership of sample i to custer j
ck is centroid of custer i
while changes in cluster Ck
% compute new memberships
for k=1,…,K do
for i=1,…,N do
ujk = f(xj – ck)
end
end
% compute new cluster centroids
for k=1,…,K do
% weighted mean
ck = SUMj jkxk xj /SUMj ujk
end
end
Fuzzy c-means
The fuzzy c-means algorithm minimizes intracluster variance as well, but has the same
problems as k-means, the minimum is local
minimum, and the results depend on the initial
choice of weights.
Fuzzy c-means
Fuzzy c-means
Fuzzy c-means
Centroids
Trajectory of Fuzzy MultiVariate Centroids
Fuzzy
c-means
1
0.9
0.8
0.7
1
0.6
2
453
0.5
0.4
2
4
0.3
0.2
3
5
0.1
0
0.8
1
-0.1
0
0.2
0.4
0.6
0.8
1
Hierarchical Clustering [book section 9.5]
One major problem with partitional clustering is that the
number of clusters (= #classes) must be pre-specified !!!
This poses the question: what IS the real number of clusters in a
given set of data?
Answer: it depends!
• Agglomerative methods: bottom-up
• Divisive methods: top-down
Hierarchical Clustering
Agglomerative hierarchical clustering
Hierarchical Clustering
Hierarchical
Clustering
Hierarchical Clustering
Hierarchical Clustering
Example of Clustering
and Classification
1. Clustering versus Classification
•
•
•
•
classification: give a pre-determined label to a sample
clustering: provide the relevant labels for classification from structure in
a given dataset
clustering: maximal intra-cluster similarity and maximal inter-cluster
dissimilarity
Objectives: - 1. segmentation of space
- 2. find natural subclasses
DATA ANALYSIS AND
UNCERTAINTY
Data Mining Lecture V
[Chapter 4, Hand, Manilla, Smyth ]
RANDOM VARIABLES
Random Variables [4.3]
multivariate random variables
marginal density
conditional density & dependency: p(x|y) = p(x,y) / p(y)
* example: supermarket purchases
RANDOM VARIABLES
Example: supermarket purchases
X = n customers x p products;
X(i,j) = Boolean variable: “Has customer #i bought a product of
type p ?”
nA = sum(X(:,A)) is number of customers that bought product A
nB = sum(X(:,B)) is number of customers that bought product B
nAB = sum(X(:,A).*X(:,B)) is number of customers that bought
product B
*** Demo: matlab: conditionaldensity
RANDOM VARIABLES
(conditionally) independent:
p(x,y) = p(x)*p(y)
i.e. :
p(x|y) = p(x)
RANDOM VARIABLES
Simpson’s paradox
Observation of two different treatments for several categories
of patients
Category
Treatment A
Treatment B
Old
0.33
0.20
Young
1.00
0.53
Total
0.50
0.40
RANDOM VARIABLES
Explanation: the sets are POOLED:
Category
Treatment A
Old
2 / 10
0.20
Young
48 / 90
0.53
Total
50 / 100
0.50
Treatment B
30 / 90
0.33
10 / 10
1.00
40 / 100
0.40
Both for Old and Young individually, treatment B seems best, but for total
Treatment A seems superior.
RANDOM VARIABLES
-
1st order Markov processes [example: text analysis and reconstruction with Markov
chains]
Demo: dir *.txt
type 'werther.txt'
[m0,m1,m2] = Markov2Analysis('werther.txt');
CreateMarkovText(m0,m1,m2,3,80,20);
Nederlands, duits, frans, engels, matlabs, europees
Correlation, dependency, causation
Example 4.3
SAMPLING
Sampling [4.4]
-
Large quantities of observations: central limit theorem : normally distributed
Small sample size: modeling: OK, pattern recognition: not
Figure 4.1
Role of model parameters
Probability of observing data D = { x(1), x(1), …, x(n)}
with independent probabilities:
n
P( D | M , )   p(x(i ) | M , )
i 1
with fixed parameter-set Θ.
ESTIMATION
Estimation [4.5]
-
Properties
1. ˆ is estimator of θ – depends on sample so stochastic variable
with E[ˆ] etc.
2. Bias: bias (ˆ)  E[ˆ]  
unbiased: bias (ˆ)  0 , i.e. no systematic bias
3. consistent estimator : asymptotically unbiased
4. Variance: var(ˆ)  E[ˆ  E[ˆ]] 2
5. Best unbiased estimator: unbiased estimator with smallest variance
Maximum Likelihood Estimation
-
Maximum Likelihood Estimation
n
-
Expression 4.7: L( | D)   f (x(i ) |  )
i 1
scalar function in θ
drop all terms not containing θ
value of θ with highest L(θ) is maximum likelihood estimator MLE:
ˆMLE  arg max L( )

-
Example 4.4 + figure 4.2
Demo: MLbinomial
Maximum Likelihood Estimation
-
Example 4.5: Normal distribution, log L(θ)
Example 4.6: Sufficient statistic: nice idea but practical infeasible
MLE: biased bur consistent O(1/n)
In practice log- likelihood l(θ) = log L(θ) is more attractive
Multiple parameters: complex but relevant: EM-algorithm [treated later]
Determination of confidence levels: example 4.8
Maximum Likelihood Estimation
-
Sampling using central limit theorem (CLT) with large samples:
Bootstrap method: Example 4.9:
use observed data as i. The real distribution F, ii. To generate many sub-samples, iii.
Estimate properties in these sub-samples using the “known” distribution F, iv.
Compare sub-sample estimate and “real” estimate and apply CLT.
BAYESIAN ESTIMATION
-
Bayesian estimation
-
Here not frequentist approach but subjective approach: D is given and parameters θ
are random variables.
p(θ) represents our degree of belief in the value of θ
in practice this means that we have to make all sorts of (wild) assumptions about p.
prior distributon: p(θ)
posterior distributon: p(θ|D)
Modification of prior to posterior by Bayes theorem:
p( D |  ) p( )
p ( D |  ) p( )
p( | D) 

p( D)
 p( D |  ) p( )d
-

BAYESIAN ESTIMATION
-
MAP: Maximum a Posteriori method: < θ> : mean or mode of the posterior
distribution
MAP relates to MLE
Example 4.10
Bayesian approach: rather than point-estimates keep full knowledge of uncertainties in
model(s)
This causes massive computation -> therefore only recent decade popular
BAYESIAN ESTIMATION
-
Sequential updating:
p( | D)  p( D |  ) p( ) (equation 4.10)
p( | D1 , D2 )  p( D2 |  ) p( D1 |  ) p( ) (equation 4.15)
Reminds of Markov Process
Algorithm:
* start with p(θ)
* new data D: use eq. 4.10 to obtain posterior distribution
* new data D2 : use eq. 4.15 to obtain new posterior distribution
* repeat ad nauseam
BAYESIAN ESTIMATION
-
Large Problem: different observers may define different p(θ); the choice is subjective!
hence: different results! Analysis depends on choice of p(θ)
-
Often a consensus prior e.g. Jeffrey’s prior (equations 4.16 and 4.17)
-
Computation of credibility interval
-
Markov Chain Monte Carlo (MCMC) methods for estimating posterior distributions
-
Similarly: Bayesian Belief Networks (BBN)
BAYESIAN ESTIMATION
-
Bayesian Approach: Uncertainty_Model + Uncertainty_Params
MLE: aim is: a point estimate
Bayesian: aim is: posterior distribution
Bayesian estimate = weighted estimate over all models M and all parameters θ
where the weights are the likelihoods of the parameters in the different models
PROBLEM: these weights are difficult to estimate
PROBABILISTIC MODEL-BASED
CLUSTERING USING MIXTURE
MODELS
Data Mining Lecture VI
[4.5, 8.4, 9.2, 9.6, Hand, Manilla, Smyth ]
Probabilistic Model-Based
Clustering using Mixture Models
A probability mixture model
A mixture model is a formalism for modeling a
probability density function as a sum of
parameterized functions. In mathematical
terms:
A probability mixture model
where pX(x) is the modeled probability
distribution function, K is the number of
components in the mixture model, and ak is
mixture proportion of component k. By
definition, 0 < ak < 1 for all k = 1…K and:
A probability mixture model
h(x | λk) is a probability distribution parameterized by λk.
Mixture models are often used when we know h(x) and
we can sample from pX(x), but we would like to
determine the ak and λk values.
Such situations can arise in studies in which we sample
from a population that is composed of several distinct
subpopulations.
A common approach for ‘decomposing’ a
mixture model
It is common to think of mixture modeling as a
missing data problem. One way to understand this is
to assume that the data points under consideration
have "membership" in one of the distributions we
are using to model the data. When we start, this
membership is unknown, or missing. The job of
estimation is to devise appropriate parameters for
the model functions we choose, with the connection
to the data points being represented as their
membership in the individual model distributions.
Probabilistic Model-Based
Clustering using Mixture Models
The EM-algorithm [book section 8.4]
Mixture Decomposition:
The ‘Expectation-Maximization’ Algorithm
The Expectation-maximization algorithm
computes the missing memberships of data points
in our chosen distribution model.
It is an iterative procedure, where we start with initial
parameters for our model distribution (the ak's and
λk's of the model listed above).
The estimation process proceeds iteratively in two
steps, the Expectation Step, and the Maximization
Step.
The ‘Expectation-Maximization’ Algorithm
The expectation step
With initial guesses for the parameters in our
mixture model, we compute "partial membership"
of each data point in each constituent distribution.
This is done by calculating expectation values for
the membership variables of each data point.
The ‘Expectation-Maximization’ Algorithm
The maximization step
With the expectation values in hand for group
membership, we can recompute plug-in estimates
of our distribution parameters.
For the mixing coefficient of this is simply the
fractional membership of all data points in the
second distribution.
EM-algorithm for Clustering
The Suppose we have data D with a model with
parameters  and hidden parameters H
Interpretation: H = the class label
Log-likelihood of observed data:
l ( )  log p( )  log  p( D, H |  )
H
EM-algorithm for Clustering
With p the probability over the data D.
Let Q be the unknown distribution over the
hidden parameters H
Then the log-likelihood is:
Then the log-likelihood is:
l ( )  log  p( D, H |  )
H
p ( D, H |  )
l ( )  log  Q( H )
Q( H )
H
p ( D, H |  )
l ( )   Q( H ). log
Q( H )
H
[*Jensen’s inequality]
1
l ( )   Q( H ). log p( D, H |  )   Q( H ). log
 F (Q, )
Q( H )
H
H
Jensen’s inequality
for a concave-down function, the expected value of the function is less than the
function of the expected value. The gray rectangle along the horizontal axis
represents the probability distribution of x, which is uniform for simplicity, but the
general idea applies for any distribution
EM-algorithm
So: F(Q,) is a lower-bound on the log-likelihood
function l(Q,) .
EM alternates between:
E-step: maximising F to Q with fixed , and:
M-step: maximising F to  with fixed Q .
EM-algorithm
E-step:
M-step:
Q

k 1
k 1
 arg max F (Q , )
k
k
Q
 arg max F (Q

k 1
, )
k
Probabilistic Model-Based
Clustering using Gaussian Mixtures
Probabilistic Model-Based Clustering using Mixture Models
Probabilistic Model-Based Clustering using Mixture Models
Probabilistic Model-Based
Clustering using Mixture Models
Probabilistic Model-Based
Clustering using Mixture Models
Gaussian Mixture Decomposition
Gaussian mixture Decomposition is a good
classificator. It allows supervised as well as
unsupervised learning (find how many classes
is optimal, how they should be defined,...). But
training is iterative and time consuming.
Idea is to set position and width of gaussian
distribution(s) to optimize the coverage of
learning samples.
Probabilistic Model-Based
Clustering using Mixture Models
Probabilistic Model-Based Clustering using Mixture Models
Probabilistic Model-Based
Clustering using Mixture Models
Probabilistic Model-Based
Clustering using Mixture Models
Probabilistic Model-Based Clustering using Mixture Models
Probabilistic Model-Based Clustering using Mixture Models
Probabilistic Model-Based Clustering using Mixture Models
Probabilistic Model-Based Clustering using Mixture Models
Probabilistic Model-Based Clustering using Mixture Models
Probabilistic Model-Based Clustering using Mixture Models
The End