Download lecture12and13_clustering

Document related concepts

Human genetic clustering wikipedia , lookup

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
ICS 278: Data Mining
Lectures 12,13: Clustering Algorithms
Padhraic Smyth
Department of Information and Computer Science
University of California, Irvine
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Clustering
• “automated detection of group structure in data”
– Typically: partition N data points into K groups (clusters) such that the
points in each group are more similar to each other than to points in
other groups
– descriptive technique (contrast with predictive)
– for real-valued vectors, clusters can be thought of as clouds of points in
p-dimensional space
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
References
• Chapter 9 (sections 9.3 through 9.6) in the text
• Algorithms for Clustering Data, A. K. Jain and R. C. Dubes, 1988,
Prentice Hall. (a bit outdated but has many useful ideas and
references on clustering)
• Cluster Analysis (4th ed), B. S. Everitt, S. Landau, and M. Leese,
Arnold Publishers, 2001 (broad overview of clustering methods)
• How many clusters? which clustering method? answers via modelbased cluster analysis, C. Fraley and A. E. Raftery, the Computer
Journal, 1998. (good overview article on probabilistic model-based
clustering, available from class Web page)
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Clustering
Sometimes easy
Sometimes impossible
and sometimes in between
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Why is Clustering useful?
•
“Discovery” of new knowledge from data
– Contrast with supervised classification (where labels are known)
– Long history in the sciences of categories, taxonomies, etc
– Can be very useful for summarizing large data sets
• For large n and/or high dimensionality
• Applications of clustering
–
–
–
–
–
Clustering of documents produced by a search engine
Segmentation of customers for an e-commerce store
Discovery of new types of galaxies in astronomical data
Clustering of genes with similar expression profiles
Cluster pixels in an image into regions of similar intensity
– …. many more
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
General Issues in Clustering
•
Clustering algorithm = Representation + Score + Optimization
•
Cluster Representation:
•
Score:
•
Optimization and Search
•
–
–
–
–
What types or “shapes” of clusters are we looking for? What defines a cluster?
A clustering = assignment of n objects to K clusters
Score = quantitative criterion used to evaluate different clusterings
Finding the optimal (minimal/maximal score) clustering is typically NP-hard
•
Greedy algorithms to optimize the score are widely used
Other issues
–
Distance function, D[x(i),x(j)] critical aspect of clustering, both
–
–
How is K selected?
Different types of data
Data Mining Lectures
•
•
distance of individual pairs of objects
distance of individual objects from clusters
•
•
Real-valued versus categorical
Attribute-valued vectors vs. n2 distance matrix
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Different Types of Clustering Algorithms
• partition-based clustering
– e.g. K-means
• probabilistic model-based clustering
– e.g. mixture models
[both of the above work with measurement data, e.g., feature vectors]
• hierarchical clustering
– e.g. hierarchical agglomerative clustering
• graph-based clustering
– E.g., min-cut algorithms
[both of the above work with distance data, e.g., distance matrix]
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Partition-Based Clustering
•
input: n data points X={x(1) … x(n)}
•
output: C = {C1 … CK} = specification of K clusters
– implicit representation:
• each x(i) is assigned to a unique Cj (hard-assignment)
– explicit representation
• each Cj is specified in some manner, e.g., as a mean or a region in input space
•
Optimization algorithm
– require that score[C] is minimized (or maximized)
• e.g., sum-of-squares of within cluster distances
– exhaustive search is intractable
– combinatorial optimization problem: assign n objects to K classes
– large search space: number of possible clusterings is approximately Kn / K!
• so, use greedy iterative method
• will be subject to local maxima
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Score Functions for Partition-Based Clustering
• want compact clusters
– minimize within cluster distances wc(C)
• want different clusters far apart
– maximize between cluster distances bc(C)
• given cluster partitioning C, find centers c1…cK
– e.g. for vectors, use centroids of points in cluster Ci
• ci = 1/(ni) 
x  Ci
x
– wc(C) = sum-of-squares within cluster distance (minimize)
• wc(C) = i=1…k wc(Ci) = i=1…k 
x  Ci
d(x,ci)2
– bc(C) = distance between clusters (maximize)
• bc(C) = i,j=1…k d(ci,cj)2
• Can define scores as combinations, i.e., score[C]=f[wc(C),bc(C)]
– See discussion on pp 297-300 in text
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
K-means Clustering
• basic idea:
– Score = wc(C) = sum-of-squares within cluster distance
– start with randomly chosen cluster centers c1 … cK
– repeat until no cluster memberships change:
• assign each point x to cluster with nearest center
– find smallest d(x,ci), over all c1 … cK
• recompute cluster centers over data assigned to them
– ci = 1/(ni)  x  Ci x
• algorithm terminates (finite number of steps)
– Score(C) at each iteration (if membership changes)
• converges to at least a local minimum of Score(C)
– not necessarily the global minimum …
– different initial centers (seeds) can lead to diff local minima
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
K-means Complexity
• time complexity = O(I e n K) << exhaustive ~ Kn / K!
– I = number of interations (steps)
– e = cost of distance computation (e=p for Euclidean dist)
• Approximations/speed-up tricks for very large n
– use x(i)’s nearest to means as cluster centers instead of actual mean
• reuse of cached dists from size n2 dist mat D (lowers effective “e”)
– “condense”: reduce “n” by replacing group with prototype
– Additional references:
• D. Pelleg and A. Moore, Accelerating Exact k-means Algorithms with
Geometric Reasoning, ACM SIGKDD Conference, 1999.
• C. Elkan, Using the Triangle Inequality to Accelerate k-means. In
Proceedings of the Twentieth International Conference on Machine Learning
(ICML'03), 2003.
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
K-means
1. Ask user how many
clusters they’d like.
(e.g. K=5)
(Example is courtesy of
Andrew Moore, CMU)
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
K-means
1. Ask user how many
clusters they’d like.
(e.g. K=5)
2. Randomly guess K
cluster Center
locations
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
K-means
1. Ask user how many
clusters they’d like.
(e.g. K=5)
2. Randomly guess K
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to. (Thus
each Center “owns”
a set of datapoints)
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns
5. New Centers =>
new boundaries
6. Repeat until no
change
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Image
K-means clustering of RGB (3 value) pixel
color intensities, K = 11 segments
(from David Forsyth, UC Berkeley)
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Image
Clusters on color
K-means clustering of RGB (3 value) pixel
color intensities, K = 11 segments
(from David Forsyth, UC Berkeley)
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Issues in K-means clustering
• Simple, but useful
– tends to select compact “isotropic” cluster shapes
– can be useful for initializing more complex methods
– many algorithmic variations on the basic theme
• e.g., in signal processing/data compression is similar to vector-quantization
• Choice of distance measure
– Euclidean distance
– Weighted Euclidean distance
– Many others possible
• Selection of K
– “screen diagram” - plot SSE versus K, look for “knee” of curve
• Limitation: may not be any clear K value
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Finite Mixture Models
p (x)  ?
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Finite Mixture Models
K
p (x)   p(x, ck )
k 1
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Finite Mixture Models
K
p (x)   p (x, ck )
k 1
K
  p (x | ck ) p (ck )
k 1
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Finite Mixture Models
K
p(x)   p(x, ck )
k 1
K
  p(x | ck ) p(ck )
k 1
K
  p(x | ck , k ) wk
k 1
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Finite Mixture Models
K
p(x)   p(x, ck )
k 1
K
  p(x | ck ) p(ck )
k 1
K
  p(x | ck , k ) wk
k 1
Component
Modelk
Data Mining Lectures
Lectures 12,13: Clustering
Weightk
Parametersk
Padhraic Smyth, UC Irvine
0.5
p(x)
0.4
Component 1
Component 2
0.3
0.2
0.1
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
Data Mining Lectures
0
x
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
0.5
p(x)
0.4
Component 1
Component 2
0.3
0.2
0.1
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
Data Mining Lectures
0
x
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
2
p(x)
1.5
Component Models
1
0.5
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
Data Mining Lectures
0
x
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Interpretation of Mixtures
1. C has a direct (physical) interpretation
e.g., C  {age of fish}, C = {male, female}
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Interpretation of Mixtures
1. C has a direct (physical) interpretation
e.g., C  {age of fish}, C = {male, female}
2. C is a convenient hidden variable (i.e., the cluster variable)
- focuses attention on subsets of the data
e.g., for visualization, clustering, etc
- C might have a physical/real interpretation
but not necessarily so
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Probabilistic Clustering: Mixture Models
• assume a probabilistic model for each component cluster
• mixture model: f(x) = k=1…K wk fk(x;k)
• wk are K mixing weights
– 0  wk  1 and k=1…K wk = 1
• where K component densities fk(x;k) can be:
–
–
–
–
Gaussian
Poisson
exponential
...
P
d
• Note:
– Assumes a model for the data (advantages and disadvantages)
– Results in probabilistic membership: p(cluster k | x)
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Gaussian Mixture Models (GMM)
• model for k-th component is normal N(k,k)
– often assume diagonal covariance: jj = j2 , ij = 0
– or sometimes even simpler:
jj = 2 , ij = 0
• f(x) = k=1…K wk fk(x;k) with k = <k , k> or <k ,k>
• generative model:
– randomly choose a component
• selected with probability wk
– generate x ~ N(k,k)
– note: k & k both d-dim vectors
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Learning Mixture Models from Data
• Score function = log-likelihood L()
– L() = log p(X|) = log H p(X,H|)
– H = hidden variables (cluster memberships of each x)
– L() cannot be optimized directly
• EM Procedure
– General technique for maximizing log-likelihood with missing data
– For mixtures
• E-step: compute “memberships” p(k | x) = wk fk(x;k) / f(x)
• M-step: pick a new  to maximize expected data log-likelihood
• Iterate: guaranteed to climb to (local) maximum of L()
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
The E (Expectation) Step
Current K clusters
and parameters
n data
points
E step: Compute p(data point i is in group k)
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
The M (Maximization) Step
New parameters for
the K clusters
n data
points
M step: Compute , given n data points and memberships
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Complexity of EM for mixtures
K models
n data
points
Complexity per iteration scales as O( n K f(p) )
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Comments on Mixtures and EM Learning
•
Complexity of each EM iteration
– Depends on the probabilistic model being used
• e.g., for Gaussians, Estep is O(nK), Mstep is O(nKp2)
– Sometimes E or M-step is not closed form
• => can require numerical optimization or sampling within each iteration
• Generalized EM (GEM): instead of maximizing likelihood, just increase likelihood
• EM can be thought of as hill-climbing with direction and step-size provided automatically
•
K-means as a special case of EM
– Gaussian mixtures with isotropic (diagonal, equi-variance) k ‘s
– Approximate the E-step by choosing most likely cluster (instead of using
membership probabilities)
•
Generalizations…
–
–
–
–
Data Mining Lectures
Mixtures of multinomials for text data
Mixtures of Markov chains for Web sequences
+ more
Will be discussed later in lectures on text and Web data
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
Data Mining Lectures
3.4
3.5
3.6
3.7
Red
Blood
Volume
Lectures
12,13:Cell
Clustering
3.8
3.9
4
Padhraic Smyth, UC Irvine
EM ITERATION 1
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
Data Mining Lectures
3.4
3.5
3.6
3.7
RedLectures
Blood
Volume
12,13:Cell
Clustering
3.8
3.9
4
Padhraic Smyth, UC Irvine
EM ITERATION 3
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
Data Mining Lectures
3.4
3.5
3.6
3.7
RedLectures
Blood
Volume
12,13:Cell
Clustering
3.8
3.9
4
Padhraic Smyth, UC Irvine
EM ITERATION 5
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
Data Mining Lectures
3.4
3.5
3.6
3.7
RedLectures
Blood
Volume
12,13:Cell
Clustering
3.8
3.9
4
Padhraic Smyth, UC Irvine
EM ITERATION 10
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
Data Mining Lectures
3.4
3.5
3.6
3.7
RedLectures
Blood
Volume
12,13:Cell
Clustering
3.8
3.9
4
Padhraic Smyth, UC Irvine
EM ITERATION 15
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
Data Mining Lectures
3.4
3.5
3.6
3.7
RedLectures
Blood
Volume
12,13:Cell
Clustering
3.8
3.9
4
Padhraic Smyth, UC Irvine
EM ITERATION 25
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
Data Mining Lectures
3.4
3.5
3.6
3.7
RedLectures
Blood
Volume
12,13:Cell
Clustering
3.8
3.9
4
Padhraic Smyth, UC Irvine
ANEMIA DATA WITH LABELS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
Data Mining Lectures
3.4
3.5
3.6
3.7
RedLectures
Blood
Volume
12,13:Cell
Clustering
3.8
3.9
4
Padhraic Smyth, UC Irvine
LOG-LIKELIHOOD AS A FUNCTION OF EM ITERATIONS
490
480
Log-Likelihood
470
460
450
440
430
420
410
400
0
Data Mining Lectures
5
10
15
EM Iteration
Lectures 12,13: Clustering
20
25
Padhraic Smyth, UC Irvine
Selecting K in mixture models
•
cannot just choose K that maximizes likelihood
–
•
Likelihood L() is always larger for larger K
Model selection alternatives:
–
1) penalize complexity
•
•
•
–
2) Bayesian: compute posteriors p(k | data)
•
•
•
–
Data Mining Lectures
P(k|data) requires computation of p(data|k) = marginal likelihood
Can be tricky to compute for mixture models
Recent work on Dirichlet process priors has made this more practical
3) (cross) validation:
•
•
•
•
–
e.g., BIC = L() – d/2 log n , d = # parameters
(Bayesian information criterion)
Asymptotically correct under certain assumptions
Often used in practice for mixture models even though assumptions for theory are not met
Score different models by log p(Xtest | )
split data into train and validate sets
Works well on large data sets
Can be noisy on small data (logL is sensitive to outliers)
Note: all of these methods evaluate the quality of the clustering as a density estimator,
rather than with any explicit notion of clustering
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Example of BIC Score for Red-Blood Cell Data
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Example of BIC Score for Red-Blood Cell Data
True number
of classes (2)
selected by BIC
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Hierarchical Clustering
•
•
Representation: tree of nested clusters
Works from a distance matrix
– advantage: x’s can be any type of object
– disadvantage: computation
•
two basic approachs:
– merge points (agglomerative)
– divide superclusters (divisive)
•
visualize both via “dendograms”
– shows nesting structure
– merges or splits = tree nodes
•
Applications
– e.g., clustering of gene expression data
– Useful for seeing hierarchical structure, for
relatively small data sets
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Simple example of hierarchical clustering
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Agglomerative Methods: Bottom-Up
•
algorithm based on distance between clusters:
– for i=1 to n let Ci = { x(i) }, i.e. start with n singletons
– while more than one cluster left
• let Ci and Cj be cluster pair with minimum distance, dist[Ci , Cj ]
• merge them, via Ci = Ci  Cj and remove Cj
•
time complexity = O(n2) to O(n3)
– n iterations (start: n clusters; end: 1 cluster)
– 1st iteration: O(n2) to find nearest singleton pair
•
space complexity = O(n2)
– accesses all distances between x(i)’s
•
interpreting large n dendrogram difficult anyway (like decision trees)
– large n idea: partition-based clusters at leafs
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Distances Between Clusters
• single link / nearest neighbor measure:
– D(Ci,Cj) = min { d(x,y) | x  Ci, y  Cj }
– can be outlier/noise sensitive
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Distances Between Clusters
• single link / nearest neighbor measure:
– D(Ci,Cj) = min { d(x,y) | x  Ci, y  Cj }
– can be outlier/noise sensitive
• complete link / furthest neighbor measure:
– D(Ci,Cj) = max { d(x,y) | x  Ci, y  Cj }
– enforces more “compact” clusters
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Distances Between Clusters
•
single link / nearest neighbor measure:
•
complete link / furthest neighbor measure:
•
intermediates between those extremes:
– D(Ci,Cj) = min { d(x,y) | x  Ci, y  Cj }
– can be outlier/noise sensitive
– D(Ci,Cj) = max { d(x,y) | x  Ci, y  Cj }
– enforces more “compact” clusters
– average link: D(Ci,Cj) = avg { d(x,y) | x  Ci, y  Cj }
– centroid:
D(Ci,Cj) = d(ci,cj) where ci , cj are centroids
– Wards’s SSE measure (for vector data):
• Merge clusters than minimize increase in within-cluster sum-of-squared-dists
– Note that centroid and Ward require that centroid (vector mean) can be defined
•
Which to choose? Different methods may be used for exploratory purposes,
depends on goals and application
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Dendrogram Using Single-Link Method
Old Faithful Eruption Duration vs Wait Data
Notice how single-link
tends to “chain”.
dendrogram y-axis = crossbar’s distance score
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Dendogram Using Ward’s SSE Distance
Old Faithful Eruption Duration vs Wait Data
Data Mining Lectures
Lectures 12,13: Clustering
More balanced than
single-link.
Padhraic Smyth, UC Irvine
Divisive Methods: Top-Down
• algorithm:
– begin with single cluster containing all data
– split into components, repeat until clusters = single points
• two major types:
– monothetic:
• split by one variable at a time -- restricts search space
• analogous to decision trees
– polythetic
• splits by all variables at once -- many choices makes this difficult
• less commonly used than agglomerative methods
– generally more computationally intensive
• more choices in search space
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
MATLAB Demo
• In-class Matlab demo comparing different hierarchical algorithms
and k-means, on the same data sets
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Spectral/Graph-based Clustering
Idea:
• think of distance matrix as a weighted graph where
- objects are nodes
- edges exist for objects with high similarity
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Spectral/Graph-based Clustering
Idea:
• think of distance matrix as a weighted graph where
- objects are nodes
- edges exist for objects with high similarity
• finding a good grouping of the objects is
somewhat equivalent to finding low-weight
subgraphs
- can be reduced to “min-cut” algorithms
- related to eigenstructure of distance matrix
- e.g., Shi and Malik, 1997, for image segmentation
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Clustering non-vector objects
•
E.g., sequences, images, documents, etc
– Can be of varying lengths, sizes
•
Distance matrix approach
– E.g., compute edit distance/transformations for pairs of sequences
– Apply clustering (e.g., hierarchical) based on distance matrix
– However….does not scale well computationally
•
“Vectorization”
– Represent each object as a vector
– Cluster resulting vectors using vector-space algorithm
– However…. can lose (e.g., sequence) information by going to vector space
•
Probabilistic model-based clustering
– Treat as mixture of stochastic models (e.g., Markov models)
– Can naturally handle variable lengths/sizes
– Will discuss application to Web session clustering later in the quarter
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Clustering of TROPICAL CYCLONES Western North Pacific 1983-2002
(from Scott Gaffney’s Phd Thesis, 2004: uses mixture of regressions clustering)
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
K-Means Clustering
Clustering
Task
Representation
Partition based on K
centers
Score Function
Within-cluster sum of
squared errors
Search/Optimization
Data Mining Lectures
Iterative greedy search
Data
Management
None specified
Models,
Parameters
K centers
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Probabilistic Model-Based Clustering
Clustering
Task
Representation
Mixture of Probability
Components
Score Function
Log-likelihood
Search/Optimization
Data Mining Lectures
EM (iterative)
Data
Management
None specified
Models,
Parameters
Probability
model
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Single-Link Hierarchical Clustering
Clustering
Task
Representation
Tree of nested groupings
Score Function
No global score
Search/Optimization
Data Mining Lectures
Iterative merging of
nearest neighbors
Data
Management
None specified
Models,
Parameters
Dendrogram
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine
Summary
• General comments:
–
–
–
–
–
Many different approaches and algorithms
What type of cluster structure are you looking for?
Computational complexity may be an issue for large n
Dimensionality is also an issue
Validation/selection of K is often an ill-posed problem
• Reading
– Chapter 9
• Covers all of the clustering methods discussed here (except graph/spectral
clustering)
– See also earlier reading references
Data Mining Lectures
Lectures 12,13: Clustering
Padhraic Smyth, UC Irvine