Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia, lookup

Nonlinear dimensionality reduction wikipedia, lookup

Nearest-neighbor chain algorithm wikipedia, lookup

K-means clustering wikipedia, lookup

Cluster analysis wikipedia, lookup

Transcript
```Clustering
Math Department, UMN
Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore
What is Clustering?
Partitioning data into classes with
high intra-class similarity
low inter-class similarity
 Is it well-defined?

What is Similarity?

Clearly, subjective measure or problem-dependent
How Similar Clusters are?

Ex1: Two clusters or one clusters?
How Similar Clusters are?

Ex2: Cluster or outliers
Sum-Squares Intra-class Similarity


Given Cluster S1  x1,..., xN1
N1
Mean:
1 xi
1 
N1

Within Cluster Sum of Squares:
WCSS(S1 )=  xi  1 , where
2
xi S1

Note that 1  argmin  xi  c
c
xi S1
y
2
D
  ( y )2j
j 1
2
Within Cluster Sum of Squares

For Set of Clusters S={S1,…,SK}
K
WCSS(S )=

j 1 xi S j
D

xi   j
Can use y 1   ( y ) j
2
j 1

 (y )
j 1
So get Within Clusters Manhattan Distance
K
WCMD(S )=

j 1 xi S j
xi  m j ,
where m j  argmin
c

D
1

xi S j
xi  c
1
Question: how to compute/estimate c?
2
j
Minimizing WCSS
Precise minimization is “NP-hard”
 Approximate minimization for WCSS by
K-means
 Approximate minimization for WCMD by
K-medians

The K-means Algorithm




Input: Data & number of clusters (K)
Randomly guess locations of K cluster centers
For each center – assign nearest cluster
Repeat till convergence ….
Demonstration: K-means/medians

Applet
K-means: Pros and Cons

Pros



Often fast
Often terminates at a local minimum
Cons

May not obtain the global minimum
 Depends on initialization
 Need to specify K
 Sensitive to outliers
 Sensitive to variations in sizes and densities of clusters
 Not suitable for non-convex shapes
 Does not apply directly to categorical data
Spectral Clustering
Idea: embed data for easy clustering
•
Construct weights
based on proximity:
2
 xi  x j
•
/
Wij  e
if i  j and 0 otherwise
(Normalize W )
Embed using eigenvectors of W
Clustering vs. Classification
Clustering – find classes in an unsupervised
way (often K is given though)
 Classification – labels of clusters are given
for some data points (supervised learning)

Data 1: Face images




Facial images (e.g., of persons 5,8,10) live on different
“planes” in the “image space”
They are often well-separated so that simple clustering
can apply to them (but not always…)
Question: What is the high-dimensional image space?
Question: How can we present high-dim. data in 3D?
Data 2: Iris Data Set
Setosa


Versicolor
Virginica
50 samples from each of 3 species
4 features per sample:
length & width of sepal and petal
Data 2: Iris Data Set
Data 2: Iris Data Set
Setosa is clearly separated from 2 others
 Can’t separate Virginica and Versicolor
(need training set as done by Fischer in 1936)
 Question: What are other ways to visualize?

Data 3: Color-based Compression
of Images
Applet
 Question: What are the actual data points?
 Question: What does the error mean?

Some methods for # of Clusters
(with online codes)
Gap statistics
 Model-based clustering
 G-means
 X-means
 Data-spectroscopic clustering
 Self-tuning clustering



algorithms, codes)
Focus: methods for determining # of clusters

Understand details
 Compare using artificial and real data
 Conclude good/bad scenarios for each (prove?)
 Come up with new/improved methods


Summarize info: literature survey and possibly
new/improved demos/applets
We can suggest additional questions tailored to