Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Clustering Gilad Lerman Math Department, UMN Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore What is Clustering? Partitioning data into classes with high intra-class similarity low inter-class similarity Is it well-defined? What is Similarity? Clearly, subjective measure or problem-dependent How Similar Clusters are? Ex1: Two clusters or one clusters? How Similar Clusters are? Ex2: Cluster or outliers Sum-Squares Intra-class Similarity Given Cluster S1 x1,..., xN1 N1 Mean: 1 xi 1 N1 Within Cluster Sum of Squares: WCSS(S1 )= xi 1 , where 2 xi S1 Note that 1 argmin xi c c xi S1 y 2 D ( y )2j j 1 2 Within Cluster Sum of Squares For Set of Clusters S={S1,…,SK} K WCSS(S )= j 1 xi S j D xi j Can use y 1 ( y ) j 2 instead of y j 1 (y ) j 1 So get Within Clusters Manhattan Distance K WCMD(S )= j 1 xi S j xi m j , where m j argmin c D 1 xi S j xi c 1 Question: how to compute/estimate c? 2 j Minimizing WCSS Precise minimization is “NP-hard” Approximate minimization for WCSS by K-means Approximate minimization for WCMD by K-medians The K-means Algorithm Input: Data & number of clusters (K) Randomly guess locations of K cluster centers For each center – assign nearest cluster Repeat till convergence …. Demonstration: K-means/medians Applet K-means: Pros and Cons Pros Often fast Often terminates at a local minimum Cons May not obtain the global minimum Depends on initialization Need to specify K Sensitive to outliers Sensitive to variations in sizes and densities of clusters Not suitable for non-convex shapes Does not apply directly to categorical data Spectral Clustering Idea: embed data for easy clustering • Construct weights based on proximity: 2 xi x j • / Wij e if i j and 0 otherwise (Normalize W ) Embed using eigenvectors of W Clustering vs. Classification Clustering – find classes in an unsupervised way (often K is given though) Classification – labels of clusters are given for some data points (supervised learning) Data 1: Face images Facial images (e.g., of persons 5,8,10) live on different “planes” in the “image space” They are often well-separated so that simple clustering can apply to them (but not always…) Question: What is the high-dimensional image space? Question: How can we present high-dim. data in 3D? Data 2: Iris Data Set Setosa Versicolor Virginica 50 samples from each of 3 species 4 features per sample: length & width of sepal and petal Data 2: Iris Data Set Data 2: Iris Data Set Setosa is clearly separated from 2 others Can’t separate Virginica and Versicolor (need training set as done by Fischer in 1936) Question: What are other ways to visualize? Data 3: Color-based Compression of Images Applet Question: What are the actual data points? Question: What does the error mean? Some methods for # of Clusters (with online codes) Gap statistics Model-based clustering G-means X-means Data-spectroscopic clustering Self-tuning clustering Your mission Learn about clustering (theoretical results, algorithms, codes) Focus: methods for determining # of clusters Understand details Compare using artificial and real data Conclude good/bad scenarios for each (prove?) Come up with new/improved methods Summarize info: literature survey and possibly new/improved demos/applets We can suggest additional questions tailored to your interest