Download Clustering

Clustering Gilad Lerman Math Department, UMN Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore What is Clustering? Partitioning data into classes with high intra-class similarity low inter-class similarity  Is it well-defined?  What is Similarity?  Clearly, subjective measure or problem-dependent How Similar Clusters are?  Ex1: Two clusters or one clusters? How Similar Clusters are?  Ex2: Cluster or outliers Sum-Squares Intra-class Similarity   Given Cluster S1  x1,..., xN1 N1 Mean: 1 xi 1  N1  Within Cluster Sum of Squares: WCSS(S1 )=  xi  1 , where 2 xi S1  Note that 1  argmin  xi  c c xi S1 y 2 D   ( y )2j j 1 2 Within Cluster Sum of Squares  For Set of Clusters S={S1,…,SK} K WCSS(S )=  j 1 xi S j D  xi   j Can use y 1   ( y ) j 2 instead of y  j 1   (y ) j 1 So get Within Clusters Manhattan Distance K WCMD(S )=  j 1 xi S j xi  m j , where m j  argmin c  D 1  xi S j xi  c 1 Question: how to compute/estimate c? 2 j Minimizing WCSS Precise minimization is “NP-hard”  Approximate minimization for WCSS by K-means  Approximate minimization for WCMD by K-medians  The K-means Algorithm     Input: Data & number of clusters (K) Randomly guess locations of K cluster centers For each center – assign nearest cluster Repeat till convergence …. Demonstration: K-means/medians  Applet K-means: Pros and Cons  Pros    Often fast Often terminates at a local minimum Cons  May not obtain the global minimum  Depends on initialization  Need to specify K  Sensitive to outliers  Sensitive to variations in sizes and densities of clusters  Not suitable for non-convex shapes  Does not apply directly to categorical data Spectral Clustering Idea: embed data for easy clustering • Construct weights based on proximity: 2  xi  x j • / Wij  e if i  j and 0 otherwise (Normalize W ) Embed using eigenvectors of W Clustering vs. Classification Clustering – find classes in an unsupervised way (often K is given though)  Classification – labels of clusters are given for some data points (supervised learning)  Data 1: Face images     Facial images (e.g., of persons 5,8,10) live on different “planes” in the “image space” They are often well-separated so that simple clustering can apply to them (but not always…) Question: What is the high-dimensional image space? Question: How can we present high-dim. data in 3D? Data 2: Iris Data Set Setosa   Versicolor Virginica 50 samples from each of 3 species 4 features per sample: length & width of sepal and petal Data 2: Iris Data Set Data 2: Iris Data Set Setosa is clearly separated from 2 others  Can’t separate Virginica and Versicolor (need training set as done by Fischer in 1936)  Question: What are other ways to visualize?  Data 3: Color-based Compression of Images Applet  Question: What are the actual data points?  Question: What does the error mean?  Some methods for # of Clusters (with online codes) Gap statistics  Model-based clustering  G-means  X-means  Data-spectroscopic clustering  Self-tuning clustering  Your mission   Learn about clustering (theoretical results, algorithms, codes) Focus: methods for determining # of clusters  Understand details  Compare using artificial and real data  Conclude good/bad scenarios for each (prove?)  Come up with new/improved methods   Summarize info: literature survey and possibly new/improved demos/applets We can suggest additional questions tailored to your interest

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Clustering