Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Clustering Gilad Lerman Math Department, UMN Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore What is Clustering? Partitioning data into classes with high intra-class similarity low inter-class similarity Is it well-defined? What is Similarity? Clearly, subjective measure or problem-dependent How Similar Clusters are? Ex1: Two clusters or one clusters? How Similar Clusters are? Ex2: Cluster or outliers Sum-Squares Intra-class Similarity Given Cluster S1 x1,..., xN1 N1 Mean: 1 xi 1 N1 Within Cluster Sum of Squares: WCSS(S1 )= xi 1 , where 2 xi S1 Note that 1 argmin xi c c xi S1 y 2 D ( y )2j j 1 2 Within Cluster Sum of Squares For Set of Clusters S={S1,…,SK} K WCSS(S )= j 1 xi S j D xi j Can use y 1 ( y ) j 2 instead of y j 1 (y ) j 1 So get Within Clusters Manhattan Distance K WCMD(S )= j 1 xi S j xi m j , where m j argmin c D 1 xi S j xi c 1 Question: how to compute/estimate c? 2 j Minimizing WCSS Precise minimization is “NP-hard” Approximate minimization for WCSS by K-means Approximate minimization for WCMD by K-medians The K-means Algorithm Input: Data & number of clusters (K) Randomly guess locations of K cluster centers For each center – assign nearest cluster Repeat till convergence …. Demonstration: K-means/medians Applet K-means: Pros and Cons Pros Often fast Often terminates at a local minimum Cons May not obtain the global minimum Depends on initialization Need to specify K Sensitive to outliers Sensitive to variations in sizes and densities of clusters Not suitable for non-convex shapes Does not apply directly to categorical data Spectral Clustering Idea: embed data for easy clustering • Construct weights based on proximity: 2 xi x j • / Wij e if i j and 0 otherwise (Normalize W ) Embed using eigenvectors of W Clustering vs. Classification Clustering – find classes in an unsupervised way (often K is given though) Classification – labels of clusters are given for some data points (supervised learning) Data 1: Face images Facial images (e.g., of persons 5,8,10) live on different “planes” in the “image space” They are often well-separated so that simple clustering can apply to them (but not always…) Question: What is the high-dimensional image space? Question: How can we present high-dim. data in 3D? Data 2: Iris Data Set Setosa Versicolor Virginica 50 samples from each of 3 species 4 features per sample: length & width of sepal and petal Data 2: Iris Data Set Data 2: Iris Data Set Setosa is clearly separated from 2 others Can’t separate Virginica and Versicolor (need training set as done by Fischer in 1936) Question: What are other ways to visualize? Data 3: Color-based Compression of Images Applet Question: What are the actual data points? Question: What does the error mean? Some methods for # of Clusters (with online codes) Gap statistics Model-based clustering G-means X-means Data-spectroscopic clustering Self-tuning clustering Your mission Learn about clustering (theoretical results, algorithms, codes) Focus: methods for determining # of clusters Understand details Compare using artificial and real data Conclude good/bad scenarios for each (prove?) Come up with new/improved methods Summarize info: literature survey and possibly new/improved demos/applets We can suggest additional questions tailored to your interest