Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Hierarchical Stability Based Model Selection for Data Clustering Bing Yin Advisor: Greg Hamerly 5/22/2017 1 Roadmap What is clustering? What is model selection for clustering algorithms? Stability Based Model Selection: Proposals and Problems Hierarchical Stability Based Model Selection ● Algorithm ● Unimodality Test ● Experiments Future work Main Contribution ● Extended the concept of stability to hierarchical stability. ● Solved the symmetric data sets problem. ● Make stability a competing tool for model selection. 5/22/2017 2 What is clustering? Given: ● data set of “objects” ● some relations between those objects: similarities, distances, neighborhoods, connections,… Goal: Find meaningful groups of objects s. t. ● objects in the same group are “similar” ● objects in different groups are “dissimilar” Clustering is: ● a form of unsupervised learning ● a method of data exploration 5/22/2017 3 What is clustering? An Example Image Segmentation: Micro array Analysis: Serum Stimulation of Human Fibroblasts (Eisen,Spellman,PNAS,1998) ● 9800 spots representing 8600 genes ● 12 samples taken over 24 hour period Document Clustering Post-search Grouping Data Mining Social Network Analysis Gene Family Grouping … 5/22/2017 ● Clusters can be roughly categorized as gene involved in A: cholesterol biosynthesis B: the cell cycle C: the immediate-early response D: signaling and angiogenesis E: wound healing and tissue remodeling 4 What is clustering? An Algorithm K-Means algorithm (Lloyd, 1957) Given: data points X1,…,Xn d, number K clusters to find. 1. Randomly initialize the centers m10,…,mK0. 2. Iterate until convergence: 2.1 Assign each point to the closest center according to Euclidean distance, i.e., define clusters C1i+1,…,CKi+1 by Xs Cki+1 where ||Xs-mki||2 < ||Xs-mli||2, l=1 to K 2.2 Compute the new cluster centers by mki+1 = Xs / |Cki+1| What is optimized? Minimizing within-cluster distances: 5/22/2017 5 What is model selection? Clustering algorithms need to know the K before running. The correct answer of K for a given data is unknown So we need a better way to find this K and also the positions of the K centers This can be intuitively called model selection for clustering algorithms. Existing model selection method: ● Bayesian Information Criterion ● Gap statistics ● Projection Test … ● Stability based approach 5/22/2017 6 Stability Based Model Selection The basic idea: ● scientific truth should be reproducible in experiments. Repeatedly run a clustering algorithm on the same data with parameter K and get a collection of clustering: ● If K is the correct model, clustering should be similar to each other ● If K is a wrong model, clustering may be quite different from each other This fact is referred as the stability of K (Ulrike von Luxburg,2007) 5/22/2017 7 Stability Based Model Selection(2) Example on the toy data: If we can mathematically define this stability score for K, then stability can be used to find the correct model for the given data. 5/22/2017 8 Define the Stability Variation of Information (VI) ● Clustering C1: X1,…,Xk and Clustering C2: X’1,…,X’k on date X ● The prob. point p belongs in Xi is : ● The entropy of C1: ● The joint prob. p in Xi and X’j is P(i,j) with entropy: ● The VI is defined as: VI indicates a distance between two clustering. 5/22/2017 9 Define the stability (2) Calculate the VI score for a single K ● Clustering the data using K-Means for K clusters, run M times ● Calculate pair wise VI of these M clustering. ● Average the VI and use it as the VI score for K The calculated VI score for K indicates instability of K Try this over different K The K with lowest VI score/instability is chosen as the correct model 5/22/2017 10 Define the Stability(3) An good example of Stability An bad example of Stability: symmetric data Why? Because Clustering data into 9 clusters apparently has more grouping choices than clustering them into 3. 5/22/2017 11 Hierarchical Stability Problems with the concept of stability introduced above: ● Symmetric Data Set ● Only local optimization – the smaller K Proposed solution ● Analyze the stability in an hierarchical manner ● Do Unimodality Test to detect the termination of the recursion 5/22/2017 12 Hierarchical Stability Given: Data set X HS-means: ● 1. Test if X is not a unimodal cluster ● 2. If yes, find the optimal K for X by analyzing stability; otherwise, X is a single cluster, return. ● 3. Partition X into K subsets ● 4. For each subset, recursively perform this algorithm from step 1 ● 5. Merge answers from each subset as answer for current data 5/22/2017 13 Unimodality Test - 2 Unimodality test Fact: sum of squared Gaussians follows 2 distribution. ● If x1,…,xd are d independent Gaussian variables, then S = x12+…+xd2 follows 2 distribution of degree d. For given data set X, calculate Si=Xi12+…+Xid2 ● If X is a single Gaussian, then S follows 2 of degree d ● Otherwise, S is not a 2 distribution. 5/22/2017 14 Unimodality Test - Gap Test Fact: the within cluster dispersion drops most apparently with the correct K (Tibshirani, 2000) Given: Data set X, candidate k ● cluster X to k clusters and get within cluster dispersion Wk ● generate uniform data sets, cluster to k clusters, calculate W*k (averaged) ● gap(k) = W*k – Wk ● select smallest k s. t. gap(k)>gap(k+1) ● we use it in another way: just ask k=1? 5/22/2017 15 Experiments Synthetic data ● Both Gaussian Distribution and Uniform Distribution ● In dimensions from 2 up to 20 ● c-separation between each cluster center and its nearest neighbor is 4 ● 200 points in each cluster, 10 clusters in total Handwritten Digits ● U.S. Postal Service handwritten digits ● 9298 instances in 256 dimensions ● 10 true clusters (maybe!) KDDD Control Curves ● 600 instances in 60 dimensions ● 6 true clusters, each has 100 instances Synthetic Gaussian (10 true clusters) Synthetic Uniform (10 true clusters) Handwritten Digits (10 true clusters) KDDD Control Curves (6 true clusters) HS-means 101 101 60 6.50.5 Lange Stability 6.51.5 71 20 30 PG-means 101 19.51.5 201 171 5/22/2017 16 Experiments – symmetric data HS-means 5/22/2017 Lange Stability 17 Future Work ● Better Unimodality Testing approach. ● More detailed comparison on the performance with existing method like within cluster distance, VI metric and so on. ● Improve the speed of the algorithm. 5/22/2017 18 Questions and Comments Thank you! 5/22/2017 19