* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download TuftsSVC - Computer Science
Survey
Document related concepts
Transcript
Gaussian Kernel Width Exploration and Cone Cluster Labeling For Support Vector Clustering Nov. 28, 2007 Sei-Hyung Lee Karen Daniels Department of Computer Science University of Massachusetts Lowell 1 Outline • • • • • • • Clustering Overview SVC Background and Related Work Selection of Gaussian Kernel Widths Cone Cluster Labeling Comparisons Contributions Future Work 2 Clustering Overview • Clustering – discovering natural groups in data • Clustering problems arise in – bioinformatics • patterns of gene expression – data mining/compression – pattern recognition/classification 3 Definition of Clustering • Definition by Everitt(1974) – “A cluster is a set of entities which are alike, and entities from different clusters are not alike.” If we assume that the objects to be clustered are represented as points in the measurement space, then – “Clusters may be described as connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points.” 4 5 6 7 8 9 Sample Clustering Taxonomy (Zaiane1999) Partitioning Hierarchical Density-based Grid-based Model-based fixed number of clusters k Statistical (COBWEB) Neural Network (SOM) Hybrids are also possible. http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/ (Chapter 8) 10 Strengths and Weaknesses Typical Strength Partitioning • relatively efficient O(ikn) Hierarchical • does not require choice of k Density-based Grid-based Model-based • discover arbitrary shape • fast processing time • exploit underlying data distribution Weakness • split large clusters & merge small clusters • find spherical-shape • sensitive to outliers (k-means) • choice of k • sensitive to initial selection • never be undone • requires termination condition • does not scale well • sensitive to parameters • sensitive to parameters • can’t find arbitrary shape • assumption is not always true • expensive to update • difficult for large data sets • slow 11 Comparison of Clustering Techniques Handle Order High Noise Dependency Dimension Scalability Arbitrary Shape k-means YES NO NO NO YES O(ikN) k-medoids YES NO Outlier NO YES O(ikN) CLARANS YES NO Outlier NO NO O(N2) BIRCH YES NO ? NO NO O(N) CURE YES YES YES NO NO O(N2logN) SVC ? YES YES NO YES O((N-Nbsv)Nsv) Density-based DBSCAN YES YES YES NO NO O(NlogN) Grid-based STING YES NO ? NO NO O(N) Model-based COBWEB NO ? ? YES NO ? Partitional Hierarchical Time Complexity k = number of clusters, i = number of iterations, N = number of data points, Nsv = number of support vectors, 12 Nbsv = number of bounded support vectors. SVC time is for single combination of parameters. Jain et al. Taxonomy (1999) Cross-cutting Issues Agglomerative vs. Divisive Monothetic vs. Polythetic (sequential feature consideration) Hard vs. Fuzzy Deterministic vs. Stochastic Incremental vs. Non-incremental Distance between 2 clusters = minimum of distances between all inter- cluster pairs. Distance between 2 clusters = maximum of distances between all inter- cluster pairs. 13 More Recent Clustering Surveys • Clustering Large Datasets (Mercer 2003) – Hybrid Methods: e.g. Distribution-Based Clustering Algorithm for Clustering Large Spatial Datasets (Xu et al. 1998) • Hybrid: model-based, density-based, grid-based • Doctoral Thesis (Lee 2005) – Boundary-Detecting Methods: • AUTOCLUST (Estivill-Castro et al. 2000) – Voronoi modeling and Delaunay triangulation • Random Walks (Harel et al. 2001) – Delaunay triangulation modeling and k-nearest-neighbors – Random walk in weighted graph • Support Vector Clustering (Ben-Hur et al. 2001) – One-class Support Vector Machine + cluster labeling 14 Overview of SVM • Map non-linearly separable data into a feature space where they are linearly separable • Class of hyperplanes : f ( x) x b 0 where, ω is normal vector of a hyper-plane b is the offset from the origin : non-linear mapping 15 Overview of SVC • • • • • • Support Vector Clustering (SVC) Clustering algorithm using (one-class) SVM Able to handle arbitrary shaped clusters Able to handle outliers Able to handle high dimensions, but… Need input parameters – For kernel function that defines inner product in feature space 1 • e.g. Gaussian kernel width q in K ( x, y ) – Soft margin C to control outliers e q x y 2 16 SVC Main Idea unit ball SV Gaussian Kernel BSV R a x 1 K ( x, y ) e q x y 2 x Φ(x) “Attract” hyper-plane onto data points instead of “repel.” Data space contours are not explicitly available. R : Radius of the minimal hyper-sphere a : center of the sphere R(x) : distance between F(x) and a BSV : data x outside of sphere, R(x) > R Num(BSV) is controlled by C SV : data x on the surface of sphere, R(x)=R Num(SV) is controlled by q Others : data x inside of sphere, R(x) < R 17 Find Minimal Hyper-sphere (with BSVs) R2 minimize L 2 R 2 R j 2 R(1 j ) 0 R j 1 3 subject to || ( x j ) a ||2 R 2 j and j 0 Lagrangian : max min L R ( R j || ( x j ) a || ) j j j C j 2 , R,a , 2 2 j j j where j 0 and j 0 are Lagrange multiplier s, Maximize to obtain ' s subject to : 02 j C. 1 ( R j || ( x j ) a ||2 ) j KKT conditions : j j 0j j C j 0 Wolfe dual form of L R 2 ( R 2 j || ( x j ) a ||2 ) j j j C j j j j j j R 2 R 2 j (C j ) ( x j ) 2 j 2a 2 a 2 j j C j j j ( x j ) j a by 4 2 j 2 j ( x ) j 2 j j j i i j ( xi ) ( x j ) j using K ( x j , x j ) ( xi ) ( x j ) , K ( x j , x j ) j i j K ( xi , x j ) j i, j j by 3 2 5 (Only points on boundary contribute.) Use j to classify data point : j R 2 R 2 j j j ( x j ) 2 j 2a ( x j ) j a 2 j j j C j j 4 L j j C 0 j j j j j j a ( x j ) j j C is a constant and C j is a penalty te rm for BSVs. j L 2 ( x j ) j 2a j 0 a j j 4 SV : j 0, 0 j C by 2 on surface of sphere BSV : 2 j 0, j 0 outside sphere. j C by 1 j 0 5 j 0, j 0 inside sphere. 18 Relationship Between Minimal Hyper-sphere and Cluster Contours R : Radius of the minimal hyper-sphere a : center of the sphere R(x) : distance between φ(x) and a R 2 ( x) distance between (x) and a contours points on the surface of the minimal sphere x | R( x) R || ( x) - a ||2 ( x) 2 - 2(a) ( x) a 2 K ( x, x) - 2( j ( x j )) ( x) ( j ( x j )) 2 j j K ( x, x) - 2 j K ( x j ,x) i j K ( xi , x j ) j Challenge: Contour boundaries are not explicitly available. i, j Number of clusters increases with increasing q. 19 SVC High-Level Pseudo-Code SVC (X) q initial value; C initial C ( =1) loop K computeKernel(X,q); β solveLagrangian(K,C); cluster labeling(X,β ); if clustering result is satisfactory, exit choose new q and/or C; end loop 20 Previous Work on SVC • Tax and Duin (1999): Novelty detection using (one-class) SVM. • SVC suggested by A. Ben-Hur, V.Vapnik, et al. (2001) – Complete Graph – Support Vector Graph • J. Yang, et al. (2002): Proximity Graph • J. Park, et al. (2004): Spectral Graph Partitioning • J. Lee, et al. (2005): Gradient Descent • W. Puma-Villanueva et al. (2005) Ensembles • S. Lee and K. Daniels (2004, 2005, 2006, 2007): Kernel width exploration and fast cluster labeling 21 Previous Work on Cluster Labeling Complete Graph (CG) Support Vector Graph (SVG) Proximity Graph (PG) all (xi,xj) in X all (xi,xj), where xi or xj is a SV all (xi,xj), where xi and xj are linked in a PG 22 Gradient Descent (GD) support vectors Non-SV data points stable equilibrium points 23 Traditional Sample Points Technique • CG, SVG, PG, and GD use this technique. ③ xj y xi ② ① xi xj xj ① ② ③ disconnected disconnected connected 24 Problems of Sample Points Technique xi xi xj xj False Negative False Positive sample points 25 CG Result (C=1) 26 Problems of SVC • Difficult to find appropriate q and C – no guidance for choosing q and C – too much trial and error • Slow cluster labeling – O(N2Nsvm) time for CG method, where m is the number of sample points on the line segment connecting any pair of data points d – general size of Delaunay triangulation in d dimensions = ( N 2 ) • Bad performance in high-dimensions – as the number of principal components is increased, there is a performance degradation 27 Our q Exploration • Lemmas – If q=0, then R2=0 – If q=∞, then βi=1/N for all i∈{1,…, N} – If q =∞,then R2=1-1/N – R2=1 iff q =∞, and N =∞ – If N is finite, then R2≤1-1/N <1 • Theorem – Under certain circumstances, R2 is a monotonically nondecreasing function of q – Secant-like algorithm 28 q-list Length Analysis • Estimation of q-list length 2 2 ≈ lg(max{ xi x j }) lg(min{ xi x j }) • depends only on – spatial characteristics of the data set and – not on the dimensionality of the data set or the number of data • 89% accuracy w.r.t. the actual q-list length for all datasets considered 29 Our Recent q Exploration Work • Curve typically has one critical radius of curvature at q*. • Approximate q* to yield q̂ *(without cluster labeling). • Use q̂ * as starting q value in sequence. 30 q Exploration Results • 2D: On average actual number is – 32% of estimate – 22% of secant length • Higher dimensions: On average actual number is – 112% of estimate. – 82% of secant length Dim. 9 25 3 4 200 31 2D q Exploration Results 32 Higher Dimensional q Exploration Results 33 Cone Cluster Labeling (CCL) • Motivation: Avoid line segment sampling • Approach: – Leverage geometry of feature space. – For Gaussian kernel • Images of all data points are on surface of unit ball in feature space. • Hyper-sphere in data space corresponds to cone in feature space with apex at origin. Gaussian Kernel 1 K ( x, y ) Sample 2D Data Space e q x y 2 unit ball Low-Dimensional View of HighDimensional Feature Space 34 Cone Cluster Labeling P : intersecti on between th e surface of the unit ball and the minimal hyperspher e in feature space Support Vector Cone : vi Covering P : ((vi V vi ) P) P , where V is set of support ve ctors vj vi v i θ θ v θ j θ 35 Cone Cluster Labeling • Cone base angles are all = . • Cones have a’ in common. • Pythagorean Theorem holds in feature space. • To derive data space hyper-sphere radius, use cos( ) F(vi ) a' F(vi ) a' a a 1 R2 cos( ) 1 R 2 36 Cone Cluster Labeling P' : mapping of P into the data space ( vi P) correspond s to a support ve ctor hyperspher e S vi centered at vi with radius Z : Z - ln(cos( )) ln( 1 R 2 ) q q (vi V S vi ) approximat ely covers P' Z q=0.003 q=0.137 P’ 37 Cone Cluster Labeling Cone Cluster Labeling ( X , Q, V ) for each q Q compute Z for q AdjacencyM atrix ConstructC onnectivit y(V , Z ) Labels FindConnCo mponents(A djacencyMa trix) for each x X , where x V idx find the nearest SV to x Labels( x) Labels( xidx ) end for print Labels end for 38 2D CCL Results (C=1) 39 Sample Higher Dimensional CCL Results in “Heat Map” Form d N N = 12 d=9 3 clusters N = 30 d = 25 5 clusters N = 205 d = 200 5 clusters 40 Comparison – cluster labeling algorithms CG SVG PG GD CCL Construct Adjacency Matrix O(N2Nsvm) O(NN2svm) O(N(logN+ Nsvm)) O(m(N2i+ NsvN2sep)) O(N2sv) Find Connected Components O(N2) O(NNsv) O(N2) O(N2sep) O(N2sv) Non-SV Labeling N/A O((N-Nsv)Nsv) O((N-Nsv)Nsv) O(N-Nsep) O((N-Nsv)Nsv) TOTAL O(N2Nsvm) O(NN2svm) O(N2 + NNsvm) O(m(N2i+ NsvN2sep)) O(NNsv) m: the number of sample points i: the number of iterations for convergence Time is for a single (q,C) combination. 41 Comparisons – 2D Construct Adjacency Matrix Find Connected Components Non-SV Labeling Total Time for Cluster Labeling 42 Comparisons – HD Construct Adjacency Matrix Find Connected Components Non-SV Labeling 43 Contributions • Automatically generate Gaussian kernel width values – include appropriate width values for our test data sets – obtain some reasonable cluster results from the q-list • Faster cluster labeling method – faster than any other SVC cluster labeling algorithms – good clustering quality 44 Future Work “The presence or absence of robust, efficient parallel clustering techniques will determine the success or failure of cluster analysis in large-scale data mining applications in the future.” - Jain et al. 1999 Make SVC scalable! 45 End 46