Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Support Vector Machines: Hype or Hallelujah? Kristin Bennett Math Sciences Dept Rensselaer Polytechnic Inst. http://www.rpi.edu/~bennek October 2-4, 2000 M2000 1 Outline Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination Extensions Application in Drug Design Hallelujah Hype October 2-4, 2000 M2000 2 Support Vector Machines (SVM) A methodology for inference based on Vapnik’s Statistical Learning Theory. Key Ideas: “Maximize Margins” “Do the Dual” “Construct Kernels” October 2-4, 2000 M2000 3 Best Linear Separator? October 2-4, 2000 M2000 4 Best Linear Separator? October 2-4, 2000 M2000 5 Best Linear Separator? October 2-4, 2000 M2000 6 Best Linear Separator? October 2-4, 2000 M2000 7 Best Linear Separator? October 2-4, 2000 M2000 8 Find Closest Points in Convex Hulls d c October 2-4, 2000 M2000 9 Plane Bisect Closest Points x wb w d c d c October 2-4, 2000 M2000 10 Find using quadratic program min 1 2 c i xi i1 s.t. i1 i 1 cd d 2 x i1 i1 i 0 i i i 1 i 1,..., Many existing and new solvers. October 2-4, 2000 M2000 11 Best Linear Separator: Supporting Plane Method Maximize distance Between two paral supporting planes x w b 1 Distance = “Margin” = 2 x w b 1 October 2-4, 2000 || w || M2000 12 Maximize margin using quadratic program min w ,b s.t October 2-4, 2000 1 2 2 || w || xi w b 1 i Class 1 xi w b 1 i Class 1 M2000 13 Dual of Closest Points Method is Support Plane Method min s.t. 2 1 yi i xi || || min 2 i 1 w ,b i 1 i 1 s.t. 2 1 w|| || 2 yi xi w b 1 0 i 1,.., i1 i1 0 Solution only depends on support vectors: i w yi i xi i 1 October 2-4, 2000 1 i Class 1 yi : 1 i Class 1 M2000 14 Statistical Learning Theory Misclassification error and the function complexity bound generalization error. Maximizing margins minimizes complexity. “Eliminates” overfitting. Solution depends only on Support Vectors not number of attributes. October 2-4, 2000 M2000 15 Margins and Complexity Skinny margin is more flexible thus more comple October 2-4, 2000 M2000 16 Margins and Complexity Fat margin is less complex. October 2-4, 2000 M2000 17 Linearly Inseparable Case Convex Hulls Intersec Same argument won’t work. October 2-4, 2000 M2000 18 Reduced Convex Hulls Don’t Intersect d iClass1 iClass1 i xi i 1 0 i D D October 2-4, 2000 Reduce by adding upper bound D 1 2 M2000 19 Find Closest Points Then Bisect min s.t. 1 2 || i xi i xi ||2 i1 i1 i1 i 1 i1 i 1 0 i D No change except for D. D determines number of Support Vectors. October 2-4, 2000 M2000 20 Linearly Inseparable Case: Supporting Plane Method Just add non-negative error vector z. min w ,b , z || w || C zi 2 i 1 yi xi w b zi 1 zi 0 i 1,.., s.t October 2-4, 2000 1 2 M2000 21 Dual of Closest Points Method is Support Plane Method min s.t. 2 1 yi i xi || || min 2 i 1 w ,b i 1 i 1 s.t. 2 1 w|| C zi || 2 i 1 yi xi w b zi 0 D i 0 zi 0 i1 i1 i 1,.., i 0 Solution only depends on support vectors: w yi i xi i 1 October 2-4, 2000 M2000 22 Nonlinear Classification x a, b x w w1a w2b ( x) a, b, ab, a , b 2 2 ( x) w w1a w2b w3ab w4 a w5b 2 October 2-4, 2000 M2000 2 23 Nonlinear Classification: Map to higher dimensional space IDEA: Map each point to higher dimensional feature space and construct linear discriminant in the higher dimensional space. Define : ( x) : n n ' n n' Dual SVM becomes: min s.t. 1 2 y y (x ) (x i 1 j 1 y i 1 i i i j i j i j ) i i 1 0 i 0 i 1,.., October 2-4, 2000 M2000 24 Generalized Inner Product By Hilbert-Schmidt Kernels (Courant and Hilbert 1953) (u ) (v) K (u, v) for certain and K, e.g. (u ) K (u, v) Degree d polynomial (u v 1) d Radial Basis Function Machine Two Layer Neural Network October 2-4, 2000 M2000 || u v ||2 exp 2 sigmoid ( (u v) c) 25 Final Classification via Kernels The Dual SVM becomes: min s.t. 1 2 y y i 1 j 1 y i 1 i i i j i j K ( xi , x j ) i i 1 0 i 0 i 1, .., October 2-4, 2000 M2000 26 October 2-4, 2000 M2000 27 Final SVM Algorithm Solve Dual SVM QP Recover primal variable b Classify new x f ( x) sign i yi K ( x, xi ) b i 1 i : 0 Solution only depends on support vectors October 2-4, 2000 M2000 28 Support Vector Machines (SVM) Key Formulation Ideas: “Maximize Margins” “Do the Dual” “Construct Kernels” Generalization Error Bounds Practical Algorithms October 2-4, 2000 M2000 29 SVM Extensions Regression Variable Selection Boosting Density Estimation Unsupervised Learning Novelty/Outlier Detection Feature Detection Clustering October 2-4, 2000 M2000 30 Example in Drug Design Goal to predict bio-reactivity of molecules to decrease drug development time. Target is to predict the logarithm of inhibition concentration for site "A" on the Cholecystokinin (CCK) molecule. Constructs quantitative structure activity relationship (QSAR) model. October 2-4, 2000 M2000 31 SVM Regression: -insensitive loss function ( x w b) y ( x w b) y - ( x w b) y October 2-4, 2000 M2000 + 32 SVM Minimizes Underestimate+Overestimate ( x w b) y min* w ,b , z , z ( x w b) y October 2-4, 2000 s.t M2000 C zi zi* || w ||1 i 1 j K ( xi , x j ) b y zi j 1 j K ( xi , x j ) b y zi j 1 zi , zi* 0 i 1,.., 33 LCCKA Problem Training data – 66 molecules 323 original attributes are wavelet coefficients of TAE Descriptors. 39 subset of attributes selected by linear 1-norm SVM (with no kernels). For details see DDASSL project link off of http://www.rpi.edu/~bennek. Testing set results reported. October 2-4, 2000 M2000 34 LCCK Prediction LCCKATest Set Estimates 1 Predicted Value 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 True Value 1 Q2=.25 October 2-4, 2000 M2000 35 Many Other Applications Speech Recognition Data Base Marketing Quark Flavors in High Energy Physics Dynamic Object Recognition Knock Detection in Engines Protein Sequence Problem Text Categorization Breast Cancer Diagnosis See: http://www.clopinet.com/isabelle/Projects/ SVM/applist.html October 2-4, 2000 M2000 36 Hallelujah! Generalization theory and practice meet General methodology for many types of problems Same Program + New Kernel = New method No problems with local minima Few model parameters. Selects capacity. Robust optimization methods. Successful Applications BUT… October 2-4, 2000 M2000 37 HYPE? Will SVMs beat my best hand-tuned method Z for X? Do SVM scale to massive datasets? How to chose C and Kernel? What is the effect of attribute scaling? How to handle categorical variables? How to incorporate domain knowledge? How to interpret results? October 2-4, 2000 M2000 38 Support Vector Machine Resources http://www.support-vector.net/ http://www.kernel-machines.org/ Links off my web page: http://www.rpi.edu/~bennek October 2-4, 2000 M2000 39