Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June 2, 2000 Overview Regression and its role in data mining Robust support vector regression – Our general formulation Tolerant support vector regression – Our contributions – Massive support vector regression – Integration with data mining tools Active support vector machines Other research and future directions What is regression? Regression forms a rule for predicting an unknown numerical feature from known ones. Example: Predicting purchase habits. Can we use... – age, income, level of education To predict... – purchasing patterns? And simultaneously... – avoid the “pitfalls” that standard statistical regression falls into? Regression example Can we use. Age Income Years of Education $ spent on software 30 $56,000 / yr 16 $800 50 $60,000 / yr 12 $0 16 $2,000 / yr 11 $200 To predict: Age Income Years of Education $ spent on software 40 $48,000 / yr 17 ? 29 $60,000 / yr 18 ? Role in data mining Goal: Find new relationships in data – e.g. customer behavior, scientific experimentation Regression explores importance of each known feature in predicting the unknown one. – Feature selection Regression is a form of supervised learning – Use data where the predictive value is known for given instances, to form a rule Massive datasets Regression is a fundamental task in data mining. Part I: Robust Regression a.k.a. Huber Regression 2 1 0 -2 -g -1 0 -1 1 g 2 “Standard” Linear Regression ê = Aw + be y y b Find w, b such that: A Aw + be ù y Optimization problem Find w, b such that: Aw + be ù y Bound the error by s: à s ô Aw + be à y ô s Minimize the error: P min s2i à s ô Aw + be à y ô s Traditional approach: minimize squared error. Examining the loss function Standard regression uses a squared error loss function. – Points which are far from the predicted line (outliers) are overemphasized. 4 3 2 1 0 -2 -1 -1 0 1 2 Alternative loss function Instead of squared error, P try absolute value of the error: min js j i à s ô Aw + be à y ô s 2 1 0 -2 -1 0 1 2 This is called the 1-norm loss function. 1-Norm Problems And Solution – Overemphasizes error on points close to the predicted line Solution: Huber loss function hybrid approach Linear 2 1 Quadratic 0 -2 -1 0 1 2 -1 Many practitioners prefer the Huber loss function. Mathematical Formulation g indicates switchover from quadratic to linear 2 1 0 -g -2 -1 ú ú( t ) = 0 -1 t 2=2; í j t j à í 2=2; g 1 2 if j t j ô í if j t j > í Larger g means “more quadratic.” Regression Approach Summary 4 Quadratic Loss Function – Standard method in statistics – Over-emphasizes outliers 2 1 0 -2 -1 Linear Loss Function (1-norm) – Formulates well as a linear program – Over-emphasizes small errors 3 0 1 2 0 1 2 0 1 2 2 1 0 -2 -1 Huber Loss Function (hybrid approach) – Appropriate emphasis on large and small errors -1 2 1 0 -2 -1 -1 Previous attempts complicated Earlier efforts to solve Huber regression: – – – – Huber: Gauss-Seidel method Madsen/Nielsen: Newton Method Li: Conjugate Gradient Method Smola: Dual Quadratic Program Our new approach:Pconvex quadratic program P 2 min jt i j zi =2 + í w2 R d; z2 R l ; t 2 R l z à t ô Aw + be à y ô z + t Our new approach is simpler and faster. Experimental Results: Census20k 20,000 points 11 features 1.345 g Li Madsen/Nielsen Huber Smola MM 1 0.1 Faster! 0 200 400 Time (CPU sec) 600 Experimental Results: CPUSmall 8,192 points 12 features 1.345 g Li Madsen/Nielsen Huber Smola MM 1 0.1 Faster! 0 50 100 150 Time (CPU sec) 200 Introduce nonlinear kernel! Begin with previous formulation: P min z2i=2 + í P jt i j w2 R d; z2 R l ; t 2 R l z à t ô Aw + be à y ô z + t Substitute w = A’a and minimize a instead: z à t ô AA 0ë + be à y ô z + t Substitute K(A,A’) for AA’: z à t ô K ( A; A 0) ë + be à y ô z + t A kernel is nonlinear function. Nonlinear results Dataset Kernel Training Accuracy Testing Accuracy CPUSmall Linear 94.50% 94.06% Gaussian 97.26% 95.90% Boston Linear 85.60% 83.81% Housing Gaussian 92.36% 88.15% Nonlinear kernels improve accuracy. Part II: Tolerant Regression a.k.a. Tolerant Training Regression Approach Summary 4 Quadratic Loss Function – Standard method in statistics – Over-emphasizes outliers 2 1 0 -2 -1 Linear Loss Function (1-norm) – Formulates well as a linear program – Over-emphasizes small errors 3 0 1 2 0 1 2 0 1 2 2 1 0 -2 -1 Huber Loss Function (hybrid approach) – Appropriate emphasis on large and small errors -1 2 1 0 -2 -1 -1 Optimization problem (1-norm) Find w, b such that: Aw + be ù y Bound the error by s: à s ô Aw + be à y ô s Minimize the error: P min j si j à s ô Aw + be à y ô s Minimize the magnitude of the error. The overfitting issue Noisy training data can be fitted “too well” – leads to poor generalization on future data ê = Aw + be y A Prefer simpler regressions, i.e. where – some w coefficients are zero – line is “flatter” Reducing overfitting To achieve both goals – minimize magnitude of w vector P P min j w i j + C si à s ô Aw + be à y ô s C is a parameter to balance the two goals – Chosen by experimentation Reduces overfitting due to points far from surface Overfitting again: “close” points “Close points” may be wrong due to noise only – Line should be influenced by “real” data, not noise " ê = Aw + be y A Ignore errors from those points which are close! Tolerant regression Allow an intervalPof size e with P uniform error min j w i j + C si à s ô Aw + be à y ô s e" ô s How large should e be? – Large as possible, while preserving accuracy P P min j w i j + C si à Cö" à s ô Aw + be à y ô s e" ô s How about a nonlinear surface? Introduce nonlinear kernel! Begin with previous formulation: P P Substitute w = A’a and minimize a instead: min j wi j + C si à Cö" à s ô Aw + be à y ô s e" ô s à s ô AA 0ë + be à y ô s Substitute K(A,A’) for AA’: à s ô K ( A; A 0) ë + be à y ô s A kernel is nonlinear function. Our improvements This formulation and interpretation is new! – Improves intuition from prior results – Uses less variables – Solves faster! Computational tests run on DMI Locop2 – Dell PowerEdge 6300 server with – Four gigabytes of memory, 36 gigabytes of disk space – Windows NT Server 4.0 – CPLEX 6.5 solver Donated to UW by Microsoft Corporation Comparison Results m Dataset 0 0.1 0.2 ... 0.7 Total Time Time (sec) Improvement Census Tuning set error 5.10% 4.74% Max 0.00 0.02 79.7% SSR time (sec) 980 935 5086 Avg MM time (sec) 199 294 3765 26.0% Tuning set error 6.60% 6.32% Max 0.00 3.09 65.7% SSR time (sec) 1364 1286 7604 Avg MM time (sec) 468 660 6533 14.1% e CompActiv e Boston Tuning set error Housing e 14.69% 14.62% Max 0.00 0.42 52.0% SSR time (sec) 36 34 170 Avg MM time (sec) 17 23 140 17.6% Problem size concerns How does the problem scale? – m = number of points – n = number of features For linear kernel: problem size is O(mn) à s ô Aw + be à y ô s For nonlinear kernel: problem size is O(m2) à s ô K ( A; A 0) ë + be à y ô s Thousands of data points ==> massive problem! Need an algorithm that will scale well. Chunking approach Idea: Use a chunking method – Bring as much into memory as possible – Solve this subset of the problem – Retain solution and integrate into next subset Explored in depth by Paul Bradley and O.L. Mangasarian for linear kernels Solve in pieces, one chunk at a time. Row-Column Chunking Why column chunking also? – If non-linear kernel is used, chunks are very wide. – A wide chunk must have a small number of rows to fit in memory. Both these chunks use the same memory! Chunking Experimental Results Dataset: 16,000 point subset of Census in R 11+ noise Kernel: Gaussian Radial Basis Kernel LP Size: 32,000 nonsparse rows and columns Problem Size: 1.024 billion nonzero values Time to termination: 18.8 days Number of SVs: 1621 support vectors Solution variables: 33 nonzero components Final tuning set error: 9.8% Tuning set error on first 16.2% chunk (1000 points) Objective Value & Tuning Set Error for Billion-Element Matrix Tuning Set Error 25000 20% 20000 18% Tuning Set Error Objective Value Objective Value 15000 10000 5000 16% 14% 12% 10% 8% 0 0 0 500 5 1000 9 1500 13 2000 18 Row-Column Chunk Iteration Number Time in Days 0 0 500 5 1000 9 1500 13 2000 18 Row-Column Chunk Iteration Number Time in Days Given enough time, we find the right answer! Integration into data mining tools Method runs as a stand-alone application, with data resident on disk With minimal effort, could sit on top of a RDBMS to manage data input/output – Queries select a subset of data - easily SQLable Database queries occur “infrequently” – Data mining can be performed on a different machine from the one maintaining the DBMS Licensing of a linear program solver necessary Algorithm can integrate with data mining tools. Part III: Active Support Vector Machines a.k.a. ASVM The Classification Problem x 0w = í + 1 A+ Ax 0w = í à 1 0 Separating Surface: x w = í w Find surface to best separate two classes. Active Support Vector Machine Features – – – – Solves classification problems No special software tools necessary! No LP or QP! FAST. Works on very large problems. Web page: www.cs.wisc.edu/~musicant/asvm • Available for download and can be integrated into data mining tools • MATLAB integration already provided # of points Features Iterations Time (CPU min) 4 million 32 5 38.04 7 million 32 5 95.57 Summary and Future Work Summary – Robust regression can be modeled simply and efficiently as a quadratic program – Tolerant regression can be used to solve massive regression problems – ASVM can solve massive classification problems quickly Future work – Parallel approaches – Distributed approaches – ASVM for various types of regression Questions?