Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining and Machine Learning via Support Vector Machines Dave Musicant Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at http://svm.research.bell-labs.com/SVT/SVMsvt.html Outline The Supervised Learning Classification Problem The Support Vector Machine for Classification (linear approaches) Nonlinear SVM approaches Active learning techniques for SVMs Iterative algorithms for solving SVMs SVM Regression Wrapup Dave Musicant Slide 2 Basic Definitions Data Mining – “non trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” -- Usama Fayyad – Utilizes techniques from machine learning, databases, and statistics Machine Learning – “concerned with the question of how to construct computer programs that automatically improve with experience." -- Tom Mitchell – Fits under Artificial Intelligence umbrella Dave Musicant Slide 3 Supervised Learning Classification Example: Cancer diagnosis Patient ID # of Tumors Avg Area Avg Density Diagnosis 1 5 20 118 Malignant 2 3 15 130 Benign 3 7 10 52 Benign 4 2 30 100 Malignant Use this training set to learn how to classify patients where diagnosis is not known: Patient ID # of Tumors Avg Area Avg Density Diagnosis 101 4 16 95 ? 102 9 22 125 ? 103 1 14 80 ? Input Data Training Set Test Set Classification The input data is often easily obtained, whereas the classification is not. Dave Musicant Slide 4 Classification Problem Goal: Use training set + some learning method to produce a predictive model. Use this predictive model to classify new data. Sample applications: Application Medical Diagnosis Input Data Noninvasive tests Optical Character Recognition Protein Folding Scanned bitmaps Research Paper Acceptance Dave Musicant Classification Results from invasive measurements Letter A-Z Amino acid construction Protein shape (helices, loops, sheets) Words in paper title Paper accepted or rejected Slide 5 Application: Breast Cancer Diagnosis Research by Mangasarian,Street, Wolberg Dave Musicant Slide 6 Breast Cancer Diagnosis Separation Research by Mangasarian,Street, Wolberg Dave Musicant Slide 7 Application: Document Classification The Federalist Papers – Written in 1787-1788 by Alexander Hamilton, John Jay, and James Madison to persuade residents of the State of New York to ratify the U.S. Constitution – All written under the pseudonym “Publius” Who wrote which of them? – Hamilton wrote 56 papers – Madison wrote 50 papers – 12 disputed papers, generally understood to be written by Hamilton or Madison, but not known which Research by Bosch, Smith Dave Musicant Slide 8 Federalist Papers Classification Graphic by Fung Dave Musicant Research by Bosch, Smith Slide 9 Application: Face Detection Training data is a collection of Faces and NonFaces Rotation and Mirroring added in to provide robustness Image obtained from work by Osuna, Freund, and Girosi at http://www.ai.mit.edu/projects/cbcl/res-area/object-detection/face-detection.html Dave Musicant Slide 10 Face Detection Results Image obtained from "Support Vector Machines: Training and Applications" by Osuna, Freund, and Girosi. Dave Musicant Slide 11 Face Detection Results Image obtained from work by Osuna, Freund, and Girosi at http://www.ai.mit.edu/projects/cbcl/res-area/object-detection/face-detection.html Dave Musicant Slide 12 Simple Linear Perceptron Class -1 Class 1 Goal: Find the best line (or hyperplane) to separate the training data. How to formalize? – In two dimensions, equation of the line is given by: – Better notation for n dimensions: treat each data point and the coefficients as vectors. Then equation is given by: Dave Musicant Slide 13 Simple Linear Perceptron (cont.) The Simple Linear Perceptron is a classifier as shown in the picture – Points that fall on the right are classified as “1” – Points that fall on the left are classified as “-1” Therefore: using the training set, find a hyperplane (line) so that This is a good starting point. But we can do better! Class -1 Dave Musicant Class 1 Slide 14 Finding the Best Plane Not all planes are equal. Which of the two following planes shown is better? Both planes accurately classify the training set. The solid green plane is the better choice, since it is more likely to do well on future test data. The solid green plane is further away from the data. Dave Musicant Slide 15 Separating the planes Construct the bounding planes: – Draw two parallel planes to the classification plane. – Push them as far apart as possible, until they hit data points. – The classification plane with bounding planes furthest apart is the best one. Class -1 Dave Musicant Class 1 Slide 16 Recap: Finding the Best Plane Details – All points in class 1 should be to the right of bounding plane 1. – All points in class -1 should be to the left of bounding plane -1. – Pick yi to be +1 or -1 depending on the classification. Then the above two inequalities can be written as one: – The distance between bounding planes should be maximized. – The distance between bounding planes is given by: Class -1 Dave Musicant Class 1 Slide 17 The Optimization Problem The previous slide can be rewritten as: This is a mathematical program. – Optimization problem subject to constraints – More specifically, this is a quadratic program – There are high powered software tools for solving this kind of problem (both commercial and academic) – These general purpose tools are slow for this particular problem Dave Musicant Slide 18 Data Which is Not Linearly Separable What if a separating plane does not exist? error Find the plane that maximizes the margin and minimizes the errors on the training points. Take original inequality and add a slack variable to measure error: Dave Musicant Slide 19 The Support Vector Machine Push the planes apart and minimize the error at the same time: C is a positive number that is chosen to balance these two goals. This problem is called a Support Vector Machine, or SVM. Dave Musicant Slide 20 Terminology Those points that touch the bounding plane, or lie on the wrong side, are called support vectors. If all the data points except the support vectors were removed, the solution would turn out the same. The SVM is mathematically equivalent to force and torque equilibrium (hence the name support vectors). Dave Musicant Slide 21 Example from Carleton College 1850 students 4 year undergraduate liberal arts college Ranked 5th in the nation by US News and World Report 15-20 computer science majors per year All research assistants are full-time undergraduates Dave Musicant Slide 22 Student Research Example Goal: automatically generate “frequently asked questions” list from discussion groups Subgoal #1: Given a corpus of discussion group postings, identify those messages that contain questions – Recruit student volunteers to identify questions – Learn classification Work by students Sarah Allen, Janet Campbell, Ester Gubbrud, Rachel Kirby, Lillie Kittredge Dave Musicant Slide 23 Building A Training Set Dave Musicant Slide 24 Building A Training Set Which sentences are questions in the following text? From: [email protected] (Wonko the Sane) I was recently talking to a possible employer ( mine! :-) ) and he made a reference to a 48-bit graphics computer/image processing system. I seem to remember it being called IMAGE or something akin to that. Anyway, he claimed it had 48-bit color + a 12-bit alpha channel. That's 60 bits of info--what could that possibly be for? Specifically the 48-bit color? That's 280 trillion colors, many more than the human eye can resolve. Is this an anti-aliasing thing? Or is this just some magic number to make it work better with a certain processor. Dave Musicant Slide 25 Representing the training set Each document is a point Each potential word is a column (bag of words) Document ID 1 ... aardvark 0 bit 4 i 2 Question? Y Other pre-processing tricks – Remove punctuation – Remove "stop words" such as "is", "a", etc. – Use stemming to remove "ing" and "ed", etc. from similar words Dave Musicant Slide 26 Results If you just guess brain-dead: "every message contains a question", get 55% right If you use a Support Vector Machine, get 66.5% of them right What words do you think were strong indicators of questions? – anyone, does, any, what, thanks, how, help, know, there, do, question What words do you think were strong contraindicators of questions? – re, sale, m, references, not, your Dave Musicant Slide 27 Beyond lines Some datasets may not be best separated by a plane. SVMs can be extended to nonlinear surfaces also. Generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at http://svm.research.bell-labs.com/SVT/SVMsvt.html Dave Musicant Slide 28 Finding nonlinear surfaces How to modify algorithm to find nonlinear surfaces? First idea (simple and effective): map each data point into a higher dimensional space, and find a linear fit there Example: Find a quadratic surface for x1 3 4 x2 5 6 x3 7 2 Use new coordinates in regular linear SVM A plane in this quadratic space is equivalent to a quadratic surface in our original space z 1=x 12 9 16 z 2=x 22 25 36 Dave Musicant z 3=x 32 49 4 z 4=x 1x 2 15 24 z 5=x 1x 3 21 8 z 6=x 2x 3 35 12 z 7=x 1 3 4 z 8=x 2 5 6 z 9=x 3 7 2 Slide 29 Problems with this method If dimensionality of space is high, lots of calculations – For a high polynomial space, combinations of coordinates explodes – Need to do all these calculations for all training points, and for each testing point – Infinite dimensional spaces impossible Nonlinear surfaces can be used without these problems through the use of a kernel function. Dave Musicant Slide 30 The Dual Problem The dual SVM is an alternative approach. – Wrap a “string” around all the data points. – Find the two points, one on each “string”, which are closest together. Connect the dots. – The perpendicular bisector to this connection is the best classification plane. Class -1 Dave Musicant Class 1 Slide 31 The Dual Variable, or “Importance” Every point on the “string” is a linear combination of the points inside the string. x3 x1 x2 In general: a’s are referred to as dual variables, and represent the “importance” of each data point. Dave Musicant Slide 32 Two Equivalent Approaches Class -1 Class 1 Primal Problem: Class 1 Class -1 Dual Problem: – Find best separating plane – Find closest points on “strings” – Variables: w,b – Variables: a Both problems yield the same classification plane. – w,b can be expressed in terms of a – a can be expressed in terms of w,b Dave Musicant Slide 33 How to generalize nonlinear fits Traditional SVM: Dual formulation: Can find w and b in terms of a. But note: don't need any xi individually, just scalar products between points. Dave Musicant Slide 34 Kernel function Dual formulation again: Substitute scalar product with kernel function: Using a kernel corresponds to having mapped the data into some high dimensional space, possibly an infinite one. Dave Musicant Slide 35 Traditional kernels Linear Polynomial Gaussian Dave Musicant Slide 36 Another interpretation Kernels can be thought of as a distance metric. Linear SVM: determine class by sign of Nonlinear SVM: determine class by sign of Those support vectors that x is "closest to" influence its class selection. Dave Musicant Slide 37 Example: Checkerboard Dave Musicant Slide 38 k-Nearest Neighbor Algorithm Dave Musicant Slide 39 SVM on Checkerboard Dave Musicant Slide 40 Active Learning with SVMs Given a set of unlabeled points that I can label at will, how do I choose which one to label next? Common answer: choose a point that is on or close to the current separating hyperplane (Campbell, Cristianini, Smola; Tong & Koller; Schohn & Cohn) Why? Dave Musicant Slide 41 On the hyperplane: Spin 1 Assume data is linearly separable. A point which is on the hyperplane (or at least in the margin) is guaranteed to change the results. (Schohn & Cohn) Dave Musicant Slide 42 On the hyperplane: Spin 2 Intuition suggests that one should grab the point that is most wrong Problem: don't know the class of the point yet If you grab a point that is far from the hyperplane, and it is classified wrong, this would be wonderful But: points which are far from the hyperplane are the ones which are most likely be correctly classified (Campbell, Cristianini, Smola) Dave Musicant Slide 43 Active Learning in Batches What if you want to choose a number of points to label at once? (Brinker) – Could choose the n closest points to the hyperplane, but this is not optimal Dave Musicant Slide 44 Heuristic approach instead Assumption: all hyperplanes go through origin – authors claim that this can be compensated for with appropriate choice of kernel To have maximal effect on direction of hyperplane, choose points with largest angle Dave Musicant Slide 45 Defining angle Let = mapping to feature space Angle between points x and y: Dave Musicant Slide 46 Approach for maximizing angle Introduce artificial point normal to existing hyperplane. Choose next point to be one that maximizes angle with this one. Choose each successive point to be the one that maximizes the minimum angle to previous point (i.e., minimizes the maximum cosine value) Dave Musicant Slide 47 What happened to distance? In practice, use both measures: – want points closest to plane – want points with largest angular separation from others Iterative greedy algorithm: value = * distance to hyperplane + (1-) * (largest cosine measure to an already existing point) Choose the next point to be the one that minimizes this value Paper has results: fairly robust to varying Dave Musicant Slide 48 Iterative Algorithms Maintain the “importance,” or dual variable associated with all data points. – This is small, since it is a single dimensional array of size m. Algorithm – Look at each point sequentially. – Update its importance. (How?) – Repeat until no further improvements in goal. a Importance ? ? ? ? Data Points Attribute 1 Attribute 2 Attribute 3 5 20 118 3 15 130 7 10 52 2 30 100 Class -1 Dave Musicant Class 1 Slide 49 Iterative Framework LSVM, ASVM, SOR, etc. are iterative algorithms on the dual variables. Algorithm: (Assume that we have m data points.) for (i=0; i < m; i++) ai = 0; // Initialize dual variables while (distance between strings continues to shorten) for (i=0; i <m; i++) { Update ai according to the update rule (not shown here). } Bottleneck: Repeated scans through the dataset. – Many of these data points are unimportant Dave Musicant Slide 50 Iterative Framework (Optimized) Optimization: Apply algorithm only to active points, i.e. those points that appear to be support vectors, as long as progress is being made. Optimized Algorithm: while (strings continue to shorten) { run the unoptimized algorithm for one iteration while (strings continue to shorten) for (all i corresponding to active points) { Update ai . If ai > 0, keep this data point active. Otherwise, remove it. } } This results in more loops, but the inner loops are so much faster that it pays off significantly. Dave Musicant Slide 51 Regression Support vector machines can also be used to solve regression problems. Dave Musicant Slide 52 The Regression Problem “Close points” may be wrong due to noise only – Line should be influenced by “real” data, not noise – Ignore errors from those points which are close! Dave Musicant Slide 53 Support Vector Regression Traditional support vector regression: – Minimize the error made outside of the tube – Regularize the fitted plane by minimizing the norm of w – The parameter C balances two competing goals Dave Musicant Slide 54 My current research Collaborating with: – Deborah Gross, Carleton College (chemistry) – Raghu Ramakrishnan, UW-Madison (computer sciences) – Jamie Schauer, UW-Madison (atmospheric sciences) Analyzing data from Aerosol Time-of-Flight Mass Spectrometer (ATOFMS) – Aerosol: "small particle of gunk in air" Questions we want to answer: – How can we classify safe vs. dangerous? – Can we determine when a sudden change in the air stream has happened? – Can we identify what substances are present in a particular particle? Dave Musicant Slide 55 Questions? Dave Musicant Slide 56