Download support vector classifier

SUPPORT VECTOR MACHINES Class 23 CSC 600: Data Mining Today…  Support Vector Machines Multiclass Problems  Feature Selection  Support Vector Machines (SVM)  What are they? Developed in the 1990s  Computer Science community  Very popular   Performance: Often considered one of the best “out of the box” classifiers  Applications: handwritten digit recognition, text categorization  Support Vector Machines (SVM)  Comparing to other statistical learning methods:  SVMs  work well with high-dimensional data Unique:  Represents “decision boundary” using a subset of training examples Terminology 1. 2. 3.  Maximal Margin Classifier Support Vector Classifier Support Vector Machine Often all three are referred to as “Support Vector Machine” The Path Ahead 1. 2. Maximal Margin Classifier Support Vector Classifier  3. Generalization of Maximal Margin Classifier Support Vector Machine  Generalization of Support Vector Classifier Maximal Margin Classifier  First, need to define a hyperplane  What is a hyperplane?  Hyperplane has p-1 dimensions in a p dimensional space  Example: in 2 dimension space, a hyperplane has 1 dimension (and thus, is a line) Hyperplane Mathematical Definition  For two dimensions, hyperplane defined as: B0 + B1X1 + B2 X2 = 0 B0, B1, B2 are parameters. X1, X2 are variables.  Note that this equation is a line:  Hyperplane is in one-dimension B0 + B1 X1 + B2Y = 0 B2Y = -B1 X1 - B0 Y= -B1 X1 - B0 B2 Hyperplane Mathematical Definition B0 + B1X1 + B2 X2 = 0   We’re going to “find” values for B0, B1, B2. Then, for any values X1 and X2: 1. if B0 + B1X1 + B2X2 = 0  Point is on the line. Hyperplane Mathematical Definition B0 + B1X1 + B2 X2 = 0   We’re going to “find” values for B0, B1, B2. Then, for any values X1 and X2: if B0 + B1X1 + B2X2 > 0 2.  Point is not on the line. On one side of the line. if B0 + B1X1 + B2X2 < 0 3.  Point is on the other side of the line. Hyperplane  … is dividing 2-dimesional space into two halves by a line. Separating Hyperplane Note: a separating hyperplane means zero training errors. Dataset with two classes: 1. Squares 2. circles Can find a separating hyperplane with all squares on one side and all circles on the other. Infinitely many such hyperplanes possible. Classification Using a Separating Hyperplane    For a new test instance, which side of the line is it on? B0 + B1X1 + B2X2 > 0 B0 + B1X1 + B2X2 < 0 Classification Using a Separating Hyperplane  Standard SVM approach:  Label class data as either +1 or -1, depending on which class an instance belongs to.  Prediction: ìï 1, if B + B x + B x +... + B x > 0 0 1 1 2 2 n n yi = í ïî -1, if B0 + B1 x1 + B2 x2 +... + Bn xn < 0 Classification Using a Separating Hyperplane  For a new test instance, which side of the line is it on?  B0 + B1X1 + B2X2 > 0 B0 + B1X1 + B2X2 < 0  Can also look at the magnitude.  How far from zero?  Greater magnitude means more confident prediction.  Some Concerns with this Approach:     Datasets with more than 2 target classes What if a “separting hyperplane” can’t be formed Data is more than two dimensions Regression instead of classification SVMs can deal with each of these. What if Data is more than 2-Dimensions?  Mathematical definition of hyperplane generalizes to n-dimensions: B0 + B1 X1 + B2 X2 = 0 B0 + B1 X1 + B2 X2 +... + Bn X n = 0 B0 + B1 X1 + B2 X2 +... + Bn X n > 0 B0 + B1 X1 + B2 X2 +... + Bn X n < 0 Maximum Margin Hyperplane  What’s the best separating hyperplane? Intuition: the one that is farthest from the training observations. Called the maximum margin hyperplane. The Margin  B1 and B2 are each separating hyperplanes  B1  is better Margin: the smallest distance from the hyperplane to the training data Represents the mid-line of the widest “slab” that can be inserted between the two classes. Maximal Margin Hyperplane  We want the hyperplane that has the greatest margin.  That is, B1 instead of B2 or any of the other infinitely many separating hyperplanes Maximal Margin Hyperplane  Support Vectors: the points in the data, that if moved, the maximal margin hyperplane would move as well. Moving any of the other data points would not affect the model. Figuring Out the Maximal Margin Classifier    Don’t worry, data mining toolkits do it automatically. Optimization problem. Involves calculus. Support Vector Classifier  Maximum Margin Classifier is natural way to perform classification if a separating hyperplane exists.   Perfect segmentation between two classes In many cases, no separating hyperplane will exist Find a hyperplane that almost perfectly segments the classes  This generalization is called: support vector classifier  Support Vector Classifier   Maximal Margin Classifier: no training errors allowed Support Margin Classifier: tolerate training errors  Approach: Soft margin  Will allow construction of linear decision boundary even when classes are not linearly separable. Support Vector Classifier Additional motivation: New data point added. Maximum margin classifier. Perfectly segments training data. Dramatic shift in maximal margin hyperplane. Model has high variance when trying to maintain perfect segmentation. Support Vector Classifier  So, interested in:  Greater robustness to individual data instances  Better classification of most of the training data  Some misclassifications permitted:  “Soft” margin: because margin can be violated by some of the instances Red Instances: • 3 4 5 6 on correct side of margin • 2 is on the margin • 1 is on the wrong side of the margin Red Instances: • 3 4 5 6 on correct side of margin • 2 is on the margin • 1 is on the wrong side of the margin • 11 is on the wrong side of the hyperplane Using Support Vector Classifier for Classification   Same as before. Which side of the line is the test instance on? Constructing the Support Vector Classifier     More interesting. How much “softness” (misclassifications) in the soft margin is ideal? Complicated math, but python figures it out. Specification of nonnegative tuning parameter C Generally chosen by analyst following cross-validation  Large C: wider margin; more instances violate margin  Small C: narrower margin; less tolerance for instances that violate margin  Same data points. Larger C to Smaller C Lower variance. Higher variance. Support Vector Machines  What if a non-linear decision boundary is needed? Poor performance using this decision boundary. Support Vector Machines  Idea: transform the data from its original coordinate space in X into a new space Φ(X) so that a linear decision boundary can separate the two classes  Φ:  Huh? nonlinear transformation Instead of fitting a support vector classifier using n features: X1, X2, …, Xn … use 2n features: X1, X12, X2, X22, …, Xn, Xn2 Support Vector Machines    Enlarged “feature space” compared to original “feature space” Can even extend to higher-order polynomial terms. Downside: can easily end up with huge number of features  overfitting F : (x1, x2 ) ® (x , x , 2x1, 2x2,1) Attribute Transformation 2 1 2 2 Learning a Nonlinear SVM Model  Once again:  Complicated, but python does it for us. Other extensions to SVMs:     Regression instead of classification Categorical variables instead of continuous Multiclass problems instead of binary Radial “kernals”  Circle instead of hyperplane Today…  Support Vector Machines Multiclass Problems  Feature Selection  Multiclass Problems   Scenario: target class is more than 2 categories Motivation: some machine learning algorithms are designed for binary classification  Example:  Ideas? Support Vector Machines (SVM) How to extend binary classifiers to handle multiclass problems? #1 - Multiclass: One-Against-Rest    Assume multiclass dataset with K target classes Decompose into K binary problems Idea: For each target class yi create a single binary problem, with classifier Ci:    Training:   Class yi: positive All other classes: negative use all instances; each instance used in training each of the Ci classifiers Testing:    run testing instance through each classifier record votes for each yi class (negative prediction is a vote for all other classes) class with most votes is the predicted class #2 - Multiclass: One-Against-One    Assume multiclass dataset with K target classes Train K(K-1)/2 binary classifiers (many more than One-Against-Rest) Idea: Each classifier distinguishes between pair of classes (yi, yj)    Classifier ignores records that don’t belong to yi or yj Training: use all instances; each instance only used in training “relevant classifiers” (K of them) Testing:    run testing instance through each classifier record votes for each yi class class with most votes is the predicted class Today…  Support Vector Machines Multiclass Problems  Feature Selection  High Dimensionality  Feature Subset Selection  Adjusted R2 Statistic  High Dimensionality   … can be bad Datasets can have a large number of features  Example: big data  Example: stock prices (time series)  Each stock is individual instance  Features/variables are closing price on given day  Imagine 30 years worth of closing prices (30 x 365) Why is it a Problem?  Often data mining algorithms work better if there are not an overwhelming number of attributes    BAD: p > n, p = # of features n = # of instances The “dimensionality” is lower “The Curse of Dimensionality” As dimensionality increases (more features), the data becomes increasingly sparse in the “feature space” that it occupies. Not enough data objects for the number of features that are present  Reduced classification model accuracy  Other Benefits to Dimensionality Reduction 1. More understandable models  Learned 2. Better visualizations  Fewer 3. attributes = less variables to plot Computational time  Fewer 4. model may involve fewer attributes attributes = quicker model learning? Elimination of irrelevant features Techniques for Dimensionality Reduction 1. Advanced: Linear Algebra Techniques  Automatic approaches  Project data from high-dimensional space into a lowerdimensional space 1. Principal Components Analysis (PCA) 2. Singular Value Decomposition (SVD)  Not necessarily interested in “losing information”; rather eliminate some of the sparsity Techniques for Dimensionality Reduction 2. Feature Construction  Example: combining two separate features (# of full baths, # of half baths) into one feature (“total baths”)  Example: combining features (mass) and (volume) into one feature (density), where density = mass / volume Techniques for Dimensionality Reduction Feature Subset Selection 3. Reducing number of features by only using a subset of features   How many should be in the subset? Losing information if we only consider a subset of features?   Redundant features   Irrelevant features   Example: (1) purchase price and (2) sales tax Example: student id numbers By eliminating unnecessary features, we hope for a better model. Eliminating Redundant and Irrelevant Features 1. Manually via Data Analyst Intuition about problem domain  2. Systematic Approach Try all possible combinations of feature subsets?   See which combination results in best model For n features, there are 2n possible combinations of subsets   Infeasible to try each of them Three Systematic Approaches 1. 2. 3. Embedded Approaches Filter Approaches Wrapper Approaches Embedded Approaches   Algorithm specific Occurs naturally as part of the data mining algorithm  Example:  Only present in decision tree induction certain subset of features are used in final decision tree  Example:  Fitted not present in linear regression model contained coefficient for each predictor variable Filter Approaches  Features are selected before the data mining algorithm is run   Filter approach is independent of the data mining task Example: (trying to eliminate redundant features) Look at pairwise correlation between variables  Pick subset of variables that each have low pairwise correlation 2. Then use that only that subset in Linear Regression model. 1. Wrapper Approaches   Data mining algorithm is a “black box” for finding best subset of features Tries different combinations of subsets will never enumerate all 2n possible combinations  Will search a feature space that is much smaller  Typically  Final model uses the specific subset that evaluates the best Top-Down Wrapper   1. Assuming n number of features… Start with no attributes Train classifier n times, each time with a different feature   2. Add to the best classifier. Recursively use remaining attributes to find which attribute that improves performance the most   Each classifier only has a single predictor See which of the n classifiers performs the best Keep including best attribute Stopping criterion: Stop if no improvement to classifier performance, or increase in classifier performance is less than some threshold Bottom-Up Wrapper    Assuming n number of features… Start with all n attributes in model Create n models, each with a different predictor omitted.      Each classifier has n-1 predictors See which of the n classifiers affects performance the least Throw that attribute out Recursively find the attribute that affects performance the least Stopping criterion: Stop if classifier performance begins to degrade Other Wrappers  Bi-Directional  Combining  Greedy Search with Backtracking  (if  Top-Down and Bottom-Up … you’re familiar with AI) Adjusted R2 Statistic  Always increases as more variables are added to the model. Recall the R2 statistic that we saw in Linear Regression:  Measured the proportion of variance explained by the model  Always a value between 0 and 1  Higher is better TSS - RSS RSS 2 R = =1TSS TSS TSS = å(yi - y )2 RSS = å(yi - ŷi )2 Adjusted R2 Statistic    In contrast to R2, Adjusted R2 penalizes for unnecessary variables in the model. d = number of predictors n = number of instances 𝑅𝑆𝑆 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 1 − (𝑛 − 𝑑 − 1) 𝑇𝑆𝑆 (𝑛 − 1) TSS = å(yi - y )2 RSS = å(yi - ŷi )2 References    Data Science from Scratch, 1st Edition, Grus Introduction to Data Mining, 1st edition, Tan et al. Data Mining and Business Analytics with R, 1st edition, Ledolter

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download support vector classifier