Download support vector classifier

Document related concepts

Mixture model wikipedia , lookup

Principal component analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

Support vector machine wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Class 23
CSC 600: Data Mining
Support Vector Machines
Multiclass Problems
Feature Selection
Support Vector Machines (SVM)
What are they?
Developed in the 1990s
 Computer Science community
 Very popular
Often considered one of the best “out of the box” classifiers
 Applications: handwritten digit recognition, text
Support Vector Machines (SVM)
Comparing to other statistical learning methods:
 SVMs
work well with high-dimensional data
 Represents
“decision boundary” using a subset of
training examples
Maximal Margin Classifier
Support Vector Classifier
Support Vector Machine
Often all three are referred to as “Support Vector
The Path Ahead
Maximal Margin Classifier
Support Vector Classifier
Generalization of Maximal Margin Classifier
Support Vector Machine
Generalization of Support Vector Classifier
Maximal Margin Classifier
First, need to define a hyperplane
What is a hyperplane?
 Hyperplane
has p-1 dimensions in a p dimensional
 Example: in 2 dimension space, a hyperplane has 1
dimension (and thus, is a line)
Hyperplane Mathematical Definition
For two dimensions, hyperplane defined as:
B0 + B1X1 + B2 X2 = 0
B0, B1, B2 are parameters.
X1, X2 are variables.
Note that this equation is a line:
 Hyperplane
is in one-dimension
B0 + B1 X1 + B2Y = 0
B2Y = -B1 X1 - B0
-B1 X1 - B0
Hyperplane Mathematical Definition
B0 + B1X1 + B2 X2 = 0
We’re going to “find” values for B0, B1, B2.
Then, for any values X1 and X2:
if B0 + B1X1 + B2X2 = 0
 Point
is on the line.
Hyperplane Mathematical Definition
B0 + B1X1 + B2 X2 = 0
We’re going to “find” values for B0, B1, B2.
Then, for any values X1 and X2:
if B0 + B1X1 + B2X2 > 0
Point is not on the line. On one side of the line.
if B0 + B1X1 + B2X2 < 0
Point is on the other side of the line.
… is dividing 2-dimesional space into two halves by
a line.
Separating Hyperplane
Note: a separating
hyperplane means
zero training errors.
Dataset with two classes:
1. Squares
2. circles
Can find a separating
hyperplane with all
squares on one side and
all circles on the other.
Infinitely many such
hyperplanes possible.
Classification Using a Separating
For a new test
instance, which side of
the line is it on?
B0 + B1X1 + B2X2 > 0
B0 + B1X1 + B2X2 < 0
Classification Using a Separating
Standard SVM approach:
 Label
class data as either +1 or -1, depending on which
class an instance belongs to.
 Prediction:
ìï 1, if B + B x + B x +... + B x > 0
1 1
2 2
n n
yi = í
ïî -1, if B0 + B1 x1 + B2 x2 +... + Bn xn < 0
Classification Using a Separating
For a new test instance, which side of the line is it on?
B0 + B1X1 + B2X2 > 0
B0 + B1X1 + B2X2 < 0
Can also look at the magnitude.
How far from zero?
 Greater magnitude means more confident prediction.
Some Concerns with this Approach:
Datasets with more than 2 target classes
What if a “separting hyperplane” can’t be formed
Data is more than two dimensions
Regression instead of classification
SVMs can deal with each of these.
What if Data is more than 2-Dimensions?
Mathematical definition of hyperplane generalizes
to n-dimensions:
B0 + B1 X1 + B2 X2 = 0
B0 + B1 X1 + B2 X2 +... + Bn X n = 0
B0 + B1 X1 + B2 X2 +... + Bn X n > 0
B0 + B1 X1 + B2 X2 +... + Bn X n < 0
Maximum Margin Hyperplane
What’s the best separating hyperplane?
Intuition: the one that is farthest from
the training observations.
Called the maximum margin
The Margin
B1 and B2 are each
separating hyperplanes
 B1
is better
Margin: the smallest
distance from the
hyperplane to the
training data
Represents the mid-line of the widest “slab” that can be inserted between the two classes.
Maximal Margin Hyperplane
We want the
hyperplane that has the
greatest margin.
 That
is, B1 instead of B2
or any of the other
infinitely many
separating hyperplanes
Maximal Margin Hyperplane
Support Vectors: the
points in the data, that
if moved, the maximal
margin hyperplane
would move as well.
Moving any of the other data points
would not affect the model.
Figuring Out the Maximal Margin Classifier
Don’t worry, data mining toolkits do it automatically.
Optimization problem.
Involves calculus.
Support Vector Classifier
Maximum Margin Classifier is natural way to perform
classification if a separating hyperplane exists.
Perfect segmentation between two classes
In many cases, no separating hyperplane will exist
Find a hyperplane that almost perfectly segments the classes
 This generalization is called: support vector classifier
Support Vector Classifier
Maximal Margin Classifier: no training errors
Support Margin Classifier: tolerate training errors
 Approach:
Soft margin
 Will allow construction of linear decision boundary
even when classes are not linearly separable.
Support Vector Classifier
Additional motivation:
New data point added.
Maximum margin classifier.
Perfectly segments training data.
Dramatic shift in maximal margin
Model has high variance when trying to
maintain perfect segmentation.
Support Vector Classifier
So, interested in:
 Greater
robustness to individual data instances
 Better classification of most of the training data
Some misclassifications permitted:
 “Soft”
margin: because margin can be violated by
some of the instances
Red Instances:
• 3 4 5 6 on correct side of margin
• 2 is on the margin
• 1 is on the wrong side of the margin
Red Instances:
• 3 4 5 6 on correct side of margin
• 2 is on the margin
• 1 is on the wrong side of the margin
• 11 is on the wrong side of the hyperplane
Using Support Vector Classifier for
Same as before.
Which side of the line is the test instance on?
Constructing the Support Vector Classifier
More interesting.
How much “softness” (misclassifications) in the soft margin is
Complicated math, but python figures it out.
Specification of nonnegative tuning parameter C
Generally chosen by analyst following cross-validation
 Large C: wider margin; more instances violate margin
 Small C: narrower margin; less tolerance for instances that violate
Same data points.
Larger C to Smaller C
Lower variance.
Higher variance.
Support Vector Machines
What if a non-linear decision boundary is needed?
Poor performance using
this decision boundary.
Support Vector Machines
Idea: transform the data from its original coordinate
space in X into a new space Φ(X) so that a linear
decision boundary can separate the two classes
 Φ:
nonlinear transformation
Instead of fitting a support vector classifier using n
X1, X2, …, Xn
… use 2n features:
X1, X12, X2, X22, …, Xn, Xn2
Support Vector Machines
Enlarged “feature space” compared to original
“feature space”
Can even extend to higher-order polynomial terms.
Downside: can easily end up with huge number of
 overfitting
F : (x1, x2 ) ® (x , x , 2x1, 2x2,1)
Attribute Transformation
Learning a Nonlinear SVM Model
Once again:
 Complicated,
but python does it for us.
Other extensions to SVMs:
Regression instead of classification
Categorical variables instead of continuous
Multiclass problems instead of binary
Radial “kernals”
 Circle
instead of hyperplane
Support Vector Machines
Multiclass Problems
Feature Selection
Multiclass Problems
Scenario: target class is more than 2 categories
Motivation: some machine learning algorithms are
designed for binary classification
 Example:
Support Vector Machines (SVM)
How to extend binary classifiers to handle multiclass
#1 - Multiclass: One-Against-Rest
Assume multiclass dataset with K target classes
Decompose into K binary problems
Idea: For each target class yi create a single binary problem, with classifier Ci:
Class yi: positive
All other classes: negative
use all instances; each instance used in training each of the Ci classifiers
run testing instance through each classifier
record votes for each yi class (negative prediction is a vote for all other classes)
class with most votes is the predicted class
#2 - Multiclass: One-Against-One
Assume multiclass dataset with K target classes
Train K(K-1)/2 binary classifiers (many more than One-Against-Rest)
Idea: Each classifier distinguishes between pair of classes (yi, yj)
Classifier ignores records that don’t belong to yi or yj
Training: use all instances; each instance only used in training “relevant
classifiers” (K of them)
run testing instance through each classifier
record votes for each yi class
class with most votes is the predicted class
Support Vector Machines
Multiclass Problems
Feature Selection
High Dimensionality
 Feature Subset Selection
 Adjusted R2 Statistic
High Dimensionality
… can be bad
Datasets can have a large number of features
 Example:
big data
 Example: stock prices (time series)
 Each
stock is individual instance
 Features/variables are closing price on given day
Imagine 30 years worth of closing prices (30 x 365)
Why is it a Problem?
Often data mining algorithms work better if there are
not an overwhelming number of attributes
BAD: p > n,
p = # of features
n = # of instances
The “dimensionality” is lower
“The Curse of Dimensionality”
As dimensionality increases (more features), the data
becomes increasingly sparse in the “feature space” that
it occupies.
Not enough data objects for the number of features that
are present
 Reduced classification model accuracy
Other Benefits to Dimensionality
More understandable models
 Learned
Better visualizations
 Fewer
attributes = less variables to plot
Computational time
 Fewer
model may involve fewer attributes
attributes = quicker model learning?
Elimination of irrelevant features
Techniques for Dimensionality Reduction
Advanced: Linear Algebra Techniques
 Automatic
 Project data from high-dimensional space into a lowerdimensional space
1. Principal Components Analysis (PCA)
2. Singular Value Decomposition (SVD)
 Not necessarily interested in “losing information”; rather
eliminate some of the sparsity
Techniques for Dimensionality Reduction
Feature Construction
 Example:
combining two separate features (# of full
baths, # of half baths) into one feature (“total baths”)
 Example: combining features (mass) and (volume) into
one feature (density), where density = mass / volume
Techniques for Dimensionality Reduction
Feature Subset Selection
Reducing number of features by only using a subset of features
How many should be in the subset?
Losing information if we only consider a subset of features?
Redundant features
Irrelevant features
Example: (1) purchase price and (2) sales tax
Example: student id numbers
By eliminating unnecessary features, we hope for a better
Eliminating Redundant and Irrelevant
Manually via Data Analyst
Intuition about problem domain
Systematic Approach
Try all possible combinations of feature subsets?
See which combination results in best model
For n features, there are 2n possible combinations of
Infeasible to try each of them
Three Systematic Approaches
Embedded Approaches
Filter Approaches
Wrapper Approaches
Embedded Approaches
Algorithm specific
Occurs naturally as part of the data mining algorithm
 Example:
 Only
present in decision tree induction
certain subset of features are used in final decision tree
 Example:
 Fitted
not present in linear regression
model contained coefficient for each predictor variable
Filter Approaches
Features are selected before the data mining algorithm
is run
Filter approach is independent of the data mining task
Example: (trying to eliminate redundant features)
Look at pairwise correlation between variables
 Pick subset of variables that each have low pairwise
Then use that only that subset in Linear Regression model.
Wrapper Approaches
Data mining algorithm is a “black box” for finding
best subset of features
Tries different combinations of subsets
will never enumerate all 2n possible
 Will search a feature space that is much smaller
 Typically
Final model uses the specific subset that evaluates
the best
Top-Down Wrapper
Assuming n number of features…
Start with no attributes
Train classifier n times, each time with a different feature
Add to the best classifier. Recursively use remaining attributes to
find which attribute that improves performance the most
Each classifier only has a single predictor
See which of the n classifiers performs the best
Keep including best attribute
Stopping criterion: Stop if no improvement to classifier performance,
or increase in classifier performance is less than some threshold
Bottom-Up Wrapper
Assuming n number of features…
Start with all n attributes in model
Create n models, each with a different predictor omitted.
Each classifier has n-1 predictors
See which of the n classifiers affects performance the least
Throw that attribute out
Recursively find the attribute that affects performance the least
Stopping criterion: Stop if classifier performance begins to degrade
Other Wrappers
 Combining
Greedy Search with Backtracking
 (if
Top-Down and Bottom-Up
you’re familiar with AI)
Adjusted R2 Statistic
Always increases as more
variables are added to the model.
Recall the R2 statistic that we saw in Linear
 Measured
the proportion of variance explained by the
 Always a value between 0 and 1
 Higher is better
R =
TSS = å(yi - y )2
RSS = å(yi - ŷi )2
Adjusted R2 Statistic
In contrast to R2, Adjusted R2 penalizes for
unnecessary variables in the model.
d = number of predictors
n = number of instances
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 1 −
(𝑛 − 𝑑 − 1)
(𝑛 − 1)
TSS = å(yi - y )2
RSS = å(yi - ŷi )2
Data Science from Scratch, 1st Edition, Grus
Introduction to Data Mining, 1st edition, Tan et al.
Data Mining and Business Analytics with R, 1st edition,