Download support vector classifier

Document related concepts

Mixture model wikipedia , lookup

Principal component analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

Support vector machine wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
SUPPORT VECTOR
MACHINES
Class 23
CSC 600: Data Mining
Today…

Support Vector Machines
Multiclass Problems

Feature Selection

Support Vector Machines (SVM)

What are they?
Developed in the 1990s
 Computer Science community
 Very popular


Performance:
Often considered one of the best “out of the box” classifiers
 Applications: handwritten digit recognition, text
categorization

Support Vector Machines (SVM)

Comparing to other statistical learning methods:
 SVMs

work well with high-dimensional data
Unique:
 Represents
“decision boundary” using a subset of
training examples
Terminology
1.
2.
3.

Maximal Margin Classifier
Support Vector Classifier
Support Vector Machine
Often all three are referred to as “Support Vector
Machine”
The Path Ahead
1.
2.
Maximal Margin Classifier
Support Vector Classifier

3.
Generalization of Maximal Margin Classifier
Support Vector Machine

Generalization of Support Vector Classifier
Maximal Margin Classifier

First, need to define a hyperplane

What is a hyperplane?
 Hyperplane
has p-1 dimensions in a p dimensional
space
 Example: in 2 dimension space, a hyperplane has 1
dimension (and thus, is a line)
Hyperplane Mathematical Definition

For two dimensions, hyperplane defined as:
B0 + B1X1 + B2 X2 = 0
B0, B1, B2 are parameters.
X1, X2 are variables.

Note that this equation is a line:
 Hyperplane
is in one-dimension
B0 + B1 X1 + B2Y = 0
B2Y = -B1 X1 - B0
Y=
-B1 X1 - B0
B2
Hyperplane Mathematical Definition
B0 + B1X1 + B2 X2 = 0


We’re going to “find” values for B0, B1, B2.
Then, for any values X1 and X2:
1.
if B0 + B1X1 + B2X2 = 0
 Point
is on the line.
Hyperplane Mathematical Definition
B0 + B1X1 + B2 X2 = 0


We’re going to “find” values for B0, B1, B2.
Then, for any values X1 and X2:
if B0 + B1X1 + B2X2 > 0
2.

Point is not on the line. On one side of the line.
if B0 + B1X1 + B2X2 < 0
3.

Point is on the other side of the line.
Hyperplane

… is dividing 2-dimesional space into two halves by
a line.
Separating Hyperplane
Note: a separating
hyperplane means
zero training errors.
Dataset with two classes:
1. Squares
2. circles
Can find a separating
hyperplane with all
squares on one side and
all circles on the other.
Infinitely many such
hyperplanes possible.
Classification Using a Separating
Hyperplane



For a new test
instance, which side of
the line is it on?
B0 + B1X1 + B2X2 > 0
B0 + B1X1 + B2X2 < 0
Classification Using a Separating
Hyperplane

Standard SVM approach:
 Label
class data as either +1 or -1, depending on which
class an instance belongs to.
 Prediction:
ìï 1, if B + B x + B x +... + B x > 0
0
1 1
2 2
n n
yi = í
ïî -1, if B0 + B1 x1 + B2 x2 +... + Bn xn < 0
Classification Using a Separating
Hyperplane

For a new test instance, which side of the line is it on?

B0 + B1X1 + B2X2 > 0
B0 + B1X1 + B2X2 < 0

Can also look at the magnitude.

How far from zero?
 Greater magnitude means more confident prediction.

Some Concerns with this Approach:




Datasets with more than 2 target classes
What if a “separting hyperplane” can’t be formed
Data is more than two dimensions
Regression instead of classification
SVMs can deal with each of these.
What if Data is more than 2-Dimensions?

Mathematical definition of hyperplane generalizes
to n-dimensions:
B0 + B1 X1 + B2 X2 = 0
B0 + B1 X1 + B2 X2 +... + Bn X n = 0
B0 + B1 X1 + B2 X2 +... + Bn X n > 0
B0 + B1 X1 + B2 X2 +... + Bn X n < 0
Maximum Margin Hyperplane

What’s the best separating hyperplane?
Intuition: the one that is farthest from
the training observations.
Called the maximum margin
hyperplane.
The Margin

B1 and B2 are each
separating hyperplanes
 B1

is better
Margin: the smallest
distance from the
hyperplane to the
training data
Represents the mid-line of the widest “slab” that can be inserted between the two classes.
Maximal Margin Hyperplane

We want the
hyperplane that has the
greatest margin.
 That
is, B1 instead of B2
or any of the other
infinitely many
separating hyperplanes
Maximal Margin Hyperplane

Support Vectors: the
points in the data, that
if moved, the maximal
margin hyperplane
would move as well.
Moving any of the other data points
would not affect the model.
Figuring Out the Maximal Margin Classifier



Don’t worry, data mining toolkits do it automatically.
Optimization problem.
Involves calculus.
Support Vector Classifier

Maximum Margin Classifier is natural way to perform
classification if a separating hyperplane exists.


Perfect segmentation between two classes
In many cases, no separating hyperplane will exist
Find a hyperplane that almost perfectly segments the classes
 This generalization is called: support vector classifier

Support Vector Classifier


Maximal Margin Classifier: no training errors
allowed
Support Margin Classifier: tolerate training errors
 Approach:
Soft margin
 Will allow construction of linear decision boundary
even when classes are not linearly separable.
Support Vector Classifier
Additional motivation:
New data point added.
Maximum margin classifier.
Perfectly segments training data.
Dramatic shift in maximal margin
hyperplane.
Model has high variance when trying to
maintain perfect segmentation.
Support Vector Classifier

So, interested in:
 Greater
robustness to individual data instances
 Better classification of most of the training data

Some misclassifications permitted:
 “Soft”
margin: because margin can be violated by
some of the instances
Red Instances:
• 3 4 5 6 on correct side of margin
• 2 is on the margin
• 1 is on the wrong side of the margin
Red Instances:
• 3 4 5 6 on correct side of margin
• 2 is on the margin
• 1 is on the wrong side of the margin
• 11 is on the wrong side of the hyperplane
Using Support Vector Classifier for
Classification


Same as before.
Which side of the line is the test instance on?
Constructing the Support Vector Classifier




More interesting.
How much “softness” (misclassifications) in the soft margin is
ideal?
Complicated math, but python figures it out.
Specification of nonnegative tuning parameter C
Generally chosen by analyst following cross-validation
 Large C: wider margin; more instances violate margin
 Small C: narrower margin; less tolerance for instances that violate
margin

Same data points.
Larger C to Smaller C
Lower variance.
Higher variance.
Support Vector Machines

What if a non-linear decision boundary is needed?
Poor performance using
this decision boundary.
Support Vector Machines

Idea: transform the data from its original coordinate
space in X into a new space Φ(X) so that a linear
decision boundary can separate the two classes
 Φ:

Huh?
nonlinear transformation
Instead of fitting a support vector classifier using n
features:
X1, X2, …, Xn
… use 2n features:
X1, X12, X2, X22, …, Xn, Xn2
Support Vector Machines



Enlarged “feature space” compared to original
“feature space”
Can even extend to higher-order polynomial terms.
Downside: can easily end up with huge number of
features
 overfitting
F : (x1, x2 ) ® (x , x , 2x1, 2x2,1)
Attribute Transformation
2
1
2
2
Learning a Nonlinear SVM Model

Once again:
 Complicated,
but python does it for us.
Other extensions to SVMs:




Regression instead of classification
Categorical variables instead of continuous
Multiclass problems instead of binary
Radial “kernals”
 Circle
instead of hyperplane
Today…

Support Vector Machines
Multiclass Problems

Feature Selection

Multiclass Problems


Scenario: target class is more than 2 categories
Motivation: some machine learning algorithms are
designed for binary classification
 Example:

Ideas?
Support Vector Machines (SVM)
How to extend binary classifiers to handle multiclass
problems?
#1 - Multiclass: One-Against-Rest



Assume multiclass dataset with K target classes
Decompose into K binary problems
Idea: For each target class yi create a single binary problem, with classifier Ci:



Training:


Class yi: positive
All other classes: negative
use all instances; each instance used in training each of the Ci classifiers
Testing:



run testing instance through each classifier
record votes for each yi class (negative prediction is a vote for all other classes)
class with most votes is the predicted class
#2 - Multiclass: One-Against-One



Assume multiclass dataset with K target classes
Train K(K-1)/2 binary classifiers (many more than One-Against-Rest)
Idea: Each classifier distinguishes between pair of classes (yi, yj)



Classifier ignores records that don’t belong to yi or yj
Training: use all instances; each instance only used in training “relevant
classifiers” (K of them)
Testing:



run testing instance through each classifier
record votes for each yi class
class with most votes is the predicted class
Today…

Support Vector Machines
Multiclass Problems

Feature Selection

High Dimensionality
 Feature Subset Selection
 Adjusted R2 Statistic

High Dimensionality


… can be bad
Datasets can have a large number of features
 Example:
big data
 Example: stock prices (time series)
 Each
stock is individual instance
 Features/variables are closing price on given day

Imagine 30 years worth of closing prices (30 x 365)
Why is it a Problem?

Often data mining algorithms work better if there are
not an overwhelming number of attributes



BAD: p > n,
p = # of features
n = # of instances
The “dimensionality” is lower
“The Curse of Dimensionality”
As dimensionality increases (more features), the data
becomes increasingly sparse in the “feature space” that
it occupies.
Not enough data objects for the number of features that
are present
 Reduced classification model accuracy

Other Benefits to Dimensionality
Reduction
1.
More understandable models
 Learned
2.
Better visualizations
 Fewer
3.
attributes = less variables to plot
Computational time
 Fewer
4.
model may involve fewer attributes
attributes = quicker model learning?
Elimination of irrelevant features
Techniques for Dimensionality Reduction
1.
Advanced: Linear Algebra Techniques
 Automatic
approaches
 Project data from high-dimensional space into a lowerdimensional space
1. Principal Components Analysis (PCA)
2. Singular Value Decomposition (SVD)
 Not necessarily interested in “losing information”; rather
eliminate some of the sparsity
Techniques for Dimensionality Reduction
2.
Feature Construction
 Example:
combining two separate features (# of full
baths, # of half baths) into one feature (“total baths”)
 Example: combining features (mass) and (volume) into
one feature (density), where density = mass / volume
Techniques for Dimensionality Reduction
Feature Subset Selection
3.
Reducing number of features by only using a subset of features


How many should be in the subset?
Losing information if we only consider a subset of features?


Redundant features


Irrelevant features


Example: (1) purchase price and (2) sales tax
Example: student id numbers
By eliminating unnecessary features, we hope for a better
model.
Eliminating Redundant and Irrelevant
Features
1.
Manually via Data Analyst
Intuition about problem domain

2.
Systematic Approach
Try all possible combinations of feature subsets?


See which combination results in best model
For n features, there are 2n possible combinations of
subsets


Infeasible to try each of them
Three Systematic Approaches
1.
2.
3.
Embedded Approaches
Filter Approaches
Wrapper Approaches
Embedded Approaches


Algorithm specific
Occurs naturally as part of the data mining algorithm
 Example:
 Only
present in decision tree induction
certain subset of features are used in final decision tree
 Example:
 Fitted
not present in linear regression
model contained coefficient for each predictor variable
Filter Approaches

Features are selected before the data mining algorithm
is run


Filter approach is independent of the data mining task
Example: (trying to eliminate redundant features)
Look at pairwise correlation between variables
 Pick subset of variables that each have low pairwise
correlation
2.
Then use that only that subset in Linear Regression model.
1.
Wrapper Approaches


Data mining algorithm is a “black box” for finding
best subset of features
Tries different combinations of subsets
will never enumerate all 2n possible
combinations
 Will search a feature space that is much smaller
 Typically

Final model uses the specific subset that evaluates
the best
Top-Down Wrapper


1.
Assuming n number of features…
Start with no attributes
Train classifier n times, each time with a different feature


2.
Add to the best classifier. Recursively use remaining attributes to
find which attribute that improves performance the most


Each classifier only has a single predictor
See which of the n classifiers performs the best
Keep including best attribute
Stopping criterion: Stop if no improvement to classifier performance,
or increase in classifier performance is less than some threshold
Bottom-Up Wrapper



Assuming n number of features…
Start with all n attributes in model
Create n models, each with a different predictor omitted.





Each classifier has n-1 predictors
See which of the n classifiers affects performance the least
Throw that attribute out
Recursively find the attribute that affects performance the least
Stopping criterion: Stop if classifier performance begins to degrade
Other Wrappers

Bi-Directional
 Combining

Greedy Search with Backtracking
 (if

Top-Down and Bottom-Up
…
you’re familiar with AI)
Adjusted R2 Statistic

Always increases as more
variables are added to the model.
Recall the R2 statistic that we saw in Linear
Regression:
 Measured
the proportion of variance explained by the
model
 Always a value between 0 and 1
 Higher is better
TSS - RSS
RSS
2
R =
=1TSS
TSS
TSS = å(yi - y )2
RSS = å(yi - ŷi )2
Adjusted R2 Statistic



In contrast to R2, Adjusted R2 penalizes for
unnecessary variables in the model.
d = number of predictors
n = number of instances
𝑅𝑆𝑆
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 1 −
(𝑛 − 𝑑 − 1)
𝑇𝑆𝑆
(𝑛 − 1)
TSS = å(yi - y )2
RSS = å(yi - ŷi )2
References



Data Science from Scratch, 1st Edition, Grus
Introduction to Data Mining, 1st edition, Tan et al.
Data Mining and Business Analytics with R, 1st edition,
Ledolter