Download x - University of Pittsburgh

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Support vector machine wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
CS 1674: Intro to Computer Vision
Support Vector Machines
Prof. Adriana Kovashka
University of Pittsburgh
October 31, 2016
Plan for today
• Support vector machines
– Separable case / non-separable case
– Linear / non-linear (kernels)
• The importance of generalization
– The bias-variance trade-off (applies to all
classifiers)
Lines in R2
Let
a 
w 
c 
 x
x 
 y
ax  cy  b  0
Kristen Grauman
Lines in R2
Let
w
a 
w 
c 
 x
x 
 y
ax  cy  b  0
wx b  0
Kristen Grauman
x0 , y0 
Lines in R2
Let
w
a 
w 
c 
 x
x 
 y
ax  cy  b  0
wx b  0
Kristen Grauman
Lines in R2
x0 , y0 
Let
D
w
a 
w 
c 
 x
x 
 y
ax  cy  b  0
wx b  0
D
Kristen Grauman
ax0  cy0  b
a c
2
2

w xb

w
distance from
point to line
Lines in R2
x0 , y0 
Let
D
w
a 
w 
c 
 x
x 
 y
ax  cy  b  0
wx b  0
D
Kristen Grauman
ax0  cy0  b
a c
2
2

|w xb|

w
distance from
point to line
Linear classifiers
• Find linear function to separate positive and
negative examples
xi positive :
xi  w  b  0
xi negative :
xi  w  b  0
Which line
is best?
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Support vector machines
• Discriminative
classifier based on
optimal separating
line (for 2d case)
• Maximize the
margin between the
positive and
negative training
examples
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Support vector machines
• Want line that maximizes the margin.
Support vectors
xi positive ( yi  1) :
xi  w  b  1
xi negative ( yi  1) :
xi  w  b  1
For support, vectors,
xi  w  b  1
Margin
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Support vector machines
• Want line that maximizes the margin.
xi positive ( yi  1) :
xi  w  b  1
xi negative ( yi  1) :
xi  w  b  1
For support, vectors,
xi  w  b  1
Distance between point
and line:
Support vectors
Margin
| xi  w  b |
|| w ||
For support vectors:
wΤ x  b  1
1
1
2

M


w
w
w
w
w
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Support vector machines
• Want line that maximizes the margin.
xi positive ( yi  1) :
xi  w  b  1
xi negative ( yi  1) :
xi  w  b  1
For support, vectors,
xi  w  b  1
Distance between point
and line:
| xi  w  b |
|| w ||
Therefore, the margin is 2 / ||w||
Support vectors
Margin
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Finding the maximum margin line
1. Maximize margin 2/||w||
2. Correctly classify all training data points:
xi positive ( yi  1) :
xi  w  b  1
xi negative ( yi  1) :
xi  w  b  1
Quadratic optimization problem:
1 T
Minimize
w w
2
Subject to yi(w·xi+b) ≥ 1
One constraint for each
training point.
Note sign trick.
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Finding the maximum margin line
• Solution: w  i  i yi xi
Learned
weight
Support
vector
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Finding the maximum margin line
• Solution: w  i  i yi xi
b = yi – w·xi (for any support vector)
• Classification function:
f ( x)  sign (w  x  b)
 sign
 y x  x  b
i i i i
If f(x) < 0, classify as negative, otherwise classify as positive.
• Notice that it relies on an inner product between the test
point x and the support vectors xi
• (Solving the optimization problem also involves
computing the inner products xi · xj between all pairs of
training points)
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Nonlinear SVMs
• Datasets that are linearly separable work out great:
x
0
• But what if the dataset is just too hard?
x
0
• We can map it to a higher-dimensional space:
x2
0
Andrew Moore
x
Nonlinear SVMs
• General idea: the original input space can always be
mapped to some higher-dimensional feature space
where the training set is separable:
Φ: x → φ(x)
Andrew Moore
Nonlinear kernel: Example
2
• Consider the mapping  ( x)  ( x, x )
x2
 ( x)   ( y)  ( x, x 2 )  ( y, y 2 )  xy  x 2 y 2
K ( x, y)  xy  x 2 y 2
Svetlana Lazebnik
The “Kernel Trick”
• The linear classifier relies on dot product between
vectors K(xi ,xj) = xi · xj
• If every data point is mapped into high-dimensional
space via some transformation Φ: xi → φ(xi ), the dot
product becomes: K(xi ,xj) = φ(xi ) · φ(xj)
• A kernel function is similarity function that corresponds
to an inner product in some expanded feature space
• The kernel trick: instead of explicitly computing the
lifting transformation φ(x), define a kernel function K
such that: K(xi ,xj) = φ(xi ) · φ(xj)
Andrew Moore
Examples of kernel functions
K ( xi , x j )  xi x j
T

Linear:

Polynomials of degree up to d:
𝐾(𝑥𝑖 , 𝑥𝑗 ) = (𝑥𝑖 𝑇 𝑥𝑗 + 1)𝑑


2
Gaussian RBF:
xi  x j
K ( xi ,x j )  exp( 
)
2
2
Histogram intersection:
K ( xi , x j )   min( xi (k ), x j (k ))
k
Andrew Moore / Carlos Guestrin
Allowing misclassifications: Before
The w that minimizes…
Maximize margin
Allowing misclassifications: After
Misclassification
cost
# data samples
Slack variable
The w that minimizes…
Maximize margin
Minimize misclassification
What about multi-class SVMs?
• Unfortunately, there is no “definitive” multi-class SVM
formulation
• In practice, we have to obtain a multi-class SVM by
combining multiple two-class SVMs
• One vs. others
– Training: learn an SVM for each class vs. the others
– Testing: apply each SVM to the test example, and assign it to the
class of the SVM that returns the highest decision value
• One vs. one
– Training: learn an SVM for each pair of classes
– Testing: each learned SVM “votes” for a class to assign to the
test example
Svetlana Lazebnik
Multi-class problems
• One-vs-all (a.k.a. one-vs-others)
– Train K classifiers
– In each, pos = data from class i, neg = data from
classes other than i
– The class with the most confident prediction wins
– Example:
•
•
•
•
•
•
You have 4 classes, train 4 classifiers
1 vs others: score 3.5
2 vs others: score 6.2
3 vs others: score 1.4
4 vs other: score 5.5
Final prediction: class 2
Multi-class problems
• One-vs-one (a.k.a. all-vs-all)
– Train K(K-1)/2 binary classifiers (all pairs of classes)
– They all vote for the label
– Example:
•
•
•
•
You have 4 classes, then train 6 classifiers
1 vs 2, 1 vs 3, 1 vs 4, 2 vs 3, 2 vs 4, 3 vs 4
Votes: 1, 1, 4, 2, 4, 4
Final prediction is class 4
SVMs for recognition
1. Define your representation for each
example.
2. Select a kernel function.
3. Compute pairwise kernel values
between labeled examples
4. Use this “kernel matrix” to solve for
SVM support vectors & weights.
5. To classify a new example: compute
kernel values between new input
and support vectors, apply weights,
check sign of output.
Kristen Grauman
Example: learning gender with SVMs
Moghaddam and Yang, Learning Gender with Support Faces,
TPAMI 2002.
Moghaddam and Yang, Face & Gesture 2000.
Kristen Grauman
Learning gender with SVMs
• Training examples:
– 1044 males
– 713 females
• Experiment with various kernels, select
Gaussian RBF
K (xi , x j )  exp( 
Kristen Grauman
xi  x j
2
2
2
)
Support Faces
Moghaddam and Yang, Learning Gender with Support Faces, TPAMI 2002.
Moghaddam and Yang, Learning Gender with Support Faces, TPAMI 2002.
Gender perception experiment:
How well can humans do?
• Subjects:
– 30 people (22 male, 8 female)
– Ages mid-20’s to mid-40’s
• Test data:
– 254 face images (6 males, 4 females)
– Low res and high res versions
• Task:
– Classify as male or female, forced choice
– No time limit
Moghaddam and Yang, Face & Gesture 2000.
Gender perception experiment:
How well can humans do?
Error
Moghaddam and Yang, Face & Gesture 2000.
Error
Human vs. Machine
• SVMs performed
better than any
single human
test subject, at
either resolution
Kristen Grauman
SVMs: Pros and cons
• Pros
• Many publicly available SVM packages:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
or use built-in Matlab version (but slower)
• Kernel-based framework is very powerful, flexible
• Often a sparse set of support vectors – compact at test time
• Work very well in practice, even with very small training
sample sizes
• Cons
• No “direct” multi-class SVM, must combine two-class SVMs
• Can be tricky to select best kernel function for a problem
• Computation, memory
– During training time, must compute matrix of kernel values for
every pair of examples
– Learning can take a very long time for large-scale problems
Adapted from Lana Lazebnik
Precision / Recall / F-measure
True positives
(images that contain people)
True negatives
(images that do not contain people)
Predicted positives
(images predicted to contain people)
Predicted negatives
(images predicted not to contain people)
•
•
•
Precision
Recall
F-measure
= 2 / 5 = 0.4
= 2 / 4 = 0.5
= 2*0.4*0.5 / 0.4+0.5 = 0.44
Accuracy: 5 / 10 = 0.5
Generalization
Training set (labels known)
Test set (labels
unknown)
• How well does a learned model generalize from
the data it was trained on to a new test set?
Slide credit: L. Lazebnik
Generalization
• Components of generalization error
– Bias: how much the average model over all training sets differs
from the true model
• Error due to inaccurate assumptions/simplifications made by
the model
– Variance: how much models estimated from different training
sets differ from each other
• Underfitting: model is too “simple” to represent all the
relevant class characteristics
– High bias and low variance
– High training error and high test error
• Overfitting: model is too “complex” and fits irrelevant
characteristics (noise) in the data
– Low bias and high variance
– Low training error and high test error
Slide credit: L. Lazebnik
Bias-Variance Trade-off
• Models with too few
parameters are
inaccurate because of a
large bias (not enough
flexibility).
• Models with too many
parameters are
inaccurate because of a
large variance (too much
sensitivity to the sample).
Slide credit: D. Hoiem
Fitting a model
Is this a good fit?
Figures from Bishop
With more training data
Figures from Bishop
Bias-variance tradeoff
Overfitting
Error
Underfitting
Test error
Training error
High Bias
Low Variance
Complexity
Low Bias
High Variance
Slide credit: D. Hoiem
Bias-variance tradeoff
Test Error
Few training examples
High Bias
Low Variance
Many training examples
Complexity
Low Bias
High Variance
Slide credit: D. Hoiem
Choosing the trade-off
• Need validation set
• Validation set is separate from the test set
Error
Validation error
Training error
High Bias
Low Variance
Complexity
Low Bias
High Variance
Slide credit: D. Hoiem
Effect of Training Size
Error
Fixed prediction model
Testing
Generalization Error
Training
Number of Training Examples
Adapted from D. Hoiem
How to reduce variance?
• Choose a simpler classifier
• Use fewer features
• Get more training data
• Regularize the parameters
Slide credit: D. Hoiem
Regularization
No regularization
Figures from Bishop
Huge regularization
Characteristics of vision learning problems
• Lots of continuous features
– Spatial pyramid may have ~15,000 features
• Imbalanced classes
– Often limited positive examples, practically infinite
negative examples
• Difficult prediction tasks
• Recently, massive training sets became available
– If we have a massive training set, we want classifiers
with low bias (high variance is ok) and reasonably
efficient training
Adapted from D. Hoiem
Remember…
• No free lunch: machine learning
algorithms are tools
• Three kinds of error
– Inherent: unavoidable
– Bias: due to over-simplifications
– Variance: due to inability to perfectly estimate parameters
from limited data
• Try simple classifiers first
• Better to have smart features and simple classifiers
than simple features and smart classifiers
• Use increasingly powerful classifiers with more
training data (bias-variance tradeoff)
Adapted from D. Hoiem