Download Overview Support Vector Machines Lines in R2 Lines in R2 w

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Support vector machine wikipedia , lookup

Transcript
2/26/15
Overview •  Recall last class: Boos1ng is a way of genera1ng a strong classifier as a weighted ensemble of weak ones •  Today: Support Vector Machine (SVM) training generates a strong classifier directly •  Case Study: Dalal and Triggs pedestrian detector Support Vector Machines SVM slides from Kristen Grauman, UT-­‐Aus1n Other good resources:
Presentation slides from Christoph Lampert:
https://sites.google.com/site/christophlampert/teaching/kernel-methods-forobject-recognition
Simple tutorial document by Chris Williams
http://www.inf.ed.ac.uk/teaching/courses/iaml/docs/svm.pdf
Video lecture by Pat Winston:
https://www.youtube.com/watch?v=_PwhiWxHK8o
Linear classifiers
Find linear function to
separate positive and
negative examples
Lines in R2
Let
⎡a ⎤
⎡ x ⎤
w = ⎢ ⎥ x = ⎢ ⎥
⎣ c ⎦
⎣ y ⎦
ax + cy + b = 0
Linear classifiers
Lines in R2
Let
w
⎡a ⎤
⎡ x ⎤
w = ⎢ ⎥ x = ⎢ ⎥
c
⎣ ⎦
⎣ y ⎦
ax + cy + b = 0
•  Find linear function to separate positive and
negative examples
xi positive :
xi ⋅ w + b ≥ 0
xi negative :
xi ⋅ w + b < 0
w⋅x +b = 0
Which line
is best?
1
2/26/15
Support Vector Machines (SVMs)
Support vector machines
•  Want line that maximizes the margin.
•  Maximize the margin
between the positive
and negative training
examples
=1
+b
wx =0 1
+b
wx +b=
wx
•  Discriminative
classifier based on
optimal separating
line (for 2d case)
xi positive ( yi = 1) :
Support vectors
xi ⋅ w + b ≥ 1
xi negative ( yi = −1) :
xi ⋅ w + b ≤ −1
For support, vectors,
xi ⋅ w + b = ±1
Margin
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
Lines in R2
(x0 , y0 )
⎡a ⎤
⎡ x ⎤
w = ⎢ ⎥ x = ⎢ ⎥
⎣ c ⎦
⎣ y ⎦
Let
D
w
Lines in R2
(x0 , y0 )
Let
D
w
ax + cy + b = 0
ax + cy + b = 0
w⋅x +b = 0
w⋅x +b = 0
D=
w
a2 + c2
⎡a ⎤
⎡ x ⎤
w = ⎢ ⎥ x = ⎢ ⎥
c
⎣ ⎦
⎣ y ⎦
ax + cy + b = 0
ax0 + cy0 + b
2
a +c
2
=
w Τx + b
w
w Τx + b
w
distance from
point to line
distance from
point to line
xi positive ( yi = 1) :
xi ⋅ w + b ≥ 1
xi negative ( yi = −1) :
xi ⋅ w + b ≤ −1
For support, vectors,
xi ⋅ w + b = ±1
Distance between point
and line:
w⋅x +b = 0
D=
=
•  Want line that maximizes the margin.
=1
+b
wx =0 1
+b
wx +b=
wx
Let
D
ax0 + cy0 + b
Support vector machines
Lines in R2
(x0 , y0 )
⎡a ⎤
⎡ x ⎤
w = ⎢ ⎥ x = ⎢ ⎥
⎣ c ⎦
⎣ y ⎦
Support vectors
| xi ⋅ w + b |
|| w ||
For support vectors:
wΤ x + b ± 1
1
−1
2
=
M=
−
=
w
w
Margin M
w
w
w
2
2/26/15
Support vector machines
Finding the maximum margin line
•  Want line that maximizes the margin.
=1
+b
wx =0 1
+b
wx +b=
wx
1.  Maximize margin 2/||w||
2.  Correctly classify all training data points:
xi positive ( yi = 1) :
xi ⋅ w + b ≥ 1
xi positive ( yi = 1) :
xi ⋅ w + b ≥ 1
xi negative ( yi = −1) :
xi ⋅ w + b ≤ −1
xi negative ( yi = −1) :
xi ⋅ w + b ≤ −1
For support, vectors,
xi ⋅ w + b = ±1
| xi ⋅ w + b |
|| w ||
Distance between point
and line:
Therefore, the margin is
Support vectors
2 / ||w||
Margin
Quadratic optimization problem:
Minimize
1 T
w w
2
One constraint for each
training point.
Subject to yi(w·xi+b) ≥ 1
Note sign trick.
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Finding the maximum margin line
•  Solution: w =
learned
weight
∑i α i yi xi
Support
vector
Finding the maximum margin line
•  Solution: w =
∑α yx
i
i
i
i
b = yi – w·xi (for any support vector)
w ⋅ x + b = ∑i α i yi xi ⋅ x + b
•  Classification function:
f (x) = sign (w ⋅ x + b)
= sign
(∑ α yix ⋅ x + b)
i
i
i
If f(x) < 0, classify
as negative,
if f(x) > 0, classify
as positive
•  Notice that it relies on an inner product between the test
point x and the support vectors xi
•  (Solving the optimization problem also involves
computing the inner products xi · xj between all pairs of
training points)
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Ques1ons •  What if the features are not 2d? •  What if the data is not linearly separable? •  What to do for more than two classes? C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Ques1ons •  What if the features are not 2d? –  Generalizes to d-­‐dimensions – replace line with “hyperplane” •  What if the data is not linearly separable? •  What to do for more than two classes? 3
2/26/15
Hyperplanes in Rn
Planes in R3
w
(x0 , y0 , z0 )
Let
D
⎡a ⎤
w = ⎢⎢b ⎥⎥
⎣⎢ c ⎦⎥
⎡ x ⎤
x = ⎢⎢ y ⎥⎥
⎢⎣ z ⎥⎦
Hyperplane H is set of all vectors
which satisfy:
x ∈ Rn
w1 x1 + w2 x2 + … + wn xn + b = 0
ax + by + cz + d = 0
w Τx + b = 0
w⋅x + d = 0
D=
ax0 + by 0 + cz 0 + d
a2 + b2 + c2
w Τ x + d distance from
=
point to plane
w
Ques1ons D ( H , x) =
w Τ x + b distance from
point to
w
hyperplane
Nonlinear SVMs •  What if the features are not 2d? •  What if the data is not linearly separable? Slide from Andrew Zisserman
Nonlinear SVMs Slide from Andrew Zisserman
Nonlinear SVMs Slide from Andrew Zisserman
4
2/26/15
The Kernel Trick Example Kernel •  Recall we transformed linear regression into nonlinear regression using a feature vector Φ(x) and, ul1mately, the “kernel trick.” •  We also use the kernel trick here to transform a linear classifier into nonlinear one. Slide from Andrew Zisserman
Example Kernels Nonlinear SVMs
•  The kernel trick: instead of explicitly computing
the lifting transformation φ(x), define a kernel
function K such that
K(xi , xj ) = φ(xi ) · φ(xj)
j
•  This gives a nonlinear decision boundary in the
original feature space:
∑ α y K ( x , x) + b
i
i
i
i
Slide from Andrew Zisserman
Ques1ons •  What if the features are not 2d? •  What if the data is not linearly separable? •  What to do for more than two classes? C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
Mul1-­‐class SVMs •  Achieve mul1-­‐class classifier by combining a number of binary classifiers •  One vs. all –  Training: learn an SVM for each class vs. the rest –  Tes1ng: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value •  One vs. one –  Training: learn an SVM for each pair of classes –  Tes1ng: each learned SVM “votes” for a class to assign to the test example 5
2/26/15
Software for SVMs
SVMs for recognition
1.  Define a vector representation for
each example.
2.  Select a kernel function.
3.  Compute pairwise kernel values
between labeled examples
4.  Given this “kernel matrix” to SVM
optimization software to identify
support vectors & weights.
5.  To classify a new example: compute
kernel values between new input
and support vectors, apply weights,
check sign of output.
Dalal and Triggs CVPR’05 Case Study: Pedestrian Detector Navneet Dalal and Bill Triggs, “Histograms of Oriented Gradients for Human DetecGon,” CVPR 2005 •  Detect upright pedestrians •  Histogram of oriented gradient feature vector •  Linear SVM classifier; sliding window detector 64X128
HoG Feature Extrac1on HoG descriptor
HoG Feature Extrac1on: Cells 64x128
Compute
gradients
Each cell
contains a
histogram of
gradient
orientations,
weighted by
gradient
magnitude
6
2/26/15
“Each scalar cell response contributes
several components to the final
descriptor vector, each normalized
with respect to a different block. This
may seem redundant but good
normalization is critical and including
overlap significantly improves the
performance.” Dalal&Triggs CVPR’05
Parameters •  Gradient scale •  Orienta1on bins •  Block overlap area Other choices
n  RGB or Lab, Color/gray
n  Block normalization
2
L2-hys,
v ←v/
L1-sqrt,
v ← v /( v 1 + ε )
or
v 2 +ε
Cell
[ , , , , ... , ]
Block
Center bin
normalize
C-HOG
2x2 block of cells
R-HOG/SIFT
HoG Design Choices HoG Feature Extrac1on: Blocks 38 Parameter / design choices were guided by
extensive experimentation to determine empirical
effects on detector performance (e.g. miss rate)
Detector Architecture Learning Phase
41 Dalal&Triggs Detector •  Default detector configura1on: –  RGB colour space with no gamma correc1on ; –  [−1, 0, 1] gradient filter with no smoothing ; –  linear gradient vo1ng into 9 orienta1on bins in 0◦–180◦; –  16×16 pixel blocks of four 8×8 pixel cells; –  Gaussian spa1al window with σ = 8 pixel; –  L2-­‐Hys (Lowe-­‐style clipped L2 norm) block normaliza1on; –  block spacing stride of 8 pixels (hence 4-­‐fold coverage of each cell) ; –  64×128 detec1on window ; –  linear SVM classifier. Posi1ve and nega1ve examples Detection Phase
Create normalised training
data set
Scan image at all scales
and locations
Encode images into feature
vectors
Run classifier to obtain
object/non-object decisions
Learn binary classifier
Fuse multiple detections in
3-D position & scale space
Object/Non-object decision
Object detections with
bounding boxes
+ thousands more… + millions more… 7
2/26/15
Person detec1on with HoG & linear SVM To detect people at all locaGons and scales: Soft (C=0.01) linear SVM trained with SVMLight.
• Sliding window using learnt HOG template • Post-­‐processing using non-­‐maxima suppression [Dalal and Triggs, CVPR 2005] To detect people at all locaGons and scales: To detect people at all locaGons and scales: • Sliding window using learnt HOG template • Sliding window using learnt HOG template • Post-­‐processing using non-­‐maxima suppression • Post-­‐processing using non-­‐maxima suppression 8
2/26/15
To detect people at all locaGons and scales: To detect people at all locaGons and scales: • Sliding window using learnt HOG template • Sliding window using learnt HOG template • Post-­‐processing using non-­‐maxima suppression • Post-­‐processing using non-­‐maxima suppression To detect people at all locaGons and scales: To detect people at all locaGons and scales: • Sliding window using learnt HOG template • Sliding window using learnt HOG template • Post-­‐processing using non-­‐maxima suppression • Post-­‐processing using non-­‐maxima suppression Non-maximum Suppression across Scales
9
2/26/15
Dalal and Triggs Summary •  HoG feature representa1on •  Linear SVM classifier; sliding window detector •  Non-­‐maximum suppression across scale •  Use of detector performance metrics to guide turning of system parameters •  Detec1on rate 90% at 10-­‐4 FP per window •  Slower than Viola-­‐Jones detector 10