Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2/26/15 Overview • Recall last class: Boos1ng is a way of genera1ng a strong classifier as a weighted ensemble of weak ones • Today: Support Vector Machine (SVM) training generates a strong classifier directly • Case Study: Dalal and Triggs pedestrian detector Support Vector Machines SVM slides from Kristen Grauman, UT-‐Aus1n Other good resources: Presentation slides from Christoph Lampert: https://sites.google.com/site/christophlampert/teaching/kernel-methods-forobject-recognition Simple tutorial document by Chris Williams http://www.inf.ed.ac.uk/teaching/courses/iaml/docs/svm.pdf Video lecture by Pat Winston: https://www.youtube.com/watch?v=_PwhiWxHK8o Linear classifiers Find linear function to separate positive and negative examples Lines in R2 Let ⎡a ⎤ ⎡ x ⎤ w = ⎢ ⎥ x = ⎢ ⎥ ⎣ c ⎦ ⎣ y ⎦ ax + cy + b = 0 Linear classifiers Lines in R2 Let w ⎡a ⎤ ⎡ x ⎤ w = ⎢ ⎥ x = ⎢ ⎥ c ⎣ ⎦ ⎣ y ⎦ ax + cy + b = 0 • Find linear function to separate positive and negative examples xi positive : xi ⋅ w + b ≥ 0 xi negative : xi ⋅ w + b < 0 w⋅x +b = 0 Which line is best? 1 2/26/15 Support Vector Machines (SVMs) Support vector machines • Want line that maximizes the margin. • Maximize the margin between the positive and negative training examples =1 +b wx =0 1 +b wx +b= wx • Discriminative classifier based on optimal separating line (for 2d case) xi positive ( yi = 1) : Support vectors xi ⋅ w + b ≥ 1 xi negative ( yi = −1) : xi ⋅ w + b ≤ −1 For support, vectors, xi ⋅ w + b = ±1 Margin C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 Lines in R2 (x0 , y0 ) ⎡a ⎤ ⎡ x ⎤ w = ⎢ ⎥ x = ⎢ ⎥ ⎣ c ⎦ ⎣ y ⎦ Let D w Lines in R2 (x0 , y0 ) Let D w ax + cy + b = 0 ax + cy + b = 0 w⋅x +b = 0 w⋅x +b = 0 D= w a2 + c2 ⎡a ⎤ ⎡ x ⎤ w = ⎢ ⎥ x = ⎢ ⎥ c ⎣ ⎦ ⎣ y ⎦ ax + cy + b = 0 ax0 + cy0 + b 2 a +c 2 = w Τx + b w w Τx + b w distance from point to line distance from point to line xi positive ( yi = 1) : xi ⋅ w + b ≥ 1 xi negative ( yi = −1) : xi ⋅ w + b ≤ −1 For support, vectors, xi ⋅ w + b = ±1 Distance between point and line: w⋅x +b = 0 D= = • Want line that maximizes the margin. =1 +b wx =0 1 +b wx +b= wx Let D ax0 + cy0 + b Support vector machines Lines in R2 (x0 , y0 ) ⎡a ⎤ ⎡ x ⎤ w = ⎢ ⎥ x = ⎢ ⎥ ⎣ c ⎦ ⎣ y ⎦ Support vectors | xi ⋅ w + b | || w || For support vectors: wΤ x + b ± 1 1 −1 2 = M= − = w w Margin M w w w 2 2/26/15 Support vector machines Finding the maximum margin line • Want line that maximizes the margin. =1 +b wx =0 1 +b wx +b= wx 1. Maximize margin 2/||w|| 2. Correctly classify all training data points: xi positive ( yi = 1) : xi ⋅ w + b ≥ 1 xi positive ( yi = 1) : xi ⋅ w + b ≥ 1 xi negative ( yi = −1) : xi ⋅ w + b ≤ −1 xi negative ( yi = −1) : xi ⋅ w + b ≤ −1 For support, vectors, xi ⋅ w + b = ±1 | xi ⋅ w + b | || w || Distance between point and line: Therefore, the margin is Support vectors 2 / ||w|| Margin Quadratic optimization problem: Minimize 1 T w w 2 One constraint for each training point. Subject to yi(w·xi+b) ≥ 1 Note sign trick. C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 Finding the maximum margin line • Solution: w = learned weight ∑i α i yi xi Support vector Finding the maximum margin line • Solution: w = ∑α yx i i i i b = yi – w·xi (for any support vector) w ⋅ x + b = ∑i α i yi xi ⋅ x + b • Classification function: f (x) = sign (w ⋅ x + b) = sign (∑ α yix ⋅ x + b) i i i If f(x) < 0, classify as negative, if f(x) > 0, classify as positive • Notice that it relies on an inner product between the test point x and the support vectors xi • (Solving the optimization problem also involves computing the inner products xi · xj between all pairs of training points) C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 Ques1ons • What if the features are not 2d? • What if the data is not linearly separable? • What to do for more than two classes? C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 Ques1ons • What if the features are not 2d? – Generalizes to d-‐dimensions – replace line with “hyperplane” • What if the data is not linearly separable? • What to do for more than two classes? 3 2/26/15 Hyperplanes in Rn Planes in R3 w (x0 , y0 , z0 ) Let D ⎡a ⎤ w = ⎢⎢b ⎥⎥ ⎣⎢ c ⎦⎥ ⎡ x ⎤ x = ⎢⎢ y ⎥⎥ ⎢⎣ z ⎥⎦ Hyperplane H is set of all vectors which satisfy: x ∈ Rn w1 x1 + w2 x2 + … + wn xn + b = 0 ax + by + cz + d = 0 w Τx + b = 0 w⋅x + d = 0 D= ax0 + by 0 + cz 0 + d a2 + b2 + c2 w Τ x + d distance from = point to plane w Ques1ons D ( H , x) = w Τ x + b distance from point to w hyperplane Nonlinear SVMs • What if the features are not 2d? • What if the data is not linearly separable? Slide from Andrew Zisserman Nonlinear SVMs Slide from Andrew Zisserman Nonlinear SVMs Slide from Andrew Zisserman 4 2/26/15 The Kernel Trick Example Kernel • Recall we transformed linear regression into nonlinear regression using a feature vector Φ(x) and, ul1mately, the “kernel trick.” • We also use the kernel trick here to transform a linear classifier into nonlinear one. Slide from Andrew Zisserman Example Kernels Nonlinear SVMs • The kernel trick: instead of explicitly computing the lifting transformation φ(x), define a kernel function K such that K(xi , xj ) = φ(xi ) · φ(xj) j • This gives a nonlinear decision boundary in the original feature space: ∑ α y K ( x , x) + b i i i i Slide from Andrew Zisserman Ques1ons • What if the features are not 2d? • What if the data is not linearly separable? • What to do for more than two classes? C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 Mul1-‐class SVMs • Achieve mul1-‐class classifier by combining a number of binary classifiers • One vs. all – Training: learn an SVM for each class vs. the rest – Tes1ng: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value • One vs. one – Training: learn an SVM for each pair of classes – Tes1ng: each learned SVM “votes” for a class to assign to the test example 5 2/26/15 Software for SVMs SVMs for recognition 1. Define a vector representation for each example. 2. Select a kernel function. 3. Compute pairwise kernel values between labeled examples 4. Given this “kernel matrix” to SVM optimization software to identify support vectors & weights. 5. To classify a new example: compute kernel values between new input and support vectors, apply weights, check sign of output. Dalal and Triggs CVPR’05 Case Study: Pedestrian Detector Navneet Dalal and Bill Triggs, “Histograms of Oriented Gradients for Human DetecGon,” CVPR 2005 • Detect upright pedestrians • Histogram of oriented gradient feature vector • Linear SVM classifier; sliding window detector 64X128 HoG Feature Extrac1on HoG descriptor HoG Feature Extrac1on: Cells 64x128 Compute gradients Each cell contains a histogram of gradient orientations, weighted by gradient magnitude 6 2/26/15 “Each scalar cell response contributes several components to the final descriptor vector, each normalized with respect to a different block. This may seem redundant but good normalization is critical and including overlap significantly improves the performance.” Dalal&Triggs CVPR’05 Parameters • Gradient scale • Orienta1on bins • Block overlap area Other choices n RGB or Lab, Color/gray n Block normalization 2 L2-hys, v ←v/ L1-sqrt, v ← v /( v 1 + ε ) or v 2 +ε Cell [ , , , , ... , ] Block Center bin normalize C-HOG 2x2 block of cells R-HOG/SIFT HoG Design Choices HoG Feature Extrac1on: Blocks 38 Parameter / design choices were guided by extensive experimentation to determine empirical effects on detector performance (e.g. miss rate) Detector Architecture Learning Phase 41 Dalal&Triggs Detector • Default detector configura1on: – RGB colour space with no gamma correc1on ; – [−1, 0, 1] gradient filter with no smoothing ; – linear gradient vo1ng into 9 orienta1on bins in 0◦–180◦; – 16×16 pixel blocks of four 8×8 pixel cells; – Gaussian spa1al window with σ = 8 pixel; – L2-‐Hys (Lowe-‐style clipped L2 norm) block normaliza1on; – block spacing stride of 8 pixels (hence 4-‐fold coverage of each cell) ; – 64×128 detec1on window ; – linear SVM classifier. Posi1ve and nega1ve examples Detection Phase Create normalised training data set Scan image at all scales and locations Encode images into feature vectors Run classifier to obtain object/non-object decisions Learn binary classifier Fuse multiple detections in 3-D position & scale space Object/Non-object decision Object detections with bounding boxes + thousands more… + millions more… 7 2/26/15 Person detec1on with HoG & linear SVM To detect people at all locaGons and scales: Soft (C=0.01) linear SVM trained with SVMLight. • Sliding window using learnt HOG template • Post-‐processing using non-‐maxima suppression [Dalal and Triggs, CVPR 2005] To detect people at all locaGons and scales: To detect people at all locaGons and scales: • Sliding window using learnt HOG template • Sliding window using learnt HOG template • Post-‐processing using non-‐maxima suppression • Post-‐processing using non-‐maxima suppression 8 2/26/15 To detect people at all locaGons and scales: To detect people at all locaGons and scales: • Sliding window using learnt HOG template • Sliding window using learnt HOG template • Post-‐processing using non-‐maxima suppression • Post-‐processing using non-‐maxima suppression To detect people at all locaGons and scales: To detect people at all locaGons and scales: • Sliding window using learnt HOG template • Sliding window using learnt HOG template • Post-‐processing using non-‐maxima suppression • Post-‐processing using non-‐maxima suppression Non-maximum Suppression across Scales 9 2/26/15 Dalal and Triggs Summary • HoG feature representa1on • Linear SVM classifier; sliding window detector • Non-‐maximum suppression across scale • Use of detector performance metrics to guide turning of system parameters • Detec1on rate 90% at 10-‐4 FP per window • Slower than Viola-‐Jones detector 10