Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Neural networks and support vector machines Perceptron Input Weights x1 w1 x2 w2 Output: sgn(wx + b) x3 . . . xD w3 wD Can incorporate bias as component of the weight vector by always including a feature with value set to 1 Loose inspiration: Human neurons Linear separability Perceptron training algorithm • Initialize weights • Cycle through training examples in multiple passes (epochs) • For each training example: – If classified correctly, do nothing – If classified incorrectly, update weights Perceptron update rule • For each training instance x with label y: – Classify with current weights: y’ = sgn(wx) – Update weights: w w + α(y-y’)x – α is a learning rate that should decay as a function of epoch t, e.g., 1000/(1000+t) – What happens if y’ is correct? – Otherwise, consider what happens to individual weights wi wi + α(y-y’)xi • If y = 1 and y’ = -1, wi will be increased if xi is positive or decreased if xi is negative wx will get bigger • If y = -1 and y’ = 1, wi will be decreased if xi is positive or increased if xi is negative wx will get smaller Convergence of perceptron update rule • Linearly separable data: converges to a perfect solution • Non-separable data: converges to a minimum-error solution assuming learning rate decays as O(1/t) and examples are presented in random sequence Implementation details • Bias (add feature dimension with value fixed to 1) vs. no bias • Initialization of weights: all zeros vs. random • Learning rate decay function • Number of epochs (passes through the training data) • Order of cycling through training examples (random) Multi-class perceptrons • One-vs-others framework: Need to keep a weight vector wc for each class c • Decision rule: c = argmaxc wc x • Update rule: suppose example from class c gets misclassified as c’ – Update for c: wc wc + αx – Update for c’: wc’ wc’ – αx Multi-class perceptrons • One-vs-others framework: Need to keep a weight vector wc for each class c • Decision rule: c = argmaxc wc x Softmax: P(c | x) = Max Inputs Perceptrons w/ weights wc exp(w c × x) C å exp(w k=1 k × x) Visualizing linear classifiers Source: Andrej Karpathy, http://cs231n.github.io/linear-classify/ Differentiable perceptron Input Weights x1 w1 x2 w2 Output: (wx + b) x3 . . . xd w3 Sigmoid function: wd 1 (t ) 1 e t Update rule for differentiable perceptron • Define total classification error or loss on the training set: E (w ) y j f w (x j ) , N 2 j 1 f w (x j ) (w x j ) • Update weights by gradient descent: w1 w2 E w w w Update rule for differentiable perceptron • Define total classification error or loss on the training set: E (w ) y j f w (x j ) , N j 1 2 f w (x j ) (w x j ) • Update weights by gradient descent: E w w w N E 2 y j f (x j ) ' (w x j ) (w x j ) w j 1 w 2 y j f (x j ) (w x j )(1 (w x j ))x j N j 1 • For a single training point, the update is: w w y f (x) (w x)(1 (w x))x Update rule for differentiable perceptron • For a single training point, the update is: w w y f (x) (w x)(1 (w x))x – This is called stochastic gradient descent • Compare with update rule with non-differentiable perceptron: w w y f ( x) x Network with a hidden layer weights w’ weights wj • Can learn nonlinear functions (provided each perceptron is nonlinear and differentiable) f (x)= s éå w¢js ë j ( ) ù w x åk jk k û Network with a hidden layer • Hidden layer size and network capacity: Source: http://cs231n.github.io/neural-networks-1/ Training of multi-layer networks • Find network weights to minimize the error between true and estimated labels of training examples: N E(w) = å( y j - fw (x j )) 2 j=1 • E Update weights by gradient descent: w w w • This requires perceptrons with a differentiable nonlinearity Sigmoid: g(t) = 1 1+ e-t Rectified linear unit (ReLU): g(t) = max(0,t) Training of multi-layer networks • Find network weights to minimize the error between true and estimated labels of training examples: N E(w) = å( y j - fw (x j )) 2 j=1 • • • E Update weights by gradient descent: w w w Back-propagation: gradients are computed in the direction from output to input layers and combined using chain rule Stochastic gradient descent: compute the weight update w.r.t. one training example (or a small batch of examples) at a time, cycle through training examples in random order in multiple epochs Regularization • It is common to add a penalty on weight magnitudes to the objective function: N E( f ) = å( yi - f (xi )) + i=1 2 l å 2 j w 2j – This encourages network to use all of its inputs “a little” rather than a few inputs “a lot” Source: http://cs231n.github.io/neural-networks-1/ Multi-Layer Network Demo http://playground.tensorflow.org/ Neural networks: Pros and cons • Pros – Flexible and general function approximation framework – Can build extremely powerful models by adding more layers • Cons – Hard to analyze theoretically (e.g., training is prone to local optima) – Huge amount of training data, computing power required to get good performance – The space of implementation choices is huge (network architectures, parameters) Support vector machines • When the data is linearly separable, there may be more than one separator (hyperplane) Which separator is best? Support vector machines • Find hyperplane that maximizes the margin between the positive and negative examples Support vectors Margin C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 SVM parameter learning 1 w • Separable data: min w ,b 2 Maximize margin • 2 subject to yi ( w x i b ) 1 Classify training data correctly Non-separable data: min w,b n 1 2 w + C å max ( 0,1- yi (w × xi + b)) 2 i=1 Maximize margin Minimize classification mistakes SVM parameter learning n min w,b 1 2 w + C å max ( 0,1- yi (w × xi + b)) 2 i=1 +1 Margin 0 -1 Demo: http://cs.stanford.edu/people/karpathy/svmjs/demo Nonlinear SVMs • General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable Φ: x → φ(x) Image source Nonlinear SVMs • Linearly separable dataset in 1D: x 0 • Non-separable dataset in 1D: x 0 • We can map the data to a higher-dimensional space: x2 0 x Slide credit: Andrew Moore The kernel trick • General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable • The kernel trick: instead of explicitly computing the lifting transformation φ(x), define a kernel function K such that K(x,y) = φ(x) · φ(y) (to be valid, the kernel function must satisfy Mercer’s condition) The kernel trick • Linear SVM decision function: w x b i i yi xi x b learned weight Support vector C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 The kernel trick • Linear SVM decision function: w x b i i yi xi x b • Kernel SVM decision function: y ( x ) ( x) b y K ( x , x) b i i i i i i i i • This gives a nonlinear decision boundary in the original feature space C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 Polynomial kernel: K (x, y ) (c x y ) d Gaussian kernel • Also known as the radial basis function (RBF) kernel: 2 1 K (x, y ) exp 2 x y K(x, y) ||x – y|| Gaussian kernel SV’s SVMs: Pros and cons • Pros • Kernel-based framework is very powerful, flexible • Training is convex optimization, globally optimal solution can be found • Amenable to theoretical analysis • SVMs work very well in practice, even with very small training sample sizes • Cons • No “direct” multi-class SVM, must combine two-class SVMs (e.g., with one-vs-others) • Computation, memory (esp. for nonlinear SVMs) Neural networks vs. SVMs (a.k.a. “deep” vs. “shallow” learning)