Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Topic 9: Advanced Classification Neural Networks Support Vector Machines Credits: Shawndra Hill Andrew Moore lecture notes Data Mining - 2011 - Volinsky - Columbia University 1 Outline • Special Topics – Neural Networks – Support Vector Machines Data Mining - 2011 - Volinsky - Columbia University 2 Neural Networks Agenda • • • • • • The biological inspiration Structure of neural net models Using neural net models Training neural net models Strengths and weaknesses An example Data Mining - 2011 - Volinsky - Columbia University 3 What the heck are neural nets? • A data mining algorithm, inspired by biological processes • A type of non-linear regression/classification • An ensemble method – Although not usually thought of as such • A black box! Data Mining - 2011 - Volinsky - Columbia University 4 Inspiration from Biology • Information processing inspired by biological nervous systems Structure of the nervous system: A large number of neurons (information processing units) connected together A neuron’s response depends on the states of other neurons it is connected to and to the ‘strength’ of those connections. The ‘strengths’ are learned based on experience. Data Mining - 2011 - Volinsky - Columbia University 5 From Real to Artificial Data Mining - 2011 - Volinsky - Columbia University 6 Nodes: A Closer Look Bias b x1 w1 Activation function Input values x2 w2 () Output y Summing function xm wm weights Data Mining - 2011 - Volinsky - Columbia University 7 Nodes: A Closer Look A node (neuron) is the basic information processing unit of a neural net. It has: A set of inputs with weights w1, w2, …, wm along with a default input called the bias An adder function (linear combiner) that computes the weighted sum of the inputs m v w jx j j1 An Activation function (squashing function) that transforms v, usually non-linearly. Data Mining - 2011 - Volinsky - Columbia University y (v b) 8 A Simple Node: A Perceptron A simple activation function: A signing threshold 1 if v 0 (v ) 1 if v 0 b (bias) x1 x2 w1 w2 v (v) y wn xn Data Mining - 2011 - Volinsky - Columbia University 9 Common Activation Functions • Step function • Sigmoid (logistic) function ev 1 (v) v 1 e 1 e v • Hyperbolic Tangent (Tanh) function e v e v tanh( v) v e e v • The s-shape adds non-linearity [Hornick (1989)]: combining many of these simple functions is sufficient to approximate any continuous non-linear function arbitrarily well over a compact interval. Data Mining - 2011 - Volinsky - Columbia University 10 Neural Network: Architecture Output layer Input layer Hidden Layer(s) • Big idea: a combination of simple non-linear models working together to model a complex function • How many layers? Nodes? What is the function? – Magic – Luckily, defaults do well Data Mining - 2011 - Volinsky - Columbia University 11 Neural Networks: The Model • Model has two components – A particular architecture • Number of hidden layers • Number of nodes in the input, output and hidden layers • Specification of the activation function(s) – The associated set of weights • Weights and complexity are “learned” from the data – Supervised learning, applied iteratively – Out-of-sample methods; Cross-validation Data Mining - 2011 - Volinsky - Columbia University 12 Fitting a Neural Net: Feed Forward • Supply attribute values at input nodes • Obtain predictions from the output node(s) – Predicting classes • Two classes – single output node with threshold • Multiple classes – use multiple outputs, one for each class Predicted class = output node with highest value Multiple class problems are one of the main uses of NN! Data Mining - 2011 - Volinsky - Columbia University 13 A Simple NN: Regression • A one-node neural network: – Called a ‘perceptron’ – Use identity function as the activation function – What’s the output? • Weighted sum of inputs b (bias) x1 x2 w1 w2 v (v) y wn xn Logistic regression just changes the activation function to the logistic function Data Mining Data- Mining 2011 - Volinsky - Columbia - Columbia University University 14 Training a NN: What does it learn? • It fits/learns the weights that best translates inputs into outputs given its architecture • Hidden units can be thought to learn some higher order regularities or features of the inputs that can be used to predict outputs. “Multi layer perceptron” Data Mining - 2011 - Volinsky - Columbia University 15 Perceptron Training Rule 1. 2. 3. Perceptron = Adder + Threshold Start with a random set of small weights. Calculate an example Change the weight by an amount proportional to the difference between the desired output and the actual output. Input Δ Wi = η * (D-Y).Ii Learning rate/ Actual output Desired output Step size Data Mining - 2011 - Volinsky - Columbia University 16 Training NNs: Back Propagation • How to train a neural net (find the optimal weights): – Present a training sample to the neural network. – Calculate the error in each output neuron. – For each neuron, calculate what the output should have been, and a scaling factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error. – Adjust the weights of each neuron to lower the local error. – Assign "blame" for the local error to neurons at the previous level, giving greater responsibility to neurons connected by stronger weights. – Repeat on the neurons at the previous level, using each one's "blame" as its error. • This ‘propogates’ the error backward. The sequence of forward and backward fits is called ‘back propogation’. Data Mining - 2011 - Volinsky - Columbia University 17 Training NNs: How to do it • A “Gradient Descent” algorithm is typically used to fit back propogation • You can imagine a surface in an n-dimensional space such that – Each dimension is a weight – Each point in this space is a particular combination of weights – Each point on the “surface” is the output error that corresponds to that combination of weights – You want to minimize error i.e. find the “valleys” on this surface – Note the potential for ‘local minima’ Data Mining - 2011 - Volinsky - Columbia University 18 Training NNs: Gradient Descent • Find the gradient in each direction: Error w i • Move according to these gradients will result in the move of ‘steepest descent’ • Note potential problem with ‘local minima’. Data Mining - 2011 - Volinsky - Columbia University 19 Gradient Descent • Direction of steepest descent can be found mathematical ly or via computation al estimation Via A. Moore Data Mining - 2011 - Volinsky - Columbia University 20 Neural Nets: Strengths • Can model very complex functions, very accurately – non linearity is built into the model • Handles noisy data quite well • Provides fast predictions • Good for multiple category problems – – – – Many-class classification Image detection Speech recognition Financial models • Good for multiple stage problems Data Mining - 2011 - Volinsky - Columbia University 21 Neural Nets: Weaknesses • A black-box. Hard to explain or gain intuition. • For complex problems, training time could be quite high • Many, many training parameters – Layers, neurons per layer, output layers, bias, training algs, learning rate • Highly prone to overfitting – Balance between complexity with parsimony can be learned through cross-validation Data Mining - 2011 - Volinsky - Columbia University 22 Example: Face Detection Architecture of the complete system: they use another neural net to estimate orientation of the face, then rectify it. They search over scales to find bigger/smaller faces. Figure from “Rotation invariant neural-network based face detection,” H.A. Rowley, S. Baluja and T. Kanade, Proc. Computer Vision and Pattern Recognition, 1998, copyright 1998, IEEE Data Mining - 2011 - Volinsky - Columbia University 23 Rowley, Baluja and Kanade’s (1998) Image Size: 20 x 20 Input Layer: 400 units Hidden Layer: 15 units Data Mining - 2011 - Volinsky - Columbia University 24 Neural Nets: Face Detection Goal: detect “face or no face” Data Mining - 2011 - Volinsky - Columbia University 25 Face Detection: Results Data Mining - 2011 - Volinsky - Columbia University 26 Face Detection Results: A Few Misses Data Mining - 2011 - Volinsky - Columbia University 27 Neural Nets • Face detection in action • For more: – See Hastie, et al Chapter 11 • R packages – Basic : nnet – Better: amore Data Mining - 2011 - Volinsky - Columbia University 28 Support Vector Machines Data Mining - 2011 - Volinsky - Columbia University 29 SVM • Classification technique • Start with a BIG assumption – The classes can be separated linearly Data Mining - 2011 - Volinsky - Columbia University 30 a Linear Classifiers x denotes +1 f yest f(x,w,b) = sign(w. x - b) denotes -1 How would you classify this data? Data Mining - 2011 - Volinsky - Columbia University 31 a Linear Classifiers x denotes +1 f yest f(x,w,b) = sign(w. x - b) denotes -1 How would you classify this data? Data Mining - 2011 - Volinsky - Columbia University 32 a Linear Classifiers x denotes +1 f yest f(x,w,b) = sign(w. x - b) denotes -1 How would you classify this data? Data Mining - 2011 - Volinsky - Columbia University 33 a Linear Classifiers x denotes +1 f yest f(x,w,b) = sign(w. x - b) denotes -1 How would you classify this data? Data Mining - 2011 - Volinsky - Columbia University 34 a Linear Classifiers x denotes +1 f yest f(x,w,b) = sign(w. x - b) denotes -1 Any of these would be fine.. ..but which is best? Data Mining - 2011 - Volinsky - Columbia University 35 a Classifier Margin x denotes +1 f yest f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. denotes -1 Data Mining - 2011 - Volinsky - Columbia University 36 a Maximum Margin x denotes +1 f yest f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. denotes -1 Linear SVM Data Mining - 2011 - Volinsky - Columbia University This is the simplest kind of SVM (Called an LSVM) 37 a Maximum Margin x denotes +1 f yest f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. denotes -1 Support Vectors are those datapoints that the margin pushes up against Linear SVM Data Mining - 2011 - Volinsky - Columbia University This is the simplest kind of SVM (Called an LSVM) 38 Why Maximum Margin? 1. Intuitively this feels safest. denotes +1 denotes -1 Support Vectors are those datapoints that the margin pushes up against f(x,w,b) = sign(w. - b) 2. If we’ve made a small error inxthe location of the boundary (it’s been The maximum jolted in its perpendicular direction) this gives us leastmargin chance linear of causing a misclassification. classifier is the 3. LOOCV is easy since the classifier model is linear immune to removal of any with the,nonum, support-vector datapoints. maximum margin. 4. There’s some theory (using VC is the dimension) that isThis related to (but not the same as) thesimplest propositionkind that of this is a good thing. SVM (Called an LSVM) 5. Empirically it works very very well. Data Mining - 2011 - Volinsky - Columbia University 39 Specifying a line and margin Plus-Plane Classifier Boundary Minus-Plane • How do we represent this mathematically? • …in m input dimensions? Data Mining - 2011 - Volinsky - Columbia University 40 Specifying a line and margin Plus-Plane Classifier Boundary Minus-Plane • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Classify as.. +1 if w . x + b >= 1 -1 if w . x + b <= -1 Universe explodes if -1 < w . x + b < 1 Data Mining - 2011 - Volinsky - Columbia University 41 Computing the margin width M = Margin Width How do we compute M in terms of w and b? • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Data Mining - 2011 - Volinsky - Columbia University 42 Computing the margin width x+ x- • • • • • M = Margin Width How do we compute M in terms of w and b? Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-. Data Mining - 2011 - Volinsky - Columbia University Any location in mm:: not not R necessarily a datapoint 43 Computing the margin width x+ x- • • • • • • M = Margin Width How do we compute M in terms of w and b? Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-. Claim: x+ = x- + l w for some value of l. Data Mining - 2011 - Volinsky - Columbia University 44 Computing the margin width x+ x- • • • • • • M = Margin Width The line from x- to x+ is perpendicular to the How do we compute planes. M in terms of w and ? So to getbfrom x- to x+ travel some distance in w. { x : w . x + b = +1direction } Plus-plane = Minus-plane = { x : w . x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-. Claim: x+ = x- + l w for some value of l. Why? Data Mining - 2011 - Volinsky - Columbia University 45 Computing the margin width x+ M = Margin Width x- What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + l w • |x+ - x- | = M It’s now easy to get M in terms of w and b Data Mining - 2011 - Volinsky - Columbia University 46 Computing the margin width M = Margin Width x+ x- w . (x - + l w) + b = 1 What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + l w • |x+ - x- | = M It’s now easy to get M in terms of w and b => w . x - + b + l w .w = 1 => -1 + l w .w = 1 => 2 l w.w Data Mining - 2011 - Volinsky - Columbia University 47 Computing the margin width M = Margin Width = x+ 2 w.w xM = |x+ - x- | =| l w |= What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + l w • |x+ - x- | = M 2 • l w.w l | w | l w.w 2 w.w 2 w.w w.w Data Mining - 2011 - Volinsky - Columbia University 48 Learning the Maximum Margin Classifier x+ M = Margin Width = 2 w.w x- Given a guess of w and b we can • Compute whether all data points in the correct half-planes • Compute the width of the margin Search the space of w’s and b’s to find the widest margin that matches all the datapoints. Data Mining - 2011 - Volinsky - Columbia University 49 Uh-oh! This is going to be a problem! What should we do? denotes +1 denotes -1 Data Mining - 2011 - Volinsky - Columbia University 50 Uh-oh! This is going to be a problem! What should we do? denotes +1 denotes -1 Idea 1: Find minimum w.w, while minimizing number of training set errors. Problemette: Two things to minimize makes for an ill-defined optimization Data Mining - 2011 - Volinsky - Columbia University 51 Uh-oh! This is going to be a problem! What should we do? denotes +1 denotes -1 Idea 1.1: Minimize w.w + C (#train errors) Tradeoff parameter And: Use a trick Data Mining - 2011 - Volinsky - Columbia University 52 Suppose we’re in 1-dimension What would SVMs do with this data? x=0 Data Mining - 2011 - Volinsky - Columbia University 53 Suppose we’re in 1-dimension Not a big surprise x=0 Positive “plane” Negative “plane” Data Mining - 2011 - Volinsky - Columbia University 54 Harder 1-dimensional dataset What can be done about this? x=0 Data Mining - 2011 - Volinsky - Columbia University 55 Harder 1-dimensional dataset Embed the data in a higher dimensional space z k ( xk , x ) 2 k x=0 Data Mining - 2011 - Volinsky - Columbia University 56 Harder 1-dimensional dataset z k ( xk , x ) 2 k x=0 Data Mining - 2011 - Volinsky - Columbia University 57 SVM Kernel Functions • Embedding the data in a higher dimensional space where it is separable is called the “kernel trick” • Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function – Radial-Basis-style Kernel Function: (a b) 2 K (a, b) exp 2 2 – Neural-net-style Kernel Function: K (a, b) tanh( a.b ) Data Mining - 2011 - Volinsky - Columbia University 58 SVM Performance • Trick: find linear boundaries in an enlarged space – Translate to nonlinear boundaries in the original space • Magic: for more details, see Hastie et al 12.3 • Anecdotally they work very very well indeed. • Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark • There is a lot of excitement and religious fervor about SVMs. • Despite this, some practitioners are a little skeptical. Data Mining - 2011 - Volinsky - Columbia University 59 Data Mining - 2011 - Volinsky - Columbia University 60 Doing multi-class classification • SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). • What can be done? • Answer: with output arity N, learn N SVM’s – – – – SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” • Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region. Data Mining - 2011 - Volinsky - Columbia University 61 References • Hastie, et al Chapter 11 (NN); Chapter 12 (SVM) • Andrew Moore notes on Neural nets • Andrew Moore notes on SVM • Wikipedia has very good pages on both topics • An excellent tutorial on VC-dimension and Support Vector Machines by C. Burges. – A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. • The SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998 Data Mining - 2011 - Volinsky - Columbia University 62