Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS621/CS449 Artificial Intelligence Lecture Notes Set 4: 24/08/2004, 25/08/2004, 27/08/2004, 08/09/2004,10/09/2004 Instructor: Prof. Pushpak Bhattacharyya 13/08/2004 CS-621/CS-449 Lecture Notes Outline • Proof of Theorem : PLC and ~LS equivalence • Feedforward neural nets • Sigmoid networks • Training through Gradient Descent • Backpropagation algorithm 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Proof of PLC ~LS We have to prove PLC ~LS and ~LS PLC • One proof for PLC ~LS: PLC exists PiYi 0 (1) Pi 0, j pj 0 Suppose LS. w such that w Yi 0, i (2) w (1) gives Pi(w Yi ) 0 (3) This is a Contradiction. QED. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Proof of PLC ~LS • Given a set of augmented and preprocessed vectors X1, X2, ….,Xm, each of dimension n, is there a vector w such that w.Xi >0 , i=1,2,…,m ? Or w. D > [0 0 ….0] m times Where D = [X1 X2 … Xm] is a m x n matrix and each Xi is a n x 1 matrix 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Proof of PLC ~LS • Alternatively, we could ask for a vector w1 such that w1. D > [1 1 ….1] m times • Consider a Linear Programming formulation: • Minimize w1.O subject to the constraint w1. D > [1 1 ….1] F1 • Trivially w1. O is O 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Proof of PLC ~LS • Any such LP formulation has a Dual LP (consequence of Forkas Lemma) • Dual LP says: Maximize [1 1 …. 1]. P subject to the constraint D. P = 0 where P = [P1 P2 …. Pm,]T, Pi 0 F2 • Observation : F1 has a feasible solution iff the objective function of F2 is bounded. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Proof of PLC ~LS • Suppose PLC exists. Then P O • If P is a solution for D. P = O , then, P is also a solution for any (as large as required) • Thus, the objective function F2 becomes unbounded. => F1 does not have a feasible solution => w1 does not exist => ~LS 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Proof of PLC ~LS • Suppose ~LS. => w1 does not exist => F1 does not have a feasible solution => F2 becomes unbounded => PLC exists 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Limitations of perceptron • Non-linear separability is all pervading • Single perceptron does not have enough computing power • Eg: XOR cannot be computed by perceptron. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Solutions • Tolerate error (Ex: pocket algorithm used by connectionist expert systems). – Try to get the best possible hyperplane using only perceptrons • Use higher dimension surfaces Ex: Degree - 2 surfaces like parabola • Use layered network 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Pocket Algorithm • Algorithm evolved in 1985 – essentially uses PTA • Basic Idea: Always preserve the best weight obtained so far in the “pocket” Change weights, if found better (i.e. changed weights result in reduced error). 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Connectionist Expert Systems • Rules and facts form Knowledge Base (KB) • Rules encoded in a neural network • Main problem: “Knowledge engineering” – lots of training/experience required. • Challenges in KE: – Obtaining knowledge from experts – Encoding this into the system • Hybrid AI area –combines Symbolic (Programbased) AI and Neural Network AI 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay XOR using 2 layers x1 x2 x1 x2 x1x2 OR ( AND( x1 , NOT ( x2 )), AND( NOT ( x1 ), x2 ))) • Non-LS function expressed as a linearly separable function of individual linearly separable functions. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Example - XOR θ = 0.5 w2=1 Calculation of XOR w1=1 x 1x 2 x 1x 2 x1 x2 x1x2 0 0 0 0 1 1 1 0 0 1 1 0 24/08/2004 - 10/09/2004 Calculation of x1x2 w1=-1 x1 CS-621/CS-449 Lecture Notes θ=1 w2=1.5 x2 0 w2 w1 w1 w2 Prof. Pushpak Bhattacharyya IIT Bombay Example - XOR θ = 0.5 w2=1 w1=1 x 1x 2 -1 x1 24/08/2004 - 10/09/2004 1 1.5 1.5 1 x 1x 2 -1 x2 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Some Terminology • A multilayer feedforward neural network has – Input layer – Output layer – Hidden layer (asserts computation) Output units and hidden units are called computation units. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Training of the MLP • Multilayer Perceptron (MLP) • Question:- How to find weights for the hidden layers when no target output is available? • Credit assignment problem – to be solved by “Gradient Descent” 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Gradient Descent Technique • Let E be the error at the output layer 1 p n E (ti oi ) 2j 2 j 1 i 1 • ti = target output; oi = observed output • i is the index going over n neurons in the outermost layer • j is the index going over the p patterns (1 to p) • Ex: XOR :– p=4 and n=1 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Weights in a NN • wmn is the weight of the connection from the nth neuron to the mth neuron wmn n • E vs W surface is a complex surface in the space defined by the weights wij • E wmn gives the direction in which a movement of the operating point in the wmn co-ordinate space will result in maximum decrease in error 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes m wmn E wmn Prof. Pushpak Bhattacharyya IIT Bombay Sigmoid neurons • Gradient Descent needs a derivative computation - not possible in perceptron due to the discontinuous step function used! Sigmoid neurons with easy-to-compute derivatives required! y 1 as x y 0 as x • Computing power comes from non-linearity of sigmoid function. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Derivative of Sigmoid function 1 y x 1 e x dy 1 e x ( e ) x 2 x 2 dx (1 e ) (1 e ) 1 x 1 e 24/08/2004 - 10/09/2004 1 y (1 y ) 1 x 1 e CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Training algorithm • Initialize weights to random values. • For input x = <xn,xn-1,…,x0>, modify weights as follows Target output = t, Observed output = o E wi wi E 1 (t o) 2 2 • Iterate until E < 24/08/2004 - 10/09/2004 (threshold) CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Calculation of ∆wi n 1 E E net where : net wi xi Wi net Wi i 0 E o net o net Wi (t o)o(1 o) xi E wi ( learning constant, 0 1) wi wi (t o)o(1 o) xi 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Observations Does the training technique support our intuition? • The larger the xi, larger is ∆wi – Error burden is borne by the weight values corresponding to large input values • If t =0, ∆wi = 0 • ∆wi α departure from target • Saturation behaviour • If o < t, ∆wi > 0 and if o > t, ∆wi < 0 establishes Hebb’s law 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Hebb’s law nj wji ni • If nj and ni are both in excitatory state (+1) – Then the change in weight might be such that it enhances the excitation – The change is proportional to both the levels of excitation ∆wji α e(nj) e(ni) • If ni and nj are in a mutual state of inhibition ( one is +1 and the other is -1), – Then the change in weight is such that the inhibition is enhanced (change in weight is negative) 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Saturation behavior • The algorithm is iterative and incremental • If the weight values or number of input values is very large, the output will be large, then the output will be in saturation region. • The weight values hardly change in the saturation region 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Backpropagation algorithm j wji i …. …. Output layer (m o/p neurons) Hidden layers …. …. Input layer (n i/p neurons) • Fully connected feed forward network • Pure FF network (no jumping of connections over layers) 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Gradient Descent Equations E w ji ( learning rate, 0 1) w ji net j E E ( net j input at the jth layer ) w ji net j w ji E net j j net j w ji j w ji 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Example - Character Recognition • Output layer – 26 neurons (all capital) • First output neuron has the responsibility of detecting all forms of ‘A’ • Centralized representation of outputs • In distributed representations, all output neurons participate in output 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Backpropagation – for outermost layer E E o j th j (net j input at the j layer ) net j o j net j 1 m E (t p o p ) 2 2 p 1 Hence, j ((t j o j )o j (1 o j )) w ji (t j o j )o j (1 o j )oi 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Backpropagation for hidden layers k j i …. …. Output layer (m o/p neurons) Hidden layers …. …. Input layer (n i/p neurons) k is propagated backwards to find value of j 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Backpropagation – for hidden layers w ji joi o j E E j net j o j net j E o j (1 o j ) o j E netk ( ) o j (1 o j ) o j knext layer net k Hence, j (w knext layer 24/08/2004 - 10/09/2004 kj ( knext layer k wkj ) o j (1 o j ) k )o j (1 o j )oi CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay General Backpropagation Rule • General weight updating rule: w ji joi • Where j (t j o j )o j (1 o j ) (w knext layer 24/08/2004 - 10/09/2004 kj k for outermost layer )o j (1 o j )oi for hidden layers CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Issues in the training algorithm • Algorithm is greedy. It always changes weight such that E reduces. • The algorithm may get stuck up in a local minimum. • If we observe that E is not getting reduced anymore, the following may be the reasons: 1. Stuck in local minimum. 2. Network paralysis. (High –ve or +ve i/p makes neurons to saturate.) 3. (learning rate is too small) 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Diagnostics in action(1) 1) If stuck in local minimum, try the following: • Re-initializing the weight vector. • Increase the learning rate. • Introduce more neurons in the hidden layer. 2) If it is network paralysis, then increase the number of neurons in the hidden layer. Problem: How to configure the hidden layer ? Known: One hidden layer seems to be sufficient. [Kolmogorov (1960’s)] 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Diagnostics in action(2) Kolgomorov statement: A feedforward network with three layers (input, output and hidden) with appropriate I/O relation that can vary from neuron to neuron is sufficient to compute any function. More hidden layers reduce the size of individual layers. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Diagnostics in action(3) 3) Observe the outputs: If they are close to 0 or 1, try the following: 1. Scale the inputs or divide by a normalizing factor. 2. Change the shape and size of the sigmoid. 3. Introduce momentum factor. (w ji ) nth iteration jOi (wji)( n 1)th iteration Accelerates the movement out of the trough. Dampens oscillation inside the trough. Choosing : If is large, we may jump over the minimum. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay