Download ppt - IIT Bombay

CS621/CS449 Artificial Intelligence Lecture Notes Set 4: 24/08/2004, 25/08/2004, 27/08/2004, 08/09/2004,10/09/2004 Instructor: Prof. Pushpak Bhattacharyya 13/08/2004 CS-621/CS-449 Lecture Notes Outline • Proof of Theorem : PLC and ~LS equivalence • Feedforward neural nets • Sigmoid networks • Training through Gradient Descent • Backpropagation algorithm 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Proof of PLC  ~LS We have to prove PLC  ~LS and ~LS  PLC • One proof for PLC  ~LS: PLC exists   PiYi  0  (1) Pi  0, j pj  0 Suppose LS. w such that w  Yi  0, i  (2) w  (1) gives  Pi(w  Yi )  0  (3) This is a Contradiction. QED. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Proof of PLC  ~LS • Given a set of augmented and preprocessed vectors X1, X2, ….,Xm, each of dimension n, is there a vector w such that w.Xi >0 , i=1,2,…,m ? Or w. D > [0 0 ….0] m times Where D = [X1 X2 … Xm] is a m x n matrix and each Xi is a n x 1 matrix 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Proof of PLC  ~LS • Alternatively, we could ask for a vector w1 such that w1. D > [1 1 ….1] m times • Consider a Linear Programming formulation: • Minimize w1.O subject to the constraint w1. D > [1 1 ….1] F1 • Trivially w1. O is O 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Proof of PLC  ~LS • Any such LP formulation has a Dual LP (consequence of Forkas Lemma) • Dual LP says: Maximize [1 1 …. 1]. P subject to the constraint D. P = 0 where P = [P1 P2 …. Pm,]T, Pi  0 F2 • Observation : F1 has a feasible solution iff the objective function of F2 is bounded. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Proof of PLC  ~LS • Suppose PLC exists. Then P  O • If P is a solution for D. P = O , then,  P is also a solution for any  (as large as required) • Thus, the objective function F2 becomes unbounded. => F1 does not have a feasible solution => w1 does not exist => ~LS 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Proof of PLC  ~LS • Suppose ~LS. => w1 does not exist => F1 does not have a feasible solution => F2 becomes unbounded => PLC exists 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Limitations of perceptron • Non-linear separability is all pervading • Single perceptron does not have enough computing power • Eg: XOR cannot be computed by perceptron. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Solutions • Tolerate error (Ex: pocket algorithm used by connectionist expert systems). – Try to get the best possible hyperplane using only perceptrons • Use higher dimension surfaces Ex: Degree - 2 surfaces like parabola • Use layered network 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Pocket Algorithm • Algorithm evolved in 1985 – essentially uses PTA • Basic Idea:  Always preserve the best weight obtained so far in the “pocket”  Change weights, if found better (i.e. changed weights result in reduced error). 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Connectionist Expert Systems • Rules and facts form Knowledge Base (KB) • Rules encoded in a neural network • Main problem: “Knowledge engineering” – lots of training/experience required. • Challenges in KE: – Obtaining knowledge from experts – Encoding this into the system • Hybrid AI area –combines Symbolic (Programbased) AI and Neural Network AI 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay XOR using 2 layers x1  x2  x1 x2 x1x2   OR ( AND( x1 , NOT ( x2 )), AND( NOT ( x1 ), x2 ))) • Non-LS function expressed as a linearly separable function of individual linearly separable functions. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Example - XOR θ = 0.5 w2=1  Calculation of XOR w1=1 x 1x 2 x 1x 2 x1 x2 x1x2 0 0 0 0 1 1 1 0 0 1 1 0 24/08/2004 - 10/09/2004 Calculation of x1x2 w1=-1 x1 CS-621/CS-449 Lecture Notes θ=1 w2=1.5 x2 0 w2   w1   w1  w2   Prof. Pushpak Bhattacharyya IIT Bombay Example - XOR θ = 0.5 w2=1 w1=1 x 1x 2 -1 x1 24/08/2004 - 10/09/2004 1 1.5 1.5 1 x 1x 2 -1 x2 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Some Terminology • A multilayer feedforward neural network has – Input layer – Output layer – Hidden layer (asserts computation) Output units and hidden units are called computation units. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Training of the MLP • Multilayer Perceptron (MLP) • Question:- How to find weights for the hidden layers when no target output is available? • Credit assignment problem – to be solved by “Gradient Descent” 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Gradient Descent Technique • Let E be the error at the output layer 1 p n E   (ti  oi ) 2j 2 j 1 i 1 • ti = target output; oi = observed output • i is the index going over n neurons in the outermost layer • j is the index going over the p patterns (1 to p) • Ex: XOR :– p=4 and n=1 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Weights in a NN • wmn is the weight of the connection from the nth neuron to the mth neuron wmn n • E vs W surface is a complex surface in the space defined by the weights wij •  E wmn gives the direction in which a movement of the operating point in the wmn co-ordinate space will result in maximum decrease in error 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes m wmn E  wmn Prof. Pushpak Bhattacharyya IIT Bombay Sigmoid neurons • Gradient Descent needs a derivative computation - not possible in perceptron due to the discontinuous step function used!  Sigmoid neurons with easy-to-compute derivatives required! y  1 as x   y  0 as x   • Computing power comes from non-linearity of sigmoid function. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Derivative of Sigmoid function 1 y x 1 e x dy 1 e x  ( e )  x 2 x 2 dx (1  e ) (1  e ) 1  x 1 e 24/08/2004 - 10/09/2004 1    y (1  y ) 1  x   1 e  CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Training algorithm • Initialize weights to random values. • For input x = <xn,xn-1,…,x0>, modify weights as follows Target output = t, Observed output = o E wi   wi E 1 (t  o) 2 2 • Iterate until E <  24/08/2004 - 10/09/2004 (threshold) CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Calculation of ∆wi n 1 E E net      where : net   wi xi  Wi net Wi  i 0  E o net    o net Wi  (t  o)o(1  o) xi E wi   (  learning constant, 0    1) wi wi   (t  o)o(1  o) xi 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Observations Does the training technique support our intuition? • The larger the xi, larger is ∆wi – Error burden is borne by the weight values corresponding to large input values • If t =0, ∆wi = 0 • ∆wi α departure from target • Saturation behaviour • If o < t, ∆wi > 0 and if o > t, ∆wi < 0  establishes Hebb’s law 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Hebb’s law nj wji ni • If nj and ni are both in excitatory state (+1) – Then the change in weight might be such that it enhances the excitation – The change is proportional to both the levels of excitation ∆wji α e(nj) e(ni) • If ni and nj are in a mutual state of inhibition ( one is +1 and the other is -1), – Then the change in weight is such that the inhibition is enhanced (change in weight is negative) 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Saturation behavior • The algorithm is iterative and incremental • If the weight values or number of input values is very large, the output will be large, then the output will be in saturation region. • The weight values hardly change in the saturation region 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Backpropagation algorithm j wji i …. …. Output layer (m o/p neurons) Hidden layers …. …. Input layer (n i/p neurons) • Fully connected feed forward network • Pure FF network (no jumping of connections over layers) 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Gradient Descent Equations E w ji   (  learning rate, 0    1) w ji net j E E   ( net j  input at the jth layer ) w ji net j w ji E net j  j net j w ji  j w ji 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Example - Character Recognition • Output layer – 26 neurons (all capital) • First output neuron has the responsibility of detecting all forms of ‘A’ • Centralized representation of outputs • In distributed representations, all output neurons participate in output 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Backpropagation – for outermost layer E E o j th j     (net j  input at the j layer ) net j o j net j 1 m E   (t p  o p ) 2 2 p 1 Hence, j  ((t j  o j )o j (1  o j )) w ji   (t j  o j )o j (1  o j )oi 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Backpropagation for hidden layers k j i …. …. Output layer (m o/p neurons) Hidden layers …. …. Input layer (n i/p neurons) k is propagated backwards to find value of j 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Backpropagation – for hidden layers w ji  joi o j E E j     net j o j net j E   o j (1  o j ) o j E netk   (  )  o j (1  o j ) o j knext layer net k Hence,  j     (w knext layer 24/08/2004 - 10/09/2004 kj  ( knext layer k  wkj )  o j (1  o j )  k )o j (1  o j )oi CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay General Backpropagation Rule • General weight updating rule: w ji  joi • Where  j  (t j  o j )o j (1  o j )   (w  knext layer 24/08/2004 - 10/09/2004 kj k for outermost layer )o j (1  o j )oi for hidden layers CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Issues in the training algorithm • Algorithm is greedy. It always changes weight such that E reduces. • The algorithm may get stuck up in a local minimum. • If we observe that E is not getting reduced anymore, the following may be the reasons: 1. Stuck in local minimum. 2. Network paralysis. (High –ve or +ve i/p makes neurons to saturate.) 3.  (learning rate is too small) 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Diagnostics in action(1) 1) If stuck in local minimum, try the following: • Re-initializing the weight vector. • Increase the learning rate. • Introduce more neurons in the hidden layer. 2) If it is network paralysis, then increase the number of neurons in the hidden layer.  Problem: How to configure the hidden layer ?  Known: One hidden layer seems to be sufficient. [Kolmogorov (1960’s)] 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Diagnostics in action(2) Kolgomorov statement: A feedforward network with three layers (input, output and hidden) with appropriate I/O relation that can vary from neuron to neuron is sufficient to compute any function.  More hidden layers reduce the size of individual layers. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay Diagnostics in action(3) 3) Observe the outputs: If they are close to 0 or 1, try the following: 1. Scale the inputs or divide by a normalizing factor. 2. Change the shape and size of the sigmoid. 3. Introduce momentum factor. (w ji ) nth  iteration  jOi   (wji)( n  1)th  iteration  Accelerates the movement out of the trough.  Dampens oscillation inside the trough.  Choosing  : If  is large, we may jump over the minimum. 24/08/2004 - 10/09/2004 CS-621/CS-449 Lecture Notes Prof. Pushpak Bhattacharyya IIT Bombay

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt - IIT Bombay