Download Back-propagation Algorithm

Back-propagation Algorithm: Supervised Learning Backpropagation (BP) is amongst the ‘most popular algorithms for ANNs’: it has been estimated by Paul Werbos, the person who first worked on the algorithm in the 1970’s, that between 40% and 90% of the real world ANN applications use the BP algorithm. Werbos traces the algorithm to the psychologist Sigmund Freud’s theory of psychodynamics. Werbos applied the algorithm in political forecasting. •David Rumelhart, Geoffery Hinton and others applied the BP algorithm in the 1980’s to problems related to supervised learning, particularly pattern recognition. •The most useful example of the BP algorithm has been in dealing with problems related to prediction and control. 1 Back-propagation Algorithm BASIC DEFINITIONS 1. Backpropagation is a procedure for efficiently calculating the derivatives of some output quantity of a non-linear system, with respect to all inputs and parameters of that system, through calculations proceeding backwards from outputs to inputs. 2. Backpropagation is any technique for adapting the weights or parameters of a nonlinear system by somehow using such derivatives or the equivalent. According to Paul Werbos there is no such thing as a “backpropagation network”, he used an ANN design called a multilayer perceptron. 2 Back-propagation Algorithm Paul Werbos provided a ‘rule for updating the weights of a multi-layered network undergoing supervised learning. It is the weight adaptation rule which is called backpropagation. Typically, a fully connected feedforward network is used to be trained using the BP algorithm: activation in such networks travels in a direction from the input to the output layer and the units in one layer are connected to every other unit in the next layer. There are two sweeps of the fully connected network: forward sweep and backward sweep. 3 Back-propagation Algorithm There are two sweeps of the fully connected network: forward sweep and backward sweep. Forward Sweep: This sweep is similar to any other feedforward ANN – the input stimuli is given to the network, the network computes the weighted average from all the input units and then passes the average through a squash function. The ANN generates an output subsequently. The ANN may have a number of hidden layers, for example, the multi-net perceptron, and the output from each hidden layer becomes the input to the next layer forward. 4 Back-propagation Algorithm There are two sweeps of the fully connected network: forward sweep and backward sweep. Backwards Sweep: This sweep is similar to the forward sweep, except what is ‘swept’ are the error values. These values essentially are the differences between the actual output and a ( e j  d j  o j ) desired output. The ANN may have a number of hidden layers, for example, the multi-net perceptron, and the output from each hidden layer becomes the input to the next layer forward. In the backward sweep output unit sends errors back to the first proximate hidden layer which in turn passes it onto the next hidden layer. No error signal is sent to the input units. 5 Back-propagation Algorithm For each input vector associate a target output vector while not STOP STOP = TRUE for each input vector • perform a forward sweep to find the actual output • obtain an error vector by comparing the actual and target output • if the actual output is not within tolerance set STOP = FALSE • use the backward sweep to determine weight changes • update weights 6 Back-propagation Algorithm Example: Perform a complete forward and backward sweep of a 2-2-1 ( 2 input units, 2 hidden layer units and 1 output unit) with the following architecture. The target output d=0.9. The input is [0.1 0.9]. 0 0.1 1 2 0.2 0.1 0.1 0.3 4 0.2 0.2 6 0.1 5 d 0.9 0.3 7 Back-propagation Algorithm Target value is 0.9 so error for the output unit is: (0.900 – 0.288) x 0.288 x (1 – 0.288) = 0.125 Forward Pass 0.288 -2 3 1 0.122 3 2 -4 0.500 0.993 0 5.000 3 -2 0.1 -2 1 0.375 -2 0.986 2 0.125 3 0.728 -1.972 -2 -2 -0.906 3 Output from unit 2 0.125 0.040 -2 2 -4 -0.030 2 3 2 -2 2 -0.110 -0.007 Input to unit -2 0.025 2 -0.001 3 -2 3 0.9 Example of a forward pass and a backward pass through a 2-2-2-1 feedforward network. Inputs, outputs and errors are shown in boxes. 8 Back-propagation Algorithm -2 + (0.8 x 0.125) x 1 = -1.900 3 + (0.8 x 0.125) x 0.122 = 3.012 1.073 = 1 + (0.8 x 0.125) x 0.728 3 + (0.8 x 0.04) x 1 = 3.032 -1.98 = -2 + (0.8 x 0.025) x 1 -2 + (0.8 x 0.04) x 0.5 = -1.984 2.019 = 2 + (0.8 x 0.025) x 0.993 2 + (0.8 x 0.025) x 0.5) = 2.010 2 + (0.8 x –0.007) x 1 = 1.994 -2 + (0.8 x –0.007) x 0.1) = -2.001 3 + (0.8 x –0.001) x 0.1 = 2.999 -3.968 = -4 + (0.8 x 0.04) x 0.993 1.999 = 2 + (0.8 x –0.001) x 1 2.999 = 3 + (0.8 x –0.001) x 0.9 -2.005 = -2 + (0.8 x –0.007) x 0.9 New weights calculated following the errors derived above 9 Back-propagation Algorithm A derivation of the BP algorithm The error signal at the output of neuron j at the nth training cycle is given as: e j ( n)  d j ( n)  y j ( n) The instantaneous value of error energy for neuron j is 1 2 e j (n) 2 The total error energy E(n) can be computed by summing up the instantaneous energy over all the neurons in the output 1 layer E ( n)   e j 2 ( n) ; j C 2 the set C includes all output neurons 10 Back-propagation Algorithm Derivative or differential coefficient For a function f(x) at the argument x, the limit of the differential coefficient f ( x  x)  f ( x) x as x0 (x+x,y+y) Q y (x,y) P y=f(x) x x dy y  lim as P approaches Q dx x 11 Back-propagation Algorithm Derivative or differential coefficient Typically defined for a function of a single variable, if the left and right hand limits exist and are equal, it is the gradient of the curve at x, and is the limit of the gradient of the chord adjoining the points (x,f(x)) and (x+x,f(x+ x)). The function of x defined as this limit for each argument x is the first derivative y=f(x). 12 Back-propagation Algorithm Partial derivative or partial differential coefficient The derivative of a function of two or more variables with respect to one of these variables, the others being regarded as constant; written as: f x 13 Back-propagation Algorithm Total Derivative The derivative of a function of two or more variables with regard to a single parameter in terms of which these variables are expressed as: if z=f(x,y) with parametric equations: x=U(t), y=V(t) then under appropriate conditions the total derivative is: dz z dx z dy   dt x dt y dt 14 Back-propagation Algorithm Total or Exact Differential The differential of a function of two or more variables with regard to a single parameter in terms of which these variables are expressed, equal to the sum of the products of each partial derivative of the function with the corresponding increment. If z=f(x,y), x=U(t), y=V(t) then under appropriate conditions, the total differential is: z z dz  dx  dy x y 15 Back-propagation Algorithm Chain rule of calculus A theorem that may be used in the differentiation of a function of a function dy dy dt  . dx dt dx where y is a differentiable function of t, and t is a differentiable function of x. This enables a function of y=f(x) to be differentiated by finding a suitable function x, such that f is a composition of y and y is a differentiable function of u, and u is a differentiable function of x. 16 Back-propagation Algorithm Chain rule of calculus Similarly for partial differentiation f f u  . x u x where f is a function of u and u is a function of x. 17 Back-propagation Algorithm Chain rule of calculus Now consider the error signal at the output of a neuron j at iteration n - i.e. presentation of the nth training examples: ej(n)=dj(n)-yj(n) where dj is the desired output and yj is the actual output and the total error energy overall the neurons in the output layer: E (n)  1 2  e j (n) 2 jC where ‘C’ is the number of all the neurons in the output layer 18 Back-propagation Algorithm Chain rule of calculus If N is the total number of patterns, the averaged squared error energy is: 1 Eav  N N  E (n) n 1 Note that ej is a function of yj and wij (the weights of connections between neurons in two adjacent layers ‘i’ and ‘j’) 19 Back-propagation Algorithm A derivation of the BP algorithm The total error energy E(n) can be computed by summing up the instantaneous energy over all the neurons in the output layer 1 2 E ( n)   j C 2 e j ( n) ; where the set C includes all the neurons in the output layer. The total error energy E(n) can be computed by summing up the instantaneous energy over all the neurons in the output layer N 1 Eav  E ( n)  N n 1 The instantaneous error energy, E(n) and therefore the average energy Eav is a function of the free parameters, 20 including synaptic weights and bias levels Back-propagation Algorithm The originators of the BP algorithm suggest that E wij ( n )   wij where  is the learning rate parameter of the BP algorithm. The minus sign indicates the use of gradient descent in the weight; seeking a direction for weight change that reduces the value of E(n)  wij (n)   j (n) yi (n) 21 Back-propagation Algorithm A derivation of the BP algorithm E (n) is a function of e j (n) e j (n) is a function of y j (n) vj y j (n) is a function of v j (n) v j (n) is a function of w ji (n) E(n) is a function of a function of a function of a function of wji(n) 22 Back-propagation Algorithm A derivation of the BP algorithm The back-propagation algorithm trains a multilayer perceptron by propagating back some measure of responsibility to a hidden (non-output) unit. Back-propagation: • Is a local rule for synaptic adjustment; • Takes into account the position of a neuron in a network to indicate how a neuron’s weight are to change 23 Back-propagation Algorithm A derivation of the BP algorithm Layers in back-propagating multi-layer perceptron 1. First layer – comprises fixed input units; 2. Second (and possibly several subsequent layers) comprise trainable ‘hidden’ units carrying an internal representation. 3. Last layer – comprises the trainable output units of the multi-layer perceptron 24 Back-propagation Algorithm A derivation of the BP algorithm Modern back-propagation algorithms are based on a formalism for propagating back the changes in the error energy E , with respect to all the weights wij from a unit (i) to its inputs (j). More precisely, what is being computed during the backwards propagation is the rate of change of the error energy E with respects to the networks weight. This computation is also called the computation of the gradient: E wij 25 Back-propagation Algorithm A derivation of the BP algorithm More precisely, what is being computed during the backwards propagation is the rate of change of the error energy E with respects to the networks weight. This computation is also called the computation of the gradient: E wij   (  e 2j / 2) 1  2 j wij   (  ( d j  y j ) 2 / 2) j (  (d j  y j )2 ) j wij wij (d j  y j ) 1  .2. ( d j  y j ). 2 wij j 26 Back-propagation Algorithm A derivation of the BP algorithm More precisely, what is being computed during the backwards propagation is the rate of change of the error energy E with respects to the networks weight. This computation is also called the computation of the gradient: E wij   ( d j ) y j    e j .   wij  j  wij  y j  y j   e j . 0     e j wij  wij j j  27 Back-propagation Algorithm A derivation of the BP algorithm The back-propagation learning rule is formulated to change the connection weights wij so as to reduce the error energy E by gradient descent wij  wijnew  wijold E   wij  ej j y j wij    (d j  y j ) j y j wij 28 Back-propagation Algorithm A derivation of the BP algorithm The error energy E is a function of the error e, the output y, the weighted sum of all the inputs v, and of the weights wij:  E  E (e j , w ji , y j , v j ) According to the chain rule then: E (n) w ji  E (n) e j e j w ji 29 Back-propagation Algorithm A derivation of the BP algorithm Further applications of the chain rule suggests that: E (n) w ji E (n) w ji  E (n) e j  E (n) e j vj e j y j y j w ji e j y j v j y j v j w ji 30 Back-propagation Algorithm A derivation of the BP algorithm Each of the partial derivatives can be simplified as: E ( n )  e j e j y j y j v j v j w ji  e j    dj y j 1 2 e  j (n) 2 jvC  ej j  yj  1    v j ( n )    ' v j ( n )  v j     w ji ( n ) yi ( n )    w ji  i  yi ( n ) 31 Back-propagation Algorithm A derivation of the BP algorithm The rate of change of the error energy E with respect to changes in the synaptic weights is: vj E (n) w ji '     e j ( n)  v j ( n ) y j 32 Back-propagation Algorithm A derivation of the BP algorithm The so-called delta rule suggests that Δ w ji (n)  or Δ w ji (n)  E (n)  w ji vj   y j ( n ) ( n )  is called the local gradient 33 Back-propagation Algorithm A derivation of the BP algorithm The local gradient  is given by  j ( n)  E (n)  v j vj or  j ( n) '  e j (n)  j (v j (n)) 34 Back-propagation Algorithm A derivation of the BP algorithm The case of the output neuron j: Δ w ji (n) '    e j (n)  j (v j (n)) y j (n) vj The weight adjustment requires the computation of the error signal 35 Back-propagation Algorithm A derivation of the BP algorithm The case of the output neuron j: The so-called delta rule suggests that Δ w ji ( n )    e j ( n )  'j ( v j ( n )) y j ( n ) Logisitic function vj  ( v j ( n ))  1 1  exp(  v j ( n )) Derivative of  ( v j ( n )) w.r.t.v j ( n ) 2   1   . exp(  v j ( n )  ( v j ( n ))    1  exp(  v ( n )) j   '    1 1  1    ( v j ( n ))    1  exp(  v j ( n ))   1  exp(  v j ( n ))  ' 36 Back-propagation Algorithm A derivation of the BP algorithm The case of the output neuron j: The so-called delta rule suggests that Δ w ji (n )    e j (n )  (v j (n )) y j (n ) ' j y j (n )   (v j (n ))  ' (v j (n ))  y j (n )1  y j (n )  and Δ w ji (n )    e j (n ) y j (n )1  y j (n ) y j (n ) 37 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neuron (j) the delta rule suggests that Δw ji (n)    e j (n)  (v j (n)) yi (n) ' j The weight adjustment requires the computation of the error signal, but the situation is complicated in that we do not have a (set of) desired output(s) for the hidden neurons and consequently we will have difficulty in computing the error signal ej for the hidden neuron j. 38 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (j): Recall that the local gradient of the hidden neuron j is given as' j and that yj is the output and equals  j E ( n ) E ( n ) y j ( n )  j (n)   v j ( n ) y j (n ) v j ( n ) We have used the chain rule of calculus here E ( n ) '  j (n)   (v j ( n )) y j ( n ) 39 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (j): The ‘error’ energy related to the hidden neurons is given as 1 E ( n )   ek2 (n ) 2 k The rate of change of the error (energy) with respect to the input during the backward pass – the input during the pass yk(n)- is given as: E ( n ) 1  y j ( n ) 2 E ( n )  y j ( n )  k e k k ek2 ( n ) y j ( n ) ek ( n ) y j ( n ) 40 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (j): The rate of change of the error (energy) with respect to the input during the backward pass – the input during the pass yk(n)- is given as: E ( n )  y j ( n ) ek ( n ) v j ( n ) ek ( n )  v j ( n ) y j ( n ) k 41 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (k)- the delta rule suggests that Δwki (n)    ek (n)  (vk (n)) y j (n) ' k The error signal for a hidden neuron has, thus, to be computed recursively in terms of the error signals of ALL the neurons to which the hidden neuron j is directly connected. Therefore, the backpropagation algorithm becomes complicated. 42 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (j) The local gradient  is given by  j ( n)  E (n)  v j or  j ( n) '  e j (n)  j (v j (n)) 43 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (j) The local gradient  is given by  j (n)  E ( n )  v j which can be redefined as  j (n) E ( n ) y j   y j v j  j (n) E ( n ) '    ( v j ( n )) y j 44 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (j) The error energy for the hidden neuron can be given as 1 2 E (n)   ek (n) ; k C 2 Note that we have used the index k instead of the index j in order to avoid confusion. 45 Examples 1. (a) Describe why an elementary perceptron cannot solve the XOR problem. (b) Consider a neural network with two input nodes and one output node connected through a simple hidden layer with two neuron, a 2-2-1 network, for solving the XOR problem. Draw the architectural graph and the signal flow graph of the abovementioned network clearly showing the connection weights between the input, hidden and output layers of the 2-2-1 network. (c) Tabulate the inputs and outputs for the 2-2-1 network showing the hidden layer and output response to the input vectors [1 1], [1 0], [0 1] and [0 0]. (d) Plot the decision boundaries constructed by hidden neurons 1 and 2 of the 2-2-1 network above and the decision boundaries by the complete 2-2-1 network. 46 Examples 2. (a) To which of the two paradigms, ‘learning with a teacher’ and ‘learning without a teacher’, do the following learning algorithms belong: perceptron learning; Hebbian learning; -learning rule; and competitive learning. Justify your answers. Draw block diagrams explaining how a neural network can be trained using algorithms which learn with or without a teacher. (b) According to Amari’s general learning rule: The weight vector increases in proportion to the product of T w i  wr.i1 , w input x and learning signal The learning signal r is in general a fraction of w1, i 2 ...w in  x and for supervised learning networks it is a function of the teacher’s signal d1. Consider an initial weight vector of and 3 training vectors w 1T  1 100.5 x1T  1 20 1;x T2  01.5 0.5 1;x 3   110.5 1 The teacher’s response for x1, x2 and x3 are d1 =-1, d2=-1 and d3=1 respectively. 47 Examples 2. (i) Describe how the weight w 1Twill change in one cycle when each of the vectors x1, x2 and x3 are shown to a discrete bipolar neuron which learns according to the perceptron learning rule. Assume the learning constant c to be 0.1. 1T (ii) Describe how the weight wwill change if we used -learning rule instead of the perceptron learning rule. (iii) Describe how the weight will change in one cycle for a network which learns according to Hebbian learning rule but is being trained by the following inputs: x 1T  1 21.50; x T2  1 0.5 2 1.5; x 3T  01 11.5 48 Examples 3. (a) Define the weight adaptation rule described in the neural networks’ literature as back-propagation in terms of the two patterns of computation used in the implementation of the rule: the forward and backward pass. (b) Consider a multi-layer feed forward network with a 2-2-1 architecture: 1 3 5 2 4 This network is to be trained using the back propagation rule on input pattern x and a target output, d of 0.9. x T  0.10.9 Perform a complete forward- and backward-pass computation using the input x and the target d for one cycle. Compute the outputs of the hidden layer neurons 3 and 4 and for the output neuron 5. Compute the error for the output node and for the hidden nodes. 49 Examples 4. (a) A self-organising map has been trained with an input consisting of eight Gaussian clouds with unit variance but difference centres located at eight points around a cube: (0,0,0,0,0,0,0,0), (4,0,0,0,0,0,0,0) … (0,4,4,0,0,0,0,0) (Figure 1). This 3-dimensional input structure was represented by a two dimensional lattice of 10 by 10 neurons output layer of a Kohenen self-organising map (SOM). Table 1 shows the feature map. 8 5 1 0 5 1 5 5 1 2 5 5 1 1 3 8 8 5 1 1 4 8 8 8 4 1 1 5 7 8 8 4 4 1 1 6 7 7 7 4 4 4 4 1 7 2 2 3 3 3 4 4 4 1 8 2 2 2 3 3 3 3 4 4 4 9 2 2 2 3 3 3 3 4 4 4 1 0 1 2 3 4 5 6 7 8 9 6 6 6 6 5 5 5 5 6 6 6 5 5 5 5 6 7 6 8 8 8 7 7 7 7 8 7 7 7 7 7 7 7 6 7 2 5 7 6 4 Table 1: Feature Map provided by an SOM 1 3 2 Fig. 1: Input Pattern 50 Examples 4. (a) cont. Do you think that an SOM has classified the Gaussian clouds correctly? Draw the clusters out in the feature map. What conclusions can be drawn from the statement that ‘SOM algorithm preserves the topological relationship that exist in the input space’? (b) A self-organised feature map (SOFM) with three input units and four output units is to be trained using the four training vectors: x 1T  0.80.70.4; x T2  0.60.90.9; x 3T  0.30.40.1 & x T4  0.10.10.3 The initial weight matrix is: 0.50.4 0.60.2;   0.80.5 The initial radius is zero and the learning rate  is set to 0.5. Calculate the weight changes during the first cycle through the training data taking the training vectors in the order given above. 51 Backpropagation/Training Algorithm {6}for the output layer, with units 0 and 3 being the bias units for the input and hidden layer respectively: net4 = (1.0 x 0.1) + (0.1 x –0.2) + (0.9 x 0.1) = -0.170 1 o4 = 1  exp( 0.17) = 0.542 net5 = (1.0 x 0.1) + (0.1 x –0.1) + (0.9 x 0.3) = 0.360 1 o5 = 1  exp( 0.36) = 0.589 Similarly, from the hidden to output layer: net6 = (1.0 x 0.2) + (0.542 x 0.2) + (0.589 x 0.3) = 0.485 o6 = 0.619 The error for the output node is: 6 = (0.9 – 0.619) x 0.619 x (1 – 0.619) = 0.066 The error for the hidden nodes is: 5 4 Note that the hidden unit errors are used to update the first layer of weights. There is no error calculated for the bias unit as no weights from the first layer connect to the hidden bias. = 0.589 x (1.0 – 0.589) x (0.066 c 0.3) = 0.005 = 0.542 x (1.0 – 0.542) x (0.066 x 0.2) = 0.003 52 Backpropagation/Training Algorithm The rate of learning for this example is taken as 0.25. There is no need to give a momentum term for this first pattern since there are no previous weight changes: w5,6 = 0.25 x 0.066 x 0.589 = 0.01 The new weight is: 0.3 + 0.01 w4,6 = 0.31 = 0.25 x 0.066 x 0.542 = 0.009 Then: 0.2 + 0.009 = 0.209 w3,6 = 0.25 x 0.066 x 1.0 = 0.017 Finally: 0.2 + 0.017 = 0.217 The calculation of the new weights for the first layer is left as an exercise. 53 Backpropagation/Training Algorithm Introduction to ‘artificial’ neural nets McCuloch Pitts Neurons Rosenblatt’s Perceptron Unsupervised Learning: Hebbian Learning Kohonen Maps Supervised Learning: (Perceptron Learning) Backpropagation Algorithm 3 Questions; Answer any 2 54 COURSEWORK Problem Area; (Related Application Area) Why Neural Nets (Learning?) Learning Algorithm Network Architecture (Total No. of neurons; layers?) Data: Training Data; Testing Data Evaluation of Test Results Conclusions+Opinions YOUR EVALUATION OF THE WHOLE SIMULATION, Max 4 pages 55 ORAL EXAMINATION Demonstrate your system Answer questions related to the learning algorithms, network architecture Questions on potential applications How do you construct your input representation? How do you/your system produces output signals? How did you chose your test data/training data? Limitations of neural nets? 56

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Back-propagation Algorithm