Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Nervous system network models wikipedia , lookup
Biological neuron model wikipedia , lookup
Neural modeling fields wikipedia , lookup
Synaptic gating wikipedia , lookup
Convolutional neural network wikipedia , lookup
Genetic algorithm wikipedia , lookup
Gene expression programming wikipedia , lookup
Pattern recognition wikipedia , lookup
Catastrophic interference wikipedia , lookup
Financial Informatics –XVI: Supervised Backpropagation Learning Khurshid Ahmad, Professor of Computer Science, Department of Computer Science Trinity College, Dublin-2, IRELAND November 19th, 2008. https://www.cs.tcd.ie/Khurshid.Ahmad/Teaching.html 1 1 Preamble Neural Networks 'learn' by adapting in accordance with a training regimen: Five key algorithms. ERROR-CORRECTION OR PERFORMANCE LEARNING HEBBIAN OR COINCIDENCE LEARNING BOLTZMAN LEARNING (STOCHASTIC NET LEARNING) COMPETITIVE LEARNING FILTER LEARNING (GROSSBERG'S NETS) 2 ANN Learning Algorithms Vector describing the environment ENVIRONMENT TEACHER Desired response LEARNING SYSTEM Actual Response - + Error Signal 3 ANN Learning Algorithms ENVIRONMENT Vector describing state of the environment LEARNING SYSTEM 4 ANN Learning Algorithms Primary Reinforcement State-vector input ENVIRONMENT CRITIC Heuristic Reinforcement Actions LEARNING SYSTEM 5 Back-propagation Algorithm: Supervised Learning Backpropagation (BP) is amongst the ‘most popular algorithms for ANNs’: it has been estimated by Paul Werbos, the person who first worked on the algorithm in the 1970’s, that between 40% and 90% of the real world ANN applications use the BP algorithm. Werbos traces the algorithm to the psychologist Sigmund Freud’s theory of psychodynamics. Werbos applied the algorithm in political forecasting. •David Rumelhart, Geoffery Hinton and others applied the BP algorithm in the 1980’s to problems related to supervised learning, particularly pattern recognition. •The most useful example of the BP algorithm has been in dealing with problems related to prediction and control. 6 Back-propagation Algorithm: Architecture of a BP system 7 Back-propagation Algorithm BASIC DEFINITIONS 1. Backpropagation is a procedure for efficiently calculating the derivatives of some output quantity of a non-linear system, with respect to all inputs and parameters of that system, through calculations proceeding backwards from outputs to inputs. 2. Backpropagation is any technique for adapting the weights or parameters of a nonlinear system by somehow using such derivatives or the equivalent. According to Paul Werbos there is no such thing as a “backpropagation network”, he used an ANN design called a multilayer perceptron. 8 Back-propagation Algorithm Paul Werbos provided a ‘rule for updating the weights of a multi-layered network undergoing supervised learning. It is the weight adaptation rule which is called backpropagation. Typically, a fully connected feedforward network is used to be trained using the BP algorithm: activation in such networks travels in a direction from the input to the output layer and the units in one layer are connected to every other unit in the next layer. There are two sweeps of the fully connected network: forward sweep and backward sweep. 9 Back-propagation Algorithm There are two sweeps of the fully connected network: forward sweep and backward sweep. Forward Sweep: This sweep is similar to any other feedforward ANN – the input stimuli is given to the network, the network computes the weighted average from all the input units and then passes the average through a squash function. The ANN generates an output subsequently. The ANN may have a number of hidden layers, for example, the multi-net perceptron, and the output from each hidden layer becomes the input to the next layer forward. 10 Back-propagation Algorithm There are two sweeps of the fully connected network: forward sweep and backward sweep. Backwards Sweep: This sweep is similar to the forward sweep, except what is ‘swept’ are the error values. These values essentially are the differences between the actual output and a ( e j d j o j ) desired output. The ANN may have a number of hidden layers, for example, the multi-net perceptron, and the output from each hidden layer becomes the input to the next layer forward. In the backward sweep output unit sends errors back to the first proximate hidden layer which in turn passes it onto the next hidden layer. No error signal is sent to the input units. 11 Back-propagation Algorithm For each input vector associate a target output vector while not STOP STOP = TRUE for each input vector • perform a forward sweep to find the actual output • obtain an error vector by comparing the actual and target output • if the actual output is not within tolerance set STOP = FALSE • use the backward sweep to determine weight changes • update weights 12 Back-propagation Algorithm Example: Perform a complete forward and backward sweep of a 2-2-1 ( 2 input units, 2 hidden layer units and 1 output unit) with the following architecture. The target output d=0.9. The input is [0.1 0.9]. 0 0.1 1 2 0.2 0.1 0.1 0.3 4 0.2 0.2 6 0.1 5 d 0.9 0.3 13 Back-propagation Algorithm Target value is 0.9 so error for the output unit is: (0.900 – 0.288) x 0.288 x (1 – 0.288) = 0.125 Forward Pass 0.288 -2 3 1 0.122 3 2 -4 0.500 0.993 0 5.000 3 -2 0.1 -2 1 0.375 -2 0.986 2 0.125 3 0.728 -1.972 -2 -2 -0.906 3 Output from unit 2 0.125 0.040 -2 2 -4 -0.030 2 3 2 -2 2 -0.110 -0.007 Input to unit -2 0.025 2 -0.001 3 -2 3 0.9 Example of a forward pass and a backward pass through a 2-2-2-1 feedforward network. Inputs, outputs and errors are shown in boxes. 14 Back-propagation Algorithm -2 + (0.8 x 0.125) x 1 = -1.900 3 + (0.8 x 0.125) x 0.122 = 3.012 1.073 = 1 + (0.8 x 0.125) x 0.728 3 + (0.8 x 0.04) x 1 = 3.032 -1.98 = -2 + (0.8 x 0.025) x 1 -2 + (0.8 x 0.04) x 0.5 = -1.984 2.019 = 2 + (0.8 x 0.025) x 0.993 2 + (0.8 x 0.025) x 0.5) = 2.010 2 + (0.8 x –0.007) x 1 = 1.994 -2 + (0.8 x –0.007) x 0.1) = -2.001 3 + (0.8 x –0.001) x 0.1 = 2.999 -3.968 = -4 + (0.8 x 0.04) x 0.993 1.999 = 2 + (0.8 x –0.001) x 1 2.999 = 3 + (0.8 x –0.001) x 0.9 -2.005 = -2 + (0.8 x –0.007) x 0.9 New weights calculated following the errors derived above 15 Back-propagation Algorithm A derivation of the BP algorithm The error signal at the output of neuron j at the nth training cycle is given as: e j ( n) d j ( n) y j ( n) The instantaneous value of error energy for neuron j is 1 2 e j (n) 2 The total error energy E(n) can be computed by summing up the instantaneous energy over all the neurons in the output 1 layer E ( n) e j 2 ( n) ; j C 2 the set C includes all output neurons 16 Back-propagation Algorithm Derivative or differential coefficient For a function f(x) at the argument x, the limit of the differential coefficient f ( x x) f ( x) x as x0 (x+x,y+y) Q y (x,y) P y=f(x) x x dy y lim as P approaches Q dx x 17 Back-propagation Algorithm Derivative or differential coefficient Typically defined for a function of a single variable, if the left and right hand limits exist and are equal, it is the gradient of the curve at x, and is the limit of the gradient of the chord adjoining the points (x,f(x)) and (x+x,f(x+ x)). The function of x defined as this limit for each argument x is the first derivative y=f(x). 18 Back-propagation Algorithm Partial derivative or partial differential coefficient The derivative of a function of two or more variables with respect to one of these variables, the others being regarded as constant; written as: f x 19 Back-propagation Algorithm Total Derivative The derivative of a function of two or more variables with regard to a single parameter in terms of which these variables are expressed as: if z=f(x,y) with parametric equations: x=U(t), y=V(t) then under appropriate conditions the total derivative is: dz z dx z dy dt x dt y dt 20 Back-propagation Algorithm Total or Exact Differential The differential of a function of two or more variables with regard to a single parameter in terms of which these variables are expressed, equal to the sum of the products of each partial derivative of the function with the corresponding increment. If z=f(x,y), x=U(t), y=V(t) then under appropriate conditions, the total differential is: z z dz dx dy x y 21 Back-propagation Algorithm Chain rule of calculus A theorem that may be used in the differentiation of a function of a function dy dy dt . dx dt dx where y is a differentiable function of t, and t is a differentiable function of x. This enables a function of y=f(x) to be differentiated by finding a suitable function x, such that f is a composition of y and y is a differentiable function of u, and u is a differentiable function of x. 22 Back-propagation Algorithm Chain rule of calculus Similarly for partial differentiation f f u . x u x where f is a function of u and u is a function of x. 23 Back-propagation Algorithm Chain rule of calculus Now consider the error signal at the output of a neuron j at iteration n - i.e. presentation of the nth training examples: ej(n)=dj(n)-yj(n) where dj is the desired output and yj is the actual output and the total error energy overall the neurons in the output layer: E (n) 1 2 e j (n) 2 jC where ‘C’ is the number of all the neurons in the output layer 24 Back-propagation Algorithm Chain rule of calculus If N is the total number of patterns, the averaged squared error energy is: 1 Eav N N E (n) n 1 Note that ej is a function of yj and wij (the weights of connections between neurons in two adjacent layers ‘i’ and ‘j’) 25 Back-propagation Algorithm A derivation of the BP algorithm The total error energy E(n) can be computed by summing up the instantaneous energy over all the neurons in the output layer 1 2 E ( n) j C 2 e j ( n) ; where the set C includes all the neurons in the output layer. The total error energy E(n) can be computed by summing up the instantaneous energy over all the neurons in the output layer N 1 Eav E ( n) N n 1 The instantaneous error energy, E(n) and therefore the average energy Eav is a function of the free parameters, 26 including synaptic weights and bias levels Back-propagation Algorithm The originators of the BP algorithm suggest that E wij ( n ) wij where is the learning rate parameter of the BP algorithm. The minus sign indicates the use of gradient descent in the weight; seeking a direction for weight change that reduces the value of E(n) wij (n) j (n) yi (n) 27 Back-propagation Algorithm A derivation of the BP algorithm E (n) is a function of e j (n) e j (n) is a function of y j (n) vj y j (n) is a function of v j (n) v j (n) is a function of w ji (n) E(n) is a function of a function of a function of a function of wji(n) 28 Back-propagation Algorithm A derivation of the BP algorithm The back-propagation algorithm trains a multilayer perceptron by propagating back some measure of responsibility to a hidden (non-output) unit. Back-propagation: • Is a local rule for synaptic adjustment; • Takes into account the position of a neuron in a network to indicate how a neuron’s weight are to change 29 Back-propagation Algorithm A derivation of the BP algorithm Layers in back-propagating multi-layer perceptron 1. First layer – comprises fixed input units; 2. Second (and possibly several subsequent layers) comprise trainable ‘hidden’ units carrying an internal representation. 3. Last layer – comprises the trainable output units of the multi-layer perceptron 30 Back-propagation Algorithm A derivation of the BP algorithm Modern back-propagation algorithms are based on a formalism for propagating back the changes in the error energy E , with respect to all the weights wij from a unit (i) to its inputs (j). More precisely, what is being computed during the backwards propagation is the rate of change of the error energy E with respects to the networks weight. This computation is also called the computation of the gradient: E wij 31 Back-propagation Algorithm A derivation of the BP algorithm More precisely, what is being computed during the backwards propagation is the rate of change of the error energy E with respects to the networks weight. This computation is also called the computation of the gradient: E wij ( e 2j / 2) 1 2 j wij ( ( d j y j ) 2 / 2) j ( (d j y j )2 ) j wij wij (d j y j ) 1 .2. ( d j y j ). 2 wij j 32 Back-propagation Algorithm A derivation of the BP algorithm More precisely, what is being computed during the backwards propagation is the rate of change of the error energy E with respects to the networks weight. This computation is also called the computation of the gradient: E wij ( d j ) y j e j . wij j wij y j y j e j . 0 e j wij wij j j 33 Back-propagation Algorithm A derivation of the BP algorithm The back-propagation learning rule is formulated to change the connection weights wij so as to reduce the error energy E by gradient descent wij wijnew wijold E wij ej j y j wij (d j y j ) j y j wij 34 Back-propagation Algorithm A derivation of the BP algorithm The error energy E is a function of the error e, the output y, the weighted sum of all the inputs v, and of the weights wij: E E (e j , w ji , y j , v j ) According to the chain rule then: E (n) w ji E (n) e j e j w ji 35 Back-propagation Algorithm A derivation of the BP algorithm Further applications of the chain rule suggests that: E (n) w ji E (n) w ji E (n) e j E (n) e j vj e j y j y j w ji e j y j v j y j v j w ji 36 Back-propagation Algorithm A derivation of the BP algorithm Each of the partial derivatives can be simplified as: E ( n ) e j e j y j y j v j v j w ji e j dj y j 1 2 e j (n) 2 jvC ej j yj 1 v j ( n ) ' v j ( n ) v j w ji ( n ) yi ( n ) w ji i yi ( n ) 37 Back-propagation Algorithm A derivation of the BP algorithm The rate of change of the error energy E with respect to changes in the synaptic weights is: vj E (n) w ji ' e j ( n) v j ( n ) y j 38 Back-propagation Algorithm A derivation of the BP algorithm The so-called delta rule suggests that Δ w ji (n) or Δ w ji (n) E (n) w ji vj y j ( n ) ( n ) is called the local gradient 39 Back-propagation Algorithm A derivation of the BP algorithm The local gradient is given by j ( n) E (n) v j vj or j ( n) ' e j (n) j (v j (n)) 40 Back-propagation Algorithm A derivation of the BP algorithm The case of the output neuron j: Δ w ji (n) ' e j (n) j (v j (n)) y j (n) vj The weight adjustment requires the computation of the error signal 41 Back-propagation Algorithm A derivation of the BP algorithm The case of the output neuron j: The so-called delta rule suggests that Δ w ji ( n ) e j ( n ) 'j ( v j ( n )) y j ( n ) Logisitic function vj ( v j ( n )) 1 1 exp( v j ( n )) Derivative of ( v j ( n )) w.r.t.v j ( n ) 2 1 . exp( v j ( n ) ( v j ( n )) 1 exp( v ( n )) j ' 1 1 1 ( v j ( n )) 1 exp( v j ( n )) 1 exp( v j ( n )) ' 42 Back-propagation Algorithm A derivation of the BP algorithm The case of the output neuron j: The so-called delta rule suggests that Δ w ji (n ) e j (n ) (v j (n )) y j (n ) ' j y j (n ) (v j (n )) ' (v j (n )) y j (n )1 y j (n ) and Δ w ji (n ) e j (n ) y j (n )1 y j (n ) y j (n ) 43 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neuron (j) the delta rule suggests that Δw ji (n) e j (n) (v j (n)) yi (n) ' j The weight adjustment requires the computation of the error signal, but the situation is complicated in that we do not have a (set of) desired output(s) for the hidden neurons and consequently we will have difficulty in computing the error signal ej for the hidden neuron j. 44 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (j): Recall that the local gradient of the hidden neuron j is given as' j and that yj is the output and equals j E ( n ) E ( n ) y j ( n ) j (n) v j ( n ) y j (n ) v j ( n ) We have used the chain rule of calculus here E ( n ) ' j (n) (v j ( n )) y j ( n ) 45 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (j): The ‘error’ energy related to the hidden neurons is given as 1 E ( n ) ek2 (n ) 2 k The rate of change of the error (energy) with respect to the input during the backward pass – the input during the pass yk(n)- is given as: E ( n ) 1 y j ( n ) 2 E ( n ) y j ( n ) k e k k ek2 ( n ) y j ( n ) ek ( n ) y j ( n ) 46 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (j): The rate of change of the error (energy) with respect to the input during the backward pass – the input during the pass yk(n)- is given as: E ( n ) y j ( n ) ek ( n ) v j ( n ) ek ( n ) v j ( n ) y j ( n ) k 47 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (k)- the delta rule suggests that Δwki (n) ek (n) (vk (n)) y j (n) ' k The error signal for a hidden neuron has, thus, to be computed recursively in terms of the error signals of ALL the neurons to which the hidden neuron j is directly connected. Therefore, the backpropagation algorithm becomes complicated. 48 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (j) The local gradient is given by j ( n) E (n) v j or j ( n) ' e j (n) j (v j (n)) 49 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (j) The local gradient is given by j (n) E ( n ) v j which can be redefined as j (n) E ( n ) y j y j v j j (n) E ( n ) ' ( v j ( n )) y j 50 Back-propagation Algorithm A derivation of the BP algorithm The case of the hidden neurons (j) The error energy for the hidden neuron can be given as 1 2 E (n) ek (n) ; k C 2 Note that we have used the index k instead of the index j in order to avoid confusion. 51