Download Document

Topic 1 Neural Networks OUTLINES Neural Networks Cerebellar Model Articulation Controller (CMAC) Applications References  C.L. Lin & H.W. Su, “Intelligent control theory in guidance and control system design: an overview,” Proc. Natl. Sci. Counc. ROC (A), pp. 15-30 Ming-Feng Yeh 1-2 1. Neural Networks As you read these words you are using a complex biological neural network. You have a highly interconnected set of 1011 neurons to facilitate your reading, breathing, motion and thinking. In the artificial neural network, the neurons are not biological. They are extremely simple abstractions of biological neurons, realized as elements in a program or perhaps as circuits made of silicon. Ming-Feng Yeh 1-3 Biological Inspiration Human brain consists of a large number (about 1011) of highly interconnected elements (about 104 connections per element) called neurons. Three principle components are the dendrites, the cell body and the axon. The point of contact is called a synapse. Ming-Feng Yeh 1-4 Biological Neurons Dendrites(樹突): carry electrical into the cell body Cell Body(細胞體): sums and thresholds these incoming signals Axon(軸突): carry the signal from the cell body out to other neurons Synapse(突觸): contact between an axon of one cell and a dendrites of another cell Ming-Feng Yeh 1-5 Neural Networks Neural Networks: a promising new generation of information processing systems, usually operate in parallel, that demonstrate the ability to learn, recall, and generalize from training patterns or data. Artificial neural networks are collections of mathematical models that emulate some of the observed properties of biological nervous systems and draw on the analogies of adaptive biological learning. Ming-Feng Yeh 1-6 Basic Model ~ 1 s1 s2 node … sn y = f(s1,s2,…,sn) y output input A neural network is composed of four pieces: nodes, connections between the nodes, nodal functions, and a learning rule for updating the information in the network. Ming-Feng Yeh 1-7 Basic Model ~ 2  Nodes: a number of nodes, each an elementary processor (EP) is required.  Connectivity: This can be represented by a matrix that shows the connections between the nodes. The number of nodes plus the connectivity define the topology of the network. In the human brain, all neurons are connected to about 104 other neurons. Artificial nets can range from totally connected to a topology where each node is just connected to its nearest neighbors. Ming-Feng Yeh 1-8 Basic Model ~ 3 Elementary processor functions: A node has inputs s1,…, sn and an output y, and the node generates the output y as a function of the inputs. A learning rule: There are two types of learning:   Supervised learning: you have to teach the networks the “answers.” Unsupervised learning: the network figures out the answers on its own. All the learning rules try to embed information by sampling the environment. Ming-Feng Yeh 1-9 Perceptron Model Suppose we have a two class problem. If we can separate these classes with a straight line (decision surface), then they are separable. The question is, how can we find the best line, and what do we mean by “best.” In n dimensions, we have a hyperplane separating the classes. These are all decision surfaces. Another problem is that you may need more than one line to separate the classes. Ming-Feng Yeh 1-10 Decision Surfaces x x x x o o o Linearly separable classes Ming-Feng Yeh x x x x x x x o o o x o o o o Multi-line decision surface 1-11 Single Layer Perceptron Model x1 w1 f(x) output inputs xn wn xi: inputs to the node; y: output; net   w x   i i i wi: weights; : threshold value. The output y can be expressed as: y  f ( wi xi   ) i The function f is called the nodal (transfer) function and is not the same in every application Ming-Feng Yeh 1-12 Nodal Function 1 1 1 -1 Hard-limiter Ming-Feng Yeh Threshold function Sigmoid function 1-13 Single Layer Perceptron Model Two-input case: w1x1 + w2x2   = 0 If we use the hard limiter, then we could say that if the output of the function is a 1, the input vector belongs to class A. If the output is a –1, the input vector belongs to class B. XOR: caused the field of neural networks to lose credibility in the 1940’s. The perceptron model could not draw a line to separate the two classes given by the exclusive-OR. Ming-Feng Yeh 1-14 Exclusive OR problem ? (0,1) x o (0,0) Ming-Feng Yeh o (1,1) ? x (1,0) 1-15 Two-layer Perceptron Model x1 y1 w11 w’1 w12 w21 x2 w22 z w’2 y2 The outputs from the two hidden nodes are y1  f ( w11 x1  w12 x2   ) y2  f ( w21 x1  w22 x2   ) The network output is z  f (w1 ' y1  w2 ' y2  ) Ming-Feng Yeh 1-16 Exclusive-XOR problem f input patterns 00 01   output patterns 0 0.5  1 11  0 +1 1.5 1 10 +1 hidden unit +1 input units x Ming-Feng Yeh -2 g +1 output unit y 1-17 Exclusive-XOR problem g = sgn (1·x + 1·y 1.5) f = sgn (1·x + 1·y  2g 0.5) input (0,0)  g=0  f=0 input (0,1)  g=0  f=1 input (1,0)  g=0  f=1 input (1,1)  g=1  f=0 Ming-Feng Yeh 1-18 Multilayer Network output patterns ok  f k ( wkj o j ) k j wkj o j  f j ( w jioi ) i internal representation units j wji oi  f i (ii ) i Ming-Feng Yeh input patterns 1-19 Weight Adjustment Adjust weights by: wji(l+1) = wji(l) + wji where wji(l) is the weight from unit i to unit j at time l (or the lth iteration) and wji is the weight adjustment. The weight change may be computed by the delta rule: wji =  j ii where  is a trial-independent learning rate and j is the error at unit j: j = tj  oj where tj is the desired output and oj is the actual output at output unit j. Repeat iterations until convergence. Ming-Feng Yeh 1-20 Generalized Delta Rule  p w ji   (t pj  o pj )i pi   pji pi t pj : the target output for jth component of the output pattern for pattern p. o pj : the jth element of the actual output pattern produced by the presentation of input pattern p. i pi : the value of the ith element of the input pattern.  pj  t pj  o pj  p wij : is the change to be made to the weight from the ith to the jth unit following presentation of pattern p. Ming-Feng Yeh 1-21 Delta Rule and Gradient Descent 1 E p   (t pj  o pj ) 2 : the error on input/output pattern p 2 j E   E p : the overall measure of the error. We wish to show that the delta rule implements a gradient descent in E when units are linear. We will proceed by simply showing that  E p w ji   pji pi which is proportional to  p wij as prescribed by the delta rule. Ming-Feng Yeh 1-22 Delta Rule & Gradient Descent When there are no hidden units it is easy to compute the relevant derivative. For this purpose we use the chain rule to write the derivative as the product of two parts: the derivative of the error with respect to the output of the unit times the derivative of the output with respect to the weight. E p E p o pj  w ji o pj w ji The first part tells how the error changes with the output of the jth unit and the second part tells how much changing wji changes that output. Ming-Feng Yeh 1-23 Delta Rule & Gradient Descent E p o pj  (t pj  o pj )   pj no hidden units The contribution of unit j to the error is simply proportional to pj. Since we have linear units, o pj   w jii pi . i From which we conclude that Thus, we have  Ming-Feng Yeh E p w ji o pj w ji  i pi.   pji pi 1-24 Delta Rule and Gradient Descent Combining this with the observation that E p E  w ji p w ji should lead us to conclude that the net change in wji after one complete cycle of pattern presentations is proportional to this derivative and hence that the delta rule implements a gradient descent in E. In fact, this is strictly true only if the values of the weights are not changed during this cycle. Ming-Feng Yeh 1-25 Delta Rule for Activation Functions in Feedforward Networks The standard delta rule essentially implements gradient descent in sum-squared error for linear activation functions. Without hidden units, the error surface is shaped like a bowl with only one minimum, so gradient descent is guaranteed to find the best set of weights. With hidden units, however, it is not so obvious how to compute the derivatives, and the error surface is not concave upwards, so there is the danger of getting stuck in local minimum. Ming-Feng Yeh 1-26 Delta Rule for Semilinear Activation Functions in Feedforward Networks The main theoretical contribution is to show that there is an efficient way of computing the derivatives. The main empirical contribution is to show that the apparently fatal problem of local minima is irrelevant in a wide variety of learning tasks. A semilinear activation function is one in which the output of a unit is a non-decreasing and differentiable function of the net total input, net pj   w ji o pi where o =i if unit i is an input unit. i i i Ming-Feng Yeh 1-27 Delta Rule for Semilinear Activation Functions in Feedforward Networks Thus, a semilinear activation function is one in which o pj  f j (net pj ) and f is differentiable and non-decreasing. To get the correct generalization of the delta rule, we must set  p w ji   E p w ji where E is the same sum-squared error function defined earlier. Ming-Feng Yeh 1-28 Delta Rule for Semilinear Activation Functions in Feedforward Networks As in the standard delta rule, it is to see this derivative as resulting from the product of two parts: One part reflecting the change in error as a function of the change in the net input to the unit and one part representing the effect of changing a particular weight on the net input. E p  E p net pj w ji The second factor is Ming-Feng Yeh net pj w ji net pj w ji   w ji  w jk o pk  o pi k 1-29 Delta Rule for Semilinear Activation Functions in Feedforward Networks Define  pj   Thus,  E p w ji E p net pj   pj o pi This says that to implement gradient descent in E we should make our weight changes according to  p w ji   pj o pi just as in the standard delta rule. The trick is to figure out what pj should be for each unit uj in the network. Ming-Feng Yeh 1-30 Delta Rule for Semilinear Activation Functions in Feedforward Networks Compute  pj   E p net pj The second factor:  o pj E p o pj o pj net pj net pj  f j ' (net pj ) which is simply the derivative of the function fj for the jth unit, evaluated at the net input netpj to that unit. Note: o pj  f j (net pj ) To compute the first factor, we consider two cases. Ming-Feng Yeh 1-31 Delta Rule for Semilinear Activation Functions in Feedforward Networks First, assume that unit uj is an output unit of the network. In this case, it follows from the definition of Ep that E p o pj  (t pj  o pj ) Thus,  pj  (t pj  o pj ) f j ' (net pj ) for any output unit uj. Ming-Feng Yeh 1-32 Delta Rule for Semilinear Activation Functions in Feedforward Networks If uj is not an output unit we use the chain rule to write E p net pk E p   net k  k pk E p net pk o pj  k net pk o pj  wki o pi i wkj    pk wkj k Thus,  pj  f j ' (net pj )  pk wkj k whenever uj is not an output unit. Ming-Feng Yeh 1-33 Delta Rule for Semilinear Activation Functions in Feedforward Networks If uj is an output unit:  pj  (t pj  o pj ) f j ' (net pj ) If uj is not an output unit:  pj  f j ' (net pj )  pk wkj k The above two equations give a recursive procedure for computing the ’s for all units in the network, which are then used to compute the weight changes in the network.  p w ji   pj o pi Ming-Feng Yeh 1-34 Delta Rule for Semilinear Activation Functions in Feedforward Networks The application of the generalized delta rule, thus, involves two phases. During the first phase the input is presented and propagated forward through the network to compute the output value opj for each unit. This output is then compared with the targets, resulting in an error signal pj for each output unit. Ming-Feng Yeh 1-35 Delta Rule for Semilinear Activation Functions in Feedforward Networks The second phase involves a backward pass through the network (analogous to the initial forward pass) during which the error signal is passed to each unit in the network and the appropriate weight changes are made. Ming-Feng Yeh 1-36 Ex: Function Approximation  g  p  = 1 + sin --- p  4 t  p e + 1-2-1 Network Ming-Feng Yeh 1-37 Network Architecture p Ming-Feng Yeh 1-2-1 Network a 1-38 Initial Values W1  0  = – 0.27 – 0.41 b 1  0  = – 0.48 W2  0  = 0.09 – 0.17 – 0.13 b2 0  = 0.48 3 Initial Network Response: a2 Network Response Sine Wave 2 1 0 -1 -2 Ming-Feng Yeh -1 0 p 1 2 1-39 Forward Propagation Initial input: 0 a = p = 1 Output of the 1st layer:     1 1 1 0 1 a = f W a + b  = l ogsig  –0.27 1 + – 0.48  = logsi g – 0.75   –0.41 a1 – 0.13   – 0.54  1 -------------------0.75 1 + e = = 0.321 1 0.368 -------------0.54 ------1+ e Output of the 2nd layer: 2 2 2 1 2 a = f W a + b  = purelin ( 0.09 – 0.17 0.321 + 0.48 ) = 0.446 0.368 error:     2   e = t – a =  1 + sin  --- p   – a =  1 + sin --- 1   – 0.446 = 1.261 4  4    Ming-Feng Yeh 1-40 Transfer Func. Derivatives n d 1 e   f 1 (n)    n  dn  1  e  (1  e  n ) 2 1  1   1 1  1   ( 1  a )( a )  n  n   1  e  1  e  f 2 (n)  d (n)  1 dn Ming-Feng Yeh 1-41 Backpropagation E p   (t pj  o pj ) 2   p w ji   j E p w ji  E p net pj net pj w ji net pj   pj w ji The second layer sensitivity:  pj  (t pj  o pj ) f j ' (net pj ) δ 2  2F 2 (n 2 )(t  a)  2[ f 2 (n 2 )]e  2 1 1.261  2.522 The first layer sensitivity:  pj  f j ' (net pj )  pk wkj 2   2   w ( 1  a )( a ) 0 1,1 1 1 1 2 T 2  δ  F (n )( W ) δ   δ 1 1  2  0 (1  a2 )( a2 )  w1, 2   0 (1  0.321)  0.321   0.09 2.522     0 (1  0.368)  0.368  0.17   1 1  0.0495    0 . 0997   Ming-Feng Yeh 1 1 k 1-42 Weight Update  p w ji   pj o pi Learning rate   0.1 W 2 (1)  W 2 (0)  s 2 (a1 )T  0.09  0.17  0.1[2.522]0.321 0.368  0.171  0.0772 b 2 (1)  b 2 (0)  s 2  [0.48]  0.1[2.522]  [0.732] W1 (1)  W1 (0)  s1 (a 0 )T  0.27  0.0495  0.265   0.1 [1]       0 . 41  0 . 0997  0 . 420        0.48  0.0495  0.475 1 1 1 b (1)  b (0)  s    0.1      0 . 13  0 . 0997  0 . 140       Ming-Feng Yeh 1-43 Choice of Network Structure Multilayer networks can be used to approximate almost any function, if we have enough neurons in the hidden layers. We cannot say, in general, how many layers or how many neurons are necessary for adequate performance. Ming-Feng Yeh 1-44 Illustrated Example 1 i g  p  = 1 + sin ----- p  4  1-3-1 Network 3 3 2 2 1 1 0 0 -1 -2 0 i 1 1 2 -1 -2 3 3 2 2 1 1 0 0 -1 -2 Ming-Feng Yeh -1 -1 0 i4 1 2 -1 -2 -1 0 i2 1 2 -1 0 1 2 i 8 1-45 Illustrated Example 2 3 6 g  p  = 1 + sin ------ p  4  2 p2 2 3 1-2-1 2 1 1 0 0 -1 -2 -1 0 1 2 3 2 -1 0 1 2 0 1 2 3 1-4-1 2 1 1 0 0 -1 -2 Ming-Feng Yeh -1 -2 1-3-1 -1 0 1 2 -1 -2 1-5-1 -1 1-46 Convergence g p = 1 + sinp  2  p  2 3 3 5 2 2 1 3 1 5 3 4 2 0 1 4 2 0 0 0 1 -1 -2 -1 0 1 2 -1 -2 -1 0 1 2 Convergence to Global Min. Convergence to Local Min. The numbers to each curve indicate the sequence of iterations. Ming-Feng Yeh 1-47

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document