Download Multilayer Neural Networks - Computing Science

Neural Network Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University Course Outline    Part I – Introduction to Artificial Intelligence Part II – Classical Artificial Intelligence Part III – Machine Learning      Introduction to Machine Learning Neural Networks Probabilistic Reasoning and Bayesian Belief Networks Artificial Life: Learning through Emergent Behavior Part IV – Advanced Topics TRU-COMP3710 Intro to Machine Learning 2 A new sort of computer  What are (everyday) computer systems good at... and not so good at? Good at Rule-based systems: doing what the programmer wants them to do Not so good at Dealing with noisy data Dealing with unknown environment data Massive parallelism Fault tolerance Adapting to circumstances  How about humans? Reference   Artificial Intelligence Illuminated, Ben Coppin, Jones and Bartlett Illuminated Series Many tutorials from the Internet  Neural Networks with Java   Joone – Java Object Oriented Neural Engine      http://fbim.fh-regensburg.de/~saj39122/jfroehl/diplom/e-main.html http://www.jooneworld.com./ http://www.ai-junkie.com/ai-junkie.html http://www.willamette.edu/~gorr/classes/cs449/intro.html http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html … Neural Network 4 Learning Outcomes        Train a perceptron with a training data set. Update the weights in a feed-forward network with error backpropagation. Update the weights in a feed-forward network with error backpropagation and delta rule. Implement a feed-forward network for a simple pattern recognition application. List the two examples of recurrent network. List applications of Hopfield netorks, Bidirectional Associative Memories and Kohonen Maps. Update the weights in a Hebbian network. Neural Network 5 Chapter Contents 1. 2. 3. 4. 5. Biological Neurons How to model biological neurons – artificial neurons The first neural network – Perceptrons How to overcome the problems in Perceptron networks – Multilayer Neural Networks Feed-forward, Backpropagation, Backpropagation with Delta Rule Can an ANN remember? – Recurrent Networks Hopfield Networks Bidirectional Associative Memories How to learn without using training data set 6. 7. 8. 9. Kohonen Maps Hebbian Learning Fuzzy Neural Networks Evolving Neural Networks Neural Network 6 1. Biological Neurons  Topics The human brain is made up of about 100 billions of simple processing units – neurons. -> parallelism; emergent behavior; ... synapse axon nucleus cell body dendrites    Inputs are received on dendrites, and if the input levels are over a threshold, the neuron fires, passing a signal through the axon to the synapse which then connects to another neuron. The human brain has a property known as plasticity, which means that neurons can change the nature and number of their connections to other neurons in response to events that occur. In this way, the brain is able to learn. [Q] How to model? Neural Network 7 2. Artificial Neurons      Artificial neurons are based on biological neurons. McCulloch and Pitts (1943) Each neuron in the network receives one or more inputs. An activation function is applied to the inputs, which determines the output of the neuron – the activation level. Weights associated to synapses Neural Network 8  Three typical activation functions. y 1 1  e x Neural Network y  ax  b 9  A typical activation function, called step function, works as follows: w1  0.8; w2  0.4 x1  0.7; x2  0.9 x1:w1 t  0.8 Y 2 X   wi xi  ??? i 1 xn:wn Y  ???    Each previous node i has a weight, wi associated with it. The sum of all the weights is 1. The input from previous node i is xi. t is the threshold. So if the weighted sum of the inputs to the neuron is above the threshold, then the neuron fires. Neural Network 10 Topics       There is no central processing or control mechanism The entire network is involved in every piece of computation that takes place. The processing time in each artificial neuron is small, not like real biological neuron. Parallelism of massive number of neurons The weight associated with each connection (equivalent to a synapse in the biological brain) can be changed in response to particular sets of inputs and events. In this way, an artificial neural network is able to learn. [Q] How and when to change the weights? Neural Network 11 3. Perceptrons   A perceptron is a single neuron that classifies a set of inputs into one of two categories (usually 1 or -1). Rosenblatt (1958) Neural Network 12     A perceptron is a single neuron that classifies a set of inputs into one of two categories (usually 1 or -1). Rosenblatt (1958) If the inputs are in the form of a grid, a perceptron can be used to recognize visual images of shapes. The perceptron usually uses a step function, which returns 1 if the weighted sum of inputs exceeds a threshold, and 0 otherwise. n w1  0.8; w2  0.4 i 1 x1  0.7; x2  0.9 X   wi xi 1 X  t Y  Step ( X )   0 X  t  n  Y  Step   wi xi   i 1  Neural Network t  0.8 2 X   wi xi  ??? i 1 Y  ??? 13  The perceptron is trained as follows:     First, random weights (usually between –0.5 and 0.5) are given. An item of training data is presented. Output classification is observed. If the perceptron mis-classifies it, the weights are modified according to the following: wi  wi  (a  xi  e)    a is the learning rate, between 0 and 1. e is the size of the error Perceptron training rule for e      0 if the output is correct Positive if the output is too low Negative if the output is too high The train continues until all errors become zero; Then weights are not changed anymore. Neural Network 14   Example of logical OR for two inputs, with t = 0 and a = 0.2 Random initial weights wi  wi  (a  xi  e) w1  0.2 w2  0.4  The training data x1  0 x2  0    Expected output is 0 (= 0 OR 0). Output is correct. Y  Step  0  0.2  0  0.4  0 The weights are not changed. Neural Network 15  wi  wi  (a  xi  e) For the next training data x1  0; w1  0.2 x2  1; w2  0.4 Y  Step  0  0.2  1 0.4   Step(0.4)  1 x1  1; w1  0.2 x2  0; w2  0.4 Correct! Y  Step 1 0.2  0  0.4   Step(0.2)  0 w1  0.2  0.2 11  0 Wrong! e  1 0  1 w2  0.4  0.2  0  1  0.4 x1  1; w1  0 x2  1; w2  0.4 Y  Step 1 0  1 0.4   Step(1)  1 Neural Network Correct! 16 Epoch x1 x2 Expected Y Actual Y Error w1 w2 1 0 0 0 0 0 -0.2 0.4 1 0 1 1 1 0 -0.2 0.4 1 1 0 1 0 1 0 0.4 1 1 1 1 1 0 0 0.4 2 0 0 0 0 0 0 0.4 2 0 1 1 1 0 0 0.4 2 1 0 1 0 1 0.2 0.4 2 1 1 1 1 0 0.2 0.4 3 0 0 0 0 0 0.2 0.4 3 0 1 1 1 0 0.2 0.4 3 1 0 1 1 0 0.2 0.4 3 1 1 1 1 0 0.2 0.4 Neural Network 17  [Q] Can you test the previous system with    0.1, 1 -0.7, 0.3 0, -0.2  [Q] Can you develop a system for the Boolean AND operation?  [Q] What is learning in Perceptrons ? Neural Network 18 Topics    Perceptrons can only classify linearly separable functions. (AND and OR) The first of the following graphs shows a linearly separable function (OR). The second is not linearly separable (Exclusive-OR).   Demo – Perceptron Learning Applet [Q] How to improve? Neural Network 19 4. Multilayer Neural Networks  Multilayer neural networks can classify a range of functions, including non linearly separable ones. A feed-forward network Not cellular automata Neurons work synchronously [Q] How to train?   Each input layer neuron connects to all neurons in the hidden layer (or layers). The neurons in the hidden layer connect to all neurons in the output layer. Neural Network 20  Demo – XOR, AND, OR  XOR can be written:    y = (x1  x2)  (x1  x2) Therefore, y = (x1  x2)  (x1  x2) Split:    y1 = x1  x2 y2 = (x1  x2) y = y1  y2 Neural Network 21 Backpropagation wi  wi  (a  xi  e)    Multilayer neural networks learn in the same way as Perceptrons. But it take a too long time to train. [Q] Why? There are many more weights, and it is important to assign credit (or blame) correctly when changing weights. Backpropagation networks use the sigmoid activation function, not a step function, as it is easy to differentiate: Neural Network 22  For node j, Xj is the input yj is the output n is the number of inputs to node j j is the threshold for j n X j   wij xi   j i 1 yj    wij 1 1 e xi X j j wjk yj k yj After values are fed forward through the network, errors are fed back to modify the weights in order to train the network. For each node, we calculate an error gradient. Neural Network 23     For a node k in the output layer, the error ek is the difference between the desired output and the actual output. The error gradient for k is: Now the weights are updated as follows: Similarly, for a node j in the hidden layer:   is the learning rate, (a positive number below 1)  Known as gradient descent xi wij j wjk yj k yj Neural Network 24 w j ,k  .45  .2  .9  .2   .2 w j ,k  .45 w j ,l  .82 y j  .9  k  .2  l  .7 xi wij j wjk yj wjl yj k l yk  .486 w j ,l  .82  .2  .9  (.7)  .694  j  y j (1  y j )  w j ,k  k  w j ,l l   .9  .1 .486  .2  .694  (.7)   .034974  wi , j  wi , j  .2  xi  (.034974) yl Neural Network 25    Not likely appear to occur in the human brain Too slow learning. With some simple problems it can take hundreds or even thousands of epochs to reach a satisfactorily low level of error. [Q] Why?   Weights are changed too easily. Can we use a sort of the idea of Simulated Annealing? [Q] How to improve? Neural Network 26 Backpropagation with Delta Rule  Generalized delta rule: Inclusion of momentum, the extent to which a weight was changed on the previous iteration w jk (t )    y j   k    w jk (t  1),  w jk (t )  w jk (t  1)    y j   k w jk (t )  w jk (t  1)  w jk (t ) where 0    1, usually very high such as 0.95  Simple backpropagation Hyperbolic tangent instead of sigmoid activation function 2a  a,  bx 1 e where a and b are constants, such as a  1.7 and b  0.7 tanh( x)   … Neural Network 27 Backpropagation  Advantages:    Downsides:     It works! Relatively fast Requires a training set Can be slow Probably not biologically realistic Alternatives to Backpropagation:  Hebbian learning   Reinforcement learning   Not successful in feed-forward nets Only limited success Artificial evolution  More general, but can be even slower than backpropagation. Neural Network 28 Example: Voice Recognition  Task: Learn to discriminate between two different voices saying “Hello” [Q] How to implement?  Data   Sources    Steve Simpson David Raubenheimer Format  Frequency distribution (60 bins)  Network architecture  Feed forward network     60 inputs (one for each frequency bin) 6 hidden nodes 2 outputs (0-1 for “Steve”, 1-0 for “David”) [Q] What is the total number of nodes?  Presenting the data Steve David  Presenting the data (untrained network) Steve 0.43 0.26 David 0.73 0.55  Calculate error Steve |0.43 – 0| = 0.43 |0.26 –1| = 0.74 |0.73 – 1| = 0.27 |0.55 – 0| = 0.55 David  Backpropagation error and adjust weights Steve David 0.43 – 0 = 0.43 0.26 – 1 = 0.74 1.17 (overall error) 0.73 – 1 = 0.27 0.55 – 0 = 0.55 0.82  Repeat process (sweep) for all training pairs      Present data Calculate error Backpropagation error Adjust weights Repeat process multiple times  Presenting the data (trained network) Steve 0.01 0.99 David 0.99 0.01  Results – Voice Recognition  Performance of trained network  Discrimination accuracy between known “Hello”s   100% Discrimination accuracy between new “Hello”’s  100%  Results – Voice Recognition (ctnd.)  Network has learnt to generalise from original data  Networks with different weight settings can have same functionality  Trained networks ‘concentrate’ on lower frequencies  Network is robust against non-functioning nodes Topics Applications of Feed-forward nets  Pattern recognition   Character recognition Face Recognition  Sonar mine/rock recognition (Gorman & Sejnowksi, 1988)  Navigation of a car (Pomerleau, 1989)  Stock-market prediction  Pronunciation (NETtalk) (Sejnowksi & Rosenberg, 1987) 5. Recurrent Networks  Feed-forward networks do not have memory. [Q] What does it mean?    Acyclic Once a feed-forward network is trained, its state is fixed and does not alter as new input data is presented to it. Recurrent networks, or also called feedback networks, can have arbitrary connections between nodes in any layer, even backward from output nodes to input nodes. The internal state can alter as sets of input data are presented to it –> a memory. [Q] What does it mean? Or called memory units Neural Network 40       Biological nervous systems show high levels of recurrency (but feedforward structures exist too). Recurrent networks can be used to solve problems where the solution depends on previous inputs as well as current inputs (e.g. predicting stock market movements). Inputs are fed through the network, including feeding data back from outputs to inputs, and repeat this process until the values of the outputs do not change – a state of equilibrium or stability. The stable values of the network are called as fundamental memories. But, it is not always the case that a recurrent network reaches a stable state. Recurrent networks are also called attractor networks. Neural Network 41 Hopfield Networks   A Hopfield Network is a recurrent network. Use a sign activation function:   1 X  0 Sign( X )   1 X  0 If a neuron receives a 0 as an input it does not change state – in other words, it continues to output the same previous value. Weights are usually represented as matrices. N W   X i X it  N  I , where i 1 X i is an input vector of m input values; I is m  m identity matrix N is the number of states ( X i ) that are to be learned Neural Network 42  Three states to learn: X 1  1 1 1 1 1 [Q] How many input values? t X 2   1 1 1 1 1 t X 3  1 1 1 1 1 t  Then, the learning step is 0 1 3  t W   X i X i  3  I  3  i 1 3 1  1 3 3 1 0 1 1 3  1 0? 3 1   1 3 0 1 3 1 1 0  The three states will be the stable states for the network. Neural Network 43 0 1  Output vector  Yi  Sign(WX i   ), WX 1  3  where  contains the thresholdes 3 1 for each of the five inputs 0 t X 1  1 1 1 1 1 1  t WX 2  3 X 2   1 1 1 1 1  t 3 X 3  1 1 1 1 1 1 0 1 3 3 1  0 1 0 1 1 3  1  3   t WX 3  3 W X i X i  3  I  3 1 0 3 1     i 1 3 3 1 3 0 1  1 1 3 1 1 0   Neural Network 1 0 1 1 3 1 0 1 1 3 1 0 1 1 3 3 3 1  1 8  1 1 1 1 3  1 6   0 3 1  1  8   1  Y1  X 1      3 0 1  1 8  1 1 1 1 0  1 6  3 3 1   1  8  1 1 3   1  6  0 3 1   1   8   ?  Y2  ?     3 0 1   1  8  1 1 0   1  6  3 3 1  1   4  1 1 3   1 0  0 3 1  1    4   ?  Y3  ?     3 0 1  1   4  1 1 0   1 0   1 44  Output vector Yi  Sign(WX i  1) X 1  1 1 1 1 1 t X 2   1 1 1 1 1 t X 3  1 1 1 1 1 t  Let X 4  1 1 1 1 1 t   Then, Y4 ? Let X 5   1 1 1 1 1 t  0 1  WX 4  3  3 1 1 0 1 1 3 3 1 0 3 1 3 1 3 0 1 0 1  WX 5  3  3 1 1 0 1 1 3 3 1 0 3 1 3 1 3 0 1 1  1   2  1 3  1   4  1 1   1  8   1  Y4  X 1       1  1   2  1 0  1   4  1 1   1 ?  3  1  ?  1   1  ?   ?  Y5  ?     1  1  ?  0  1  ?  Then Y5  1 1 1 1 1 t  Applied again, then Y5  X 1 . [Q] What does this mean? Neural Network 45  Three steps      The network is trained to represent a set of attractors, or stable states. Any input usually will be mapped to an output state that is the attractor closest to the input.    Training weights with the attractor states as inputs– a storage or memorization stage Testing Using the network in acting as a memory to retrieve data from its memory The measure of distance is the Hamming distance that measures the number of elements of the vectors that differ Hence, the Hopfield network is a memory that usually maps an input vector to the memorized vector whose Hamming distance from the input vector is least. In fact, not always converge to the state closest to the original input. Neural Network 46  [Q] Applications of Hopfield networks?    [Q] How? Demo    Pattern recognition Pattern recognition: http://www.cbu.edu/~pong/ai/hopfield/hopfieldapplet.html [Q] How to implement the above demo? [Q] How is the pattern recognition using Hopfield networks different from the one that uses the backpropagation network in the seminar? Neural Network 47  A Hopfield network is autoassociative – it can only associate an item with itself or a similar one. However, the human brain is fully associative, or heteroassociative, which means one item is able to cause the brain to recall an entirely different item. E.g., Fall makes me think of autumn color.  [Q] How to improve Hopfield networks?   Neural Network 48 Bidirectional Associative Memories  A BAM (Bidirectional Associative Memory) is a heteroassociative memory:    The network consists of two fully connected layers of nodes.    Like the brain, it can learn to associate one item from one set with another completely unrelated item in another set. Similar in structure to the Hopfield network Every node in one layer is connected to every node in the other layer, not to any node in the same layer. In a Hopfield network, there is only one layer and all nodes are interconnected. The BAM is guaranteed to produce a stable output for any given inputs and for any training data. Neural Network 49    The BAM uses a sign activation function. Two sets of data are to be learned, so that when an item from set X is presented to the network, it will recall a corresponding item from set Y. The weights matrix is n W   X iYi t i 1 Neural Network 50 X 1  [1 1]t X 2  [1 1]t Y1  [1 1 1]t Y2  [1 1 1]t  2 2 2 W XY X Y   2 2 2   t 1 1 t 2 2 Sign(W t X 1 )  Y1 [Q] Can you prove it? Sign(WY1 )  X 1 Neural Network 51 Topics  [Q] Applications of BAM ?   Pattern recognition Demo  Pattern recognition: http://www.cbu.edu/~pong/ai/bam/bamapplet.html Neural Network 52 6. Kohonen Maps      Also called SOM (Self-Organizing Feature Map) An unsupervised learning system. There is no training. Finding the natural structure inherent in the input data The objective of a Kohonen network is to map input vectors (patterns) of arbitrary dimension N onto a discrete map with 1, 2 or 3 dimensions. Patterns close to one another in the input space should be close to one another in the map: they should be topologically ordered. Neural Network 53   Two layers of nodes: an input layer and a cluster (output) layer. Uses competitive learning, using winner-take-all:  Two layers         Input and cluster The number of input nodes is equal to the size of input vectors. The number of cluster nodes is equal to the number of clusters to find. Input node is connected to every node in the cluster node. Every input is compared with the weight vectors of each node in the cluster node. In the cluster layer, the node that most closely matches the input fires. The node is called the winner. This is the clustering of the input. Euclidean distance is used. The winning node has its weight vector modified to be closer to the input vector. Neural Network 54  Learning process:   initialize the weights for each cluster unit loop until weight changes are negligible  for each input pattern      present the input pattern find the winning cluster unit (i.e., the most similar one to the input pattern) find all units in the neighborhood of the winner update the weight vectors for all those units reduce the size of neighborhoods if required Neural Network 55  The cluster node for which dj is the smallest is the winner dj  w n i 1 ij 2  xi  , where x is the input vector and w i is the weight vector  Update of the weights of the winner wij  wij    xi  wij  => The weights become similar to the input, i.e., the winner becomes similar to the input.  The learning rate decreases over time. In fact, a neighborhood of neurons around the winner are usually updated together. The radius defining the neighbor decreases over time. The training phase terminates when the modification of weights becomes very small.    Neural Network 56 Topics    It has been shown that while self-organizing maps with a small number of nodes behave in a way that is similar to K-means, larger self-organizing maps rearrange data in a way that is fundamentally topological in character. High dimensional data to 3D -> Applications?    Clustering Dimension reduction; visualization of high dimensional data Demo  http://www.superstable.net/sketches/som/ Neural Network 57 7. Hebbian Learning    Hebb’s law: “When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased”. Hence, if two neurons that are connected together fire at the same time, the weights of the connection between them is strengthened. Conversely, if the neurons fire at different times, the weight of the connection between them is decreased. i xi wij j yj Neural Network 58   The activity product rule is used to modify the weights of a connection between two nodes i and j that fire at the same time: wij  wij    y j  xi where  is the learning rate; xi is the input to node j from node i and yj is the output of node j. Hebbian networks usually also use a forgetting factor, which decreases the weight of the connection between the two nodes if they fire at different times. wij  wij    y j  xi    y j  wij i xi wij j yj Neural Network 59 More interesting demo  Topics BrainyAliens Neural Network 60 8. Fuzzy Neural Networks  Topics Weights are fuzzy sets  Takagi-Sugeno fuzzy rules Neural Network 61 9. Evolving Neural Networks   Neural networks can be susceptible to local maxima. Evolutionary methods (genetic algorithms) can be used to determine the starting weights for a neural network, thus avoiding these kinds of problems. Neural Network 62 Other Learning Methods  Clustering        The K-Means The Fuzzy C-Means Classification   Topics Naïve Bayes classifier Support vector machine Decision tree Reinforcement learning … I will be back! Neural Network 63

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Multilayer Neural Networks - Computing Science