Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 15: Artificial Neural Networks ARTIFICIAL NEURAL NETWORKS • Analogy to biological neural systems, the most robust learning systems we know. • Attempt to understand natural biological systems through computational modeling. • Massive parallelism allows for computational efficiency. • Help understand “distributed” nature of neural representations (rather than “localist” representation) that allow robustness and graceful degradation. • Intelligent behavior as an “emergent” property of large number of simple units rather than from explicitly encoded symbolic rules and algorithms. 2 Real Neurons • Cell structures – Cell body – Dendrites – Axon – Synaptic terminals 3 Neural Communication • Electrical potential across cell membrane exhibits spikes called action potentials. • Spike originates in cell body, travels down axon, and causes synaptic terminals to release neurotransmitters. • Chemical diffuses across synapse to dendrites of other neurons. • Neurotransmitters can be excititory or inhibitory. • If net input of neurotransmitters to a neuron from other neurons is excititory and exceeds some threshold, it fires an action potential. 4 NEURAL SPEED CONSTRAINTS • Neurons have a “switching time” on the order of a few milliseconds, compared to nanoseconds for current computing hardware. • However, neural systems can perform complex cognitive tasks (vision, speech understanding) in tenths of a second. • Only time for performing 100 serial steps in this time frame, compared to orders of magnitude more for current computers. • Must be exploiting “massive parallelism.” • Human brain has about 1011 neurons with an average of 104 connections each. 5 Real Neural Learning • Synapses change size and strength with experience. • Hebbian learning: When two connected neurons are firing at the same time, the strength of the synapse between them increases. • “Neurons that fire together, wire together.” 6 ARTIFICIAL NEURAL NETWORKS • Inputs xi arrive through presynaptic connections • Synaptic efficacy is modeled using real weights wi • The response of the neuron is a nonlinear function f of its weighted inputs 7 Inputs To Neurons • Arise from other neurons or from outside the network • Nodes whose inputs arise outside the network are called input nodes and simply copy values • An input may excite or inhibit the response of the neuron to which it is applied, depending upon the weight of the connection 8 Weights • Represent synaptic efficacy and may be excitatory or inhibitory • Normally, positive weights are considered as excitatory while negative weights are thought of as inhibitory • Learning is the process of modifying the weights in order to produce a network that performs some function 9 ARTIFICIAL NEURAL NETWORK LEARNING • Learning approach based on modeling adaptation in biological neural systems. • Perceptron: Initial algorithm for learning simple neural networks (single layer) developed in the 1950’s. • Backpropagation: More complex algorithm for learning multi-layer neural networks developed in the 1980’s. 10 SINGLE LAYER NETWORKS • Model network as a graph with cells as nodes and synaptic connections as weighted edges from node i to node j, wji 1 • Model net input to cell as w12 net j w jioi w15 w16 w13 w14 2 i 3 4 5 6 • Cell output is: oj 0 if net j T j oj 1 if neti T j 1 (Tj is threshold for unit j) 0 11 Tj netj Neural Computation • McCollough and Pitts (1943) showed how single-layer neurons could compute logical functions and be used to construct finite-state machines. • Can be used to simulate logic gates: – AND: Let all wji be Tj/n, where n is the number of inputs. – OR: Let all wji be Tj – NOT: Let threshold be 0, single input with a negative weight. • Can build arbitrary logic circuits, sequential machines, and computers with such gates. • Given negated inputs, two layer network can compute any boolean function using a two level AND-OR network. 12 Perceptron Training • Assume supervised training examples giving the desired output for a unit given a set of known input activations. • Learn synaptic weights so that unit produces the correct output for each example. • Perceptron uses iterative update algorithm to learn a correct set of weights. 13 Perceptron Learning Rule • Update weights by: w ji w ji (t j o j )oi where η is the “learning rate” tj is the teacher specified output for unit j. • Equivalent to rules: – If output is correct do nothing. – If output is high, lower weights on active inputs – If output is low, increase weights on active inputs • Also adjust threshold to compensate: T j T j (t j o j ) 14 Perceptron Learning Algorithm • Iteratively update weights until convergence. Initialize weights to random values Until outputs of all training examples are correct For each training pair, E, do: Compute current output oj for E given its inputs Compare current output to target value, tj , for E Update synaptic weights and threshold using learning rule • Each execution of the outer loop is typically called an epoch. 15 Perceptron as a Linear Separator • Since perceptron uses linear threshold function, it is searching for a linear separator that discriminates the classes. o3 w12o2 w13o3 T1 ?? w12 T1 o3 o2 w13 w13 o2 16 Or hyperplane in n-dimensional space Concept Perceptron Cannot Learn • Cannot learn exclusive-or, or parity function in general. o3 1 + ?? – – + 0 o2 1 17 Perceptron Limits • System obviously cannot learn concepts it cannot represent. • Minksy and Papert (1969) wrote a book analyzing the perceptron and demonstrating many functions it could not learn. • These results discouraged further research on neural nets; and symbolic AI became the dominate paradigm. 18 Multi-Layer Networks • Multi-layer networks can represent arbitrary functions, but an effective learning algorithm for such networks was thought to be difficult. • A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next, with activation feeding forward. output hidden activation input • The weights determine the function computed. Given an arbitrary number of hidden units, any boolean function can be computed with a single hidden layer. 19 Differentiable Output Function • Need non-linear output function to move beyond linear functions. – A multi-layer linear network is still linear. • Standard solution is to use the non-linear, differentiable sigmoidal “logistic” function: oj 1 1 e 1 ( net j T j ) 0 Tj netj Can also use tanh or Gaussian output function 21 LEARNING WITH MULTI-LAYER NETWORKS • The BACKPROPAGATION algorithm allows to modify weights in a network consisting of several layers by ‘propagating back’ corrections Error Backpropagation • First calculate error of output units and use this to change the top layer of weights. Current output: oj=0.2 Correct output: tj=1.0 Error δj = oj(1–oj)(tj–oj) 0.2(1–0.2)(1–0.2)=0.128 output Update weights into j w ji j oi hidden input 23 Error Backpropagation • Next calculate error for hidden units based on errors on the output units it feeds into. output j o j (1 o j ) k wkj hidden k input 24 Error Backpropagation • Finally update bottom layer of weights based on errors calculated for hidden units. output j o j (1 o j ) k wkj k hidden Update weights into j w ji j oi input 25 Backpropagation Learning Rule • Each weight changed by: w ji j oi j o j (1 o j )(t j o j ) j o j (1 o j ) k wkj if j is an output unit if j is a hidden unit k where η is a constant called the learning rate tj is the correct teacher output for unit j δj is the error measure for unit j 26 A Pseudo-Code Algorithm • Randomly choose the initial weights • While error is too large – For each training pattern • Apply the inputs to the network • Calculate the output for every neuron from the input layer, through the hidden layer(s), to the output layer • Calculate the error at the outputs • Use the output error to compute error signals for pre-output layers • Use the error signals to compute weight adjustments • Apply the weight adjustments – Periodically evaluate the network performance 30 Apply Inputs From A Pattern Outputs Feedforward Inputs • Apply the value of each input parameter to each input node • Input nodes computer only the identity function 31 Calculate Outputs For Each Neuron Based On The Pattern • The output from neuron j for pattern p is Opj where 1 Opj (net j ) net j 1 e Feedforward k k ranges over the input indices and Wjk is the weight on the connection from input k to neuron j Outputs net j bias *Wbias O pkW jk Inputs and 32 Calculate The Error Signal For Each Output Neuron • The output neuron error signal pj is given by pj=(Tpj-Opj) Opj (1-Opj) • Tpj is the target value of output neuron j for pattern p • Opj is the actual output value of output neuron j for pattern p 33 Calculate The Error Signal For Each Hidden Neuron • The hidden neuron error signal pj is given by pj O pj (1 O pj ) pkWkj k where pk is the error signal of a post-synaptic neuron k and Wkj is the weight of the connection from hidden neuron j to the post-synaptic neuron k 34 Calculate And Apply Weight Adjustments • Compute weight adjustments Wji by Wji = η pj Opi • Apply weight adjustments according to Wji = Wji + Wji 35 Representational Power • Boolean functions: Any boolean function can be represented by a two-layer network with sufficient hidden units. • Continuous functions: Any bounded continuous function can be approximated with arbitrarily small error by a two-layer network. – Sigmoid functions can act as a set of basis functions for composing more complex functions, like sine waves in Fourier analysis. • Arbitrary function: Any function can be approximated to arbitrary accuracy by a three-layer network. 36 Sample Learned XOR Network O 3.11 7.38 6.96 5.24 A 3.6 X 3.58 5.74 B 2.03 5.57 Y Hidden Unit A represents: (X Y) Hidden Unit B represents: (X Y) Output O represents: A B = (X Y) (X Y) =XY 37 Hidden Unit Representations • Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space. • On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc.. • However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature. 38 Successful Applications • Text to Speech (NetTalk) • Fraud detection • Financial Applications – HNC (eventually bought by Fair Isaac) • Chemical Plant Control – Pavillion Technologies • Automated Vehicles • Game Playing – Neurogammon • Handwriting recognition 39 Analysis of ANN Algorithm • Advantages: – Produce good results in complex domains – Suitable for both discrete and continuous data (especially better for the continuous domain) – Testing is very fast • Disadvantages: – Training is relatively slow – Learned results are difficult for users to interpret than learned rules (comparing with DT) – Empirical Risk Minimization (ERM) makes ANN try to minimize training error, may lead to overfitting 40 READINGS • Mitchell, Machine Learning, ch. 4 ACKNOWLEDGMENTS • Several slides come from Ray Mooney’s course on AI