Download Slide 1

Neural Networks Intelligent Systems – Lecture 13 ©www.sti-innsbruck.at Copyright 2008 STI INNSBRUCK www.sti-innsbruck.at 1) Motivation 1) Technical solution • Definitions • Explanations • Procedures • Illustrations by short examples 2) Illustration by a larger example 3) Extensions (Work only sketched) www.sti-innsbruck.at Motivation • A main motivation behind neural networks is the fact that symbolic rules do not reflect reasoning processes performed by humans. • Biological neural systems can capture highly parallel computations based on representations that are distributed over many neurons. • They learn and generalize from training data; no need for programming it all... • They are very noise tolerant – better resistance than symbolic systems. • In summary: neural networks can do whatever symbolic or logic systems can do, and more; in practice it is not that obvious however. www.sti-innsbruck.at Motivation • Neural networks are stong in: – Learning from a set of examples – Optimizing solutions via constraints and cost functions – Classification: grouping elements in classes – Speech recognition, pattern matching – Non-parametric statistical analysis and regressions www.sti-innsbruck.at Introduction: What are Neural Networks? • Neural networks are networks of neurons as in the real biological brain. • Neurons are highly specialized cells that transmit impulses within animals to cause a change in a target cell such as a muscle effector cell or glandular cell. • The axon, is the primary conduit through which the neuron transmits impulses to neurons downstream in the signal chain • Humans: 1011 neurons of > 20 types, 1014 synapses, 1ms10ms cycle time • Signals are noisy “spike trains” of electrical potential www.sti-innsbruck.at Introduction: What are Neural Networks? • What we refer to Neural Networks in the course are thus mostly Artificial Neural Networks (ANN). • ANN are approximation of biological neural networks and are built of physical devices, or simulated on computers. • ANN are parallel computational entities that consist of multiple simple processing units that are connected in specific ways in order to perform the desired tasks. • Remember: ANN are computationally primitive approximations of the real biological brains. www.sti-innsbruck.at Comparison & Contrast • Neural networks vs. classical symbolic computing 1. Sub-symbolic vs. Symbolic 2. Non-modular vs. Modular 3. Distributed Representation vs. Localist representation 4. Bottom-up vs. Top Down 5. Parallel processing vs. Sequential processing • In reality however, it can be observed that the distinctions become increasingly less obvious! www.sti-innsbruck.at McCulloch-Pitts „Unit“ – Artificial Neurons • Output is a „squashed“ linear function of the inputs: • A gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do. www.sti-innsbruck.at Activation Functions (a) is a step function or threshold function (b) is a sigmoid function 1/(1+e-x) • Changing the bias weight w0,i moves the threshold location www.sti-innsbruck.at Perceptron • McCulloch-Pitts neurons can be connected together in any desired way to build an artificial neural network. • A construct of one input layer of neurons that feed forward to one output layer of neurons is called Perceptron. Figure of simple perceptron www.sti-innsbruck.at Example: Logical Functions • McCulloch and Pitts: every boolean function can be implemented with a artificial neuron. www.sti-innsbruck.at Example: Finding Weights for AND Operation www.sti-innsbruck.at Neural Network Structures • Mathematically artificial neural networks are represented by weighted directed graphs. • In more practical terms, a neural network has activations flowing between processing units via one-way connections. • There are three common artificial neural network architectures known: – Single-Layer Feed-Forward (Perceptron) – Multi-Layer Feed-Forward – Recurrent Neural Network www.sti-innsbruck.at Single-Layer Feed-Forward • A Single-Layer Feed-Forward Structure is a simple perceptron, and has thus – one input layer – one output layer – NO feed-back connections • Feed-forward networks implement functions, have no internal state (of course also valid for multi-layer perceptrons). www.sti-innsbruck.at Multi-Layer Feed-Forward • Multi-Layer Feed-Forward Structures have: – one input layer – one output layer – one or MORE hidden layers of processing units • The hidden layers are between the input and the output layer, and thus hidden from the outside world: no input from the world, not output to the world. www.sti-innsbruck.at Recurrent Network • Recurrent networks have at least one feedback connection: – They have thus directed cycles with delays: they have internal state (like flip flops), can oscillate, etc. – The response to an input depends on the initial state which may depend on previous inputs – can model short-time memory – Hopfield networks have symmetric weights (Wij = Wji) – Boltzmann machines use stochastic activation functions, ≈ MCMC in Bayes nets www.sti-innsbruck.at Feed-forward example • Feed-forward network = a parameterized family of nonlinear functions: • Adjust weights changes the function: this is where learning occurs! www.sti-innsbruck.at Single-layer feed-forward neural networks: perceptrons • • Output units all operate separately – no shared weights (the study can be limited to single output perceptrons!) Adjusting weights moves the location, orientation, and steepness of cliff www.sti-innsbruck.at Expressiveness of perceptrons (1) • • • Consider a perceptron with g = step function (Rosenblatt, 1957, 1960) – can model a Boolean function - classification Can represent AND, OR, NOT, majority (weights: n/2, 1, 1, ...), etc. compactly, but not XOR Represents a linear separator in input space www.sti-innsbruck.at • Threshold perceptrons can represent only linearly separable functions (i.e. functions for which such a separation hyperplane exists) x1 0 • 0 x2 Limited expressivity, but there exists an algorithm that will fit a threshold perceptron to any linealy separable training set. www.sti-innsbruck.at Perceptron learning (1) • • Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is • Perform optimization search by gradient descent • Simple weight update rule – positive error ⇒ increase network output ⇒ increase weights on positive inputs, decrease on negative inputs • Algorithm runs training examples through the net once at a time and adjusts weights after each example. Such a cycle is called epoch www.sti-innsbruck.at Perceptron learning (2) • Perceptron learning rule converges to a consistent function for any linearly separable data set • • Perceptron learns majority function easily, DTL is hopeless DTL learns restaurant function easily, perceptron cannot represent it www.sti-innsbruck.at Multilayer perceptrons • • • Layers are usually fully connected; numbers of hidden units typically chosen by hand Hidden layers enlarge the space of hypotheses that the network can represent Learning done by back-propagation algorithm  errors are back-propagated from the output layer to the hidden layers www.sti-innsbruck.at Expressiveness of MLPs (1) • • 2 layers can represent all continuous functions 3 layers can represent all functions • • • • Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface The required number of hidden units grows exponentially with the number of inputs (2n/n for all boolean functions) www.sti-innsbruck.at Expressiveness of MLPs (2) • The required number of hidden units grows exponentially with the number of inputs (2n/n for all boolean functions) – more hidden units – more bumps – single, sufficiently large hidden layer – any continuous function of the inputs with arbitrary accuracy – two layers – discontinuous functions – for any particular network structure, harder to characterize exactly which functions can be represented and which ones cannot www.sti-innsbruck.at Example MLP • Build a hidden layer network for the restaurant problem (see lecture 7, Learning from observations) – 10 attributes describing each example Restaurant problem: decide whether to wait for a table at a restaurant, based on the following attributes: • Alternate: is there an alternative restaurant nearby? Bar: is there a comfortable bar area to wait in? Fri/Sat: is today Friday or Saturday? Hungry: are we hungry? Patrons: number of people in the restaurant (None, Some, Full) Price: price range ($, $$, $$$) Raining: is it raining outside? Reservation: have we made a reservation? Type: kind of restaurant (French, Italian, Thai, Burger) WaitEstimate: estimated waiting time (0-10, 10-30, 3060, >60) A multilayer neural network suitable for the restaurant problem: – one hidden layer with four hidden units – 10 input units www.sti-innsbruck.at Learning • Learning is one of the most powerful features of neural networks • Neurons adapt the weights and connections between neurons (the axons in the biological neural network) so that the final output activitation reflects the correct answer. www.sti-innsbruck.at Back-propagation learning (1) • Output layer: same as for single-layer perceptron, • where • Hidden layer: back-propagate the error from the output layer: • Update rule for weights in hidden layer: • (Most neuroscientists deny that back-propagation occurs in the brain) www.sti-innsbruck.at Back-propagation learning (2) • • At each epoch, sum gradient updated for all examples Training curve for 100 restaurant examples converges to a perfect fit to the training data • Typical problems: slow convergence, local minima www.sti-innsbruck.at Back-propagation learning (3) • Learning curve for MLP with 4 hidden units (as in our restaurant example): • MLPs are quite good for complex pattern recognition tasks, but resulting hypotheses cannot be understood easily www.sti-innsbruck.at Kernel machines • New family of learning methods: support vector machines (SVMs) or, more generally, kernel machines • Combine the best of single-layer networks and multi-layer networks – use an efficient training algorithm – represent complex, nonlinear functions • Techniques – kernel machines find the optimal linear separator • i.e., the one that has the largest margin between it and the positive examples on one side and the negative examples on the other – quadratic programming optimization problem www.sti-innsbruck.at Application Example (1) • Many applications of MLPs: speech, driving, handwriting, fraud detection, etc. • Example: handwritten digit recognition – automated sorting of mail by postal code – automated reading of checks – data entry for hand-held computers www.sti-innsbruck.at Application Example (2) • 2 examples of handwritten digits: – Top row: digits 0-9, easy to identify – Bottom row: digits 0-9, more difficult to identify • Different learning approaches applied to examples: – 3-nearest neighbor classifier = 2,4% error – single-hidden-layer neural network: 400-300-10 unit MLP (i.e., 400 input units, 300 hidden units, 10 output units) = 1,6% error – specialized neural networks (called „LeNet“): MLPs with 3 hidden layers with 768,192 and 30 units = 0.9% error – Current best (kernel machines, vision algorithms) ≈ 0,6 % error • Humans are estimated to have an error rate of about 0,2% on this problem – not extensively tested though – Experiences from United States Postal Service: human errors of 2,5% www.sti-innsbruck.at Summary • Most brains have lots of neurons, each neuron approximates linear-threshold unit • Perceptrons (one-layer networks) are insufficiently expressive • Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation • Many applications: speech, driving, handwriting, etc. • Engineering, cognitive modeling, and neural system modeling subfields have largely diverged www.sti-innsbruck.at

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1