Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CISC 4631 Data Mining Lecture 11: Neural Networks Biological Motivation • Can we simulate the human learning process? Two schools • modeling biological learning process • obtain highly effective algorithms, independent of whether these algorithms mirror biological processes (this course) • Biological learning system (brain) – complex network of neurons – ANN are loosely motivated by biological neural systems. However, many features of ANNs are inconsistent with biological systems 2 Neural Speed Constraints • Neurons have a “switching time” on the order of a few milliseconds, compared to nanoseconds for current computing hardware. • However, neural systems can perform complex cognitive tasks (vision, speech understanding) in tenths of a second. • Only time for performing 100 serial steps in this time frame, compared to orders of magnitude more for current computers. • Must be exploiting “massive parallelism.” • Human brain has about 1011 neurons with an average of 104 connections each. 3 Artificial Neural Networks (ANN) • ANN – network of simple units – real-valued inputs & outputs • Many neuron-like threshold switching units • Many weighted interconnections among units • Highly parallel, distributed process • Emphasis on tuning weights automatically 4 Neural Network Learning • Learning approach based on modeling adaptation in biological neural systems. • Perceptron: Initial algorithm for learning simple neural networks (single layer) developed in the 1950’s. • Backpropagation: More complex algorithm for learning multi-layer neural networks developed in the 1980’s. 5 Real Neurons 6 How Does our Brain Work? • A neuron is connected to other neurons via its input and output links • Each incoming neuron has an activation value and each connection has a weight associated with it • The neuron sums the incoming weighted values and this value is input to an activation function • The output of the activation function is the output from the neuron 7 Neural Communication • Electrical potential across cell membrane exhibits spikes called action potentials. • Spike originates in cell body, travels down axon, and causes synaptic terminals to release neurotransmitters. • Chemical diffuses across synapse to dendrites of other neurons. • Neurotransmitters can be excititory or inhibitory. • If net input of neurotransmitters to a neuron from other neurons is excititory and exceeds some threshold, it fires an action potential. 8 Real Neural Learning • To model the brain we need to model a neuron • Each neuron performs a simple computation – It receives signals from its input links and it uses these values to compute the activation level (or output) for the neuron. – This value is passed to other neurons via its output links. 9 Prototypical ANN • Units interconnected in layers – directed, acyclic graph (DAG) • Network structure is fixed – learning = weight adjustment – backpropagation algorithm 10 Appropriate Problems • Instances: vectors of attributes – discrete or real values • Target function – discrete, real, vector – ANNs can handle classification & regression • • • • Noisy data Long training times acceptable Fast evaluation No need to be readable – It is almost impossible to interpret neural networks except for the simplest target functions 11 Perceptrons • • • The perceptron is a type of artificial neural network which can be seen as the simplest kind of feedforward neural network: a linear classifier Introduced in the late 50s Perceptron convergence theorem (Rosenblatt 1962): – Perceptron will learn to classify any linearly separable set of inputs. Perceptron is a network: – single-layer – feed-forward: data only travels in one direction 12 XOR function (no linear separation) ALVINN drives 70 mph on highways See Alvinn video Alvinn Video 13 Artificial Neuron Model • Model network as a graph with cells as nodes and synaptic connections as weighted edges from node i to node j, wji 1 • Model net input to cell as w12 net j w jioi w15 w16 w13 w14 i 2 3 4 5 6 • Cell output is: 0 if net j T j oj 1 if neti T j (Tj is threshold for unit j) oj 1 0 Tj netj 14 Perceptron: Artificial Neuron Model Model network as a graph with cells as nodes and synaptic connections as weighted edges from node i to node j, wji The input value received of na neuron is calculated by summing the weighted input values from its input links wi xi i 0 threshold threshold function Vector notation: 15 Different Threshold Functions 1, w x 0 o( x ) 1, w x 0 1, w x t o( x ) 0, otherwise 1, w x 0 o( x ) 1, w x 0 1, w x 0 o( x ) 1 , w x 0 We should learn the weight w1,…, wn 16 Examples (step activation function) In1 In2 Out In1 In2 Out In Out 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 n w0 – t w x i 0 i i 17 Neural Computation • McCollough and Pitts (1943) showed how such model neurons could compute logical functions and be used to construct finite-state machines • Can be used to simulate logic gates: – AND: Let all wji be Tj/n, where n is the number of inputs. – OR: Let all wji be Tj – NOT: Let threshold be 0, single input with a negative weight. • Can build arbitrary logic circuits, sequential machines, and computers with such gates 18 Perceptron Training • Assume supervised training examples giving the desired output for a unit given a set of known input activations. • Goal: learn the weight vector (synaptic weights) that causes the perceptron to produce the correct +/- 1 values • Perceptron uses iterative update algorithm to learn a correct set of weights – Perceptron training rule – Delta rule • Both algorithms are guaranteed to converge to somewhat different acceptable hypotheses, under somewhat different conditions 19 Perceptron Training Rule • Update weights by: wi wi wi wi (t o) wi where η is the learning rate • a small value (e.g., 0.1) • sometimes is made to decay as the number of weight-tuning operations increases t – target output for the current training example o – linear unit output for the current training example 20 Perceptron Training Rule • Equivalent to rules: – If output is correct do nothing. – If output is high, lower weights on active inputs – If output is low, increase weights on active inputs • Can prove it will converge – if training data is linearly separable – and η is sufficiently small 21 Perceptron Learning Algorithm • Iteratively update weights until convergence. Initialize weights to random values Until outputs of all training examples are correct For each training pair, E, do: Compute current output oj for E given its inputs Compare current output to target value, tj , for E Update synaptic weights and threshold using learning rule • Each execution of the outer loop is typically called an epoch. 22 Delta Rule • Works reasonably with data that is not linearly separable • Minimizes error • Gradient descent method – basis of Backpropagation method – basis for methods working in multidimensional continuous spaces – Discussion of this rule is beyond the scope of this course 23 Perceptron as a Linear Separator • Since perceptron uses linear threshold function it searches for a linear separator that discriminates the classes o3 w12o2 w13o3 T1 ?? w12 T1 o3 o2 w13 w13 o2 Or hyperplane in n-dimensional space 24 Concept Perceptron Cannot Learn • Cannot learn exclusive-or (XOR), or parity function in general o3 1 + ?? – – + 0 1 o2 25 General Structure of an ANN x1 x2 x3 Input Layer x4 x5 Input I1 I2 Hidden Layer I3 Neuron i Output wi1 wi2 wi3 Si Activation function g(Si ) Oi Oi threshold, t Output Layer Perceptrons have no hidden layers Multilayer perceptrons may have many y 26 Learning Power of an ANN • Perceptron is guaranteed to converge if data is linearly separable – It will learn a hyperplane that separates the classes – The XOR function on page 250 TSK is not linearly separable • A mulitlayer ANN has no such guarantee of convergence but can learn functions that are not linearly separable • An ANN with a hidder layer can learn the XOR function by constructing two hyperplanes (see page 253) 27 Multilayer Network Example The decision surface is highly nonlinear 28 Sigmoid Threshold Unit • Sigmoid is a unit whose output is a nonlinear function of its inputs, but whose output is also a differentiable function of its inputs • We can derive gradient descent rules to train – Sigmoid unit – Multilayer networks of sigmoid units backpropagation 29 Multi-Layer Networks • Multi-layer networks can represent arbitrary functions, but an effective learning algorithm for such networks was thought to be difficult • A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next, with activation feeding forward. output hidden activation input • The weights determine the function computed. Given an arbitrary number of hidden units, any boolean function can be computed with a single hidden layer 30 Logistic function Inputs Age Gender Stage 34 1 4 Independent variables Output .5 .4 0.6 S “Probability of beingAlive” .8 Coefficients Prediction 1 n 1 e wi xi i 0 31 Neural Network Model Inputs Age .6 34 .2 .4 S .5 .1 Gender 2 .2 .3 S .7 Stage 4 Independent variables Output .8 S “Probability of beingAlive” .2 Weights 0.6 Hidden Layer Weights Dependent variable Prediction 32 Getting an answer from a NN Inputs Age Output .6 34 .5 .1 Gender S 2 Stage “Probability .8 .7 of beingAlive” 4 Independent variables 0.6 Weights Hidden Layer Weights Dependent variable Prediction 33 Getting an answer from a NN Inputs Age Output 34 .5 .2 Gender 2 S .3 “Probability .8 Stage 4 Independent variables of beingAlive” .2 Weights Hidden Layer 0.6 Weights Dependent variable Prediction 34 Getting an answer from a NN Inputs Age Output .6 34 .5 .2 .1 Gender 1 S .3 .7 Stage 4 Independent variables “Probability .8 of beingAlive” .2 Weights Hidden Layer 0.6 Weights Dependent variable Prediction 35 Comments on Training Algorithm • Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely. • However, in practice, does converge to low error for many large networks on real data. • Many epochs (thousands) may be required, hours or days of training for large networks. • To avoid local-minima problems, run several trials starting with different random weights (random restarts). – Take results of trial with lowest training set error. – Build a committee of results from multiple trials (possibly weighting votes by training set accuracy). 36 Hidden Unit Representations • Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space: key features of ANNs • On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc.. • However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature. 37 Expressive Power of ANNs • Boolean functions: Any boolean function can be represented by a two-layer network with sufficient hidden units. • Continuous functions: Any bounded continuous function can be approximated with arbitrarily small error by a two-layer network. • Arbitrary function: Any function can be approximated to arbitrary accuracy by a three-layer network. 38 Sample Learned XOR Network IN 1 IN 2 OUT 7.38 0 0 0 B 2.03 0 1 1 5.57 1 0 1 1 1 0 O 3.11 6.96 5.24 A 3.6 X 3.58 5.74 Y Hidden Unit A represents: (X Y) Hidden Unit B represents: (X Y) Output O represents: A B = (X Y) (X Y) =XY 39 Expressive Power of ANNs • Universal Function Approximator: – Given enough hidden units, can approximate any continuous function f • Why not use millions of hidden units? – Efficiency (neural network training is slow) – Overfitting 40 Combating Overfitting in Neural Nets • Many techniques • Two popular ones: – Early Stopping • Use “a lot” of hidden units • Just don’t over-train – Cross-validation • Choose the “right” number of hidden units 41 Determining the Best Number of Hidden Units error • Too few hidden units prevents the network from adequately fitting the data • Too many hidden units can result in over-fitting on test data on training data 0 # hidden units • Use internal cross-validation to empirically determine an optimal number of hidden units 42 Over-Training Prevention error • Running too many epochs can result in over-fitting. on test data on training data 0 # training epochs • Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error. • To avoid losing training data for validation: – Use internal 10-fold CV on the training set to compute the average number of epochs that maximizes generalization accuracy. – Train final network on complete training set for this many epochs. 43 Summary of Neural Networks When are Neural Networks useful? – Instances represented by attribute-value pairs • Particularly when attributes are real valued – The target function is • Discrete-valued • Real-valued • Vector-valued – Training examples may contain errors – Fast evaluation times are necessary When not? – Fast training times are necessary – Understandability of the function is required 44 Successful Applications • Text to Speech (NetTalk) • Fraud detection • Financial Applications – HNC (eventually bought by Fair Isaac) • Chemical Plant Control – Pavillion Technologies • Automated Vehicles • Game Playing – Neurogammon • Handwriting recognition 45 Issues in Neural Nets • Learning the proper network architecture: – Grow network until able to fit data • Cascade Correlation • Upstart – Shrink large network until unable to fit data • Optimal Brain Damage • Recurrent networks that use feedback and can learn finite state machines with “backpropagation through time.” 46