* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download lecture22 - University of Virginia, Department of Computer Science
Neural modeling fields wikipedia , lookup
Perceptual control theory wikipedia , lookup
Time series wikipedia , lookup
History of artificial intelligence wikipedia , lookup
Gene expression programming wikipedia , lookup
Hierarchical temporal memory wikipedia , lookup
Pattern recognition wikipedia , lookup
Backpropagation wikipedia , lookup
CS 416 Artificial Intelligence Lecture 22 Statistical Learning Chapter 20.5 Perceptrons • Each input is binary and has associated with it a weight • The sum of the inner product of the input and weights is calculated • If this sum exceeds a threshold, the perceptron fires Perceptrons are linear classifiers Consider a two-input neuron • Two weights are “tuned” to fit the data • The neuron uses the equation w1*x1 + w2*x2 to fire or not – This is like the equation of a line mx + b - y http://www.compapp.dcu.ie/~humphrys/Notes/Neural/single.neural.html Linearly separable These single-layer perceptron networks can classify linearly separable systems • Consider a system like AND x1 x2 x1 AND x2 1 1 1 0 1 0 1 0 0 0 0 0 1- 1 Linearly separable - AND • Consider a system like AND x1 x2 x1 AND x2 1 1 1 0 1 0 1 0 0 0 0 0 x1 w1 w2 x2 S q(x●w) 1- 1 Linearly separable - XOR • Consider a system like XOR x1 x2 x1 XOR x2 1 1 0 0 1 1 1 0 1 0 0 0 x1 w1 w2 x2 S q(x●w) 1- 1 Linearly separable - XOR IMPOSSIBLE! 2nd Class Exercise • x3 = ~x1, x4 = ~x2 • Find w1, w2, w3, w4, and theta such that Theta(x1*w1+x2* w2)= x1 xor x2 • Or, prove that it can’t be done 3rd Class Exercise • Find w1, w2, and f() such that f(x1*w1+x2*w2) = x1 xor x2 • Or, prove that it can’t be done Limitations of Perceptrons • Minsky & Papert published (1969) “Perceptrons” stressing the limitations of perceptrons • Single-layer perceptrons cannot solve problems that are linearly inseparable (e.g., xor) • Most interesting problems are linearly inseparable • Kills funding for neural nets for 12-15 years A brief aside about Marvin Minsky • Attended Bronx H.S. of Science • Served in U.S. Navy during WW II • B.A. Harvard and Ph.D. Princeton • MIT faculty since 1958 • First graphical head-mounted display (1963) • Co-inventor of Logo (1968) • Nearly killed during 2001: A Space Odyssey but survived to write paper critical of neural networks • Turing Award 1970 From wikipedia.org Single-layer networks for classification • Single output with 0.5 as dividing line for binary classification • Single output with n-1 dividing lines for n-ary classification • n outputs with 0.5 dividing line for n-ary classification Recent History of Neural Nets • 1969 Minsky & Papert “kill” neural nets • 1974 Werbos describes backpropagation • 1982 Hopfield reinvigorates neural nets • 1986 Parallel Distributed Processing Multi-layered Perceptrons • Input layer, output layer, and “hidden” layers • Eliminates some concerns of Minsky and Papert • Modification rules are more complicated! Why are modification rules more complicated? We can calculate the error of the output neuron by comparing to training data • We could use previous update rule to adjust W3,5 and W4,5 to correct that error • But how do W1,3 W1,4 W2,3 W2,4 adjust? First consider error in single-layer neural networks Sum of squared errors (across training data) For one sample: How can we minimize the error? • Set derivative equal to zero and solve for weights • is that error affected by each of the weights in the weight vector? Minimizing the error What is the derivative? • The gradient, – Composed of Computing the partial By the Chain Rule: g ( ) = the activation function Computing the partial g’( in ) = derivative of the activation function = g(1-g) in the case of the sigmoid Minimizing the error Gradient descent What changes in multilayer? Output is not one value, y • Output is a vector We do not know the correct outputs for the hidden layers • We will have to propagate errors backwards Back propagation (backprop) Multilayer Backprop at the output layer Output layer error is computed as in singlelayer and weights are updated in same fashion • Let Erri be the ith component of the error vector y – hW – Let Backprop in the hidden layer Each hidden node is responsible for some fraction of the error Di in each of the output nodes to which it is connected • Di is divided among all hidden nodes that connect to output i according to their strengths Error at hidden node j: Backprop in the hidden layer Error is: Correction is: Summary of backprop 1. Compute the D value for the output units using the observed error 2. Starting with the output layer, repeat the following for each layer until done • Propagate D value back to previous layer • Update the weights between the two layers 4th Class Exercise • Find w1, w2, w3, w4, w5, theta1, and theta2 such that output is x1 xor x2 • Or, prove that it can’t be done Back-Propagation (xor) • Initial weights are random • Threshold is now sigmoidal (function should have derivatives) Initial weights: w1=0.90, w2=-0.54 w3=0.21, w4=-0.03 w5 = 0.78 1 f ( x w) 1 e x w Back-Propagation (xor) • Input layer – two unit • Hidden layer – one unit • Output layer – one unit • Output is related to input by F w , x f f x w w • Performance is defined as 1 2 P F w , x c T x ,c T For all samples in training set T Back-Propagation (xor) • Error at last layer (hiddenoutput) is defined as: 1 F w , x c • Error at previous layer (inputhidden) is defined as: j w j k o k 1 o k k • Change in weight: Dwi j • Where: P x, c wi j oi o j 1 o j j x ,c T P x ,c wi j Back-Propagation (xor) • (0,0)0 – 1st example • Input to hidden unit is 0, sigmoid(0)=0.5 • Input to output unit is (0.5)(-0.03)=-0.015 • Sigmoid(-0.015)=0.4963error=-0.4963 • So, o 0.4963 P (0.5)(0.49 63)(1 0.4963)( 0.4963) 0.0620 w 4 • Example’s contribution to Dw 4 is –0.0062 Why are we ignoring the other weight changes? Back-Propagation (xor) P x, c • (0,1)1 – 2nd example • ih=-0.54 oh=0.3862 oi o j 1 o j j wi j j w j k o k 1 o k k • io=(0.3862)(-.03)+0.78=0.769oo=0.6683 o 1 0.6833 0.3167 P (0.3862)( 0.6833)(1 0.6833)( 0.3167) 0.0252 w 4 P (1)( 0.6833)(1 0.6833)( 0.3167) 0.0685 w 5 h ( 0.03)( 0.6833)(1 0.6833)( 0.3167) 0.0021 P (1)( 0.3682)(1 0.3682)( 0.0021) 0.0005 w 2 &c… Back-Propagation (xor) • Initial performance = -0.2696 • After 100 iterations we have: • w=(0.913, -0.521, 0.036, -0.232, 0.288) • Performance = -0.2515 • After 100K iterations we have: • w=(15.75, -7.671, 7.146, -7.149, 0.0022) • Performance = -0.1880 • After 1M iterations we have: • w=(21.38, -10.49, 9.798, -9.798, 0.0002) • Performance = -0.1875 Some general artificial neural network (ANN) info • The entire network is a function g( inputs ) = outputs – These functions frequently have sigmoids in them – These functions are frequently differentiable – These functions have coefficients (weights) • Backpropagation networks are simply ways to tune the coefficients of a function so it produces desired output Function approximation Consider fitting a line to data • Coefficients: slope and y-intercept • Training data: some samples • Use least-squares fit y This is what an ANN does x Function approximation A function of two inputs… • Fit a smooth curve to the available data – Quadratic – Cubic – nth-order – ANN! Curve fitting • A neural network should be able to generate the input/output pairs from the training data • You’d like for it to be smooth (and well-behaved) in the voids between the training data • There are risks of over fitting the data When using ANNs • Sometimes the output layer feeds back into the input layer – recurrent neural networks • The backpropagation will tune the weights • You determine the topology – Different topologies have different training outcomes (consider overfitting) – Sometimes a genetic algorithm is used to explore the space of neural network topologies What is the Purpose of NN? To create an Artificial Intelligence, or • Although not an invalid purpose, many people in the AI community think neural networks do not provide anything that cannot be obtained through other techniques – It is hard to unravel the “intelligence” behind why the ANN works To study how the human brain works? • Ironically, those studying neural networks with this in mind are more likely to contribute to the previous purpose Some Brain Facts • Contains ~100,000,000,000 neurons • Hippocampus CA3 region contains ~3,000,000 neurons • Each neuron is connected to ~10,000 other neurons • ~ (1015)(1015) connections! • Consumes ~20-30% of the body’s energy • Contains about 2% of the body’s mass