Download lecture22 - University of Virginia, Department of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neural modeling fields wikipedia , lookup

Perceptual control theory wikipedia , lookup

Time series wikipedia , lookup

History of artificial intelligence wikipedia , lookup

Gene expression programming wikipedia , lookup

Hierarchical temporal memory wikipedia , lookup

Pattern recognition wikipedia , lookup

Backpropagation wikipedia , lookup

Catastrophic interference wikipedia , lookup

Convolutional neural network wikipedia , lookup

Transcript
CS 416
Artificial Intelligence
Lecture 22
Statistical Learning
Chapter 20.5
Perceptrons
• Each input is binary and
has associated with it a
weight
• The sum of the inner
product of the input and
weights is calculated
• If this sum exceeds a
threshold, the perceptron
fires
Perceptrons are linear
classifiers
Consider a two-input neuron
• Two weights are “tuned” to fit the data
• The neuron uses the equation w1*x1 + w2*x2 to fire or
not
– This is like the equation of a line mx + b - y
http://www.compapp.dcu.ie/~humphrys/Notes/Neural/single.neural.html
Linearly separable
These single-layer perceptron networks can
classify linearly separable systems
• Consider a system like AND
x1
x2
x1 AND x2
1
1
1
0
1
0
1
0
0
0
0
0
1-
1
Linearly separable - AND
• Consider a system like AND
x1
x2
x1 AND x2
1
1
1
0
1
0
1
0
0
0
0
0
x1
w1
w2
x2
S
q(x●w)
1-
1
Linearly separable - XOR
• Consider a system like XOR
x1
x2
x1 XOR x2
1
1
0
0
1
1
1
0
1
0
0
0
x1
w1
w2
x2
S
q(x●w)
1-
1
Linearly separable - XOR
IMPOSSIBLE!
2nd Class Exercise
• x3 = ~x1, x4 = ~x2
• Find w1, w2, w3,
w4, and theta
such that
Theta(x1*w1+x2*
w2)= x1 xor x2
• Or, prove that it
can’t be done
3rd Class Exercise
• Find w1, w2, and
f() such that
f(x1*w1+x2*w2) =
x1 xor x2
• Or, prove that it
can’t be done
Limitations of Perceptrons
• Minsky & Papert published (1969)
“Perceptrons” stressing the limitations
of perceptrons
• Single-layer perceptrons cannot solve
problems that are linearly inseparable
(e.g., xor)
• Most interesting problems are linearly
inseparable
• Kills funding for neural nets for 12-15
years
A brief aside about Marvin Minsky
• Attended Bronx H.S. of Science
• Served in U.S. Navy during WW II
• B.A. Harvard and Ph.D. Princeton
• MIT faculty since 1958
• First graphical head-mounted display (1963)
• Co-inventor of Logo (1968)
• Nearly killed during 2001: A Space Odyssey but survived to write
paper critical of neural networks
• Turing Award 1970
From wikipedia.org
Single-layer networks for
classification
• Single output with 0.5 as dividing line for binary
classification
• Single output with n-1 dividing lines for n-ary
classification
• n outputs with 0.5 dividing line for n-ary classification
Recent History of Neural Nets
• 1969 Minsky & Papert “kill” neural nets
• 1974 Werbos describes backpropagation
• 1982 Hopfield reinvigorates neural nets
• 1986 Parallel Distributed Processing
Multi-layered Perceptrons
• Input layer,
output layer, and
“hidden” layers
• Eliminates some
concerns of
Minsky and
Papert
• Modification rules
are more
complicated!
Why are modification rules more
complicated?
We can calculate the error of the output
neuron by comparing to training data
• We could use previous update rule to adjust W3,5 and
W4,5 to correct that error
• But how do W1,3
W1,4 W2,3 W2,4
adjust?
First consider error in single-layer
neural networks
Sum of squared errors (across training data)
For one sample:
How can we minimize the error?
•
Set derivative equal to zero and solve for weights
•
is that error affected by each of the weights in the
weight vector?
Minimizing the error
What is the derivative?
• The gradient,
– Composed of
Computing the partial
By the Chain Rule:
g ( ) = the activation function
Computing the partial
g’( in ) = derivative of the activation function
= g(1-g) in the case of the sigmoid
Minimizing the error
Gradient descent
What changes in multilayer?
Output is not one value, y
• Output is a vector
We do not know the correct outputs for the
hidden layers
• We will have to propagate errors backwards
Back propagation (backprop)
Multilayer
Backprop at the output layer
Output layer error is computed as in singlelayer and weights are updated in same
fashion
• Let Erri be the ith component of the error vector y – hW
– Let
Backprop in the hidden layer
Each hidden node is responsible for some
fraction of the error Di in each of the output
nodes to which it is connected
• Di is divided among all hidden nodes that connect to
output i according to their strengths
Error at hidden node j:
Backprop in the hidden layer
Error is:
Correction is:
Summary of backprop
1. Compute the D value for the output units
using the observed error
2. Starting with the output layer, repeat the
following for each layer until done
•
Propagate D value back to previous layer
•
Update the weights between the two layers
4th Class Exercise
• Find w1, w2, w3,
w4, w5, theta1,
and theta2 such
that output is
x1 xor x2
• Or, prove that it
can’t be done
Back-Propagation (xor)
• Initial weights are random
• Threshold is now
sigmoidal (function should
have derivatives)
Initial weights:
w1=0.90, w2=-0.54
w3=0.21, w4=-0.03
w5 = 0.78
1
f ( x  w) 
1  e  x w
Back-Propagation (xor)
• Input layer – two unit
• Hidden layer – one unit
• Output layer – one unit
• Output is related to input by

F w , x   f f x w w 
• Performance is defined as

1
2




P 
F
w
,
x

c

T x ,c T
For all samples in training set T
Back-Propagation (xor)
• Error at last layer (hiddenoutput) is
defined as: 1  F w , x   c 
• Error at previous layer (inputhidden)
is defined as:  j w j k o k 1  o k  k
• Change in weight: Dwi  j  
• Where: P x, c
wi  j
 oi o j 1  o j  j

x ,c T
P x ,c
wi  j
Back-Propagation (xor)
• (0,0)0 – 1st example
• Input to hidden unit is 0, sigmoid(0)=0.5
• Input to output unit is (0.5)(-0.03)=-0.015
• Sigmoid(-0.015)=0.4963error=-0.4963
• So,  o  0.4963
P
 (0.5)(0.49 63)(1  0.4963)( 0.4963)  0.0620
w 4
• Example’s contribution to Dw 4 is –0.0062
Why are we ignoring the other weight changes?
Back-Propagation (xor)
P x, c
• (0,1)1 – 2nd example
• ih=-0.54  oh=0.3862
 oi o j 1  o j  j
wi  j
 j w j k o k 1  o k  k
• io=(0.3862)(-.03)+0.78=0.769oo=0.6683
 o  1  0.6833  0.3167
P
 (0.3862)( 0.6833)(1  0.6833)( 0.3167)  0.0252
w 4
P
 (1)( 0.6833)(1  0.6833)( 0.3167)  0.0685
w 5
 h  ( 0.03)( 0.6833)(1  0.6833)( 0.3167)  0.0021
P
 (1)( 0.3682)(1  0.3682)( 0.0021)  0.0005
w 2
&c…
Back-Propagation (xor)
• Initial performance = -0.2696
• After 100 iterations we have:
• w=(0.913, -0.521, 0.036, -0.232, 0.288)
• Performance = -0.2515
• After 100K iterations we have:
• w=(15.75, -7.671, 7.146, -7.149, 0.0022)
• Performance = -0.1880
• After 1M iterations we have:
• w=(21.38, -10.49, 9.798, -9.798, 0.0002)
• Performance = -0.1875
Some general artificial neural
network (ANN) info
• The entire network is a function g( inputs ) = outputs
– These functions frequently have sigmoids in them
– These functions are frequently differentiable
– These functions have coefficients (weights)
• Backpropagation networks are simply ways to tune the
coefficients of a function so it produces desired output
Function approximation
Consider fitting a line to data
• Coefficients: slope and y-intercept
• Training data: some samples
• Use least-squares fit
y
This is what an ANN does
x
Function approximation
A function of two inputs…
• Fit a smooth
curve to the
available
data
– Quadratic
– Cubic
– nth-order
– ANN!
Curve fitting
• A neural network should be able to generate the
input/output pairs from the training data
• You’d like for it to be smooth (and well-behaved) in the
voids between the training data
• There are risks of over fitting the data
When using ANNs
• Sometimes the output layer feeds back into the input
layer – recurrent neural networks
• The backpropagation will tune the weights
• You determine the topology
– Different topologies have different training outcomes
(consider overfitting)
– Sometimes a genetic algorithm is used to explore
the space of neural network topologies
What is the Purpose of NN?
To create an Artificial Intelligence, or
• Although not an invalid purpose, many people in the AI
community think neural networks do not provide anything
that cannot be obtained through other techniques
– It is hard to unravel the “intelligence” behind why the
ANN works
To study how the human brain works?
• Ironically, those studying neural networks with this in mind
are more likely to contribute to the previous purpose
Some Brain Facts
• Contains ~100,000,000,000 neurons
• Hippocampus CA3 region contains
~3,000,000 neurons
• Each neuron is connected to ~10,000 other
neurons
• ~ (1015)(1015) connections!
• Consumes ~20-30% of the body’s energy
• Contains about 2% of the body’s mass