Download Multilayer Networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Perceptual control theory wikipedia , lookup

Machine learning wikipedia , lookup

Gene expression programming wikipedia , lookup

Neural modeling fields wikipedia , lookup

Pattern recognition wikipedia , lookup

Hierarchical temporal memory wikipedia , lookup

Backpropagation wikipedia , lookup

Catastrophic interference wikipedia , lookup

Convolutional neural network wikipedia , lookup

Transcript
Input Signals
Out put Signals
Artificial Neural Networks
Input
layer
First
hidden
layer
Second
hidden
layer
Output
layer
1
Commercial ANNs
• Commercial ANNs incorporate three and
sometimes four layers, including one or two
hidden layers. Each layer can contain from
10 to 1000 neurons. Experimental neural
networks may have five or even six layers,
including three or four hidden layers, and
utilise millions of neurons.
2
Example
• Character recognition
3
Problems that are not linearly
separable
• Xor function is not linearly separable
• Using Multilayer networks with back
propagation training algorithm
• There are hundreds of training algorithms
for multilayer neural networks
4
Multilayer neural networks



A multilayer perceptron is a feedforward neural
network with one or more hidden layers.
The network consists of an input layer of source
neurons, at least one middle or hidden layer of
computational neurons, and an output layer of
computational neurons.
The input signals are propagated in a forward
direction on a layer-by-layer basis.
5
Input Signals
Out put Signals
Multilayer perceptron with two hidden layers
Input
layer
First
hidden
layer
Second
hidden
layer
Output
layer
6
What do the middle layers hide?
A hidden layer “hides” its desired output. Neurons
in the hidden layer cannot be observed through the
input/output behaviour of the network. There is no
obvious way to know what the desired output of the
hidden layer should be.
7
Back-propagation neural network



Learning in a multilayer network proceeds the
same way as for a perceptron.
A training set of input patterns is presented to the
network.
The network computes its output pattern, and if
there is an error  or in other words a difference
between actual and desired output patterns  the
weights are adjusted to reduce this error.
8



In a back-propagation neural network, the learning
algorithm has two phases.
First, a training input pattern is presented to the
network input layer. The network propagates the
input pattern from layer to layer until the output
pattern is generated by the output layer.
If this pattern is different from the desired output,
an error is calculated and then propagated
backwards through the network from the output
layer to the input layer. The weights are modified
as the error is propagated.
9
Three-layer back-propagation neural network
Input signals
1
x1
x2
2
xi
y1
2
y2
k
yk
l
yl
1
2
i
1
wij
j
wjk
m
n
xn
Input
layer
Hidden
layer
Output
layer
Error signals
10
The back-propagation training algorithm
Step 1: Initialisation
Set all the weights and threshold levels of the
network to random numbers uniformly
distributed inside a small range:
 2.4
2.4 
 

, 
Fi 
 Fi
where Fi is the total number of inputs of neuron i
in the network. The weight initialisation is done
on a neuron-by-neuron basis.
11
Three-layer network
w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0,
w35 = 1.2, w45 = 1.1, 3 = 0.8, 4 = 0.1 and
5 = 0.3
1
3
x1
1
w13
3
1
w35
w23
5
5
w24
x2
2
y5
w45
4
w24
Input
layer
4
1
Hidden layer
Output
layer
12


The effect of the threshold applied to a neuron in the
hidden or output layer is represented by its weight, ,
connected to a fixed input equal to 1.
The initial weights and threshold levels are set
randomly e.g., as follows:
w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0, w35 = 1.2,
w45 = 1.1, 3 = 0.8, 4 = 0.1 and 5 = 0.3.
13
Assuming the sigmoid activation Function
14
Step 2: Activation
Activate the back-propagation neural network by
applying inputs x1(p), x2(p),…, xn(p) and desired
outputs yd,1(p), yd,2(p),…, yd,n(p).
(a) Calculate the actual outputs of the neurons in
the hidden layer:
n

y j ( p)  sigmoid  xi ( p)  wij ( p)   j 
 i 1

where n is the number of inputs of neuron j in the
hidden layer, and sigmoid is the sigmoid activation
function.
15
Step 2: Activation (continued)
(b) Calculate the actual outputs of the neurons in
the output layer:
m

yk ( p )  sigmoid   x jk ( p )  w jk ( p )   k 
 j 1

where m is the number of inputs of neuron k in the
output layer.
16
Class Exercise
17
If the sigmoid activation function is used the output of
the hidden layer is


  0.8808
y3  sigmoid ( x1w13  x2 w23  3 )  1 / 1  e(10.510.410.8)  0.5250

y4  sigmoid ( x1w14  x2 w24  4 )  1 / 1  e(10.911.010.1)
And the actual output of neuron 5 in the output layer is


y5  sigmoid ( y3w35  y4 w45  5 )  1 / 1  e(0.52501.20.88081.110.3)  0.5097
And the error is
e  yd ,5  y5  0  0.5097  0.5097
18
What learning law applies
in a multilayer neural network?
w jk ( p)    y j ( p)   k ( p)
19
Step 3: Weight training output layer
Update the weights in the back-propagation network
propagating backward the errors associated with
output neurons.
(a) Calculate the error ek ( p)  yd ,k ( p)  yk ( p)
and then the error gradient for the neurons in the
output layer:
 k ( p)  yk ( p)  1  yk ( p) ek ( p)
Then the weight corrections:
w jk ( p)    y j ( p)   k ( p)
Then the new weights at the output neurons:
w jk ( p  1)  w jk ( p)  w jk ( p)
20
Three-layer network for solving the
Exclusive-OR operation
1
3
x1
1
w13
3
1
w35
w23
5
5
w24
x2
2
y5
w45
4
w24
Input
layer
4
1
Hidden layer
Output
layer
21

The error gradient for neuron 5 in the output layer:
 5  y5 (1  y5 ) e  0.5097  (1  0.5097)  (0.5097)  0.1274

Determine the weight corrections assuming that the
learning rate parameter, , is equal to 0.1:
w35    y3   5  0.1 0.5250  (0.1274)  0.0067
w45    y4   5  0.1 0.8808  (0.1274)  0.0112
5    (1)   5  0.1 (1)  (0.1274)  0.0127
22
Apportioning error in
the hidden layer
• Error is apportioned in proportion to the
weights of the connecting arcs.
• Higher weight indicates higher error
responsibility
23
Step 3: Weight training hidden layer
(b) Calculate the error gradient for the neurons in
the hidden layer:
l
 j ( p)  y j ( p ) [1  y j ( p)]    k ( p) w jk ( p )
k 1
Calculate the weight corrections:
wij ( p)    xi ( p)   j ( p)
Update the weights at the hidden neurons:
wij ( p  1)  wij ( p)  wij ( p)
24

The error gradients for neurons 3 and 4 in the hidden
layer:
 3  y3 (1  y3 )   5  w35  0.5250  (1  0.5250)  (  0.1274)  (  1.2)  0.0381
 4  y4 (1  y4 )   5  w45  0.8808  (1  0.8808)  (  0.127 4)  1.1  0.0147

Determine the weight corrections:
w13    x1   3  0.11 0.0381  0.0038
w23    x2   3  0.11 0.0381  0.0038
3    (1)   3  0.1 (1)  0.0381  0.0038
w14    x1   4  0.11 (0.0147)  0.0015
w24    x2   4  0.11 (0.0147)  0.0015
4    (1)   4  0.1 (1)  (0.0147)  0.0015
25

At last, we update all weights and threshold:
w13  w13   w13  0 .5  0 .0038  0 .5038
w14  w14   w14  0 .9  0 .0015  0 .8985
w 23  w 23   w 23  0 .4  0 .0038  0 .4038
w 24  w 24   w 24  1 .0  0 .0015  0 .9985
w35  w35   w35   1 .2  0 .0067   1 .2067
w 45  w 45   w 45  1 .1  0 .0112  1 .0888
 3   3    3  0 .8  0 .0038  0 .7962
 4   4    4   0 .1  0 .0015   0 .0985
 5   5    5  0 .3  0 .0127  0 .3127

The training process is repeated until the sum of
squared errors is less than 0.001.
26
Step 4: Iteration
Increase iteration p by one, go back to Step 2 and
repeat the process until the selected error criterion
is satisfied.
As an example, we may consider the three-layer
back-propagation network. Suppose that the
network is required to perform logical operation
Exclusive-OR. Recall that a single-layer perceptron
could not do this operation. Now we will apply the
three-layer net.
27
Typical Learning Curve
1
Sum-Squared Network Error for 224 Epochs
10
Sum-Squared Error
100
10-1
10-2
10-3
10-4
0
50
100
Epoch
150
200
28
Final results of three-layer network learning
Inputs
Desired
output
x1 x2
1 1
0 1
1 0
0 0
yd
0
1
1
0
Actual
output
y5
Y
0.0155
0.9849
0.9849
0.0175
Error
e
0.0155
0.0151
0.0151
0.0175
Sum of
squared
errors
0.0010
e
29
Network represented by McCulloch-Pitts model
for solving the Exclusive-OR operation
1
+1.5
x1
1
+1.0
3
1
+1.0
+1.0
+0.5
5
+1.0
x2
2
+1.0
y5
+1.0
4
+0.5
1
30
Accelerated learning in multilayer
neural networks

A multilayer network learns much faster when the
sigmoidal activation function is represented by a
hyperbolic tangent:
Y
tanh

2a
1 e
bX
a
where a and b are constants.
Suitable values for a and b are:
a = 1.716 and b = 0.667
31

We also can accelerate training by including a
momentum term in the delta rule:
w jk ( p)    w jk ( p 1)    y j ( p)   k ( p)
where  is a positive number (0    1) called the
momentum constant. Typically, the momentum
constant is set to 0.95.
This equation is called the generalised delta rule.
32
Learning with an adaptive learning rate
To accelerate the convergence and yet avoid the
danger of instability, we can apply two heuristics:
Heuristic
If the error is decreasing the learning rate , should be
increased.
If the error is increasing or remaining constant the
learning rate , should be decreased.
33



Adapting the learning rate requires some changes
in the back-propagation algorithm.
If the sum of squared errors at the current epoch
exceeds the previous value by more than a
predefined ratio (typically 1.04), the learning rate
parameter is decreased (typically by multiplying
by 0.7) and new weights and thresholds are
calculated.
If the error is less than the previous one, the
learning rate is increased (typically by multiplying
by 1.05).
34
Typical Learning Curve
1
Sum-Squared Network Error for 224 Epochs
10
Sum-Squared Error
100
10-1
10-2
10-3
10-4
0
50
100
Epoch
150
200
35
Typical learning with adaptive learning rate
Training for 103 Epochs
2
Sum-Squared Error
10
101
100
10-1
10-2
10-3
10-4
0
10
20
30
40
50
60
Epoch
70
80
90
100
Learning Rate
1
0.8
0.6
0.4
0.2
0
0
20
40
60
Epoch
80
100
120
36
Typical Learning with
adaptive learning rate plus momentum
Training for 85 Epochs
2
Sum-Squared Error
10
101
100
10-1
10-2
10-3
10-4
0
10
0
10
20
30
40
50
Epoch
60
70
80
Learning Rate
2.5
2
1.5
1
0.5
0
20
30
40
50
Epoch
60
70
80
90
37