Download Learning rules. The McCullock-Pitts neuron, regardless of its

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Learning rules.
The McCullock-Pitts neuron, regardless of its technological implementation
problems, made a base for a machine (network of units) capable of storing information
and producing logical and arithmetical operations on it. These correspond to the main
functions of the brain as to store knowledge and to apply the knowledge stored to solve
problems. The next step must be to realise another important function of the brain, which
is to acquire new knowledge through experience, i.e. learning. For that reason, let us
come back to the abstract neuron developed in the First Lecture (Fig.1.5.  Fig.3.1) :
Fig.3.1. Abstract neuron.
The ideal free parameters to adjust, and so to resolve learning without changing
the connection pattern, are the weights of connections wji. The problem is to define
learning rule, i.e. the rule how to adjust the weights to get desirable output.
Much work in ANN focuses on the learning rules that change the weights of
connections between neurons to better adapt a network to serve some overall function. As
for the first time the problem was formulated in 1940s, when experimental neuroscience
was limited, the classic definitions of these learning rules came not from biology, but
from psychological studies of Donald Hebb and Frank Rosenblatt.
Hebb’s rule (1949).
Hebb proposed that a particular type of use-dependent modification of the
connection strength of synapses might underlie learning in the nervous system. He
introduced a neurophysiological postulate (far in advance of physiological evidence) :
“When an axon of cell A is near enough to excite a cell B and repeatedly and
persistently tales part in firing it, some growth process or metabolic change takes
place in one or both cells, such that A’s efficiency as one of the cells firing B, is
increased.” Only recent explorations of the physiological properties of neuronal
connections have revealed the existence of long-term potentiation, a sustained state of
increased synaptic efficacy consequent to intense synaptic activity. The conditions that
Hebb predicted would lead to changes in synaptic strength have now been found to cause
long-term potentiation in some neurons of the hippocampus and other brain areas.
The simplest formalisation of Hebb’s rule is to increase weight of connection at
every next instant in the way:
wkji1  wkji  wkji ,
(3.1)
wkji  Caik X kj ,
(3.2)
where
here w kji is the weight of connection at instant k, and w kji1 is the weight of connection at
the following instant k+1. wkji is increment by which the weight of connection is
enlarged, C is positive coefficient which determines learning rate, a ik is input value from
the presynaptic neuron at instant k, and X kj is output of the postsynaptic neuron at the
same instant k. Thus, the weight of connection changes at the next instant only if both
preceding input via this connection and the resulting output simultaneously are not equal
to 0. Equation (3.2) emphasises the correlation nature of a Hebbian synapse. It is
sometimes referred to as the activity product rule.
Hebb’s original learning rule (3.2) referred exclusively to excitatory synapses, and
has the unfortunate property that it can only increase synaptic weights, thus washing out
the distinctive performance of different neurons in a network, as the connections drive
into saturation. However, when the Hebbian rule is augmented by a normalisation rule,
e.g. keeping constant the total strength of synapses upon a given neuron), it tends to
“sharpen” a neuron’s predisposition “without a teacher”, causing its firing to become
better and better correlated with a cluster of stimulus patterns. For this reason, Hebb’s
rule plays an important role in studies of ANN algorithms much “younger” than the rule
itself, such as unsupervised learning or self-organisation, which we shall consider later.
Perceptron (1958).
Rosenblatt (1958) explicitly considered the problem of pattern recognition, where
a “teacher” is essential. He introduced perceptrons, neural networks that change with
“experience”, using an error-correction rule designed to change the weights of each
response unit when it makes erroneous responses to stimuli presented to the network.
The simplest architecture of perceptron comprises two layers of idealised
“neurons”, which we shall call “units” of the network (Fig.3.2), so there are one layer of
input units and one layer of output units in the perceptron. The two layers are fully
interconnected, i.e. every input unit is connected to every output unit. Thus, processing
elements of the perceptron are the abstract neurons (Fig.3.1), each has the same input
comprising total input layer, but individual outputs with individual connections and
therefore different weights of connections.
Fig.3.2. A two-layer perceptron.
The total input to the output unit j is
n
S j   w ji ai
(3.3)
i 0
where a i is input value from the i-th input unit, w ji is the weight of connection between
i-th input and j-th output units. The sum is taken over all n+1 input units connected to the
output unit j.
Note the special bias input unit depicted at the top left of the Fig.3.1, it behaves as
an input, which always produces inputs of the fixed value of +1. Its connection to output
unit j has a connection weight j0 adjusted in the same way as all the other weights. The
bias unit functions as a constant value in the sum (3.3).
The output value Xj of the unit j depends on whether the weighted sum is above or
below a threshold value, using the threshold activation function
 1,
X j  f S j   
0,
Sj  0
Sj  0
(3.4)
The result of (3.4) is the output of the j-th perceptron processing element,
corresponding to the j-th output unit, which becomes the j-th component of the output
vector of the network. (Each unit in the output layer generates such an output value.)
The weights w ji of connections between the two layers of perceptron are
adjustable, that means they are changed according to some rule, so the network are more
likely to produce the desired output in response to certain inputs. Such rule is called
perceptron learning rule, and the process of the weights adjustment is called the process
of perceptron “learning” or “training”.
The perceptron is trained by using a training set – a set of input patterns
repeatedly presented to the network during training, and corresponding desired responses,
i.e. target outputs. Fig.3.3 shows an example training set with binary vector patterns. The
target outputs are shown along with each of the training input patterns.
During training session every input pattern of the set is repeatedly presented to the
perseptron. That means that the input layer assumes the values of the components of the
training input vector and every processing element computes an output with the weighted
sum and threshold as in (3.3) and (3.4). The network outputs are then compared to the
desired outputs specified in the training set, the difference is computed, and then used to
readjust the values of the connection weights. The readjustment is done in such a way
that the network is – on the whole – more likely to give the desired response next time.
The goal of the training session is to arrive at a single set of weights that allow each of
the mappings in the training set to be done successfully by the network. After training,
the weights are not readjusted between the presentations of input patterns. The weights
updating might be done by a number of different rules. One of the simplest rules is that
“error”, i.e. the difference between the target response to training pattern p and the
instant output is computed for all output units first:
e jp  (t jp  X jp ) ,
(3.5)
where
t jp  the target value for output unit j after presentation of pattern p,
X jp  the output value produced by output unit j after presentation of pattern p.
Fig.3.3. example
training set, with
desired responses.
For a perceptron that uses only 0/1 as inputs and outputs for its units, the result of
(3.5) is zero if the target and output do coincide, and the result is +1 or -1 if they are
different.
Following the simplest perceptron learning rule, a weight of connection is
increased every training step by a value wji :
w ji  w ji  w
new
ji
,
old
w ji  C (t j  X j )ai  Ce j ai ,
(3.6)
where
the error e j  (t j  X j ) of the j-th output unit:
 1 if t j  1, X j  0,

(t j  X j )   0 if t j  X j ,
  1 if t  0, X  1;
j
j

a i is the input value of the i-th input unit;
C is a small constant called the “learning rate”.
According to the perceptron rule (3.6), a weight of connection changes only if the input
value a i is not equal to 0, and the output value X j does not coincide with the target
response t j , i.e. the error of the output unit is not equal to 0. If that condition is true, then
the parameter C, the “learning rate”, is either added to the weight, if the target is higher
than the output, or it is subtracted from the weight when the target is lower than the
output. The value of the learning rate C is usually set below 1 and determines the amount
of correction made in a single iteration. The overall learning time of the network is
affected by C: slower for small values and faster for larger values of C.
Fig.3.4. Example learning
curve,
showing
the
performance of a two-layer
perceptron on the training set
of Fig.3.3. C  0.1 for this
training session, and the
initial weights of connections
were random values between
-0.05 and 0.05.
The network performance during training session is measured by a root-meansquare (RMS) error value
np
RMS 
no
 e jp
p 0 j 0
n p no
np
2

no
 (t
p 0 j 0
jp
 X jp ) 2
n p no
,
(3.7)
where
n p  number of patterns in the training set
and
no  number of units in the output layer.
The first sum is taken over all patterns in the training set, and the second sum is taken
over all output units. As the target output values t jp and n p and n o numbers are
constants, the RMS is a function of the instant output values X jp only. In turn, the instant
outputs X jp are functions of the input values aip , which also are constants, and of the
weights of connections w ji :
X jp
ni


 f  S jp   w ji aip   f ( w ji , aip ),
i 0


Therefore, performance of the network measured by the RMS error also is
function of the weights of connections only. The best performance of the network
corresponds to the minimum of the RMS error, and we adjust the weights of connections
in order to get to that minimum.
Fig.3.4 shows a sample learning curve, dependence of the RMS error on the
number of iteration, for the training set of Fig.3.3. Initially, the adaptable weights are all
set to small random values, and the network does not perform very well. As the weights
are adjusted during training, performance improves, and when the error rate is very low,
training is stopped, the network is said to have converged. There is a theorem called the
perceptron convergence theorem (Rosenblatt 1962) which states the following: if a set
of weights that allow the perceptron to respond correctly to all of the training
patterns exists, then the perceptron’s learning method will find the set of weights,
and it will do it in a finite number of iteration.
There might be another possibility during a training session. Eventually performance
stops improving, and the RMS error does not get smaller regardless of number of
iterations. That means the network has failed to learn all of the answers correctly.
If the training is successful, the perceptron is said to have gone through the
supervised learning and is able to classify patterns similar to those of the training set.
Linear separability.
Perceptrons became very successful at solving certain types of pattern recognition
problem. This led to exaggerated claims about their applicability to a broad range of
problems. Marvin Minsky and Deymour Papert in 1969 published a detailed analysis of
the capabilities and limitations of perceptrons. The best known example of a very simple
limitation of perceptron was the impossibility to model an XOR (exclusive OR) gate.
This is called the XOR problem. To solve it a model has to learn two weights so that the
XOR gate from the Table 3.1 can be reproduced:
a1
0
0
1
1
XOR
0
1
1
0
a2
0
1
0
1
OR
0
1
1
1
AND
0
0
0
1
Table 3.1. XOR, OR, and AND gates.
Dropping the neuron j index for simplicity, the output is then defined by:
1 if w1a1  w2 a 2   ,
X (S )  
0 if w1a1  w2 a2   .
Hence, to match the target values, the following four inequalities have to be satisfied:
w1 (0)  w2 (0)  

0 
w1 (0)  w2 (1)  

w2  
w1 (1)  w2 (0)  

w1  
w1 (1)  w2 (1)  
 w1  w2   .
This is a contradiction, because it is impossible for each individual weight to be greater
than or equal to positive while their sum is less than 
Unlike the XOR problem, OR and AND gates are successfully solved by
perceptron. Taking the inputs as points on the coordinate plane, input space, linear
separability of the tree gates can be demonstrated as on the Fig.3.5.
XOR
OR
AND
Fig.3.5. Linear separability of the XOR, OR, and AND gates.
Thus, perceptron-computable functions are those, for which the points of the input
space with function value (output) 0 can be separated from the points with function value
1 using a line. The property of linear separability/nonseparability extends for functions of
n arguments as well.