Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Learning rules. The McCullock-Pitts neuron, regardless of its technological implementation problems, made a base for a machine (network of units) capable of storing information and producing logical and arithmetical operations on it. These correspond to the main functions of the brain as to store knowledge and to apply the knowledge stored to solve problems. The next step must be to realise another important function of the brain, which is to acquire new knowledge through experience, i.e. learning. For that reason, let us come back to the abstract neuron developed in the First Lecture (Fig.1.5. Fig.3.1) : Fig.3.1. Abstract neuron. The ideal free parameters to adjust, and so to resolve learning without changing the connection pattern, are the weights of connections wji. The problem is to define learning rule, i.e. the rule how to adjust the weights to get desirable output. Much work in ANN focuses on the learning rules that change the weights of connections between neurons to better adapt a network to serve some overall function. As for the first time the problem was formulated in 1940s, when experimental neuroscience was limited, the classic definitions of these learning rules came not from biology, but from psychological studies of Donald Hebb and Frank Rosenblatt. Hebb’s rule (1949). Hebb proposed that a particular type of use-dependent modification of the connection strength of synapses might underlie learning in the nervous system. He introduced a neurophysiological postulate (far in advance of physiological evidence) : “When an axon of cell A is near enough to excite a cell B and repeatedly and persistently tales part in firing it, some growth process or metabolic change takes place in one or both cells, such that A’s efficiency as one of the cells firing B, is increased.” Only recent explorations of the physiological properties of neuronal connections have revealed the existence of long-term potentiation, a sustained state of increased synaptic efficacy consequent to intense synaptic activity. The conditions that Hebb predicted would lead to changes in synaptic strength have now been found to cause long-term potentiation in some neurons of the hippocampus and other brain areas. The simplest formalisation of Hebb’s rule is to increase weight of connection at every next instant in the way: wkji1 wkji wkji , (3.1) wkji Caik X kj , (3.2) where here w kji is the weight of connection at instant k, and w kji1 is the weight of connection at the following instant k+1. wkji is increment by which the weight of connection is enlarged, C is positive coefficient which determines learning rate, a ik is input value from the presynaptic neuron at instant k, and X kj is output of the postsynaptic neuron at the same instant k. Thus, the weight of connection changes at the next instant only if both preceding input via this connection and the resulting output simultaneously are not equal to 0. Equation (3.2) emphasises the correlation nature of a Hebbian synapse. It is sometimes referred to as the activity product rule. Hebb’s original learning rule (3.2) referred exclusively to excitatory synapses, and has the unfortunate property that it can only increase synaptic weights, thus washing out the distinctive performance of different neurons in a network, as the connections drive into saturation. However, when the Hebbian rule is augmented by a normalisation rule, e.g. keeping constant the total strength of synapses upon a given neuron), it tends to “sharpen” a neuron’s predisposition “without a teacher”, causing its firing to become better and better correlated with a cluster of stimulus patterns. For this reason, Hebb’s rule plays an important role in studies of ANN algorithms much “younger” than the rule itself, such as unsupervised learning or self-organisation, which we shall consider later. Perceptron (1958). Rosenblatt (1958) explicitly considered the problem of pattern recognition, where a “teacher” is essential. He introduced perceptrons, neural networks that change with “experience”, using an error-correction rule designed to change the weights of each response unit when it makes erroneous responses to stimuli presented to the network. The simplest architecture of perceptron comprises two layers of idealised “neurons”, which we shall call “units” of the network (Fig.3.2), so there are one layer of input units and one layer of output units in the perceptron. The two layers are fully interconnected, i.e. every input unit is connected to every output unit. Thus, processing elements of the perceptron are the abstract neurons (Fig.3.1), each has the same input comprising total input layer, but individual outputs with individual connections and therefore different weights of connections. Fig.3.2. A two-layer perceptron. The total input to the output unit j is n S j w ji ai (3.3) i 0 where a i is input value from the i-th input unit, w ji is the weight of connection between i-th input and j-th output units. The sum is taken over all n+1 input units connected to the output unit j. Note the special bias input unit depicted at the top left of the Fig.3.1, it behaves as an input, which always produces inputs of the fixed value of +1. Its connection to output unit j has a connection weight j0 adjusted in the same way as all the other weights. The bias unit functions as a constant value in the sum (3.3). The output value Xj of the unit j depends on whether the weighted sum is above or below a threshold value, using the threshold activation function 1, X j f S j 0, Sj 0 Sj 0 (3.4) The result of (3.4) is the output of the j-th perceptron processing element, corresponding to the j-th output unit, which becomes the j-th component of the output vector of the network. (Each unit in the output layer generates such an output value.) The weights w ji of connections between the two layers of perceptron are adjustable, that means they are changed according to some rule, so the network are more likely to produce the desired output in response to certain inputs. Such rule is called perceptron learning rule, and the process of the weights adjustment is called the process of perceptron “learning” or “training”. The perceptron is trained by using a training set – a set of input patterns repeatedly presented to the network during training, and corresponding desired responses, i.e. target outputs. Fig.3.3 shows an example training set with binary vector patterns. The target outputs are shown along with each of the training input patterns. During training session every input pattern of the set is repeatedly presented to the perseptron. That means that the input layer assumes the values of the components of the training input vector and every processing element computes an output with the weighted sum and threshold as in (3.3) and (3.4). The network outputs are then compared to the desired outputs specified in the training set, the difference is computed, and then used to readjust the values of the connection weights. The readjustment is done in such a way that the network is – on the whole – more likely to give the desired response next time. The goal of the training session is to arrive at a single set of weights that allow each of the mappings in the training set to be done successfully by the network. After training, the weights are not readjusted between the presentations of input patterns. The weights updating might be done by a number of different rules. One of the simplest rules is that “error”, i.e. the difference between the target response to training pattern p and the instant output is computed for all output units first: e jp (t jp X jp ) , (3.5) where t jp the target value for output unit j after presentation of pattern p, X jp the output value produced by output unit j after presentation of pattern p. Fig.3.3. example training set, with desired responses. For a perceptron that uses only 0/1 as inputs and outputs for its units, the result of (3.5) is zero if the target and output do coincide, and the result is +1 or -1 if they are different. Following the simplest perceptron learning rule, a weight of connection is increased every training step by a value wji : w ji w ji w new ji , old w ji C (t j X j )ai Ce j ai , (3.6) where the error e j (t j X j ) of the j-th output unit: 1 if t j 1, X j 0, (t j X j ) 0 if t j X j , 1 if t 0, X 1; j j a i is the input value of the i-th input unit; C is a small constant called the “learning rate”. According to the perceptron rule (3.6), a weight of connection changes only if the input value a i is not equal to 0, and the output value X j does not coincide with the target response t j , i.e. the error of the output unit is not equal to 0. If that condition is true, then the parameter C, the “learning rate”, is either added to the weight, if the target is higher than the output, or it is subtracted from the weight when the target is lower than the output. The value of the learning rate C is usually set below 1 and determines the amount of correction made in a single iteration. The overall learning time of the network is affected by C: slower for small values and faster for larger values of C. Fig.3.4. Example learning curve, showing the performance of a two-layer perceptron on the training set of Fig.3.3. C 0.1 for this training session, and the initial weights of connections were random values between -0.05 and 0.05. The network performance during training session is measured by a root-meansquare (RMS) error value np RMS no e jp p 0 j 0 n p no np 2 no (t p 0 j 0 jp X jp ) 2 n p no , (3.7) where n p number of patterns in the training set and no number of units in the output layer. The first sum is taken over all patterns in the training set, and the second sum is taken over all output units. As the target output values t jp and n p and n o numbers are constants, the RMS is a function of the instant output values X jp only. In turn, the instant outputs X jp are functions of the input values aip , which also are constants, and of the weights of connections w ji : X jp ni f S jp w ji aip f ( w ji , aip ), i 0 Therefore, performance of the network measured by the RMS error also is function of the weights of connections only. The best performance of the network corresponds to the minimum of the RMS error, and we adjust the weights of connections in order to get to that minimum. Fig.3.4 shows a sample learning curve, dependence of the RMS error on the number of iteration, for the training set of Fig.3.3. Initially, the adaptable weights are all set to small random values, and the network does not perform very well. As the weights are adjusted during training, performance improves, and when the error rate is very low, training is stopped, the network is said to have converged. There is a theorem called the perceptron convergence theorem (Rosenblatt 1962) which states the following: if a set of weights that allow the perceptron to respond correctly to all of the training patterns exists, then the perceptron’s learning method will find the set of weights, and it will do it in a finite number of iteration. There might be another possibility during a training session. Eventually performance stops improving, and the RMS error does not get smaller regardless of number of iterations. That means the network has failed to learn all of the answers correctly. If the training is successful, the perceptron is said to have gone through the supervised learning and is able to classify patterns similar to those of the training set. Linear separability. Perceptrons became very successful at solving certain types of pattern recognition problem. This led to exaggerated claims about their applicability to a broad range of problems. Marvin Minsky and Deymour Papert in 1969 published a detailed analysis of the capabilities and limitations of perceptrons. The best known example of a very simple limitation of perceptron was the impossibility to model an XOR (exclusive OR) gate. This is called the XOR problem. To solve it a model has to learn two weights so that the XOR gate from the Table 3.1 can be reproduced: a1 0 0 1 1 XOR 0 1 1 0 a2 0 1 0 1 OR 0 1 1 1 AND 0 0 0 1 Table 3.1. XOR, OR, and AND gates. Dropping the neuron j index for simplicity, the output is then defined by: 1 if w1a1 w2 a 2 , X (S ) 0 if w1a1 w2 a2 . Hence, to match the target values, the following four inequalities have to be satisfied: w1 (0) w2 (0) 0 w1 (0) w2 (1) w2 w1 (1) w2 (0) w1 w1 (1) w2 (1) w1 w2 . This is a contradiction, because it is impossible for each individual weight to be greater than or equal to positive while their sum is less than Unlike the XOR problem, OR and AND gates are successfully solved by perceptron. Taking the inputs as points on the coordinate plane, input space, linear separability of the tree gates can be demonstrated as on the Fig.3.5. XOR OR AND Fig.3.5. Linear separability of the XOR, OR, and AND gates. Thus, perceptron-computable functions are those, for which the points of the input space with function value (output) 0 can be separated from the points with function value 1 using a line. The property of linear separability/nonseparability extends for functions of n arguments as well.