Download Computability and Learnability of Weightless Neural Networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of artificial intelligence wikipedia , lookup

Gene expression programming wikipedia , lookup

Pattern language wikipedia , lookup

Pattern recognition wikipedia , lookup

Convolutional neural network wikipedia , lookup

Catastrophic interference wikipedia , lookup

Hierarchical temporal memory wikipedia , lookup

Transcript
I SBAI - UNESP - Rio Claro/SP - Brasil
Computability and Learnability of Weightless Neural Networks
Teresa B. Ludermir*
Departamento de Informática
Universidade Federal de Pernambuco
Recife - PE - Brazil
CEP: 50.739
e-mail:tblCldi.ufpe.br
Abstract
The aim of this paper is to compare what can be computed with what can be learnable by
Artificial Neural Networks (ANN). We will approach the learnability problem in ANN by fixing
a particular class of ANN: the weightless neural networks (WNN) ([4)) and restricting ourselves
to a particular learning paradigm: learning to recognise sentences of a formallanguage. We will
approach the computability problem by comparing the class of languages that can be recognised
by WNN with the classes of languages in the Chomsky hierarchy.
1
Introduction
Undoubtly the most exciting feature of neural networks (artificial or natural) is their ability to
generalise (adapt) to new situations. After being trained on number of examples of, say, a function,
they can often induce a completely defined function that interpolates and extrapolates from the
exa.mples in a sensible way. But what is meant by sensible generalisation is often not clear and may
be dependent of the particular application. In many problems there are almost infinitely many
possible generalisations. We can then say that a neural network learn a function when the induced
completely defined function resembles the function from which the examples were taken. Again,
the notion of resemblance may be application dependent.
Despite the existence of sound results showing that some mathematically well defined classes of
function can be implemented ( com pu ted) 1 in Artificial Neural Networ ks (ANN), the characterisation
of what can be learned is still open. Thus the importance of the study of learnability of ANN.
This paper is divided into six sections, this being the first. Section 2 presents the weightless
neural models and their main characteristics. Section 3 describes the cut-point node and its recognition method. Section 4 is concerned with computability of WNN. Section 5 is dedicated to the
study of the learnability problem. Section 6 has a summary of the paper.
2
Weightless Neural Models
Definition 1 A RAM Neural Network is an arrangement of a finite number of neurons in any
number of layers, in which the neurons are RAM (Random Access Memory) nodes. The RAM
node is represented in the Figure 1.
·Partially supported by Nuffield Foundation and FACEPE (Research Foundation of the State of Pernambuco).
lTo implement a function on an artificial neural network is to present a complete specification of an architecture
(units, connections, etc.) in such way that the network behaves in exact correspondence with the given function. For
example, (Cybenko [5]) shows how to implement any continuous function on the unit interval; (Hecht-Nielsen [10))
for any L 2 function; and (Ludermir [15]) for the characteristic function oi any set recognizable by a probabilistic
automata.
- 1-
I SBAI - UNESP - Rio Claro/SP - Brasil
I
'mo
o
ZI
•
Z2
1
r
Figure 1: RAM node.
The input may represent external input or the output of neurons from another layer or a
feedback input. The data out may be O's or 1's. The set of connections is fixed and there are no
weights in such nets. Instead the function performed by the neuron is determined by the contents
of the RAM - its output is the value accessed by the activated memory location. There are 22N
different functions which can be performed on N address lines and these correspond exactly to the
2N states that the RAM can be in, that is a single RAM can compute any function of its inputs.
Seeing aRAM node as look-up table (truth table) the output of the RAM node is described by
equation 1 below:
r _
se C[p] =
(1)
1 se C[p] = 1
{O
O
where C[p] is the content of the address position associated with the input pattern p.
The introduction of a probabilistic element into the weightless node was proposed by Aleksander
(Aleksander [3]) after attempting a rapprochement between Boltzmann machines and weightless
no de networks (Aleksander [2]). He called the no de with this probabilistic element a probabilistic
logic node (PLN). The main feature of the PLN model is the unknown state, u, which responds
with a randomly generated output for inputs on which it has not been trained. Below is given a
definition of a PLN node.
Definition 2 A PLN Neural Network is an arrangement of a finite number of neurons in any
number of layers, in which the neurons are PLN (Probabilistic Logic Node) nodes.
A PLN node differs from aRAM node in the sense that a 2-bit number (rather than a single bit)
is now stored at the addressed memory location. The content of this location is turned into the
probability of firing (i.e. generating a 1) at the overall output of the node. In other words, a PLN
consists of a RAM-node augmented with a probabilistic output generator. As in a simple RAM
node, the I binary inputs to a node form an address into one of the 21 RAM locations. Sim pIe
-2-
I SBAI - UNESP - Rio Claro/SP - Brasil
RAM nodes then output the stored value directly. In the PLN, the value stored at this address is
passed through the output function which converts it into a binary node output. Thus ea.ch stored
value may be interpreted as aifecting the probability of outputting a 1 for a given pattern of no de
inputs.
The contents of the nodes are O's, l's and u's. u means the same probability to produce as
output O or 1. The output of the PLN node is described by equation 2 below:
se C[p] = O
se C[p] = 1
{ random(O, l) se C[p] = u
O
r =
1
(2)
where C[p] is the content of address position associated with the input pattern p and random(O,l)
is a random function that generates zeros and ones with the same probability.
The Multiple-valued Probabilistic Logic Neuron (MPLN) (Myers [16]) is simply an extension of
the PLN. A MPLN differs from a PLN no de in the sense that a b-bit number (rather than two bits)
is now stored at the addressed memory location. The content of this location is also the probability
of firing at the overall output of the node. Say that b is 3, then the numbers O to 7 can be stored
in ea.ch location of the MPLN. One way of regarding the actual number stored may be as a direct
representation of the firing probability by treating the number as a fraction of 7. So a stored 2
would cause the output to fire with a probability of 2/7, and so on.
One result of extending the PLN to MPLN is that the node locations may now store output
probabilities which are more finely gradated than in the PLN - for example, it is possible that a
node will output 1 with 14% probability under a certa.in input. The second result of the extension
is that learning can a.llow incremental changes in stored values. In this way, one reset does not
erase much information. Erroneous information is discarded only after a series of errors. Similarly,
new information is only acquired after a series of experiences indicate its va.lidity.
3
Description of the cut-point node and the cut-point recognition method
In this section the cut-point node is expla.ined. The cut-point node is an extension of the MPLN
node and it is based on ideas froro the theory of probabilistic autoroata. Probabilistic automata
recognise patterns depend on two information: the final state of the automaton after the pattern
has been subroitted to it and the the overall probability associated with such pattern. If the final
state of the automaton belongs to the set of final states of the language to be recognised, then
the probability associated with the pattern is compared with the threshold associated with the
language. Only if the probability is bigger than the threshold, the pattern is considered a.ccepted
by the automaton as belonging to the language in questiono
To simulate the behaviour of a probabilistic automata with neural networks we need to have a
node capable of storing the probability of the path followed by the network up to that no de with a
fed pattern. To be able to store the probability of the path extra memory was given to our node.
That is, besides the n words of b-bits (considering a node with n inputs) the no de is going to have
an extra word with k-bits to store the probability.
With a MPLN node the numbers stored in the n words of b-bits represent the 'firing probability'
of the neuron and the node outputs 1/0 depending on this 'firing probability'. Once the node
outputted 1/0 there is no other use for the 'firing probability' with a MPLN. The overa.ll output
of the network is only a pattern, without any probability. In our case, the overall output of the
network needs to be the pattern followed by the probability associated with this pattern. The
way the probability is calculated depends on the type of the function the node is computing and
it will be expla.ined with more deta.ils later in this paper. Every node needs to have only one
-3-
I SBAI - UNESP - Rio Claro/SP - Brasil
probability associated with it in order to calculate the probability of the whole pattern submitted
to the network. This probability is used basically in the recognition phase of the system.
A different method of achieving pattern recognition is used with our model. This method was
firstly used for MPLN networks in ([14]), and we called it cut-point recognition algorithm. The
output of our network consists of two parts: a pattern (which is the final state of the network) and
a probability (which is the probability of the input to be recognised by the network). Patterns will
be recognised by the network if the last states of the network, when such patterns were submitted
to it, are in the set of final states of the network for the language in question and the probabilities
associated with the patterns are greater than the threshold.
A short description of the cut point recognition algorithm is given below, where P is the
probability of the path in each node and pl is the probability of the node to fire.
ALGORITHM
1. Make for alI nodes P = O (there is no path followed before the first symbol of each pattern is
fed).
2. Choose an input pattern
X = XIX2 ... Xn.
=
3. Feed the pattern X XIX2 ... Xn into the network symbol by symbol and calculate the overall
probability associated with the pattern X.
For every symbol Xi do
3.1 Feed symbol Xi to alI external input terminaIs.
3.2 For alI nodes j in the network do
If the node j fire then calculate probability P for this node.
3.2.1 If the node j is an and node P = p/j
3.2.2 If the node j is an or node P =
* Pinputs of the node j that halJe fire in time
j.
í: Pinputs of the node j that halJe fire in time i·
3.2.3 If the node j is a complement or a delay no de
P =
Pinput of the node j if it has fired in time i
4. If the final state of the net is in the set of final states for the language then compare the overall
probability with the threshold. If the probability is bigger than the threshold then the pattern
X is accepted as belonging to the language.
This algorithm was based on the way probabilistic automata recognise patterns with thresholds.
The class of patterns recognised by the network depends not only on the structure of the network
but also on the threshold associated with the network. This algorithm not only increases the
computation power of the network but also it is more appropriate to pattern recognition because
it gives a deterministic decision as to whether a pattern X belongs to the language recognised by
a given network or noto
4
Computability of Weightless Neural Networks
The recognition methods used to date with weightless neural networks make them behave like
finite state machines. On the other hand many neural problems, for instance, natural language
understanding, make use of context free grammars, therefore it is important to be able to implement
in neural networks at least some of the context-free grammars.
The cut-point node and recognition method were defined with the purpose of increasing the
computability capacity of weightless neural networks from finite state automata to probabilistic
automata.
- 4-
I SBAI - UNESP - Rio Claro/SP - Brasil
The relationship between the Chomsky's hierarchy and weighted regular languages (these are
the languages recognised by probabilist automata) can be illustrated by the following equations: (1)
WRL :J RL; (2) WRL n CFL\RL =; cp; (3) WRL n CSL\CFL =; cp; (4) WRL n TZL\CSL =; cp;
where the abbreviated notation means: WRL is the set of weighted regular languages, RL is the set
of regular languages, CFL is the set of context-free languages, CSL is the set of context-sensitive
languages and TZL is the set of type zero languages.
In order to show that the computational power of a cut-point network and a probabilistic
automaton is the same it is necessary to prove the following theorems:
• (Theorem 1) Let G w be a weighted regular grammar and L( G w ) be the language generated
by G w associated with some cut-point À then there exists a weightless neural network, which
associated with the same cut-point À, that recognises L(G w ), and
• (Theorem 2) If a set of patterns L is recognised by a weightless neural network associated
with some cut-point À then this set can be generated by a weighted regular grammar G w
associated with the same cut-point À.
A formal proof of these theorems can be found in ([15]). Here a short description of the proofs
are given.
The proof of theorem 1 is an algorithm for transforming any weighted regular grammar into
a weightless network. The way in which the grammar is transformed into the neural network is
such that ali the properties of the grammar are preserved and it is possible to infer the grammar
from the weightless network generated. Networks constructed with four kinds of nodes: p-and,
p-or, complement and delay nodes will be considered. The proof is based on the complexity of the
production rules. Transformation of the production rules, in weightless networks, will be started
by the simplest production rule. The proof is then divided in three cases. The first case deals with
production rules of the form SI -+ w(p) where w is a word in the language, SI is a non-terminal
symbol of the grammar and p is the probability associated with this production rule. The second
case deals with production rules of the form Si -+ wSj(p) and the last case deals with production
rules ofthe form Sj -+ WISj(Pl) I Sj -+ W2Sk(P2) where I denotes the possibility of Si being replaced
by WISj(Pl) or by W2Sk(P2). Each ones of these cases are divided into sub-cases for the simplicity
of the proof. For every sub-case it is given the circuit which implement it. The circuitous are
designed in function of the four types of nodes mentioned above.
Although the method described to transform the weighted regular grammar into network gives a
complete structure - the network and its memory contents - the language recognised by the network
can be changed either by training or by changing the value of the threshold. This is particularly
useful when the exact grammar of the language to be recognised is not know, only an approximation
is known. Training is described in the next section.
The main purpose of the theorem 2 is to show that the relationship between weighted regular
languages and weightless networks is an if then if relation. This proof can be used to determine the
total generalisation of a network after the network has been trained. This method of calculating
the total generalisation is more efficient than many others, for example: submitting patterns"to the
network to see if they are recognised by the neto It is also better than going through the whole
network in order to calculate the generalisation. The method derived from the theorem gives also
the probability of recognition for each pattern in the class recognised by the network.
Examples of the transformation of the weighted regular grammars into weightless networks and
vice-versa can be found in ([12)).
Qur method of generating networks for parsing has many advantages over others models such
as the method in ([11]) and in ([8]). In the model in ([11)) only finite regular grammars without
recursive productions rules can be transformed in networks while ([8]) can deal with some contextfree languages with the use of a stack, as an auxiliary memory. The use of stacks for pattern
recognition is not very natural. It is not likely that human beings recognise patterns using stacks.
- 5-
I SBAI - UNESP - Rio Claro/SP - Brasil
We can extend the weightless neural network further (slightly and naturally) to obtain an
equivalence with Turing machines. A Turing machine is a combination of a finite state machine
with a device of infinite memory or cells (tape) on which it can be printed and read symbols from a
finite set. The tape is scanned in either left or right direction one cell at a time. A WNN, and any
ANN, is a finite state automata. The only missing component for the equivalence is access to a tape
in a similar fashion to Turing machines. This can be easily simulated by (de )coding instructions
from the output of a neural network with feedback and imagine that the input of the network is
fed from a tape. The details and some experiments can be found in ([6]). It is important to notice
that the Turing machine simulation by WNN has no infinite number of neurons nor finite number
of neurons but with unbounded weights as in the case in the weighted systems ([7], [9] [18]).
This establishes, at a theoreticallevel, the relationship between conventional and neural mo deIs
of computation. And allows for the possibilities of using ANN in "high-level" cognitive tasks such
as naturallanguage processing and logical or common sense reasoning, that have traditionally been
attacked with the symbolic representation and processing techniques of Artificial Intelligence.
5
Learnability on Weightless Neural Networks
Learning in aRAM node takes place simply by writing into it, which is much simpler than the
adjustment of weights as in a weighted-sum-and-threshold network. The RAM node, as defined
section 2, can compute alI binary functions of its input while the weighted-sum-and-threshoId nodes
can onIy compute linearly separable function of its input. There is no generalisation in the RAM
node itself (the no de must store the appropriate response for every possible input), but there is
generalisation in networks composed of RAM nodes (AIeksander [1]). Generalisation in weightIess
networks is affected first by the diversity of the patterns in the training set, that is, the more diverse
the patterns in the training set, the greater will be the number of subpatterns seen by each RAM,
resulting in a larger generalisation set. SecondIy, the connection of RAMs to common features in
the training set reduces the generalisation set.
Training with PLN networks becomes a process of replacing u's with O's and 1's so that the
network consistently produces the correct output pattern in response to training pattern inputs. At
the start of training, alI stored values in alI nodes are initialised to u , and thus the net's behaviour
is completely unbiased. In a fully converged PLN net, every addressed location should contain a O
or a 1, and the net's behaviour will be completely deterministic. 2
By restricting the scope of the tasks to formallanguages recognition, we can more directly relate
neural networks to conventional models of computations. For example, techniques from Machine
Learning, Inductive and Grammatical Inference now become available to analysis and comparisons.
Besides, the restriction is just methodological once any other task can be reduced to a problem of
language recognition 3.
In this paper, we present preliminary results on the learnability problem, considering only
experiments with RAM networks. Now that we have developed the cut-point recognition method
and training strategies, experiments are being made with them and the results will be published in
another opportunity.
Initially three different types of systems were developed to analyse learning with RAM feedback
networks ([13]). Besides training the systems with formal languages other problems were also
studied. The type of network used consisted of a layer of identical RAM nodes, where each of them
had n-address terminals, i of them connected to an external matrix of binary elements and f of them
2There may be PLN locations which are never addressed, e.g. addresses to nodes in the input layer which represent
n-tuples not present in the training set, or to nodes in higher layers if some combinations of lower node outputs never
occur. These unaddressed locations may contain 'U without affecting the status of the net as converged.
3Suppose one wants to learn functions from a set D to another R. Then, learn f is to recognise the language
L = {(x,y)lx = f(y)}·
- 6-
I SBAI - UNESP - Rio Claro/SP - Brasil
D(t)
II
X(t ) -
-
-
R(t - 1)
:--
R(t)
.--
In-delay unitsl
Figure 2: Sequential Weightless Neural Network
connected to the output terminals of others neuron through clocked delay units (n = i + 1). The
structure ofthe network is represented in Figure 2 where binary vectors X(t), R(t), R(t-l) and D(t)
represent respectively at time t, the state of input matrix, the response, the delayed response and
the "desired" response of the nodes. The input and feedback connections are randomly generated.
The network is trained to anticipate its inputs, i.e. D(t) = X(t + 1), such that R(t) = X(t + 0:)
during test phase.
The experimental results achieved with the systems shortly described above were satisfactory
in the sense that it was possible to distinguish many formal (regular, context-free and contextsensitive) languages from their complement but not alI the languages trained. Even with regular
languages there were cases where the complete separation of the language from its complement
were not achieved. The systems used were finite state machines and they can compute ai! regular
languages but they can not learn alI regular languages. Alternative training strategies are being
investigated and the results obtained will be compare with the experimental results achieved with
the systems described above.
Training cut-point networks can involve:
• only changes in the probabilities stored in the nodes;
• changes in any memory position of the node and
• changes in the value of the threshold.
The languages recognised by a neural network when only the threshold is changed, are related
to each other. The higher the value of the threshold the fewer the elements of the language will be
recognised by the network. That is, L1 :) L2 :) ••• :) Ln when .\1 < .\2 < ... < .\n. If the threshold
is small there are more restrictions in the path followed in the network: every time a new symbol,
-7-
r
I SBAI - UNESP - Rio Claro/SP - Brasil
Xi, is submitted to the network the value of the probability of the pattern p(X) will decre~e or
will not change, but it will never increase.
When changes occur only in the probabilities stored in the nodes, the training algorithm can
be very simple. Given a pattern X, if X E L then to reward the network (decrease the probabilities
of the transitions the network went to with X) otherwise then to punish the network (increase the
probabilities of the transitions the network went to with X).
As a cut-point network is similar to a probabilistic automaton ([15]) we consider the behaviour
of small changes in a probabilistic automaton. When the probabilistic state transition of the probabilistic automaton is slightly changed, the probabilistic automaton will, sometimes, recognise a
different language ([17]). Unfortunately, this is not true all of the time. There are some sufficient
conditions for stability in probabilistic automata and there are cases in which stability is not possible. The general problem of stability is still unsolved, which means that if the changes generated
by the training algorithm in the state transition of the network are small there is no guarantee that
the training will change the language recognised by the network.
A training algorithm which allows changes in any memory position in the network is also
possible. In this case the function computed by the network can change completely.
6
Conclusions
This papers reviewed the basic processing mechanisms of weightless neurons and some of their
applications. Results on the computability and learnability of WNN were given.
Many existing learning algorithms for ANN has been applied in the formallanguage recognition
problem. New algorithm will be developed based on grammatical and inductive inferences.
Although RAM networks can compute alI regular languages there is no learning algorithm for
alI these functions. Cut-point networks are similar in computation power to probabilistic automata
but also there is no learning algorithm to train a network to compute alI weighted regular languages.
A Turing machine simulation with RAM nodes has been proposed but once again no algorithm to
learn alI recursive languages is known.
References
[IJ 1. Aleksander. Emergent intelligent properties of progressivelystructured pattern recognition
nets. Pattern Recognition Letters, 81:375-384, 1983.
[2] 1. Aleksander. Adaptive pattern recognition systems and boltzmann machines: A rapprochement. Pattern Recognition Letters, 1987.
[3] I. Aleksander. The logic of connectionist systems. In R. Eckmiller and C. Malsburg, editors,
Neural Computers, pages 189- 197. Springer-Verlag, Berlin, 1988.
[4] 1. Aleksander and H. Morton. An Introduction to Neural Computing. Chapman and Hall,
London, 1990.
[5] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics
trai, Signals, and Systems, 2:303-314, 1989.
01 Con-
[6] W. R. de Oliveira and T. B. Ludermir. Turing machine simulation by logical neural networks.
In 1. Aleksander and J. Taylor, editors, Artificial Neural Networks 11, pages 663-668. NorthHolland, 1992.
[7] S. Franklin and M. Garzon. Neural computability. Pragress in Neural Networks, 1:128,144,
1990. Ablex, Norwood.
-8·
1 SBAI - UNESP - Rio Claro/SP - Brasil
[8] C. Giles, G. Chen, H. Chen, and Y. Chen. Higher order recurrent networks and gramatical
inference. In D. Touretzky, editor, Advances in Neural Information Processing Systems 2.
Morgan Kaufmann, San Mateo, 1990.
[9] R. Hartley and H. Szu. A comparison of the computational power of neural network models.
In Proc. IEEE Conf. Neural Networks 111, pages 17,22, 1987.
[10] R. Hecht-Nielsen. Theory of the backpropagation neural networks. In International Joint
Conference in Neural Networks, volume 1, pages 593-605, Washington, 1989. IEEE.
[11] S. Lucas and R. Damper. A new learning paradigm for neural networks. In Proceedings First
IEE International Conference on Artificial Neural Networks, 1989.
[12] T. B. Ludermir. Automata theoretic aspects of temporal behavior and computability in Logical
neural networks. PhD thesis, Imperial College, London, UK, December 1990.
[13] T. B. Ludermir. A feedback network for temporal pattern recognition. In R. Eckmiller, editor,
Parallel Processing in Neural Systems and Computers, pages 395- 98. North-Holland, 1990.
[14] T. B. Ludermir. A cut-point recognition algorithm using PLN node. In Proceedings of ICANN91, Helsinki, 1991.
[15] T. B. Ludermir. Computability of logical neural networks. Journal of Intelligent Systems,
2:261- 290, 1992.
[16] C. E. Myers. Output functions for probabilistic logic nodes. In Proceedings First IEE International Conference on Artificial Neural Networks, 1989.
[17] M. O. Rabin. Probabilistic automata. Information and Contrai, 6:320-345, 1963.
[18] G. Sun, H. Chen, Y. Lee, and C. Giles. Turing equivalence of neural networks with second
order connection weights. In Proc. Int. Joint Conf. Neural Networks, 1991.
-9-