Download 5. hierarchical multimodal language modeling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Total Annihilation wikipedia , lookup

Visual servoing wikipedia , lookup

Agent-based model in biology wikipedia , lookup

Convolutional neural network wikipedia , lookup

Mathematical model wikipedia , lookup

Neural modeling fields wikipedia , lookup

Embodied cognitive science wikipedia , lookup

Catastrophic interference wikipedia , lookup

Hierarchical temporal memory wikipedia , lookup

Embodied language processing wikipedia , lookup

Transcript
MirrorBot
IST-2001-35282
Biomimetic multimodal learning in a mirror
neuron-based robot
Results from the Language Experiments and Simulations with
MirrorBot (Workpackage 15.3)
Frédéric Alexandre, Stefan Wermter, Günther Palm, Olivier Ménard Andreas Knoblauch,
Cornelius Weber, Hervé Frezza-Buet Uli Kaufmann, David Muse, and Mark Elshaw
Covering period
MirrorBot Report 20
Report Version: 0
Report Preparation Date: 8 May 2017
Classification: Draft
Contract Start Date: 1st June 2002
Duration: Three Years
Project Co-ordinator: Professor Stefan Wermter
Partners: University of Sunderland, Institut National de Recherche en Informatique et en
Automatique, Universität Ulm, Medical Research Council, Universita degli Studi di
Parma
Project funded by the European Community under the
“Information Society Technologies Programme”
0 TABLE OF CONTENTS
0 TABLE OF CONTENTS ................................................................................................. 2
1 INTRODUCTION ........................................................................................................... 3
2 LANGUAGE INPUT....................................................................................................... 5
3. MODEL OF CORTICAL LANGUAGE MODEL ......................................................... 7
3.1 Model Simulation – Grammatical Sentence ............................................................. 8
3.2 Model Simulation – Acceptable and unacceptable sentence .................................. 11
3.3 Disambiguation using model .................................................................................. 16
4. SELF-ORGANISING CORTICAL MAP LANGUAGE REPRESENTATION
MODEL ............................................................................................................................ 17
5. HIERARCHICAL MULTIMODAL LANGUAGE MODELING ............................... 23
5.1 Hierarchical GLIA architectures ............................................................................ 27
5.2 Representation of action verb instruction word form ............................................ 29
5.3 Visual and motor direction representations ............................................................ 30
5.4 Training algorithm .................................................................................................. 31
5.5 Hierarchical GLIA architecture results ................................................................... 33
5.6 Discussion of hierarchical architecture ................................................................... 37
6. CONCLUSION ............................................................................................................. 39
7. REFERENCES ............................................................................................................. 40
1 INTRODUCTION
In this report we consider the results from experiments for language models for three
related language components. The first model acts as a front-end component to the other
two with the second model recreating the neuroscience findings of our Cambridge
partners on how action verbs are represented by on body parts and the third model based
on mirror neuron system to act as a language instruction grounding in actions (GLIA)
architecture. Although these three models can be used separately, they provide a
successful approach to modelling language processing. The first model the cortical
language model takes in language either spoken or from pushing buttons and represent
this language input using different regions. The cortical language model is able to act as
a grammar checker for the input as well as determine if the input is semantically possible.
This model consists of 15 areas that each contains spiking associative memory of 400
neurons.
The second language model the self-organising cortical map language representation
model uses multi-modal information processing, inspired from cortical maps. We
illustrate a phonetic - motor association, that shows that the organisation of words can
integrate motor constraints, as observed by our Cambridge partners. This model takes the
action verbs from the first model. The main computational block of the model is a set of
computational units called a map. A map is a sheet made of a tiling of identical units.
This sheet has been implemented as a disk, for architectural reasons described further.
When input information is given to the map, each unit shows a level of activity,
depending on the similarity of the information it receives with the information it
specifically detects.
The final model offers to ground language instructions in actions (GILA) model by taking
three example action verbs represented in the second model and ground these in actual
actions. In doing so this GLIA model learn to perform and recognise three complex
behaviours, ‘go’, ‘pick’ and ‘lift’ and to associate these with their action verbs. This
hierarchical model has two layers. In the lower level hidden layer the Helmholtz machine
wake-sleep algorithm is used to learn the relationship between action and vision, the
upper layer uses the Kohonen self-organising approach to combine the output of the
lower hidden layer and the language input. These models are able to recreate the findings
of the mirror neuron system in that during both performing and recognising, the
activations in the hidden layers are similar. In the hierarchical model rather separate
sensory- and motor representations on the lower level are bound to corresponding
sensory-motor pairings via the top level that organises according to the language input.
We suggest analogies to the organisation of motor cortical areas F5 and F6 and mirror
neurons therein. Figure 1 shows the overall structure of the language model with the
cortex model passing the action verbs into the self-organising cortical map language
representation model which producing a topological representation of the actions verbs
using the inspiration of the Cambridge partner and the final model grounds example
language instructions from the second language model in the actions.
Language input
Model 1
Cortical language
Model
Perceptual
information on
action verbs
Action verbs
Model 2
The self-organising
cortical map
language
representation
model
Topological
representation of action
verb based on body parts
Multimodal inputs for three action verbs
from those represented in the level 2 model
Mirror neuron
system based
language model
Model 3
GLIA Model
Figure 1 The overall structure of the language model architecture.
2 LANGUAGE INPUT
It is possible to provide speech and button pressing input to the MirrorBot. The program
for using button pressing is shown in Figure 2. The speech input is through the CSLU
speech recognition and production software which runs under Windows has been tested
and found to be robust for this task. The CSLU is run on a laptop computer on the robot.
This was achieved by having the robot act as a server and having the laptop running the
speech recognition and production requesting a connection to the robot. Figure 3
illustrates the server client connection via a socket between the robot and the laptop
running the CSLU toolkit Interface. The recognised to input is to be introduced in the
language model in the appropriate form as described below.
Figure 2 The push button program for inputting an input sentence.
Figure 3 Communication between the robot and laptop.
3. MODEL OF CORTICAL LANGUAGE MODEL
The first model, the cortical language model, is described in Fay et al. (2004), Knoblauch
et al. (2005a, b), and Markert et al. (2005). It consists of 15 areas. Each of the areas is
modelled as a spiking associative memory of 400 neurons. In each area we defined a
priori a set of binary patterns (i.e., subsets of the neurons) constituting the neural
assemblies. In addition to the local synaptic connections we also modelled extensive
inter-areal hetero- associative connections between the cortical areas (see Figure 4). The
model can roughly be divided into three parts. (1) Auditory cortical areas A1,A2, and
A3: First auditory input is represented in area A1 by primary linguistic features (such as
phonemes), and subsequently classified with respect to function (area A2) and content
(area A3). (2) Language specific areas A4, A5-S, A5-O1-a, A5-O1, A5-O2-a, and A5O2: Area A4 contains information about previously learned sentence structures, for
example that a sentence starts with the subject followed by a predicate. The other areas
contain representations of the different sentence constituents (such as subject, predicate,
or object). (3) Activation fields af-A4, af-A5-S, af-A5-O1, and af-A5-O2: The activation
fields are relatively primitive areas that are connected to the corresponding grammar
areas. They serve to activate or deactivate the grammar areas in a rather unspecific way.
Figure 4 Cortical areas and inter-areal connections. Each of the 15 areas (boxes)
was implemented as a spiking associative memory, where patterns (or assemblies)
are stored auto-associatively in local synaptic connection. Each black arrow
corresponds to a hetero-associative inter-areal synaptic connection. Blue feedback
arrows indicate short-term memory mechanisms.
Each area consists of spiking neurons that are connected by local synaptic feedback.
Previously learned word features, whole words, or sentence structures are represented by
neural assemblies (subgroups of neurons, i.e., binary pattern vectors), and laid down as
long-term memory in the local synaptic connectivity according to Hebbian coincidence
learning (Hebb 1949).
When processing a sentence then the most important information stream flows from the
primary auditory areas A1 via A3 to the grammatical role areas A5-S, A5-P, A5-O1, A5O1-a, A5-O2, and A5-O2-a. From there other parts of the global model can use the
grammatical information to perform, for example, action planning. The routing of
information from area A3 to the grammatical role areas is guided by the grammatical
sequence area A4 which receives input mainly via the primary auditory areas A1 and A2.
This guiding mechanism of area A4 works by activating the respective grammatical role
areas appropriately via the activation fields af-A5-S, af-A5-P, af-A5-O1, and af-A5-O2.
In the following we will further illustrate the interplay of the different areas when
processing a sentence.
3.1 Model Simulation – Grammatical Sentence
The sentence “Bot put plum green apple” is processed in 36 simulation steps, where in
each step an associative step updating the activity state in the areas. We observed that
processing of a grammatical correct sentence is accomplished only if some temporal
constraints are matched by the auditory input. This means that the representation of a
word in the primary areas must be active at least for a minimal number of steps. It turned
out that a word should be active for at least 4-5 simulation steps. Figure 5 shows the
activity state of the model after the 6-th simulation step. Solid arrows indicate currently
involved connections, dashed arrows indicate relevant associations in previous simulation
steps. At the beginning all the activation fields except for af-A4 have the ‘OFF’ assembly
activated due to input bias from external neuron populations. This means that the
grammatical role areas are initially inactive. Only activation field af-A4 enables area A4
to be activated.
Figure 5. Activation state of the model of cortical language areas after simulation step 6.
Initially the OFF-assemblies of the activation fields are active except for area A4 where
the ON-assembly is activated by external input. The processing of a sentence beginning
activates the ‘_start’ assembly in area A2 which activates ‘S’ in A4 which activates afA5-S which primes A5-S. When subsequently the word “bot” is processed in area A1 this
will activate the ‘_word’ assembly in area A2, and the ‘bot’ assemblies in area A3 and
area A5-S. Solid arrows indicate currently involved connections, dashed arrows indicate
relevant associations in previous simulation steps.
This happens, for example, when auditory input enters area A1. First the ‘_start’
representation in area A1 gets activated indicating the beginning of a new spoken
sentence. This will activate the corresponding ‘_start’ assembly in area A2 which in turn
activates the ‘S’ assembly in the sequence area A4 since the next processed word is
expected to be the subject of the sentence. The ‘S’ assembly therefore will activate the
‘ON’ assembly in the activation field af-A5-S which primes area A5-S such that the next
processed word in area A3 will be routed to A5-S. Indeed, as soon as the word
representation of “Bot” is activated in area A1 it is routed in two steps further via area A3
to area A5-S.
In step 7 the ‘_blank’ representation enters area A1 indicating the border between the
words “bot” and “put”. This will activate the ‘_blank’ assembly in area A2 which will
activate the ‘OFF’ assembly in activation af-A4. This results in a deactivation of the
sequence area A4. Since the ‘_blank’ assemblies in areas A1 and A2 is only active for
one single simulation step and immediately followed by the representations of the word
“put”, the ‘ON’ assembly in activation field af-A4 is one step later active again and
activates also the sequence area A4. However, since the ‘S’ assembly in A4 was
intermittently erased the delayed intra-areal feed-back of area A4 will switch activity to
the next sequence assembly ‘P’. This means that the next processed word is expected to
be the predicate of the sentence. The ‘P’ representation in area A4 activates the ‘ON’
assembly in activation field af-A5-S which primes area A5-S to be activated by input
from area A3. In the meantime the representations of the input word “put” were activated
in areas A1 and A3, and from there routed further to area A-S. Thus we conclude with the
situation after simulation step 13 as shown in Figure 6.
Figure 6. System state of the model of cortical language areas after simulation step 13.
The ‘_blank’ representing the word border between words “bot” and “put” is recognized
in area A2 which activates the ‘OFF’ representation in af-A4 which deactivates area A4
for one simulation step. Immediately after the ‘_blank’ the word “put” is processed in A1
which actives the ‘_word’ and ‘put’ assemblies in areas A2 and A3, respectively. This
will activate again via af-A4 area A4. Due to the delayed intra-areal feedback this will
activate the ‘P’ representation in area A4 as the next sequence assembly. This activates
the ‘ON’ assembly in af-A5-P which primes A5-P to be activated by the ‘put’
representation in area A3.
The situation at this point after simulation step 25 is illustrated in Figure 7. At this stage
the next word border must not switch the sequence assembly in area A4 further to its next
part. This is because the second is actually not yet complete. We have only processed the
attribute (“green”) so far. Therefore a rather subtle (but not implausible) mechanism
guarantees in our model that the switching is prevented at this stage. The ‘_none’
representation in area A5-O2 has now two effects. First it activates together with the
‘blank’ assembly in area A2 representing the word border the ‘OFF_p’ assembly in afA5-O2. Second it prevents by a projection to activation field af-A4 the activation of the
‘OFF’ assembly in af-A4 and therefore the switching in the sequence area A4. The
‘OFF_p’ deactivates only area A5-O2 but not A5-O2-a. As soon as A5-O2 is deactivated
and the next word “apple” is processed and represented by the ‘_word’ assembly in area
A2, input from the ‘vp2_O2’ assembly in area A4 activates again the ‘ON’ assembly in
activation field af-A5-O2. And thus the ‘apple’ assembly in area A5-O2 gets activated via
areas A1 and A3. Figure 12 shows the situation at this point after simulation step 30.
Figure 7. The system state of the model of cortical language areas after simulation step
25. The word border between “plum” and “green” activates the ‘_blank’ representations
in area A1 and A2 which initiates the switching of the sequence assembly in area A4 to
its next part as described before (see text). The activation of the ‘vp2_O2’ assembly
activates the ‘ON’ assembly in activation field af-A5-O2 which primes areas A5-O2-a
and A5-O2. Then the processing of the word “green” activates the ‘green’ assembly in
area A5-O2-a via areas A1 and A3. Since there exists no corresponding representation in
area A5-O2 here the ‘_none’ assembly is activated.
3.2 Model Simulation – Acceptable and unacceptable sentence
Figure 8 illustrate the flow of neural activation in our model when processing a simple
sentence such as “Bot lift orange”. In the following it is briefly explained how the model
works.
The beginning of a sentence which is represented in area A2 by assembly _start will
activate assembly S in area A4 which activates assembly _activate in activation field
afA5-S indicating that the next processed word should be the subject of the sentence.
Until step 7 the word “bot” has been processed and the corresponding assembly extends
over areas A1, A3 and A5-S. While new ongoing input will activate other assemblies in
the primary areas A1,A2,A3 a part of the global “bot” assembly is still activated in area
A5-S due to the working memory mechanism.
Before the representation of the next word “lift” enters A1 and A2, the sequence
assembly in A4 is switched on from S to Pp, guided by contextual input from area A5-S.
In the same way as before an assembly representing “lift” is activated in areas A1, A3,
and A5-P.
Since “lift” is a verb requiring a “small” object, processing of the next word “orange”
will switch the sequence assembly in area A4 to node OA1s which means that the next
word is expected to be either an attribute or a small object. While a part of the “lift”
assembly remains active in the working memory of area A5-P, processing of “orange”
activates an assembly extending over A1, A3, A5-O1, and A5-O1-a . Activation of
pattern _obj in area A5-O1-a indicates that no attributal information has been processed.
Since “orange” is actual a small object which can be lifted, the sequence assembly in area
A4 switches to ok_SPOs. While the assemblies in the primary areas A1,A2,A3 fade, the
information about the processed sentence is still present in the subfields of area A5 and is
ready for being used by other cortical areas such as the goal areas.
Figure 9 illustrates the processing of the sentence “Bot lift wall”. The initial processing of
“Bot lift...” is the same as illustrated in Figure 8. That means in area A4 there is an
activation of assembly OA1s which activates areas A5-O1-a and A5-O1 via the activation
field af-A5-O1. This indicates that the next word is expected to be a small object. Table 1
and Table 2 provide the results from the cortical language model when the sentences
introduced are acceptable and non-acceptable.
Figure 8: Processing of the sentence “Bot lift orange”. Black letters on white background
indicates local assemblies which have been activated in the respective areas (small letters
below area names indicate how good the current activity matches the learned assemblies).
Arrows indicate major flow of information.
Since the next word “wall” is not a small object (which cannot be lifted), the sequence in
area A4 is switched on to the error representation err_OA1s.
Figure 9: Processing of the sentence “Bot lift wall”. Processing of the first two words is
the same as in Fig.6. When processing “wall” this activates an error state in area A4
(err_OA1s) since the verb “lift” requires a “small” object which can be lifted.
More test results produced by our cortical language model are summarized in the
following tables. The first table shows examples of correct sentences.
Table1 Results of the language module when processing correct sentences
sentence
area A4
A5-S
A5-P
A5-O1-a
A5-O1
A5-O2-a
A5-O2
Bot stop.
ok_SP
bot
stop
_null
_null
_null
_null
Bot go white wall.
ok_SPO
bot
go
white
wall
_null
_null
Bot move body forward.
ok_SPA
bot
move_body forward
_null
_null
_null
Sam turn body right.
ok_SPA
sam
turn_body
right
_null
_null
_null
Bot turn head up.
ok_SPA
bot
turn_head
up
_null
_null
_null
Bot show blue plum.
ok_SPO
bot
show
blue
plum
_null
_null
Bot pick brown nut.
ok_SPOs
bot
pick
brown
nut
_null
_null
The second table shows examples where the sentences are grammatically wrong or
represent implausible actions (such as lifting walls).
Table.2: Results of the language model when processing grammatically wrong,
incomplete, or implausible sentences.
sentence
area A4
A5-S
A5-P
A5-O1-a
A5-O1
A5-O2-a
A5-O2
Bot stop apple.
err_okSP
bot
stop
_null
_null
_null
_null
Orange bot lift.
err_Po
orange
_none
_null
_null
_null
_null
Stop apple.
err_S
_none
_null
_null
_null
_null
_null
Bot bot lift orange.
err_Pp
bot
_none
_null
_null
_null
_null
Orange lift plum.
err_Po
orange
lift
_null
_null
_null
_null
Sam lift orange apple.
err_okSPOs sam
lift
_obj
Orange
_null
_null
This is red go.
err_OA1
this
is
red
_none
_null
_null
Bot lift wall.
err_OA1s
bot
lift
_obj
Wall
_null
_null
Bot pick white dog.
err_OA1s
bot
pick
white
dog
_null
_null
Bot put wall (to) red desk
err_OA1s
bot
put
_obj
wall
_null
_null
3.3 Disambiguation using model
The cortical model is able to use context for disambiguation. For example, an ambiguous
phonetic input such as ”bwall”, which is between ”ball” and ”wall”, is interpreted as
”ball” in the context of the verb ”lift”, since ”lift” requires a small object, even if without
this context information the input would have been resolved to ”wall”. Thus the sentence
”bot lift bwall” is correctly interpreted. As the robot first hears the word ”lift” and then
immediately uses this information to resolve the ambiguous input ”bwall”, we call that
”forward disambiguation”. This is shown in figure 10. Our model is also capable of the
more difficult task of ”backward disambiguation”, where the ambiguity cannot
immediately be resolved because the required information is still missing. Consider for
example the artificial ambiguity ”bot show/lift wall”, where we assume that the robot
could not decide Figure . A phonetic ambiguity between ”ball” and ”wall” can be
resolved by using context information. The context ”bot lift” implies that the following
object has to be of small size. Thus the correct word ”ball” is selected. between ”show”
and ”lift”. This ambiguity cannot be resolved until the word ”wall” is recognized and
assigned to its correct grammatical position, i.e. the verb of the sentence has to stay in an
ambiguous state until enough information is gained to resolve the ambiguity. This is
achieved by activating superpositions of the different assemblies representing ”show” and
”lift” in area A5P, which stores the verb of the current sentence. More subtle information
can be represented in the spike times, which allows for example to remember which of
the alternatives was the more probable one. More details on this system can be found in
Fay et al. (2004), Knoblauch et al. (2005a, b), and Markert et al. (2005).
Figure 10 A phonetic ambiguity between ”ball” and ”wall” can be resolved by using
context information. The context ”bot lift” implies that the following object has to be of
small size. Thus the correct word ”ball” is selected.
4. SELF-ORGANISING CORTICAL MAP
LANGUAGE REPRESENTATION MODEL
As described in Prototype 7 the cortical language model can as the front-end to other
models produced in the MirrorBot project. For instance it can be used as the input to a
cortical planning, action, and motor processing system so that a camera can be moved to
centre a recognized object or perform docking to object. Furthermore, the cortical model
can be used as the front-end of the second language model that uses self-organising
model of multi-modal information, inspired from cortical maps. This model illustrates a
phonetic - motor association, that shows that the organisation of words can integrate
motor constraints, as observed by our Cambridge partners. This model, called Bijama, is
a general-purpose cortically-inspired computational framework that has also been used
for rewarded arm control. The main computational block of the model is a set of
computational units called a map. A map is a sheet made of a tiling of identical units.
This sheet has been implemented as a disk, for architectural reasons described further.
When input information is given to the map, each unit shows a level of activity,
depending on the similarity of the information it receives with the information it
specifically detects.
When an input is given to the map, the distribution of matching activities among units is a
scattered pattern, because tuning curves are not sharp, which allows many units to have
non null activities, even if prototypes don't perfectly match the input. From this activity
distribution over the map, a small compact set of units that contains the most active units
has to be selected. In order to decide which units are locally the best matching ones
inside a map, a local competition mechanism is implemented. It is inspired from
theoretical results of the continuum neural field theory. The CNFT algorithm tends to
choose more often units that have more connections. Thus, the local connection pattern
within the maps must be the same for all units, which is the case for torus-like lateral
connection pattern, with units in one border of the map connected to the opposite border,
for example. Here, the field of units in the map computes a distribution of global
activities A* , resulting from the competition among current matching activity A t . This
competition process has been made insensitive to the actual position of units within the
map, in spite of heterogeneous local connection patterns at the level of border units in the
Bijama model, as detailed further. The global behavior of the map, involving adaptive
matching processes, and learning rules dependent on a competition, reminds the Kohonen
SOM. However, the local competition algorithm used here allows the units to be feed
with different inputs.
A cortical layer, that receives information from another map, doesn't receive inputs from
all the units of the remote map, but only from one stripe of units. For instance, a map may
be connected row-to-row to another map: Each unit in any row of the first map is
connected to every remote units in the corresponding row of the other map. These
connections are always reciprocal in the model. the combination of self-organization and
coherent learning produces what we call joint organization: Competition, although locally
computed, occurs not only inside any given map, but across all maps. Moreover, the use
of connection stripes limits the connectivity, which avoids the combinatorial explosion
that would occur if the model were to employ full connectivity between the maps. Thus,
coherent learning leads to both efficient data representation in each map and coordination
between all connected maps.
A multi-associative model is intended to associate multiple modalities, regardless of
however they are related. It must then handle the case where the associations between
two modalities are not one-to-one, but rather one-to-many, or even many-to-many. This
multi-association problem will now be presented on a simple example.
Let us consider an association between certain objects and the sounds they produce. A
car, for example, could be associated with a motor noise. Certain objects produce the
same noise. As a result, a single noise will be associated with multiple objects. For
instance, a firing gun and some exploding dynamite produce basically both an explosion
sound. In our model, let us represent the sounds and the objects as two thalamic
modalities on two different cortical maps. Let us now link both of these maps to another
one, that we call an associative map. The sound representations and the object
representations are now be bound together through the associative map (see Figure 11). If
we want a single unit to represent the “BANG" sound in the sound map, a single unit in
the associative map has to bind together the “BANG" unit with both the gun and the
dynamite units. This associative unit must then have the ability to perform multiassociations: It must have a strong cortical connection to two different units (the gun and
the dynamite units) in the same cortical stripe (see Figure 11a). If associative units
cannot perform multi-associations, the resulting self-organisation process among all maps
will duplicate the “BANG" representative unit. The reason is that, in this very case, an
associative unit is able to listen to only one unit in each connected module. Each instance
of that unit will then be bound, through its own associative unit, either to the dynamite or
the gun unit (see Figure 11b). Moreover, the two object units cannot be in the same
connectivity stripe, or else the model would try to perform multi-association and fail.
Since our model uses a logical “AND” between the different inputs of a unit, that single
associative unit must be active each time one of these sound and one of these objects are
active together.
Using a learning rule adapted from the Widrow-Hoff learning rule, if unit i is connected
to unit j, the weight wij of the cortical connection from i to j is updated through :
δwij = δ( Ai* - w ) • ( Ai* - Ai* ) • A*j is the cortical activity of a unit i, and ! is the decay
rate of cortical connections. Here, cortical activity Aic = ∑j wij A*j j is seen as a
predictor of Ai* . When both the local unit i and the remote unit j are active together, if
Aic is lower than Ai* , wij grows, and if Aic is higher than Ai* , wij decreases. wij also
decreases slowly over time. Here, the global connection strength of the local unit i for a
given cortical stripe is distributed among all active remote units j, and not among all
remote units. As with the Hebb/anti-Hebb rule, because of the local competition, the
connection strength wij between i and a unit j must be high. However, here, raising wij
doesn't imply lowering all wik for all k in the remote connection stripe. Raising wij will
only lower wik if j and k are active at the same time. However, since only a small Ai*
activity bubble is present on each map, most remote units in the connection stripe cannot
be active at the same time. Thus, the local unit i can bind together multiple units in a
given stripe connection to a unit in another stripe connection. This is the reason why the
use of the Widrow-Hoff learning rule in our model leads to multi-map organization as in
Figure 11a. Solving the multi-association problem has one main benefit: The maps need
fewer units to represent a certain situation than when multi-association between unit is
impossible. Moreover, since instances of a given thalamic input are not duplicated in
different parts of a cortical map, it is easier for the model to perform a compromise
between the local organization and the cortical connectivity requirements, i.e. joint
organization is less constrained.
Several brain imaging studies have shown that word encoding within the brain is not only
organized around purely phonetic codes but is also organized around action. How this is
done within the brain has not yet been fully explained but we would like to present how
these action based representations naturally emerge in our model by virtue of solving
constraints coming from motor maps. We therefore applied our model to a simple wordaction association. A part of the word set from the European MirrorBot project, was used
in a “phonetic" map, and we tried to associate these words to the body part that performs
the corresponding action. One goal of this project is to define multimodal robotic
experiments and the corresponding protocols are consequently well suited for this task.
The phonetic coding used in our model is taken from the MirrorBot project. A word is
separated into its constituting phonemes. Each phoneme is then coded by a binary vector
of length 20. Since the longest word that is used has 4 phonemes, each word is coded by
4 phonemes, and if they have less, they are completed by empty phonemes. The distance
between two different phonemes is the Cartesian distance between the coding vectors.
The distance between two words is the sum of the distances between their constituting
phonemes. While we are well aware that this is a very functional way to represent the
phonetic distance between two words, it is sufficient in order to exhibit the joint
organization properties discussed in this paper. The actions are coded in the same way as
the words: There are 3 different input actions (head action, body action and hand action),
and each action is coded as a binary vector of length 3. The distance between two actions
is, once again, the Cartesian distance between their representing vectors. Each word is
semantically associated to a specific action. The word-action relationship is shown on
Figure 12.
Figure 11 A multi-association: A sound is associated with two different object, both of
which may produce this actual sound. (a) A single sound unit is connected with the two
object units by a unit in the associative map, that stands at the intersection of two sound
and object stripes: This is possible because the unit's global connection strength is not
distributed among all connections, but only among active connections (Widrow-Hoff
rule). (b) When a unit's global connection strength is distributed among all connections,
the sound unit must be duplicated (Hebb rule).
As illustrated in Figure 13, we can clearly see that the topological organization found by
the model meets these criteria. Within the word map, words are grouped relatively to the
body part they represent: Body action words are grouped together (stripes) as well as
hand action words (gray) and head action words (white). However, the phonetic
distribution of words remains the most important factorin the phonetic map organization.
Each word is represented by a “cluster" of close units, and the words whose phonetic
representation is close tend to be represented in close clusters of units. For instance, while
“Go" and “Show" correspond to different motor actions, their phonetic representations
are close, so that their representing clusters are adjacent. This illustrates the fact that the
model is actually doing a successful compromise between the local demands, which tend
to organize the words phonetically, and the motor demands, which tend to put together
the words that correspond to the same action. The joint organization does not destroy the
local self-organization, but rather modulates it so that it becomes coherent with the other
map organization.
Figure 12 Word-Action Relationship.
The thalamic prototypes (i.e. external inputs) of the motor and the phonetic units are,
respectively, coded actions and coded words. However, these do not necessarily
correspond to real input words or actions: These prototypes are vector of float values, not
binary ones.
Let us consider three maps, one for word representation, one for action representation and
finally an associative one that links word to action. In our model, however, the higher
level associative map linking auditory representation with motor action will use close
units to represent these words, since they both relate to the same action (head action), see
“look" and “show" positions on Figure 13. As our model deals with an implicit global
coherence, it is able to reflect this higher level of association and to overcome the simpler
phonetic organization.
Figure 13 Two simulation results of word representation map after coherent learning has
occurred with our model. Word representations are now constrained by the motor map
via the associative map and, as a result, words that correspond to the same action are
grouped together. Nevertheless, phonetic proximity is still kept.
5. HIERARCHICAL MULTIMODAL LANGUAGE
MODELING
The GLIA model performs are a lower level to than the other model and so takes three of
the action words represented by the second models and grounds language instruction in
actions based on multimodal inputs. A motivation for this study is to allow the student
robot to learn from the teacher robot who performs three behaviours ‘go’, ‘pick’ and ‘lift’
based on multimodal inputs. Although only three language instructions are used in this
initial study, it is our intention in the future to extend the language representation to
incorporate the actors, more actions and objects.
target object
of interest
45 o

x-axis
36 units to
front wall
robot
18 units
positive y
18 units
negative y
x-axis
36 units to
back wall
Figure 14 The simulated environment for the robot which is 72 units along the x-axis and
36 along the y-axis.
To allow the student robot to learn these behaviours, the teacher robot performs ‘go’,
‘pick’ and ‘lift’ actions one after another in a loop in an environment (Figure 14 and 15)
of 72 units along the x-axis and 36 units along the y-axis. The distance from the nearest
wall to the centre of the area along the x-axis is 36 units. The y-axis of the arena is split
between positive and negative values either side of the central object. The simulation
environment is set to this size as it is the same size as our actual environment for running
experiments on a PeopleBot robot platform, with one unit in the simulator equal to 10cm
in the real environment. This gives the opportunity in the future to use the trained
computational architectures from the simulator as the basis for a real PeopleBot robot
application.
target object for
the ‘pick’
behaviour
start position
Figure 15 The robot in the simulated environment at coordinates x, y and rotation angle
 . The robot has performed the behaviour for ten time steps and currently turns away
from the wall in the learnt ‘pick’ behaviour.
Figure 15 shows the teacher robot in the environment performing the ‘pick’ behaviour at
coordinates x, y and angle  to the nearest wall. The student robot observes the teacher
robot performing the behaviours and is trained by receiving multimodal inputs. These
multimodal inputs are: (i) high-level visual inputs which are the x- and y-coordinates and
the rotation angle  of the teacher robot relative to the nearest wall; (ii) the motor
directions of the robot (‘forward’, ‘backward’, ‘turn left’ and ‘turn right’); and (iii) a
language instruction stating the behaviour the teacher is performing (‘go’, ‘pick’ or ‘lift’).
The dashed areas in Figure 14 indicate the boundary between the regions of the
environment associated with a particular wall. Hence, once the robot moves out of a
region its coordinates x and y and angle  are measured relative to the nearest wall. The
region associated with the top wall is the shaded area in Figure 15. The regions are made
up of isosceles triangle with two internal angles of 45 o and two sides that are 40 units.
In our approach the ‘go’ behaviour involves the teacher robot moving forward in the
environment until it reaches a wall and then turns away from it. The distance from the
wall at which point the robot turns when performing the ‘go’ behaviour is 5 units to
ensure the robot does not hit the wall. Furthermore, to stop it hitting into the side wall
when turning, it turns away from that wall. Figure 17 shows an example ‘go’ behaviour
where the robot is moving towards the front wall and then turns left towards the left wall,
once it gets to the left wall it turns right and moves towards the back wall. Although
Figure 15 only shows the robot until it moves towards the back wall however it would
continue until it is position to start the ‘pick’ behaviour.
The second action verb ‘pick’ involves the robot moving toward the target object
depicted in Figure 14 and Figure 15 at the top of the arena and placing itself in a position
to get the object.
start position
end position
Figure 16 An example of the ‘go’ behaviour by the robot.
The docking procedure performed by the teacher that is involved in the ‘pick’ behaviour
is based on a reinforcement learning actor-critic approach of Weber et al. (2003) and
Weber et al. (2004) using the research of Foster et al. (2000). This model uses neural
vision and a reinforcement learning actor-critic approach to move a simulated version of
the PeopleBot robot toward a table so that it can grasp an object. As the robot has a short
non-extendable gripper and wide ‘shoulders’ it must dock in the manner that the real
PeopleBot robot does, not hit the table with its shoulders and finish at a position so that
the gripper is at a 90 o angle to the object so that the gripper can grasp the object. The
input to the reinforcement neural network is the relative position of the robot to the target
object that is positioned at the border of the table using visual perception (Figure 17). By
using this reinforcement approach to control the behaviour of the teacher robot, the
training of the student robot does not always receive the optimum solution or a correct
solution. However, it fits with the concept of imitation learning by a child from the
parent, as the parent may not provide the optimal way to solve a problem.
In our approach when the teacher robot is performing the three behaviours in a loop it
would only move from the ‘go’ behaviour to the ‘pick’ behaviour if the object is in the
field of vision which is dependent on the angle to the object  and the distance d. Using
the field of vision associated with the PeopleBot the value of d is determined using basic
trigonometry and should be less than 20 units for the object to be in vision. Figure 18
provides an example of a ‘pick’ behaviour performed by the simulated robot. In this case
the robot starts close to the front wall so it moves backwards so it can turn and move
towards the object. It finishes in a position in front of the object so that the robot can get
the object with its grippers.
table
object of interest
field of vision
robot
Figure 17 The figure depicts the table and the target on it. The robot shown has short
black grippers and its field of vision is outlined by the dotted line. Real world
coordinates (x, y,  ) state the position and rotation angle of the robot. The perceived
position of the object in the robot’s field of vision is defined by the angle  and distance
d. (From Weber and Wermter 2004)
start position
end position
Figure 18 A simulated teacher robot performing the ‘pick’ behaviour in the arena.
The final action verb ‘lift’ involves moving backward to leave the table and then turning
around to face toward the middle of the arena (Figure 19). Coordinates x and 
determine how far to move backward and in which direction to turn around. The robot
moves backwards until it reaches unit 13 on the x-axis and then turns to a random angle
to face the back wall. If the angle that the robot turns to is negative the robot turns to the
left, if it is positive it turns to the right.
start position
end position
Figure 19 An example of the ‘lift’ behaviour by the robot.
The ‘go’, ‘pick and ‘lift’ behaviours and action verbs were selected as they can be seen as
individual behaviours or as the main components in a ‘pick and place’ scenario, where
the robot moves around and then picks and retrieves the object. Furthermore, they
provide the situation that the same visual information can produce different or the same
motor direction depending on the language instruction. These behaviours fit well with
the mirror neuron system principle as they can be seen as goal-related and can either be
recognised or performed.
Finally, we can combine behaviours produced by
reinforcement learning and pre-programming to train architectures that are able to learn
these behaviours to ground language instruction in actions.
The high-level visual inputs which are shared by teacher and student are chosen such that
they could be produced once the multimodal grounding architecture is implemented on a
real PeopleBot robot. The robot will simply be able to use visual processing of the scene
to determine where the teacher robot is in the arena. This shared information will be
gained through processing of the images collected by the student robot and basic
geometry. Hence, this associates simple visual perception with language instruction to
achieve grounding of language instruction in actions.
When receiving the multimodal inputs related to the teacher's actions the student robot is
required to learn these behaviours so that it could perform them from a language
instruction or recognise them. When learning, performing and recognising these
behaviours ‘forward’ and ‘backward’ movement is at a constant speed of 1 unit for each
time step and the decision to ‘turn left’ or ‘right’ is 15 o each time step. Two neural
architectures are considered for this language instruction in actions grounding system.
5.1 Hierarchical GLIA architectures
The GLIA hierarchical architecture (Figure 20) uses an associator network based on the
Helmholtz machine learning approach [Dayan 2000, Dayan and Hinton 1996, Dayan et
al. 1995, Hinton et al. 1995]. The Helmholtz machine approach is selected as it matches
neuroscience evidence on language processing by using a distributed representation and
unsupervised learning. Furthermore, the Helmholtz learning algorithm is able to
associate different modalities to perform the behaviours and offers the ability to recreate
the missing inputs. This allows the architectures to recreate the language instruction
when performing recognition and the motor directions when instructed to perform a
behaviour.
It was found that the sparse coding of the wake-sleep algorithm of the
Helmholtz machine leads to the extraction of independent components in the data which
is not desired since many of these components would not span over multiple modalities.
Hence, the representation of the three modalities in the Helmholtz machine has them
positioned at a separate location on the Helmholtz machine hidden layer and so two
different architectural approaches are considered that overcome this: the flat architecture
and the hierarchical architecture.
w
bu
w
w
bu
w td
td
language instruction
w td
w bu
w td
high-level vision coordinate and angle
w bu
motor direction
Figure 20 A hierarchical multimodal GLIA architecture.
In the hierarchical architecture (Figure 20) there is the association of the motor and highlevel vision inputs using the first hidden layer using the Helmholtz machine learning
algorithm shown on the diagram as the HM area. The activations of the first hidden layer
are then associated with the language instruction region input at the second hidden layer
based on the self-organising map (SOM) learning algorithm, shown as SOM area. By
using a Kohonen self-organising map such an architecture allows the features produce on
the Helmholtz machine hidden layer to relate a specific motor direction with the
appropriate high-level visual information for each behaviour based on the language
instruction. Hence this is able to overcome the problem associated with the Helmholtz
machine network extracting independent components which means components fail to
span over multiple modalities. The size of the HM area hidden layer is 32 by 32 units
and the SOM area layer has 24 by 24 units. These sizes are empirically determined as
they offer sufficient space to allow differentiation between the activation patterns to
represent different high-level vision, motor and language instruction states.
The number of training behaviour examples is 500000 to allow time for the flat
architecture and the hierarchical architecture to learn the three behaviours. During
training the flat and hierarchical architectures receive all the inputs, however when testing
either the language instruction input or the motor direction input is omitted. The
language instruction input is omitted when the student robot architecture is required to
take the other inputs that are gained from observing the teacher robot and recognise the
behaviour that is performed. When the motor direction input is omitted the student robot
is required to perform a behaviour based on a language instruction. The architecture then
continuously receives its own current x and y coordinates and angle  and the language
instruction of the behaviour to be performed.
Without the motor direction input the grounding architectures have to produce the
appropriate motor activations which it has learnt from observing the teacher to produce
the required behaviour. Recognition is tested by comparing the units which are activated
on the language instruction area with the activation pattern belonging to the verbal
instruction of the appropriate behaviour at each time step. Behaviour production is tested
by comparing the performance of the behaviour at each time step by the student robot
with the teacher robot. The duration of a single behaviour is dependent on the initial start
input conditions and is typically around 25 consecutive time steps before the end
condition is met.
5.2 Representation of action verb instruction word form
First we will focus on a phoneme schema for the action verb instruction word forms for
the architectures. Although there are several machine readable schemes that are
available, the one used in our architectures relates to the CELEX database. This CELEX
approach uses a feature description of 46 English phonemes, based on the phonemes in
the CELEX lexical databases (www.kun.nl/celex/). Table 3 provides the representation
of the action verbs in our scenario using this phoneme scheme.
Table 3 The representation of the action verbs in our scenario using CELEX phonemes.
Word
GO
LIFT
PICK
CELEX Phonemes
g@u
l we f t
p we k
Each phoneme is numerically represented using 20 phonetic features, which produces a
different binary pattern of activation in the language input region for each phoneme. The
diagram representation for the action verbs ‘go’, ‘pick’ and ‘lift’, using a region of 4 rows
by 20 columns, is shown in Figure 21. By using this approach the order in which the
phonemes appear in the word is maintained. Although this approach does maintain this
order, it has the limitation that words that have the same phonemes but at different
locations in the representations may not be seen as similar.
‘g @ U’
‘p we k’
‘l we f t’
Figure 21 Distributed random phonemes representation for ‘go’, ‘pick’ and ‘lift’ based on
20 features associated with phonemes.
5.3 Visual and motor direction representations
In this section of the thesis we will describe the representation used for the high-level
vision and motor direction inputs. The high-level visual input, the coordinates x and y
and angle  , as can be seen in Figure 16, of the robot in the environment are represented
by two arrays of 37 units and one array of 25 units, respectively. When the robot is close
to the nearest wall, the x-position is a Gaussian hill of activation centred near the first unit
while for a robot position near the middle of the arena the Gaussian is centred near the
last unit of the first column of 37 units. The next column of 37 units represented the ycoordinates so that a Gaussian positioned near the middle unit indicates the robot is in the
centre of the environment along the y-axis.
In Figure 22 the y-coordinates column is between -18 and +18 to equate to the location
sensor readings produced by the PeopleBot robot’s internal odometer. As the robot
environment is 36 units on the x-axis from the centre of the arena to the nearest wall and
36 units on the y-axis these input columns have a one to one relationship with the units in
the environment (with coordinate 0 represented by a unit). Rotation angles  from
 180 o to 180 o are represented along 25 units with the Gaussian centred on the centre
unit if   0 o , on the first unit for   180 o and the last for   180 o . The  column
is 25 units in size as the robot turns 15 o at each time step to fit in with the actual
movement of our PeopleBot robot.
-18
0
- 180 o
25
180 o
+18
37
Figure 22 An example input for the coordinates x and y and angle  .
2
The Gaussian activation levels depends on the equation Tkk' = e ( 0.5( d kk' / σ
2
))
, where d kk' is
the distance between the current unit being considered and the centre of the Gaussian.
The value of  is set to 3 for the x- and y-coordinates columns and 1.5 for the  column.
These values for  are based on past experiences. Figure 24 provides the activations for
the x- and y-columns and the  column, with in the example the centre of the Gaussian
hill of activation for the x-coordinate is at position 21, for the y-coordinate it is +7 and for
 the centre of the Gaussian hill of activation is 30 o .
As the final part of the multimodal input the teacher robot’s motor directions are
presented on the 4 motor units (‘forward’, ‘backward’, ‘turn right’ and ‘turn left’) one for
each of the possible actions with only one active at a time. The activation values in all
three input areas are between 0 and 1. Figure 23 provides a representation of the motor
direction input when ‘left’ is activated.
‘forward’
‘backward’
‘right’
‘left’
Figure 23 Example motor direction inputs for the model.
5.4 Training algorithm
As hierarchical architecture is using the Helmholtz machine learning algorithm [Dayan
2000, Dayan and Hinton 1996, Dayan et al. 1995, Hinton et al. 1995] SOM learning
algorithm [Kohonen 1997], internal representations of the input are produced by
unsupervised learning. Bottom-up weights W bu dark green in Figure 5.11 produce a


hidden representation r of some input data z . The top-down weights W td represented as
light green in Figure 20 are used to reconstruct an approximation ~z of the data based on
the hidden representation.
The Helmholtz machine algorithm learning for the flat architecture and the HM layer area
of the hierarchical model consists of alternating wake phase and sleep phase to train the

top-down and bottom up weights, respectively. In the wake phase, a full data point z is
presented which consists of an action direction, higher-level visual values and for the flat
multimodal architecture the language instruction. The linear hidden representation



r l  W bu z is computed first. In the flat architecture a competitive version r c is obtained
l
by taking the winning unit which has the highest activation from r and assigning

activation values under a Gaussian envelope to the units around the winner. Thus, r c is
effectively a smoothed localist code. On the HM area of the hierarchical architecture, the

linear activation is converted into a sparse representation r s using the transfer function
x
r l
r js  e j /(e j  n) , where   2 controls the slope and n  64 the sparseness of firing.
These values of  and n are selected based on empirical studies and their ability to
produce the most stable learning and hidden representation. The reconstruction of the

z  W td r c / s and the top-down weights from units j to units i are
data is obtained by ~
modified according to
wijtd  r jc / s  ( zi  ~
zi )
(1)
with a learning rate   0.001 .
The learning rate is increased 5-fold whenever the active motor unit of the teacher
changes. This is critical during the ‘go’ behaviour when the robot turns for a while in
front of a wall until it does its first step forward. Without emphasising the ‘forward’ step,
the student would learn only the ‘turn’ command which dominates this situation. The
ability to detect novel behaviour is useful for learning as it can reduce the amount of
information processed and so focus on unusual events [Marsland 2003]. Increasing the
learning rate when a student robot recognises significant events such as a change in
direction and remembers where this occurs is used by Hayes and Demiris (1984). There
is also neuroscience evidence to support this approach of increasing the learning rate for
novel behaviours as the cerebral cortex has regions that detect novel or significant
behaviour to aid learning [Opitz et al. 1999]. In the wake phase the inputs are scaled to
different relative levels to identify the relationship that gives the optimum performance.

In the sleep phase, a random hidden code r r is produced by assigning activation values
under a Gaussian envelope centred on a random position on the hidden layer. Its linear

input representation z  W td r r is determined, and then the reconstructed hidden

r r  W bu z r . From this, in the flat architecture we obtain the competitive
representation ~
version ~
r c by assigning activation values under a Gaussian envelope centred around the
winning unit that has the strongest activation. In the HM area of the hierarchical model, a
sparse version ~
r s is obtained using the above transfer function and parameters on a linear
representation. The bottom-up weights from units i to units j are modified according to
bu
r
~c / s
wbu
ji   ( w ji  z i )  r j
(2)
with an empirically determined learning rate   0.01 . Learning rates  and  are
decreased linearly to zero during the last 25% of 500000 training data points in order to
limit noise. All weights W td and W bu are rectified to be non-negative at every learning
step. In the flat architecture the bottom-up weights W bu for each hidden unit are
normalised to input unit length to reduce the impact from their difference sizes. The
weights are updated after each 100 training points. In the HM area of the hierarchical
model to ensure that the weights do not grow too large, a weight decay of  0.015  wijtd is
added to Equation (1) and  0.015  wbu
ji to Equation (2).
The SOM area of the hierarchical architecture is trained using the Kohonen approach.
Only the bottom-up weights are trained in Figure 19. The training of the SOM area
weights only occurs following the training of the HM area. The learning rate  was set
to 0.01. The neighbour function is a Gaussian using Tkk' = e
2
2
( 0.5( d kk
' / σ ))
, where d kk ' is
the distance between unit k and the winning unit k ' on the SOM grid. A large
neighbourhood of   12 is used at first to gain broad topographical learning that is
reduced to   0.1 during training over 450000 data points. Finer training is achieved
using a smaller neighbourhood width by reducing  from 0.1 to 0.01 over 50000 data
points.
For the flat architecture in the wake phase although the inputs have different size it was
found that the best performance is achieved when the inputs have no scaling value and so
the input is not increased when introduced. However, for the hierarchical architecture in
the wake phase for the Helmholtz machine a better performance is achieved when scaling
the motor input by 4 and the high-level vision input by 2 to increase the impact on the
weight vectors and activation patterns. Motor direction input is scaled more than the
high-vision input as the motor action being a single unit while the high-level vision uses
Gaussian hills of activation to represent the movement of the robot in the arena.
5.5 Hierarchical GLIA architecture results
First, we have trained a HM area to perform a single behaviour, `pick', without the use of
a higher-level SOM area. The robot thereby self-imitates a behaviour it has previously
learnt by reinforcement (Weber et al. 2004). Example videos of its movements can be
seen on-line at: www.his.sunderland.ac.uk/supplements/AI04/}.
Figure 24a) shows the total incoming innervation originating from the motor units (left)
and the high-level vision units (right) on the HM area. The figure has been obtained by
activating all four motor units or all high-level vision units, respectively, with activation 1
and by displaying the resulting activation pattern on the HM area.
a)
b)
Figure 24 a) Left, the projections of the four motor units onto the HM area. Right, the
projections of all high-level vision inputs on to the HM area. b) Four neighbouring SOM
units' RFs in the HM area. These selected units are active during the `go' behaviour.
Circles indicate that the leftmost units' RFs overlap with those of the `left' motor unit
while the rightmost unit's RF overlaps with the RF of the `forward' motor unit.
It can be seen that the patches of motor innervation avoid areas of high-density sensory
innervation, and vice versa. This effect is due to competitive effects between incoming
innervation. This does not mean that motor activation is independent of sensory
activation: Figure 24b) shows the innervation of SOM area units on the HM area which
bind regions specialised on motor- and sensory input.
The leftmost of the four units displayed binds the "left" motor action with some sensory
input while the rightmost binds the "forward" motor action with partially different
sensory input. In the cortex we would expect such binding not only to occur via another
cortical area (such as the SOM area in our model) but also via horizontal lateral innerarea connections which we do not model.
The activation patterns for the student robot during recognition of the 10 step action
sequence from a ‘go’ behaviour in Figure 26 are shown in Figure 25 and the activation
during the performance by the student robot of the sequence of the ‘go’ behaviour in
Figure 30 is shown in Figure 27. In the examples depicted in Figure 27 to Figure 30 the
teacher and student robots are initialised at the same position. When the student is
performing the ‘go’ behaviour there is a difference between the two activation patterns of
the HM area when it receives vision alone (Figure 27a) and when the HM area receives
activation from the SOM area (Figure 27c). The difference comes from the lack of motor
direction input (Figure 27a) and the completion of the pattern to include the motor
direction induced activation which would occur during full observation (Figure 27c). The
activation patterns on the HM and SOM areas are very close between recognition and
performance which indicates that neurons display mirror neuron properties. The main
difference between performance and recognition can be seen between the activation
patterns of the HM area based on high-level vision and motor direction for recognition
(Figure 25a) and for performance the activation patterns on the HM area based on vision
alone (Figure 27a) and the reconstruction of the activation on the HM area to produce the
motor action based on the SOM area activation (Figure 27c).
(a) HM area activation based on high-level vision and motor input
(b) SOM area winner-based classification based only on the HM area input
(c) language classification by the SOM winning unit
‘go’
‘go’
‘go’
‘go’
‘go’
‘go’
‘go’
‘go’
‘go’ ‘ pick’
Figure 25 Activation sequences during observation of a ‘go’ behaviour, without language
instruction input. Strong activations are depicted dark, and shown at ten time steps from
left to right. Circles mark the bottom-up input of the active motor unit of the teacher
which changes from ‘forward’ in the first 6 steps to ‘turn left’ during the last 4 steps.
Language instruction recognition is correct except for the last time step which is
classified as ‘pick’.
Figure 26 The teacher robot performing the 10 step section of the ‘go’ behaviour related
to the activation sequences during observation of the ‘go’ behaviour from Figure 19.
The differences in the HM area unit activation patterns during recognition and
performance are dependent on the effect of the motor inputs. If during training, the input
differs between the behaviours only by the motor direction input with the high-level
vision input being the same, then the difference in the activation patterns must be large
enough to activate a different SOM unit so that it can change between behaviours.
During performance, however, the absence of the motor input should not have a too
strong an effect on the HM area representation, because the winner in the SOM area
could be very different from when all the inputs are present and so the performance
would be unpredictable.
(a) HM area activation based only on high-level vision input
((b) SOM area winner-based classification based on HM area and language input
c) reconstruction of HM area activation by the SOM winning unit
(d) reconstruction of motor command from HM area activations
Figure 27 Activation sequences during performance by the student robot of a ‘go’
behaviour, i.e. without motor input. The performed sequence is visualised in Figure 5.26.
Circles mark the region at each time step which has the decisive influence on the action
being performed.
Figure 27c shows the activations of the language instruction area as a result of the topdown influence from the winning SOM area unit during recognition. An error is made at
the last time step despite the HM activation being hardly distinguishable from the second
last time step due to a change in the SOM area winning unit (Figure 27b). However, the
recognition error is difficult to determine since some sections of behaviours are not
specific to a single behaviour with the high-level vision and motor being the same. For
instance, during ‘go’ and ‘pick’, a forward movement toward the front wall is made in
large areas of the arena at certain orientation angles  , or a ‘turn’ movement near the
wall toward the centre might also be a result of either behaviour.
Figure 28 The student robot performing the 10 step section of the ‘go’ behaviour without
motor input related to the activation sequences.
5.6 Discussion of hierarchical architecture
The hierarchical architecture recreates concepts of the regional cerebral cortex
modularity, word representation and the mirror neuron system principles. Considering
regional modularity and the neurocognitive evidence on word representation this
architecture make use of semantic features of the action verbs from the vision and motor
action inputs. There is also the inclusion the regional modularity, and word and action
verb representation principles as this architecture includes distributed representations.
For the hierarchical architecture the distributed representation comes from multiple units
being activated in the Helmholtz hidden area for specific action-perception association
and different units of the SOM area being activated when performing the binding
operation. Furthermore the hierarchical architecture form of distributed representation is
closer to the cerebral cortex as it has active units that are analogous to the word web
assemblies that represent words as described by Pulvermüller (1999) and (2003).
Moreover, the hierarchical model can be considered to be close to the regional modularity
of the cerebral cortex as it offers modules that perform independently on specific
operations in a hierarchical manner.
The ability of hierarchical architectures to both recognise and perform behaviours shows
that the architectures are able to recreate some concepts of the mirror neuron system
principle. The GLIA hierarchical architecture achieves common ‘action understanding’
between the teacher and student on the behaviour’s meaning through language instruction
which recreates the concept that the mirror neuron system principle has a role in the
emergence of the human language system.
With regards to the hierarchical architecture it is possible to relate the HM architecture of
the model with area F5 of the primate cortex and the SOM area with F5 area. The F5
area contains neurons that can produce a number of different grasping activities. F6 area
performs as a switch to produce or restrict the effects of F5 unit activations so that only
the required units in the F5 region are activated to perform or recognise the required
action [Arbib et al. 2001]. F6 area also produces the signal when the action should be
performed and so activates the appropriate neurons in F5 and hence is responsible for the
higher level control [Seitz et al. 2000]. F6 area is responsible for controlling the high
level sequence for performing the behaviour [Arbib et al. 2001]. In our hierarchical
GLIA architecture the HM area is directly linked to the motor output and identifiable
groups of neurons that activate specific motor units while the SOM area represents the
channel through which a language instruction must pass in order to reach the motorrelated HM units. Hence, the SOM area uses the language instruction to select the
appropriate motor actions and so suppresses the others. The hierarchical model suggests
analogies to the organisation of motor cortical areas F5 and F6, with area F5 containing
the mirror neurons and area F6 determining which mirror neurons should be activated.
6. CONCLUSION
In this report we have described three language models that can be combined to achieve
neural multimodal language architecture. The first component of the architecture is able
to process all words in the MirrorBot grammar using spiking associative memory and is
made up of various areas for processing the language. As shown in this report this
language cortical model is able to check the grammar and semantic correctness and make
the language available for other MirrorBot models for performing behaviours such as for
tracking a specific object.
The second self-organising cortical map language
representation model is able to represent action verbs based on the part of body that
performs the action by comparing a phoneme and perceptual representation of word by
taking the action verb from the cortical language model. The final model is able to take
example action verbs represented by the second model and ground the action in verbs at a
motor level based on mirror neurons. When developing the GLIA model we have placed
emphasis on modelling in a more realistic manner the mirror neuron system and
multimodal language processing in a neural manner. Fundamentally, the model is able to
bring together inspiration from the two neuroscience partners by representing action verb
based on the part of body that performs the action and by basing a model on the mirror
neuron system.
7. REFERENCES
Arbib, M.A., Fagg, A.H., Grafton, S.T. (2001) Synthetic PET imaging for grasping: From
primate neurophysiology to human behavior. Explorative Analysis and Data
Modelling in Functional Neuroimaging, Sommer, F., Wichert, A., Eds., Cambridge,
MA. USA, MIT Press, pp. 231-250.
Arbib, M. (2005) From monkey-like action recognition to human language: An
evolutionary framework for neurolinguistics. Behavioral and Brain Science, in
press.
Dayan, P. (2000) Helmholtz machines and wake-sleep learning. Hand of Brain Theory
and Neural Network, Arbib, M., Eds., Cambridge, MA, USA, MIT Press.
Dayan, P., Hinton, G. (1996) Varieties of Helmholtz machine. Neural Network, 9(8), pp.
1385-1403.
Dayan, P., Hinton, G. E., Neal, R., Zemel, R. S. (1995) The Helmholtz machine. Neural
Computation, 7, pp. 1022-1037.
Fay, R., Kaufmann, U., Knobauch, H., Markert, H., Palm, G. (2004) Integrating Object
Recognition, Visual Attention, Language and Action Processing on a Robot in a
Neurobiologically Plausible Associative Architectur. In: Proceedings of the
NeuroBotics-Workshop, 27th German Conference on Artificial Intelligence (KI2004), Ulm, Germany.
Foster, D.J., Morris, R.G.M., Dayan, P. (2000) A model of hippocampally dependent
navigation, using the temporal difference learning rule. Hippocampus, 10, pp. 1-16.
Hayes, G., Demiris, Y. (1984) A robot controller using learning by imitation.
Proceedings of the 2nd International Symposium on Intelligent Robotic Systems,
Grenoble, France.
Hinton, G.E., Dayan, P., Frey, B.J., Neal, R. (1995) The wake-sleep algorithm for
unsupervised neural networks. Science, 268, pp 1158-1161.
Knoblauch, A., Markert, H., Palm, G. (2005a) An Associative Model of Cortical
Language and Action Processing. In: Modeling Language, Cognition and Action
(Cangelosi, A., Bugmann, G., Borisyuk, R., eds.), Proceedings of the Ninth Neural
Computation and Psychology Workshop, Plymouth 2004. 79-83, World Scientific.
Knoblauch, A., Markert, H., Palm, G. (2005b) An associative cortical model of language
understanding and action planning. In: Artificial Intelligence and Knowledge
Engineering Applications: A Bioinspired Approach (Mira, J., Alvarez, J.R., eds.),
Proceedings of IWINAC 2005, Las Palmas de Gran Canaria, Spain) LNCS 3562,
405-414, Springer, Berlin, Heidelberg.
Markert, H., Knoblauch, A., Palm, G. (2005) Detecting sequences and understanding
language with neural associative memories and cell assemblies. In: Biomimetic
Neural Learning for Intelligent Robots (Wermter, S., Palm, G., Elshaw, M., eds.).
LNAI 3575, 107-117, Springer, Berlin, Heidelberg.
Marsland, S. (2003) Novelty detection in learning systems. Neural Computing Surveys,
3, pp. 157-195.
Opitz, B., Mecklinger, A., Friederici, A.D., von Cramon, D.Y. (1999) The functional
neuroanatomy of novelty processing: Integrating ERP and fMRI results. Cerebral
Cortex, 9(4), pp. 379-391.
Pulvermüller, F. (1999) Words in the brain’s language. Behavioral and Brain Sciences,
22(2), pp. 253-336.
Pulvermüller, F. (2003) The neuroscience of language: On brain circuits of words and
language. Cambridge, UK, Cambridge Press.
Rizzolatti, G., Arbib, M. (1998) Language within our grasp. Trends in Neuroscience,
21(5), pp. 188-194.
Seitz, R., Stephan, M., Binkofski, F. (2000) Control of action as mediated by the human
frontal lobe. Experimental Brain Research, 133, pp. 71-80.
Weber, C., Wermter, S., Zochios, A. (2003) Robot docking with neural vision and
reinforcement. Proceedings of the Twenty-third SGAI International Conference on
Innovative Techniques and Applications of Artificial Intelligence, Cambridge, UK.
Weber, C., Wermter, S., Zochios, A. (2004) Robot docking with neural vision and
reinforcement. Knowledge-Based Systems, 17(2-4), pp. 165-172.
Wermter, S., Palm, P., Elshaw, M. (2005) Biomimetic Neural Learning for Intelligent
Robots. Springer, Heidelberg.