Download 5. hierarchical multimodal language modeling

MirrorBot IST-2001-35282 Biomimetic multimodal learning in a mirror neuron-based robot Results from the Language Experiments and Simulations with MirrorBot (Workpackage 15.3) Frédéric Alexandre, Stefan Wermter, Günther Palm, Olivier Ménard Andreas Knoblauch, Cornelius Weber, Hervé Frezza-Buet Uli Kaufmann, David Muse, and Mark Elshaw Covering period MirrorBot Report 20 Report Version: 0 Report Preparation Date: 8 May 2017 Classification: Draft Contract Start Date: 1st June 2002 Duration: Three Years Project Co-ordinator: Professor Stefan Wermter Partners: University of Sunderland, Institut National de Recherche en Informatique et en Automatique, Universität Ulm, Medical Research Council, Universita degli Studi di Parma Project funded by the European Community under the “Information Society Technologies Programme” 0 TABLE OF CONTENTS 0 TABLE OF CONTENTS ................................................................................................. 2 1 INTRODUCTION ........................................................................................................... 3 2 LANGUAGE INPUT....................................................................................................... 5 3. MODEL OF CORTICAL LANGUAGE MODEL ......................................................... 7 3.1 Model Simulation – Grammatical Sentence ............................................................. 8 3.2 Model Simulation – Acceptable and unacceptable sentence .................................. 11 3.3 Disambiguation using model .................................................................................. 16 4. SELF-ORGANISING CORTICAL MAP LANGUAGE REPRESENTATION MODEL ............................................................................................................................ 17 5. HIERARCHICAL MULTIMODAL LANGUAGE MODELING ............................... 23 5.1 Hierarchical GLIA architectures ............................................................................ 27 5.2 Representation of action verb instruction word form ............................................ 29 5.3 Visual and motor direction representations ............................................................ 30 5.4 Training algorithm .................................................................................................. 31 5.5 Hierarchical GLIA architecture results ................................................................... 33 5.6 Discussion of hierarchical architecture ................................................................... 37 6. CONCLUSION ............................................................................................................. 39 7. REFERENCES ............................................................................................................. 40 1 INTRODUCTION In this report we consider the results from experiments for language models for three related language components. The first model acts as a front-end component to the other two with the second model recreating the neuroscience findings of our Cambridge partners on how action verbs are represented by on body parts and the third model based on mirror neuron system to act as a language instruction grounding in actions (GLIA) architecture. Although these three models can be used separately, they provide a successful approach to modelling language processing. The first model the cortical language model takes in language either spoken or from pushing buttons and represent this language input using different regions. The cortical language model is able to act as a grammar checker for the input as well as determine if the input is semantically possible. This model consists of 15 areas that each contains spiking associative memory of 400 neurons. The second language model the self-organising cortical map language representation model uses multi-modal information processing, inspired from cortical maps. We illustrate a phonetic - motor association, that shows that the organisation of words can integrate motor constraints, as observed by our Cambridge partners. This model takes the action verbs from the first model. The main computational block of the model is a set of computational units called a map. A map is a sheet made of a tiling of identical units. This sheet has been implemented as a disk, for architectural reasons described further. When input information is given to the map, each unit shows a level of activity, depending on the similarity of the information it receives with the information it specifically detects. The final model offers to ground language instructions in actions (GILA) model by taking three example action verbs represented in the second model and ground these in actual actions. In doing so this GLIA model learn to perform and recognise three complex behaviours, ‘go’, ‘pick’ and ‘lift’ and to associate these with their action verbs. This hierarchical model has two layers. In the lower level hidden layer the Helmholtz machine wake-sleep algorithm is used to learn the relationship between action and vision, the upper layer uses the Kohonen self-organising approach to combine the output of the lower hidden layer and the language input. These models are able to recreate the findings of the mirror neuron system in that during both performing and recognising, the activations in the hidden layers are similar. In the hierarchical model rather separate sensory- and motor representations on the lower level are bound to corresponding sensory-motor pairings via the top level that organises according to the language input. We suggest analogies to the organisation of motor cortical areas F5 and F6 and mirror neurons therein. Figure 1 shows the overall structure of the language model with the cortex model passing the action verbs into the self-organising cortical map language representation model which producing a topological representation of the actions verbs using the inspiration of the Cambridge partner and the final model grounds example language instructions from the second language model in the actions. Language input Model 1 Cortical language Model Perceptual information on action verbs Action verbs Model 2 The self-organising cortical map language representation model Topological representation of action verb based on body parts Multimodal inputs for three action verbs from those represented in the level 2 model Mirror neuron system based language model Model 3 GLIA Model Figure 1 The overall structure of the language model architecture. 2 LANGUAGE INPUT It is possible to provide speech and button pressing input to the MirrorBot. The program for using button pressing is shown in Figure 2. The speech input is through the CSLU speech recognition and production software which runs under Windows has been tested and found to be robust for this task. The CSLU is run on a laptop computer on the robot. This was achieved by having the robot act as a server and having the laptop running the speech recognition and production requesting a connection to the robot. Figure 3 illustrates the server client connection via a socket between the robot and the laptop running the CSLU toolkit Interface. The recognised to input is to be introduced in the language model in the appropriate form as described below. Figure 2 The push button program for inputting an input sentence. Figure 3 Communication between the robot and laptop. 3. MODEL OF CORTICAL LANGUAGE MODEL The first model, the cortical language model, is described in Fay et al. (2004), Knoblauch et al. (2005a, b), and Markert et al. (2005). It consists of 15 areas. Each of the areas is modelled as a spiking associative memory of 400 neurons. In each area we defined a priori a set of binary patterns (i.e., subsets of the neurons) constituting the neural assemblies. In addition to the local synaptic connections we also modelled extensive inter-areal hetero- associative connections between the cortical areas (see Figure 4). The model can roughly be divided into three parts. (1) Auditory cortical areas A1,A2, and A3: First auditory input is represented in area A1 by primary linguistic features (such as phonemes), and subsequently classified with respect to function (area A2) and content (area A3). (2) Language specific areas A4, A5-S, A5-O1-a, A5-O1, A5-O2-a, and A5O2: Area A4 contains information about previously learned sentence structures, for example that a sentence starts with the subject followed by a predicate. The other areas contain representations of the different sentence constituents (such as subject, predicate, or object). (3) Activation fields af-A4, af-A5-S, af-A5-O1, and af-A5-O2: The activation fields are relatively primitive areas that are connected to the corresponding grammar areas. They serve to activate or deactivate the grammar areas in a rather unspecific way. Figure 4 Cortical areas and inter-areal connections. Each of the 15 areas (boxes) was implemented as a spiking associative memory, where patterns (or assemblies) are stored auto-associatively in local synaptic connection. Each black arrow corresponds to a hetero-associative inter-areal synaptic connection. Blue feedback arrows indicate short-term memory mechanisms. Each area consists of spiking neurons that are connected by local synaptic feedback. Previously learned word features, whole words, or sentence structures are represented by neural assemblies (subgroups of neurons, i.e., binary pattern vectors), and laid down as long-term memory in the local synaptic connectivity according to Hebbian coincidence learning (Hebb 1949). When processing a sentence then the most important information stream flows from the primary auditory areas A1 via A3 to the grammatical role areas A5-S, A5-P, A5-O1, A5O1-a, A5-O2, and A5-O2-a. From there other parts of the global model can use the grammatical information to perform, for example, action planning. The routing of information from area A3 to the grammatical role areas is guided by the grammatical sequence area A4 which receives input mainly via the primary auditory areas A1 and A2. This guiding mechanism of area A4 works by activating the respective grammatical role areas appropriately via the activation fields af-A5-S, af-A5-P, af-A5-O1, and af-A5-O2. In the following we will further illustrate the interplay of the different areas when processing a sentence. 3.1 Model Simulation – Grammatical Sentence The sentence “Bot put plum green apple” is processed in 36 simulation steps, where in each step an associative step updating the activity state in the areas. We observed that processing of a grammatical correct sentence is accomplished only if some temporal constraints are matched by the auditory input. This means that the representation of a word in the primary areas must be active at least for a minimal number of steps. It turned out that a word should be active for at least 4-5 simulation steps. Figure 5 shows the activity state of the model after the 6-th simulation step. Solid arrows indicate currently involved connections, dashed arrows indicate relevant associations in previous simulation steps. At the beginning all the activation fields except for af-A4 have the ‘OFF’ assembly activated due to input bias from external neuron populations. This means that the grammatical role areas are initially inactive. Only activation field af-A4 enables area A4 to be activated. Figure 5. Activation state of the model of cortical language areas after simulation step 6. Initially the OFF-assemblies of the activation fields are active except for area A4 where the ON-assembly is activated by external input. The processing of a sentence beginning activates the ‘_start’ assembly in area A2 which activates ‘S’ in A4 which activates afA5-S which primes A5-S. When subsequently the word “bot” is processed in area A1 this will activate the ‘_word’ assembly in area A2, and the ‘bot’ assemblies in area A3 and area A5-S. Solid arrows indicate currently involved connections, dashed arrows indicate relevant associations in previous simulation steps. This happens, for example, when auditory input enters area A1. First the ‘_start’ representation in area A1 gets activated indicating the beginning of a new spoken sentence. This will activate the corresponding ‘_start’ assembly in area A2 which in turn activates the ‘S’ assembly in the sequence area A4 since the next processed word is expected to be the subject of the sentence. The ‘S’ assembly therefore will activate the ‘ON’ assembly in the activation field af-A5-S which primes area A5-S such that the next processed word in area A3 will be routed to A5-S. Indeed, as soon as the word representation of “Bot” is activated in area A1 it is routed in two steps further via area A3 to area A5-S. In step 7 the ‘_blank’ representation enters area A1 indicating the border between the words “bot” and “put”. This will activate the ‘_blank’ assembly in area A2 which will activate the ‘OFF’ assembly in activation af-A4. This results in a deactivation of the sequence area A4. Since the ‘_blank’ assemblies in areas A1 and A2 is only active for one single simulation step and immediately followed by the representations of the word “put”, the ‘ON’ assembly in activation field af-A4 is one step later active again and activates also the sequence area A4. However, since the ‘S’ assembly in A4 was intermittently erased the delayed intra-areal feed-back of area A4 will switch activity to the next sequence assembly ‘P’. This means that the next processed word is expected to be the predicate of the sentence. The ‘P’ representation in area A4 activates the ‘ON’ assembly in activation field af-A5-S which primes area A5-S to be activated by input from area A3. In the meantime the representations of the input word “put” were activated in areas A1 and A3, and from there routed further to area A-S. Thus we conclude with the situation after simulation step 13 as shown in Figure 6. Figure 6. System state of the model of cortical language areas after simulation step 13. The ‘_blank’ representing the word border between words “bot” and “put” is recognized in area A2 which activates the ‘OFF’ representation in af-A4 which deactivates area A4 for one simulation step. Immediately after the ‘_blank’ the word “put” is processed in A1 which actives the ‘_word’ and ‘put’ assemblies in areas A2 and A3, respectively. This will activate again via af-A4 area A4. Due to the delayed intra-areal feedback this will activate the ‘P’ representation in area A4 as the next sequence assembly. This activates the ‘ON’ assembly in af-A5-P which primes A5-P to be activated by the ‘put’ representation in area A3. The situation at this point after simulation step 25 is illustrated in Figure 7. At this stage the next word border must not switch the sequence assembly in area A4 further to its next part. This is because the second is actually not yet complete. We have only processed the attribute (“green”) so far. Therefore a rather subtle (but not implausible) mechanism guarantees in our model that the switching is prevented at this stage. The ‘_none’ representation in area A5-O2 has now two effects. First it activates together with the ‘blank’ assembly in area A2 representing the word border the ‘OFF_p’ assembly in afA5-O2. Second it prevents by a projection to activation field af-A4 the activation of the ‘OFF’ assembly in af-A4 and therefore the switching in the sequence area A4. The ‘OFF_p’ deactivates only area A5-O2 but not A5-O2-a. As soon as A5-O2 is deactivated and the next word “apple” is processed and represented by the ‘_word’ assembly in area A2, input from the ‘vp2_O2’ assembly in area A4 activates again the ‘ON’ assembly in activation field af-A5-O2. And thus the ‘apple’ assembly in area A5-O2 gets activated via areas A1 and A3. Figure 12 shows the situation at this point after simulation step 30. Figure 7. The system state of the model of cortical language areas after simulation step 25. The word border between “plum” and “green” activates the ‘_blank’ representations in area A1 and A2 which initiates the switching of the sequence assembly in area A4 to its next part as described before (see text). The activation of the ‘vp2_O2’ assembly activates the ‘ON’ assembly in activation field af-A5-O2 which primes areas A5-O2-a and A5-O2. Then the processing of the word “green” activates the ‘green’ assembly in area A5-O2-a via areas A1 and A3. Since there exists no corresponding representation in area A5-O2 here the ‘_none’ assembly is activated. 3.2 Model Simulation – Acceptable and unacceptable sentence Figure 8 illustrate the flow of neural activation in our model when processing a simple sentence such as “Bot lift orange”. In the following it is briefly explained how the model works. The beginning of a sentence which is represented in area A2 by assembly _start will activate assembly S in area A4 which activates assembly _activate in activation field afA5-S indicating that the next processed word should be the subject of the sentence. Until step 7 the word “bot” has been processed and the corresponding assembly extends over areas A1, A3 and A5-S. While new ongoing input will activate other assemblies in the primary areas A1,A2,A3 a part of the global “bot” assembly is still activated in area A5-S due to the working memory mechanism. Before the representation of the next word “lift” enters A1 and A2, the sequence assembly in A4 is switched on from S to Pp, guided by contextual input from area A5-S. In the same way as before an assembly representing “lift” is activated in areas A1, A3, and A5-P. Since “lift” is a verb requiring a “small” object, processing of the next word “orange” will switch the sequence assembly in area A4 to node OA1s which means that the next word is expected to be either an attribute or a small object. While a part of the “lift” assembly remains active in the working memory of area A5-P, processing of “orange” activates an assembly extending over A1, A3, A5-O1, and A5-O1-a . Activation of pattern _obj in area A5-O1-a indicates that no attributal information has been processed. Since “orange” is actual a small object which can be lifted, the sequence assembly in area A4 switches to ok_SPOs. While the assemblies in the primary areas A1,A2,A3 fade, the information about the processed sentence is still present in the subfields of area A5 and is ready for being used by other cortical areas such as the goal areas. Figure 9 illustrates the processing of the sentence “Bot lift wall”. The initial processing of “Bot lift...” is the same as illustrated in Figure 8. That means in area A4 there is an activation of assembly OA1s which activates areas A5-O1-a and A5-O1 via the activation field af-A5-O1. This indicates that the next word is expected to be a small object. Table 1 and Table 2 provide the results from the cortical language model when the sentences introduced are acceptable and non-acceptable. Figure 8: Processing of the sentence “Bot lift orange”. Black letters on white background indicates local assemblies which have been activated in the respective areas (small letters below area names indicate how good the current activity matches the learned assemblies). Arrows indicate major flow of information. Since the next word “wall” is not a small object (which cannot be lifted), the sequence in area A4 is switched on to the error representation err_OA1s. Figure 9: Processing of the sentence “Bot lift wall”. Processing of the first two words is the same as in Fig.6. When processing “wall” this activates an error state in area A4 (err_OA1s) since the verb “lift” requires a “small” object which can be lifted. More test results produced by our cortical language model are summarized in the following tables. The first table shows examples of correct sentences. Table1 Results of the language module when processing correct sentences sentence area A4 A5-S A5-P A5-O1-a A5-O1 A5-O2-a A5-O2 Bot stop. ok_SP bot stop _null _null _null _null Bot go white wall. ok_SPO bot go white wall _null _null Bot move body forward. ok_SPA bot move_body forward _null _null _null Sam turn body right. ok_SPA sam turn_body right _null _null _null Bot turn head up. ok_SPA bot turn_head up _null _null _null Bot show blue plum. ok_SPO bot show blue plum _null _null Bot pick brown nut. ok_SPOs bot pick brown nut _null _null The second table shows examples where the sentences are grammatically wrong or represent implausible actions (such as lifting walls). Table.2: Results of the language model when processing grammatically wrong, incomplete, or implausible sentences. sentence area A4 A5-S A5-P A5-O1-a A5-O1 A5-O2-a A5-O2 Bot stop apple. err_okSP bot stop _null _null _null _null Orange bot lift. err_Po orange _none _null _null _null _null Stop apple. err_S _none _null _null _null _null _null Bot bot lift orange. err_Pp bot _none _null _null _null _null Orange lift plum. err_Po orange lift _null _null _null _null Sam lift orange apple. err_okSPOs sam lift _obj Orange _null _null This is red go. err_OA1 this is red _none _null _null Bot lift wall. err_OA1s bot lift _obj Wall _null _null Bot pick white dog. err_OA1s bot pick white dog _null _null Bot put wall (to) red desk err_OA1s bot put _obj wall _null _null 3.3 Disambiguation using model The cortical model is able to use context for disambiguation. For example, an ambiguous phonetic input such as ”bwall”, which is between ”ball” and ”wall”, is interpreted as ”ball” in the context of the verb ”lift”, since ”lift” requires a small object, even if without this context information the input would have been resolved to ”wall”. Thus the sentence ”bot lift bwall” is correctly interpreted. As the robot first hears the word ”lift” and then immediately uses this information to resolve the ambiguous input ”bwall”, we call that ”forward disambiguation”. This is shown in figure 10. Our model is also capable of the more difficult task of ”backward disambiguation”, where the ambiguity cannot immediately be resolved because the required information is still missing. Consider for example the artificial ambiguity ”bot show/lift wall”, where we assume that the robot could not decide Figure . A phonetic ambiguity between ”ball” and ”wall” can be resolved by using context information. The context ”bot lift” implies that the following object has to be of small size. Thus the correct word ”ball” is selected. between ”show” and ”lift”. This ambiguity cannot be resolved until the word ”wall” is recognized and assigned to its correct grammatical position, i.e. the verb of the sentence has to stay in an ambiguous state until enough information is gained to resolve the ambiguity. This is achieved by activating superpositions of the different assemblies representing ”show” and ”lift” in area A5P, which stores the verb of the current sentence. More subtle information can be represented in the spike times, which allows for example to remember which of the alternatives was the more probable one. More details on this system can be found in Fay et al. (2004), Knoblauch et al. (2005a, b), and Markert et al. (2005). Figure 10 A phonetic ambiguity between ”ball” and ”wall” can be resolved by using context information. The context ”bot lift” implies that the following object has to be of small size. Thus the correct word ”ball” is selected. 4. SELF-ORGANISING CORTICAL MAP LANGUAGE REPRESENTATION MODEL As described in Prototype 7 the cortical language model can as the front-end to other models produced in the MirrorBot project. For instance it can be used as the input to a cortical planning, action, and motor processing system so that a camera can be moved to centre a recognized object or perform docking to object. Furthermore, the cortical model can be used as the front-end of the second language model that uses self-organising model of multi-modal information, inspired from cortical maps. This model illustrates a phonetic - motor association, that shows that the organisation of words can integrate motor constraints, as observed by our Cambridge partners. This model, called Bijama, is a general-purpose cortically-inspired computational framework that has also been used for rewarded arm control. The main computational block of the model is a set of computational units called a map. A map is a sheet made of a tiling of identical units. This sheet has been implemented as a disk, for architectural reasons described further. When input information is given to the map, each unit shows a level of activity, depending on the similarity of the information it receives with the information it specifically detects. When an input is given to the map, the distribution of matching activities among units is a scattered pattern, because tuning curves are not sharp, which allows many units to have non null activities, even if prototypes don't perfectly match the input. From this activity distribution over the map, a small compact set of units that contains the most active units has to be selected. In order to decide which units are locally the best matching ones inside a map, a local competition mechanism is implemented. It is inspired from theoretical results of the continuum neural field theory. The CNFT algorithm tends to choose more often units that have more connections. Thus, the local connection pattern within the maps must be the same for all units, which is the case for torus-like lateral connection pattern, with units in one border of the map connected to the opposite border, for example. Here, the field of units in the map computes a distribution of global activities A* , resulting from the competition among current matching activity A t . This competition process has been made insensitive to the actual position of units within the map, in spite of heterogeneous local connection patterns at the level of border units in the Bijama model, as detailed further. The global behavior of the map, involving adaptive matching processes, and learning rules dependent on a competition, reminds the Kohonen SOM. However, the local competition algorithm used here allows the units to be feed with different inputs. A cortical layer, that receives information from another map, doesn't receive inputs from all the units of the remote map, but only from one stripe of units. For instance, a map may be connected row-to-row to another map: Each unit in any row of the first map is connected to every remote units in the corresponding row of the other map. These connections are always reciprocal in the model. the combination of self-organization and coherent learning produces what we call joint organization: Competition, although locally computed, occurs not only inside any given map, but across all maps. Moreover, the use of connection stripes limits the connectivity, which avoids the combinatorial explosion that would occur if the model were to employ full connectivity between the maps. Thus, coherent learning leads to both efficient data representation in each map and coordination between all connected maps. A multi-associative model is intended to associate multiple modalities, regardless of however they are related. It must then handle the case where the associations between two modalities are not one-to-one, but rather one-to-many, or even many-to-many. This multi-association problem will now be presented on a simple example. Let us consider an association between certain objects and the sounds they produce. A car, for example, could be associated with a motor noise. Certain objects produce the same noise. As a result, a single noise will be associated with multiple objects. For instance, a firing gun and some exploding dynamite produce basically both an explosion sound. In our model, let us represent the sounds and the objects as two thalamic modalities on two different cortical maps. Let us now link both of these maps to another one, that we call an associative map. The sound representations and the object representations are now be bound together through the associative map (see Figure 11). If we want a single unit to represent the “BANG" sound in the sound map, a single unit in the associative map has to bind together the “BANG" unit with both the gun and the dynamite units. This associative unit must then have the ability to perform multiassociations: It must have a strong cortical connection to two different units (the gun and the dynamite units) in the same cortical stripe (see Figure 11a). If associative units cannot perform multi-associations, the resulting self-organisation process among all maps will duplicate the “BANG" representative unit. The reason is that, in this very case, an associative unit is able to listen to only one unit in each connected module. Each instance of that unit will then be bound, through its own associative unit, either to the dynamite or the gun unit (see Figure 11b). Moreover, the two object units cannot be in the same connectivity stripe, or else the model would try to perform multi-association and fail. Since our model uses a logical “AND” between the different inputs of a unit, that single associative unit must be active each time one of these sound and one of these objects are active together. Using a learning rule adapted from the Widrow-Hoff learning rule, if unit i is connected to unit j, the weight wij of the cortical connection from i to j is updated through : δwij = δ( Ai* - w ) • ( Ai* - Ai* ) • A*j is the cortical activity of a unit i, and ! is the decay rate of cortical connections. Here, cortical activity Aic = ∑j wij A*j j is seen as a predictor of Ai* . When both the local unit i and the remote unit j are active together, if Aic is lower than Ai* , wij grows, and if Aic is higher than Ai* , wij decreases. wij also decreases slowly over time. Here, the global connection strength of the local unit i for a given cortical stripe is distributed among all active remote units j, and not among all remote units. As with the Hebb/anti-Hebb rule, because of the local competition, the connection strength wij between i and a unit j must be high. However, here, raising wij doesn't imply lowering all wik for all k in the remote connection stripe. Raising wij will only lower wik if j and k are active at the same time. However, since only a small Ai* activity bubble is present on each map, most remote units in the connection stripe cannot be active at the same time. Thus, the local unit i can bind together multiple units in a given stripe connection to a unit in another stripe connection. This is the reason why the use of the Widrow-Hoff learning rule in our model leads to multi-map organization as in Figure 11a. Solving the multi-association problem has one main benefit: The maps need fewer units to represent a certain situation than when multi-association between unit is impossible. Moreover, since instances of a given thalamic input are not duplicated in different parts of a cortical map, it is easier for the model to perform a compromise between the local organization and the cortical connectivity requirements, i.e. joint organization is less constrained. Several brain imaging studies have shown that word encoding within the brain is not only organized around purely phonetic codes but is also organized around action. How this is done within the brain has not yet been fully explained but we would like to present how these action based representations naturally emerge in our model by virtue of solving constraints coming from motor maps. We therefore applied our model to a simple wordaction association. A part of the word set from the European MirrorBot project, was used in a “phonetic" map, and we tried to associate these words to the body part that performs the corresponding action. One goal of this project is to define multimodal robotic experiments and the corresponding protocols are consequently well suited for this task. The phonetic coding used in our model is taken from the MirrorBot project. A word is separated into its constituting phonemes. Each phoneme is then coded by a binary vector of length 20. Since the longest word that is used has 4 phonemes, each word is coded by 4 phonemes, and if they have less, they are completed by empty phonemes. The distance between two different phonemes is the Cartesian distance between the coding vectors. The distance between two words is the sum of the distances between their constituting phonemes. While we are well aware that this is a very functional way to represent the phonetic distance between two words, it is sufficient in order to exhibit the joint organization properties discussed in this paper. The actions are coded in the same way as the words: There are 3 different input actions (head action, body action and hand action), and each action is coded as a binary vector of length 3. The distance between two actions is, once again, the Cartesian distance between their representing vectors. Each word is semantically associated to a specific action. The word-action relationship is shown on Figure 12. Figure 11 A multi-association: A sound is associated with two different object, both of which may produce this actual sound. (a) A single sound unit is connected with the two object units by a unit in the associative map, that stands at the intersection of two sound and object stripes: This is possible because the unit's global connection strength is not distributed among all connections, but only among active connections (Widrow-Hoff rule). (b) When a unit's global connection strength is distributed among all connections, the sound unit must be duplicated (Hebb rule). As illustrated in Figure 13, we can clearly see that the topological organization found by the model meets these criteria. Within the word map, words are grouped relatively to the body part they represent: Body action words are grouped together (stripes) as well as hand action words (gray) and head action words (white). However, the phonetic distribution of words remains the most important factorin the phonetic map organization. Each word is represented by a “cluster" of close units, and the words whose phonetic representation is close tend to be represented in close clusters of units. For instance, while “Go" and “Show" correspond to different motor actions, their phonetic representations are close, so that their representing clusters are adjacent. This illustrates the fact that the model is actually doing a successful compromise between the local demands, which tend to organize the words phonetically, and the motor demands, which tend to put together the words that correspond to the same action. The joint organization does not destroy the local self-organization, but rather modulates it so that it becomes coherent with the other map organization. Figure 12 Word-Action Relationship. The thalamic prototypes (i.e. external inputs) of the motor and the phonetic units are, respectively, coded actions and coded words. However, these do not necessarily correspond to real input words or actions: These prototypes are vector of float values, not binary ones. Let us consider three maps, one for word representation, one for action representation and finally an associative one that links word to action. In our model, however, the higher level associative map linking auditory representation with motor action will use close units to represent these words, since they both relate to the same action (head action), see “look" and “show" positions on Figure 13. As our model deals with an implicit global coherence, it is able to reflect this higher level of association and to overcome the simpler phonetic organization. Figure 13 Two simulation results of word representation map after coherent learning has occurred with our model. Word representations are now constrained by the motor map via the associative map and, as a result, words that correspond to the same action are grouped together. Nevertheless, phonetic proximity is still kept. 5. HIERARCHICAL MULTIMODAL LANGUAGE MODELING The GLIA model performs are a lower level to than the other model and so takes three of the action words represented by the second models and grounds language instruction in actions based on multimodal inputs. A motivation for this study is to allow the student robot to learn from the teacher robot who performs three behaviours ‘go’, ‘pick’ and ‘lift’ based on multimodal inputs. Although only three language instructions are used in this initial study, it is our intention in the future to extend the language representation to incorporate the actors, more actions and objects. target object of interest 45 o  x-axis 36 units to front wall robot 18 units positive y 18 units negative y x-axis 36 units to back wall Figure 14 The simulated environment for the robot which is 72 units along the x-axis and 36 along the y-axis. To allow the student robot to learn these behaviours, the teacher robot performs ‘go’, ‘pick’ and ‘lift’ actions one after another in a loop in an environment (Figure 14 and 15) of 72 units along the x-axis and 36 units along the y-axis. The distance from the nearest wall to the centre of the area along the x-axis is 36 units. The y-axis of the arena is split between positive and negative values either side of the central object. The simulation environment is set to this size as it is the same size as our actual environment for running experiments on a PeopleBot robot platform, with one unit in the simulator equal to 10cm in the real environment. This gives the opportunity in the future to use the trained computational architectures from the simulator as the basis for a real PeopleBot robot application. target object for the ‘pick’ behaviour start position Figure 15 The robot in the simulated environment at coordinates x, y and rotation angle  . The robot has performed the behaviour for ten time steps and currently turns away from the wall in the learnt ‘pick’ behaviour. Figure 15 shows the teacher robot in the environment performing the ‘pick’ behaviour at coordinates x, y and angle  to the nearest wall. The student robot observes the teacher robot performing the behaviours and is trained by receiving multimodal inputs. These multimodal inputs are: (i) high-level visual inputs which are the x- and y-coordinates and the rotation angle  of the teacher robot relative to the nearest wall; (ii) the motor directions of the robot (‘forward’, ‘backward’, ‘turn left’ and ‘turn right’); and (iii) a language instruction stating the behaviour the teacher is performing (‘go’, ‘pick’ or ‘lift’). The dashed areas in Figure 14 indicate the boundary between the regions of the environment associated with a particular wall. Hence, once the robot moves out of a region its coordinates x and y and angle  are measured relative to the nearest wall. The region associated with the top wall is the shaded area in Figure 15. The regions are made up of isosceles triangle with two internal angles of 45 o and two sides that are 40 units. In our approach the ‘go’ behaviour involves the teacher robot moving forward in the environment until it reaches a wall and then turns away from it. The distance from the wall at which point the robot turns when performing the ‘go’ behaviour is 5 units to ensure the robot does not hit the wall. Furthermore, to stop it hitting into the side wall when turning, it turns away from that wall. Figure 17 shows an example ‘go’ behaviour where the robot is moving towards the front wall and then turns left towards the left wall, once it gets to the left wall it turns right and moves towards the back wall. Although Figure 15 only shows the robot until it moves towards the back wall however it would continue until it is position to start the ‘pick’ behaviour. The second action verb ‘pick’ involves the robot moving toward the target object depicted in Figure 14 and Figure 15 at the top of the arena and placing itself in a position to get the object. start position end position Figure 16 An example of the ‘go’ behaviour by the robot. The docking procedure performed by the teacher that is involved in the ‘pick’ behaviour is based on a reinforcement learning actor-critic approach of Weber et al. (2003) and Weber et al. (2004) using the research of Foster et al. (2000). This model uses neural vision and a reinforcement learning actor-critic approach to move a simulated version of the PeopleBot robot toward a table so that it can grasp an object. As the robot has a short non-extendable gripper and wide ‘shoulders’ it must dock in the manner that the real PeopleBot robot does, not hit the table with its shoulders and finish at a position so that the gripper is at a 90 o angle to the object so that the gripper can grasp the object. The input to the reinforcement neural network is the relative position of the robot to the target object that is positioned at the border of the table using visual perception (Figure 17). By using this reinforcement approach to control the behaviour of the teacher robot, the training of the student robot does not always receive the optimum solution or a correct solution. However, it fits with the concept of imitation learning by a child from the parent, as the parent may not provide the optimal way to solve a problem. In our approach when the teacher robot is performing the three behaviours in a loop it would only move from the ‘go’ behaviour to the ‘pick’ behaviour if the object is in the field of vision which is dependent on the angle to the object  and the distance d. Using the field of vision associated with the PeopleBot the value of d is determined using basic trigonometry and should be less than 20 units for the object to be in vision. Figure 18 provides an example of a ‘pick’ behaviour performed by the simulated robot. In this case the robot starts close to the front wall so it moves backwards so it can turn and move towards the object. It finishes in a position in front of the object so that the robot can get the object with its grippers. table object of interest field of vision robot Figure 17 The figure depicts the table and the target on it. The robot shown has short black grippers and its field of vision is outlined by the dotted line. Real world coordinates (x, y,  ) state the position and rotation angle of the robot. The perceived position of the object in the robot’s field of vision is defined by the angle  and distance d. (From Weber and Wermter 2004) start position end position Figure 18 A simulated teacher robot performing the ‘pick’ behaviour in the arena. The final action verb ‘lift’ involves moving backward to leave the table and then turning around to face toward the middle of the arena (Figure 19). Coordinates x and  determine how far to move backward and in which direction to turn around. The robot moves backwards until it reaches unit 13 on the x-axis and then turns to a random angle to face the back wall. If the angle that the robot turns to is negative the robot turns to the left, if it is positive it turns to the right. start position end position Figure 19 An example of the ‘lift’ behaviour by the robot. The ‘go’, ‘pick and ‘lift’ behaviours and action verbs were selected as they can be seen as individual behaviours or as the main components in a ‘pick and place’ scenario, where the robot moves around and then picks and retrieves the object. Furthermore, they provide the situation that the same visual information can produce different or the same motor direction depending on the language instruction. These behaviours fit well with the mirror neuron system principle as they can be seen as goal-related and can either be recognised or performed. Finally, we can combine behaviours produced by reinforcement learning and pre-programming to train architectures that are able to learn these behaviours to ground language instruction in actions. The high-level visual inputs which are shared by teacher and student are chosen such that they could be produced once the multimodal grounding architecture is implemented on a real PeopleBot robot. The robot will simply be able to use visual processing of the scene to determine where the teacher robot is in the arena. This shared information will be gained through processing of the images collected by the student robot and basic geometry. Hence, this associates simple visual perception with language instruction to achieve grounding of language instruction in actions. When receiving the multimodal inputs related to the teacher's actions the student robot is required to learn these behaviours so that it could perform them from a language instruction or recognise them. When learning, performing and recognising these behaviours ‘forward’ and ‘backward’ movement is at a constant speed of 1 unit for each time step and the decision to ‘turn left’ or ‘right’ is 15 o each time step. Two neural architectures are considered for this language instruction in actions grounding system. 5.1 Hierarchical GLIA architectures The GLIA hierarchical architecture (Figure 20) uses an associator network based on the Helmholtz machine learning approach [Dayan 2000, Dayan and Hinton 1996, Dayan et al. 1995, Hinton et al. 1995]. The Helmholtz machine approach is selected as it matches neuroscience evidence on language processing by using a distributed representation and unsupervised learning. Furthermore, the Helmholtz learning algorithm is able to associate different modalities to perform the behaviours and offers the ability to recreate the missing inputs. This allows the architectures to recreate the language instruction when performing recognition and the motor directions when instructed to perform a behaviour. It was found that the sparse coding of the wake-sleep algorithm of the Helmholtz machine leads to the extraction of independent components in the data which is not desired since many of these components would not span over multiple modalities. Hence, the representation of the three modalities in the Helmholtz machine has them positioned at a separate location on the Helmholtz machine hidden layer and so two different architectural approaches are considered that overcome this: the flat architecture and the hierarchical architecture. w bu w w bu w td td language instruction w td w bu w td high-level vision coordinate and angle w bu motor direction Figure 20 A hierarchical multimodal GLIA architecture. In the hierarchical architecture (Figure 20) there is the association of the motor and highlevel vision inputs using the first hidden layer using the Helmholtz machine learning algorithm shown on the diagram as the HM area. The activations of the first hidden layer are then associated with the language instruction region input at the second hidden layer based on the self-organising map (SOM) learning algorithm, shown as SOM area. By using a Kohonen self-organising map such an architecture allows the features produce on the Helmholtz machine hidden layer to relate a specific motor direction with the appropriate high-level visual information for each behaviour based on the language instruction. Hence this is able to overcome the problem associated with the Helmholtz machine network extracting independent components which means components fail to span over multiple modalities. The size of the HM area hidden layer is 32 by 32 units and the SOM area layer has 24 by 24 units. These sizes are empirically determined as they offer sufficient space to allow differentiation between the activation patterns to represent different high-level vision, motor and language instruction states. The number of training behaviour examples is 500000 to allow time for the flat architecture and the hierarchical architecture to learn the three behaviours. During training the flat and hierarchical architectures receive all the inputs, however when testing either the language instruction input or the motor direction input is omitted. The language instruction input is omitted when the student robot architecture is required to take the other inputs that are gained from observing the teacher robot and recognise the behaviour that is performed. When the motor direction input is omitted the student robot is required to perform a behaviour based on a language instruction. The architecture then continuously receives its own current x and y coordinates and angle  and the language instruction of the behaviour to be performed. Without the motor direction input the grounding architectures have to produce the appropriate motor activations which it has learnt from observing the teacher to produce the required behaviour. Recognition is tested by comparing the units which are activated on the language instruction area with the activation pattern belonging to the verbal instruction of the appropriate behaviour at each time step. Behaviour production is tested by comparing the performance of the behaviour at each time step by the student robot with the teacher robot. The duration of a single behaviour is dependent on the initial start input conditions and is typically around 25 consecutive time steps before the end condition is met. 5.2 Representation of action verb instruction word form First we will focus on a phoneme schema for the action verb instruction word forms for the architectures. Although there are several machine readable schemes that are available, the one used in our architectures relates to the CELEX database. This CELEX approach uses a feature description of 46 English phonemes, based on the phonemes in the CELEX lexical databases (www.kun.nl/celex/). Table 3 provides the representation of the action verbs in our scenario using this phoneme scheme. Table 3 The representation of the action verbs in our scenario using CELEX phonemes. Word GO LIFT PICK CELEX Phonemes g@u l we f t p we k Each phoneme is numerically represented using 20 phonetic features, which produces a different binary pattern of activation in the language input region for each phoneme. The diagram representation for the action verbs ‘go’, ‘pick’ and ‘lift’, using a region of 4 rows by 20 columns, is shown in Figure 21. By using this approach the order in which the phonemes appear in the word is maintained. Although this approach does maintain this order, it has the limitation that words that have the same phonemes but at different locations in the representations may not be seen as similar. ‘g @ U’ ‘p we k’ ‘l we f t’ Figure 21 Distributed random phonemes representation for ‘go’, ‘pick’ and ‘lift’ based on 20 features associated with phonemes. 5.3 Visual and motor direction representations In this section of the thesis we will describe the representation used for the high-level vision and motor direction inputs. The high-level visual input, the coordinates x and y and angle  , as can be seen in Figure 16, of the robot in the environment are represented by two arrays of 37 units and one array of 25 units, respectively. When the robot is close to the nearest wall, the x-position is a Gaussian hill of activation centred near the first unit while for a robot position near the middle of the arena the Gaussian is centred near the last unit of the first column of 37 units. The next column of 37 units represented the ycoordinates so that a Gaussian positioned near the middle unit indicates the robot is in the centre of the environment along the y-axis. In Figure 22 the y-coordinates column is between -18 and +18 to equate to the location sensor readings produced by the PeopleBot robot’s internal odometer. As the robot environment is 36 units on the x-axis from the centre of the arena to the nearest wall and 36 units on the y-axis these input columns have a one to one relationship with the units in the environment (with coordinate 0 represented by a unit). Rotation angles  from  180 o to 180 o are represented along 25 units with the Gaussian centred on the centre unit if   0 o , on the first unit for   180 o and the last for   180 o . The  column is 25 units in size as the robot turns 15 o at each time step to fit in with the actual movement of our PeopleBot robot. -18 0 - 180 o 25 180 o +18 37 Figure 22 An example input for the coordinates x and y and angle  . 2 The Gaussian activation levels depends on the equation Tkk' = e ( 0.5( d kk' / σ 2 )) , where d kk' is the distance between the current unit being considered and the centre of the Gaussian. The value of  is set to 3 for the x- and y-coordinates columns and 1.5 for the  column. These values for  are based on past experiences. Figure 24 provides the activations for the x- and y-columns and the  column, with in the example the centre of the Gaussian hill of activation for the x-coordinate is at position 21, for the y-coordinate it is +7 and for  the centre of the Gaussian hill of activation is 30 o . As the final part of the multimodal input the teacher robot’s motor directions are presented on the 4 motor units (‘forward’, ‘backward’, ‘turn right’ and ‘turn left’) one for each of the possible actions with only one active at a time. The activation values in all three input areas are between 0 and 1. Figure 23 provides a representation of the motor direction input when ‘left’ is activated. ‘forward’ ‘backward’ ‘right’ ‘left’ Figure 23 Example motor direction inputs for the model. 5.4 Training algorithm As hierarchical architecture is using the Helmholtz machine learning algorithm [Dayan 2000, Dayan and Hinton 1996, Dayan et al. 1995, Hinton et al. 1995] SOM learning algorithm [Kohonen 1997], internal representations of the input are produced by unsupervised learning. Bottom-up weights W bu dark green in Figure 5.11 produce a   hidden representation r of some input data z . The top-down weights W td represented as light green in Figure 20 are used to reconstruct an approximation ~z of the data based on the hidden representation. The Helmholtz machine algorithm learning for the flat architecture and the HM layer area of the hierarchical model consists of alternating wake phase and sleep phase to train the  top-down and bottom up weights, respectively. In the wake phase, a full data point z is presented which consists of an action direction, higher-level visual values and for the flat multimodal architecture the language instruction. The linear hidden representation    r l  W bu z is computed first. In the flat architecture a competitive version r c is obtained l by taking the winning unit which has the highest activation from r and assigning  activation values under a Gaussian envelope to the units around the winner. Thus, r c is effectively a smoothed localist code. On the HM area of the hierarchical architecture, the  linear activation is converted into a sparse representation r s using the transfer function x r l r js  e j /(e j  n) , where   2 controls the slope and n  64 the sparseness of firing. These values of  and n are selected based on empirical studies and their ability to produce the most stable learning and hidden representation. The reconstruction of the  z  W td r c / s and the top-down weights from units j to units i are data is obtained by ~ modified according to wijtd  r jc / s  ( zi  ~ zi ) (1) with a learning rate   0.001 . The learning rate is increased 5-fold whenever the active motor unit of the teacher changes. This is critical during the ‘go’ behaviour when the robot turns for a while in front of a wall until it does its first step forward. Without emphasising the ‘forward’ step, the student would learn only the ‘turn’ command which dominates this situation. The ability to detect novel behaviour is useful for learning as it can reduce the amount of information processed and so focus on unusual events [Marsland 2003]. Increasing the learning rate when a student robot recognises significant events such as a change in direction and remembers where this occurs is used by Hayes and Demiris (1984). There is also neuroscience evidence to support this approach of increasing the learning rate for novel behaviours as the cerebral cortex has regions that detect novel or significant behaviour to aid learning [Opitz et al. 1999]. In the wake phase the inputs are scaled to different relative levels to identify the relationship that gives the optimum performance.  In the sleep phase, a random hidden code r r is produced by assigning activation values under a Gaussian envelope centred on a random position on the hidden layer. Its linear  input representation z  W td r r is determined, and then the reconstructed hidden  r r  W bu z r . From this, in the flat architecture we obtain the competitive representation ~ version ~ r c by assigning activation values under a Gaussian envelope centred around the winning unit that has the strongest activation. In the HM area of the hierarchical model, a sparse version ~ r s is obtained using the above transfer function and parameters on a linear representation. The bottom-up weights from units i to units j are modified according to bu r ~c / s wbu ji   ( w ji  z i )  r j (2) with an empirically determined learning rate   0.01 . Learning rates  and  are decreased linearly to zero during the last 25% of 500000 training data points in order to limit noise. All weights W td and W bu are rectified to be non-negative at every learning step. In the flat architecture the bottom-up weights W bu for each hidden unit are normalised to input unit length to reduce the impact from their difference sizes. The weights are updated after each 100 training points. In the HM area of the hierarchical model to ensure that the weights do not grow too large, a weight decay of  0.015  wijtd is added to Equation (1) and  0.015  wbu ji to Equation (2). The SOM area of the hierarchical architecture is trained using the Kohonen approach. Only the bottom-up weights are trained in Figure 19. The training of the SOM area weights only occurs following the training of the HM area. The learning rate  was set to 0.01. The neighbour function is a Gaussian using Tkk' = e 2 2 ( 0.5( d kk ' / σ )) , where d kk ' is the distance between unit k and the winning unit k ' on the SOM grid. A large neighbourhood of   12 is used at first to gain broad topographical learning that is reduced to   0.1 during training over 450000 data points. Finer training is achieved using a smaller neighbourhood width by reducing  from 0.1 to 0.01 over 50000 data points. For the flat architecture in the wake phase although the inputs have different size it was found that the best performance is achieved when the inputs have no scaling value and so the input is not increased when introduced. However, for the hierarchical architecture in the wake phase for the Helmholtz machine a better performance is achieved when scaling the motor input by 4 and the high-level vision input by 2 to increase the impact on the weight vectors and activation patterns. Motor direction input is scaled more than the high-vision input as the motor action being a single unit while the high-level vision uses Gaussian hills of activation to represent the movement of the robot in the arena. 5.5 Hierarchical GLIA architecture results First, we have trained a HM area to perform a single behaviour, `pick', without the use of a higher-level SOM area. The robot thereby self-imitates a behaviour it has previously learnt by reinforcement (Weber et al. 2004). Example videos of its movements can be seen on-line at: www.his.sunderland.ac.uk/supplements/AI04/}. Figure 24a) shows the total incoming innervation originating from the motor units (left) and the high-level vision units (right) on the HM area. The figure has been obtained by activating all four motor units or all high-level vision units, respectively, with activation 1 and by displaying the resulting activation pattern on the HM area. a) b) Figure 24 a) Left, the projections of the four motor units onto the HM area. Right, the projections of all high-level vision inputs on to the HM area. b) Four neighbouring SOM units' RFs in the HM area. These selected units are active during the `go' behaviour. Circles indicate that the leftmost units' RFs overlap with those of the `left' motor unit while the rightmost unit's RF overlaps with the RF of the `forward' motor unit. It can be seen that the patches of motor innervation avoid areas of high-density sensory innervation, and vice versa. This effect is due to competitive effects between incoming innervation. This does not mean that motor activation is independent of sensory activation: Figure 24b) shows the innervation of SOM area units on the HM area which bind regions specialised on motor- and sensory input. The leftmost of the four units displayed binds the "left" motor action with some sensory input while the rightmost binds the "forward" motor action with partially different sensory input. In the cortex we would expect such binding not only to occur via another cortical area (such as the SOM area in our model) but also via horizontal lateral innerarea connections which we do not model. The activation patterns for the student robot during recognition of the 10 step action sequence from a ‘go’ behaviour in Figure 26 are shown in Figure 25 and the activation during the performance by the student robot of the sequence of the ‘go’ behaviour in Figure 30 is shown in Figure 27. In the examples depicted in Figure 27 to Figure 30 the teacher and student robots are initialised at the same position. When the student is performing the ‘go’ behaviour there is a difference between the two activation patterns of the HM area when it receives vision alone (Figure 27a) and when the HM area receives activation from the SOM area (Figure 27c). The difference comes from the lack of motor direction input (Figure 27a) and the completion of the pattern to include the motor direction induced activation which would occur during full observation (Figure 27c). The activation patterns on the HM and SOM areas are very close between recognition and performance which indicates that neurons display mirror neuron properties. The main difference between performance and recognition can be seen between the activation patterns of the HM area based on high-level vision and motor direction for recognition (Figure 25a) and for performance the activation patterns on the HM area based on vision alone (Figure 27a) and the reconstruction of the activation on the HM area to produce the motor action based on the SOM area activation (Figure 27c). (a) HM area activation based on high-level vision and motor input (b) SOM area winner-based classification based only on the HM area input (c) language classification by the SOM winning unit ‘go’ ‘go’ ‘go’ ‘go’ ‘go’ ‘go’ ‘go’ ‘go’ ‘go’ ‘ pick’ Figure 25 Activation sequences during observation of a ‘go’ behaviour, without language instruction input. Strong activations are depicted dark, and shown at ten time steps from left to right. Circles mark the bottom-up input of the active motor unit of the teacher which changes from ‘forward’ in the first 6 steps to ‘turn left’ during the last 4 steps. Language instruction recognition is correct except for the last time step which is classified as ‘pick’. Figure 26 The teacher robot performing the 10 step section of the ‘go’ behaviour related to the activation sequences during observation of the ‘go’ behaviour from Figure 19. The differences in the HM area unit activation patterns during recognition and performance are dependent on the effect of the motor inputs. If during training, the input differs between the behaviours only by the motor direction input with the high-level vision input being the same, then the difference in the activation patterns must be large enough to activate a different SOM unit so that it can change between behaviours. During performance, however, the absence of the motor input should not have a too strong an effect on the HM area representation, because the winner in the SOM area could be very different from when all the inputs are present and so the performance would be unpredictable. (a) HM area activation based only on high-level vision input ((b) SOM area winner-based classification based on HM area and language input c) reconstruction of HM area activation by the SOM winning unit (d) reconstruction of motor command from HM area activations Figure 27 Activation sequences during performance by the student robot of a ‘go’ behaviour, i.e. without motor input. The performed sequence is visualised in Figure 5.26. Circles mark the region at each time step which has the decisive influence on the action being performed. Figure 27c shows the activations of the language instruction area as a result of the topdown influence from the winning SOM area unit during recognition. An error is made at the last time step despite the HM activation being hardly distinguishable from the second last time step due to a change in the SOM area winning unit (Figure 27b). However, the recognition error is difficult to determine since some sections of behaviours are not specific to a single behaviour with the high-level vision and motor being the same. For instance, during ‘go’ and ‘pick’, a forward movement toward the front wall is made in large areas of the arena at certain orientation angles  , or a ‘turn’ movement near the wall toward the centre might also be a result of either behaviour. Figure 28 The student robot performing the 10 step section of the ‘go’ behaviour without motor input related to the activation sequences. 5.6 Discussion of hierarchical architecture The hierarchical architecture recreates concepts of the regional cerebral cortex modularity, word representation and the mirror neuron system principles. Considering regional modularity and the neurocognitive evidence on word representation this architecture make use of semantic features of the action verbs from the vision and motor action inputs. There is also the inclusion the regional modularity, and word and action verb representation principles as this architecture includes distributed representations. For the hierarchical architecture the distributed representation comes from multiple units being activated in the Helmholtz hidden area for specific action-perception association and different units of the SOM area being activated when performing the binding operation. Furthermore the hierarchical architecture form of distributed representation is closer to the cerebral cortex as it has active units that are analogous to the word web assemblies that represent words as described by Pulvermüller (1999) and (2003). Moreover, the hierarchical model can be considered to be close to the regional modularity of the cerebral cortex as it offers modules that perform independently on specific operations in a hierarchical manner. The ability of hierarchical architectures to both recognise and perform behaviours shows that the architectures are able to recreate some concepts of the mirror neuron system principle. The GLIA hierarchical architecture achieves common ‘action understanding’ between the teacher and student on the behaviour’s meaning through language instruction which recreates the concept that the mirror neuron system principle has a role in the emergence of the human language system. With regards to the hierarchical architecture it is possible to relate the HM architecture of the model with area F5 of the primate cortex and the SOM area with F5 area. The F5 area contains neurons that can produce a number of different grasping activities. F6 area performs as a switch to produce or restrict the effects of F5 unit activations so that only the required units in the F5 region are activated to perform or recognise the required action [Arbib et al. 2001]. F6 area also produces the signal when the action should be performed and so activates the appropriate neurons in F5 and hence is responsible for the higher level control [Seitz et al. 2000]. F6 area is responsible for controlling the high level sequence for performing the behaviour [Arbib et al. 2001]. In our hierarchical GLIA architecture the HM area is directly linked to the motor output and identifiable groups of neurons that activate specific motor units while the SOM area represents the channel through which a language instruction must pass in order to reach the motorrelated HM units. Hence, the SOM area uses the language instruction to select the appropriate motor actions and so suppresses the others. The hierarchical model suggests analogies to the organisation of motor cortical areas F5 and F6, with area F5 containing the mirror neurons and area F6 determining which mirror neurons should be activated. 6. CONCLUSION In this report we have described three language models that can be combined to achieve neural multimodal language architecture. The first component of the architecture is able to process all words in the MirrorBot grammar using spiking associative memory and is made up of various areas for processing the language. As shown in this report this language cortical model is able to check the grammar and semantic correctness and make the language available for other MirrorBot models for performing behaviours such as for tracking a specific object. The second self-organising cortical map language representation model is able to represent action verbs based on the part of body that performs the action by comparing a phoneme and perceptual representation of word by taking the action verb from the cortical language model. The final model is able to take example action verbs represented by the second model and ground the action in verbs at a motor level based on mirror neurons. When developing the GLIA model we have placed emphasis on modelling in a more realistic manner the mirror neuron system and multimodal language processing in a neural manner. Fundamentally, the model is able to bring together inspiration from the two neuroscience partners by representing action verb based on the part of body that performs the action and by basing a model on the mirror neuron system. 7. REFERENCES Arbib, M.A., Fagg, A.H., Grafton, S.T. (2001) Synthetic PET imaging for grasping: From primate neurophysiology to human behavior. Explorative Analysis and Data Modelling in Functional Neuroimaging, Sommer, F., Wichert, A., Eds., Cambridge, MA. USA, MIT Press, pp. 231-250. Arbib, M. (2005) From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics. Behavioral and Brain Science, in press. Dayan, P. (2000) Helmholtz machines and wake-sleep learning. Hand of Brain Theory and Neural Network, Arbib, M., Eds., Cambridge, MA, USA, MIT Press. Dayan, P., Hinton, G. (1996) Varieties of Helmholtz machine. Neural Network, 9(8), pp. 1385-1403. Dayan, P., Hinton, G. E., Neal, R., Zemel, R. S. (1995) The Helmholtz machine. Neural Computation, 7, pp. 1022-1037. Fay, R., Kaufmann, U., Knobauch, H., Markert, H., Palm, G. (2004) Integrating Object Recognition, Visual Attention, Language and Action Processing on a Robot in a Neurobiologically Plausible Associative Architectur. In: Proceedings of the NeuroBotics-Workshop, 27th German Conference on Artificial Intelligence (KI2004), Ulm, Germany. Foster, D.J., Morris, R.G.M., Dayan, P. (2000) A model of hippocampally dependent navigation, using the temporal difference learning rule. Hippocampus, 10, pp. 1-16. Hayes, G., Demiris, Y. (1984) A robot controller using learning by imitation. Proceedings of the 2nd International Symposium on Intelligent Robotic Systems, Grenoble, France. Hinton, G.E., Dayan, P., Frey, B.J., Neal, R. (1995) The wake-sleep algorithm for unsupervised neural networks. Science, 268, pp 1158-1161. Knoblauch, A., Markert, H., Palm, G. (2005a) An Associative Model of Cortical Language and Action Processing. In: Modeling Language, Cognition and Action (Cangelosi, A., Bugmann, G., Borisyuk, R., eds.), Proceedings of the Ninth Neural Computation and Psychology Workshop, Plymouth 2004. 79-83, World Scientific. Knoblauch, A., Markert, H., Palm, G. (2005b) An associative cortical model of language understanding and action planning. In: Artificial Intelligence and Knowledge Engineering Applications: A Bioinspired Approach (Mira, J., Alvarez, J.R., eds.), Proceedings of IWINAC 2005, Las Palmas de Gran Canaria, Spain) LNCS 3562, 405-414, Springer, Berlin, Heidelberg. Markert, H., Knoblauch, A., Palm, G. (2005) Detecting sequences and understanding language with neural associative memories and cell assemblies. In: Biomimetic Neural Learning for Intelligent Robots (Wermter, S., Palm, G., Elshaw, M., eds.). LNAI 3575, 107-117, Springer, Berlin, Heidelberg. Marsland, S. (2003) Novelty detection in learning systems. Neural Computing Surveys, 3, pp. 157-195. Opitz, B., Mecklinger, A., Friederici, A.D., von Cramon, D.Y. (1999) The functional neuroanatomy of novelty processing: Integrating ERP and fMRI results. Cerebral Cortex, 9(4), pp. 379-391. Pulvermüller, F. (1999) Words in the brain’s language. Behavioral and Brain Sciences, 22(2), pp. 253-336. Pulvermüller, F. (2003) The neuroscience of language: On brain circuits of words and language. Cambridge, UK, Cambridge Press. Rizzolatti, G., Arbib, M. (1998) Language within our grasp. Trends in Neuroscience, 21(5), pp. 188-194. Seitz, R., Stephan, M., Binkofski, F. (2000) Control of action as mediated by the human frontal lobe. Experimental Brain Research, 133, pp. 71-80. Weber, C., Wermter, S., Zochios, A. (2003) Robot docking with neural vision and reinforcement. Proceedings of the Twenty-third SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence, Cambridge, UK. Weber, C., Wermter, S., Zochios, A. (2004) Robot docking with neural vision and reinforcement. Knowledge-Based Systems, 17(2-4), pp. 165-172. Wermter, S., Palm, P., Elshaw, M. (2005) Biomimetic Neural Learning for Intelligent Robots. Springer, Heidelberg.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 5. hierarchical multimodal language modeling